![]() |
[QUOTE=kriesel;493721]UID is optional; modify the ini file and restart to include that. You'll still need to log in before pasting the results into the manual results submission form.[/QUOTE]
Thank you, I don't know which config file option I must use to specify the user.. |
[QUOTE=SELROC;493723]Thank you, I don't know which config file option I must use to specify the user..[/QUOTE]In mfakto.ini
[CODE]# if V5UserID and ComputerID are specified, then the result lines in the results file will # have the prefix "UID: user/host, " - the same way as prime95 does it. # default: none (unset) V5UserID=kriesel ComputerID=condorella-rx550 [/CODE] |
[QUOTE=kriesel;493751]In mfakto.ini
[CODE]# if V5UserID and ComputerID are specified, then the result lines in the results file will # have the prefix "UID: user/host, " - the same way as prime95 does it. # default: none (unset) V5UserID=kriesel ComputerID=condorella-rx550 [/CODE][/QUOTE] Thank you very much. |
[QUOTE=preda;493669]Try it. If it passes the self-test, it's good to go.[/QUOTE]
It took a while to figure out that device numbering is different in various programs, in gpuOwl device numbering starts at 0, in Mfakto it starts at 1, so this may be confusing sometimes. Anyway under heavy load here having got a batch of 100 TF assignments. A side note: something must be done for the Mfakto wiki page [url]http://mersennewiki.org/index.php?title=Mfakto&printable=yes[/url] which shows an error: [CODE] Action failed Could not open lock file for "mwstore://local-backend/local-public/6/67/Gimps.gif". Make sure your upload directory is configured correctly and your web server has permission to write to that directory. See [URL]https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:$wgUploadDirectory[/URL] for more information. Return to [URL="http://mersennewiki.org/index.php/Main_Page"]Main Page[/URL]. Retrieved from "[URL]http://mersennewiki.org/index.php?title=Mfakto&oldid=8290[/URL]" [/CODE] |
When I run cudaowl on a GTX-960, on Linux, cudaowl fully occupies one CPU core. Does anyone else see gpuowl/cudaowl using large amounts of CPU time?
I found I can reduce the CPU usage significantly by adding a cudaDeviceSynchronize() call in the modSqLoop. I suspect the CPU usage is due to the CPU busy-waiting on the GPU to finish. With an explicit synchronize call the CPU thread sleeps instead, freeing up the core. github issue here: [URL]https://github.com/preda/gpuowl/issues/13[/URL] . Btw I saw the same behaviour running gpuowl before the CUDA version. |
[QUOTE=Fredrik;493785]When I run cudaowl on a GTX-960, on Linux, cudaowl fully occupies one CPU core. Does anyone else see gpuowl/cudaowl using large amounts of CPU time?
I found I can reduce the CPU usage significantly by adding a cudaDeviceSynchronize() call in the modSqLoop. I suspect the CPU usage is due to the CPU busy-waiting on the GPU to finish. With an explicit synchronize call the CPU thread sleeps instead, freeing up the core. github issue here: [URL]https://github.com/preda/gpuowl/issues/13[/URL] . Btw I saw the same behaviour running gpuowl before the CUDA version.[/QUOTE] I don't know about cuda. GpuOwl (opencl variant) does not occupy a full cpu core, instead it goes in a sleep phase and the cpu consumption is minimal. I have run 10 instances of gpuOwl in tandem and did not see this problem. |
[QUOTE=Fredrik;493785]When I run cudaowl on a GTX-960, on Linux, cudaowl fully occupies one CPU core. Does anyone else see gpuowl/cudaowl using large amounts of CPU time?
I found I can reduce the CPU usage significantly by adding a cudaDeviceSynchronize() call in the modSqLoop. I suspect the CPU usage is due to the CPU busy-waiting on the GPU to finish. With an explicit synchronize call the CPU thread sleeps instead, freeing up the core. github issue here: [URL]https://github.com/preda/gpuowl/issues/13[/URL] . Btw I saw the same behaviour running gpuowl before the CUDA version.[/QUOTE] I'm not seeing that effect on Windows7, on any of gpuOwL v1.9 gpuOwL v2.0 OpenOwL v3.3 OpenOwL v3.5 (haven't built cudaOwL yet or the build environment for it) But checking was a useful exercise, because I found MSI Live Update 6 sucking up 10% of the 12 cores' throughput. |
[QUOTE=Fredrik;493785]When I run cudaowl on a GTX-960, on Linux, cudaowl fully occupies one CPU core. Does anyone else see gpuowl/cudaowl using large amounts of CPU time?
I found I can reduce the CPU usage significantly by adding a cudaDeviceSynchronize() call in the modSqLoop. I suspect the CPU usage is due to the CPU busy-waiting on the GPU to finish. With an explicit synchronize call the CPU thread sleeps instead, freeing up the core. github issue here: [URL]https://github.com/preda/gpuowl/issues/13[/URL] . Btw I saw the same behaviour running gpuowl before the CUDA version.[/QUOTE] Could you please tell me the version reported by cudaowl? And the CUDA toolkit version? This is an annoying issues on CUDA. I already tried to get rid of the busy-wait, but it seems I need to try more. Will look into this in the following days. |
v3.6 openOwL Win64 build
1 Attachment(s)
In msys2/mingw64:[CODE]$ g++ -DREV=\"v20180810\" -O2 -c gpuowl.cpp -o gpuowl.o
$ g++ -O2 -c OpenGpu.cpp -o OpenGpu.o $ g++ -O2 -c common.cpp -o common.o $ g++ -O2 -c Gpu.cpp -o Gpu.o $ g++ -o openowl-V36-v20180810-W64.exe OpenGpu.o common.o Gpu.o gpuowl.o -lOpenCL -static[/CODE]Seems to continue (a copy of) a v2.0 exponent in progress ok, with two notable differences; faster 3.85 ms/it on RX480 with fft size 4608k autoselected for a 79M exponent, vs. V2.0's 5.25ms/it on 5000K fft size; and log and checkpoint writes are 250K iterations apart (continuing with v2.0's blocksize 500; 500^2=250,000) rather than the V3.x default 400 blocksize yielding checkpoint and log writes at 400^2=160,000 iterations. No busy-wait behavior in this build either, though that's a reported issue with CUDA not OpenCL; prime95 99% cpu, openowl-v3.6 0%, gpuowl v1.9 0% cpu. Running overlapping iterations from the point of copy for 100k+ iterations on V2.0 and comparison to the V3.6 res64 showed exact matches at each of the available matching total iterations points. Is there any reason to believe the speed of V3.6 would be different from V3.5? V3.6 seems to match V3.5 at 4608K within plausible measurement error. Compressed executable for Win64 attached. As usual, user assumes risk of downloaded software. |
[QUOTE=preda;493831]Could you please tell me the version reported by cudaowl? And the CUDA toolkit version?
This is an annoying issues on CUDA. I already tried to get rid of the busy-wait, but it seems I need to try more. Will look into this in the following days.[/QUOTE] My versions are these, I updated gpuowl from github yesterday: [CODE]./cudaowl --version gpuowl-CUDA 3.6-f7c3865-mod Cuda version reported by mfaktc: CUDA runtime version 9.20 CUDA driver version 9.20 [/CODE]The gpuowl version says -mod because I inserted a cudaDeviceSynchronize() call. My theory at the moment is that the flag cudaDeviceScheduleBlockingSync which you set, takes effect for explicit cudaDeviceSynchronize() calls. In the compute intensive modSqLoop function, there is no synchronization call. cudaowl just issues a long sequence of kernel calls, which apparently are executed in sequence, by being in the same stream. And somehow, it seems this causes a busy wait. I tried different cudaDeviceScheduleBlockingSync values, but couldn't find any that helped. What did help for me is this modification in CudaGpu.h: [CODE] void modSqLoop(int *&bufIn, int *&bufOut, int nIters, bool doMul3) { u32 baseBits = E / N; preWeight<<<N/4/256, 256>>>(baseBits, (int4 *) bufIn, (double4 *) bufA, (double4 *) bufBig1); for (int i = 0; i < nIters - 1; ++i) { CC(cufftExecD2Z(plan1, (double *) bufBig1, (double2 *) bufBig2)); square<<<N/2/256, 256>>>((double2 *) bufBig2); CC(cufftExecZ2D(plan2, (double2 *) bufBig2, (double *) bufBig1)); carryA<1><<<N/4/256, 256>>>(baseBits, (double4 *) bufBig1, (int4 *) bufOut, (long *) bufBig2, (double4 *) bufI); carryB<<<N/4/256, 256>>>(baseBits, (int4 *) bufOut, (long *) bufBig2, (double4 *) bufBig1, (double4 *) bufA); cudaDeviceSynchronize(); // <----------- added this } [/CODE] I don't understand why a single synchronization call is sufficient, but it was. Maybe the GPU can queue a few kernel calls? There may be a difference in behavior between Linux and Windows, I found a 10 year old [URL="https://devtalk.nvidia.com/default/topic/384311/cuda-programming-and-performance/do-the-non-async-calls-sleep-or-burn-cpu-/1"]discussion[/URL] claiming that Cuda in Windows yields easier, while in Linux it tends to busy-wait. |
[QUOTE=preda;493831]Could you please tell me the version reported by cudaowl? And the CUDA toolkit version?
This is an annoying issues on CUDA. I already tried to get rid of the busy-wait, but it seems I need to try more. Will look into this in the following days.[/QUOTE] I found this, don't know if it helps in any way [url]http://www.ece.neu.edu/groups/nucar/GPGPU4/files/thall.pdf[/url] |
| All times are UTC. The time now is 23:03. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.