![]() |
[QUOTE=Prime95;525711]The first step should be to look at the code and convince oneself that 2047 ought to work. No one has done that. So.....
I took a look at the code. I see no reason why a setting of 2047 would not work. In fact, changing gpu_sieve_size to an unsigned int might allow for values up to 4095. Changing to unsigned long long could allow much higher values. There would also be some typecasts required to avoid compiler warnings. The real limit is imposed by CUDA on this code line: [CODE] SegSieve<<<(sieve_size + block_size - 1) / block_size, threadsPerBlock>>>((uint8 *)mystuff->d_bitarray, (uint8 *)mystuff->d_sieve_info, primes_per_thread); [/CODE]What is CUDA's limit on the first parameter ((sieve_size + block_size - 1) / block_size)?[/QUOTE]That's good news. The ini file says gpusievesize is in units of megaBITS there. If the relevant parameter is maximum pitch, that is ~2047MBYTES. There's a daunting variety of CUDA parameters. Following is GPU-Z 2.25.0's Advanced, CUDA output for an older gpu:[CODE]General CUDA Device Name Tesla C2075 Compute Capability 2.0 Processor Count 14 Cores per Processor 32 GPU Clock Rate 1147.0 MHz Memory Clock Rate 1566.0 MHz Memory Bus Width 384 L2 Cache Size 768 KB Global Memory Size 5316 MB Async Engines 2 SP to DP Ratio 1:2 ECC Supported Yes Using TCC Driver Yes Compute Mode Default Multi-GPU Board No (0) PCI ID Bus 6, Dev 0, Domain 0 Threads per Multiprocessor 1536 Max Shmem per Multiprocessor 48 KB Execute Multiple Kernels Yes Preemption Supported No Memory Native Atomic Supported No Unified Address Space Yes Integrated w/ Host Memory No Can map Host Memory Yes Can allocate Managed Memory No Pageable Memory Access No Concurrent Managed Memory Access No Can use Host Memory Pointers No Supports Stream Priorities No Can Cache Globals in L1 Yes Can Cache Locals in L1 Yes Max Block Size 1024 x 1024 x 64 Max # of Threads per Block 1024 Max Shmem per Block 48 KB Max Grid Size 65535 x 65535 x 65535 Max Registers per Block 32768 Max Registers per Block 32768 Total Constant Memory 64 KB Warp Size 32 Threads [B]Maximum Pitch 2097151 KB[/B] Texture Alignment 0 KB Surface Alignment 512 Texture Pitch Alignment 32 GPU Overlap Yes Kernel Runtime Limit No Size Constraints 1D Texture Size 65536 1D Layered Texture Size 16384 x 2048 2D Texture Size 65536 x 65535 2D Layered Texture Size 16384 x 16384 x 2048 2D Texture Size Gather 16384 x 16384 3D Texture Size 2048 x 2048 x 2048 3D Texture Size Alt 0 x 0 x 0 Cubemap Texture Size 16384 x 16384 Layered Cubemap Texture Size 32768 x 32768 x 2046 1D Surface Size 65536 1D Layered Surface Size 65536 x 2048 2D Surface Size 65536 x 32768 2D Layered Texture Size 65536 x 65536 x 2048 3D Surface Size 65536 x 32768 x 2048 Cubemap Surface Size 32768 x 32768 Cubemap Layererd Surface Size 32768 x 32768 x 2046 [B]1D Linear Texture Size 134217728[/B] 2D Linear Texture Size 65000 x 65000 2D Linear Texture Pitch 1048544 1D Mipmapped Texture Size 16384 2D Mipmapped Texture Size 16384 x 16384 [/CODE][URL]https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications[/URL] |
1 Attachment(s)
I thought I'd share some scripts that users might find useful:
1. For those with multi-GPU systems, [c]start-mfaktc.sh[/c] allows multiple mfaktc instances to be launched from a single copy of mfaktc. You'll first need to create a folder called [C]device-<device number>[/C] for each device. For example, you'll need to create the sub-folders [C]device-0[/C] through [C]device-3[/C] in your mfaktc folder on a system with four GPUs. You can then run [C]./start-mfaktc.sh <device number>[/C] on launch an instance. If no device number is specified, then the script will simply call mfaktc from its root folder. This script works by creating symbolic links to your mfaktc executable and configuration file. It also uses a lock file to prevent the user from accidentally running more than one instance on the same device. Once the mfaktc instance terminates, the symbolic links and lock file are deleted. You can set the [c]SHARED_SETTINGS[/c] variable to [C]false[/C] if you want a separate [C]mfaktc.ini[/C] file for each GPU. This is useful if you want each device to use specific settings. I'll try to port this script to Windows If I ever get access to a Windows system with multiple GPUs. 2. [C]archive-mfaktc.sh[/C] backs up your mfaktc results and is intended for people who use remote systems. You can then use SCP to copy the .zip file to a local computer, etc. |
CPU slows GPU ... should it?
When I run my GTX-2080Ti all alone doing mfaktc TF it runs at 4,300 GhzDays/Day
However if I start up all 8 cores of the corresponding i5-7820x doing RAM intensive P-1 the GPU drops to under 4,000. In other words it drops by more than the entire GDays throughput of the CPU. Is this normal or do is there a GPU setting I simply need to change? Thanks Relevant GPU Config parms: [CODE]SieveOnGPU=1 GPUSievePrimes=82486 GPUSieveSize=128 GPUSieveProcessSize=16[/CODE] |
[QUOTE=petrw1;526872]Is this normal or do is there a GPU setting I simply need to change?[/QUOTE]I see the same thing (with much smaller numbers) with my RX 480 if there's something taking up all cores. I normally let P-1 use 5of6 cores leaving one free for mfakto and general system responsiveness. Even though mfakt[i]x[/i] takes up almost no CPU resources, if it doesn't get it immediately when it wants it the throughput seems to suffer.
|
[QUOTE=petrw1;526872]When I run my GTX-2080Ti all alone doing mfaktc TF it runs at 4,300 GhzDays/Day
However if I start up all 8 cores of the corresponding i5-7820x doing RAM intensive P-1 the GPU drops to under 4,000. In other words it drops by more than the entire GDays throughput of the CPU. Is this normal or do is there a GPU setting I simply need to change? Thanks Relevant GPU Config parms: [CODE]SieveOnGPU=1 GPUSievePrimes=82486 GPUSieveSize=128 GPUSieveProcessSize=16[/CODE][/QUOTE]Do you see this effect only with P-1, not with PRP or LL? Looks like the only parameter you've changed from the initial mfaktc.ini defaults is gpusievesize from 64 to 128. On a gtx1080Ti I found these improved throughput: gpusiveprimes=90000 gpusivesize=128 gpusievprocesssize=32. number of instances running simultaneously on the gpu= 2 added a couple percent itself; 3 instances was better than one but not as much overall throughput as 2 instances. All of that was with prime95 saturating all cpu cores. The effect of multiple instances is stronger the faster the gpu is, but increasing total throughput on gtx1050Ti also. (Presumably it's waiting for the result of a class to be written to display and disk. When two instances are run, providing they are not synchronized, one can keep the gpu busy while another waits.) Others with RTX20xx have found gains of several percent with recompiled versions of mfaktc using gpusievesize up to 2047. |
I've only tried it with P-1 on the CPU.
I don't have the tools (maybe the mental tools) to recompile mfaktc. Larger gpuseiveprimes or gpusievprocesssize seem to have negligible impact for me. Thanks [QUOTE=kriesel;526890]Do you see this effect only with P-1, not with PRP or LL? Looks like the only parameter you've changed from the initial mfaktc.ini defaults is gpusievesize from 64 to 128. On a gtx1080Ti I found these improved throughput: gpusiveprimes=90000 gpusivesize=128 gpusievprocesssize=32. number of instances running simultaneously on the gpu= 2 added a couple percent itself; 3 instances was better than one but not as much overall throughput as 2 instances. All of that was with prime95 saturating all cpu cores. The effect of multiple instances is stronger the faster the gpu is, but increasing total throughput on gtx1050Ti also. (Presumably it's waiting for the result of a class to be written to display and disk. When two instances are run, providing they are not synchronized, one can keep the gpu busy while another waits.) Others with RTX20xx have found gains of several percent with recompiled versions of mfaktc using gpusievesize up to 2047.[/QUOTE] |
[QUOTE=petrw1;526915]
I don't have the tools (maybe the mental tools) to recompile mfaktc. [/QUOTE]If you're running Windows, see [URL]https://www.mersenneforum.org/showpost.php?p=508680&postcount=116[/URL] or for linux [URL]https://www.mersenneforum.org/showpost.php?p=525178&postcount=3189[/URL] |
[QUOTE=kriesel;526922]If you're running Windows, see [URL]https://www.mersenneforum.org/showpost.php?p=508680&postcount=116[/URL]
or for linux [URL]https://www.mersenneforum.org/showpost.php?p=525178&postcount=3189[/URL][/QUOTE] I've been running the above Windows version on a 1080-Ti since February 2019 and it gave me a nice boost when I switched. |
Wow too bad I didn't see this months ago.
Throughput changed from 3,900 to 4,500 on my 2080Ti doing TF to 74 in the 4xM ranges. 15% improvement. When would I used the Less Classes version? |
[QUOTE=petrw1;526935]When would I used the Less Classes version?[/QUOTE]When assignment runtime is ridiculously short, like a few seconds per exponent (e.g. [URL="https://www.mersenne.ca/tf1G.php"]TF>1G[/URL]).
|
On my 2080 assignments that take 9.3 seconds with normal mfaktc takes 8.8 seconds with Less Classes, so the cutoff is somewhere near that.
|
| All times are UTC. The time now is 22:50. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.