mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

kriesel 2019-09-12 14:40

[QUOTE=Prime95;525711]The first step should be to look at the code and convince oneself that 2047 ought to work. No one has done that. So.....

I took a look at the code. I see no reason why a setting of 2047 would not work. In fact, changing gpu_sieve_size to an unsigned int might allow for values up to 4095. Changing to unsigned long long could allow much higher values. There would also be some typecasts required to avoid compiler warnings.

The real limit is imposed by CUDA on this code line:

[CODE] SegSieve<<<(sieve_size + block_size - 1) / block_size, threadsPerBlock>>>((uint8 *)mystuff->d_bitarray, (uint8 *)mystuff->d_sieve_info, primes_per_thread);
[/CODE]What is CUDA's limit on the first parameter ((sieve_size + block_size - 1) / block_size)?[/QUOTE]That's good news.

The ini file says gpusievesize is in units of megaBITS there. If the relevant parameter is maximum pitch, that is ~2047MBYTES. There's a daunting variety of CUDA parameters. Following is GPU-Z 2.25.0's Advanced, CUDA output for an older gpu:[CODE]General
CUDA Device Name Tesla C2075
Compute Capability 2.0
Processor Count 14
Cores per Processor 32
GPU Clock Rate 1147.0 MHz
Memory Clock Rate 1566.0 MHz
Memory Bus Width 384
L2 Cache Size 768 KB
Global Memory Size 5316 MB
Async Engines 2
SP to DP Ratio 1:2
ECC Supported Yes
Using TCC Driver Yes
Compute Mode Default
Multi-GPU Board No (0)
PCI ID Bus 6, Dev 0, Domain 0
Threads per Multiprocessor 1536
Max Shmem per Multiprocessor 48 KB
Execute Multiple Kernels Yes
Preemption Supported No

Memory
Native Atomic Supported No
Unified Address Space Yes
Integrated w/ Host Memory No
Can map Host Memory Yes
Can allocate Managed Memory No
Pageable Memory Access No
Concurrent Managed Memory Access No
Can use Host Memory Pointers No
Supports Stream Priorities No
Can Cache Globals in L1 Yes
Can Cache Locals in L1 Yes
Max Block Size 1024 x 1024 x 64
Max # of Threads per Block 1024
Max Shmem per Block 48 KB
Max Grid Size 65535 x 65535 x 65535
Max Registers per Block 32768
Max Registers per Block 32768
Total Constant Memory 64 KB
Warp Size 32 Threads
[B]Maximum Pitch 2097151 KB[/B]
Texture Alignment 0 KB
Surface Alignment 512
Texture Pitch Alignment 32
GPU Overlap Yes
Kernel Runtime Limit No

Size Constraints
1D Texture Size 65536
1D Layered Texture Size 16384 x 2048
2D Texture Size 65536 x 65535
2D Layered Texture Size 16384 x 16384 x 2048
2D Texture Size Gather 16384 x 16384
3D Texture Size 2048 x 2048 x 2048
3D Texture Size Alt 0 x 0 x 0
Cubemap Texture Size 16384 x 16384
Layered Cubemap Texture Size 32768 x 32768 x 2046
1D Surface Size 65536
1D Layered Surface Size 65536 x 2048
2D Surface Size 65536 x 32768
2D Layered Texture Size 65536 x 65536 x 2048
3D Surface Size 65536 x 32768 x 2048
Cubemap Surface Size 32768 x 32768
Cubemap Layererd Surface Size 32768 x 32768 x 2046
[B]1D Linear Texture Size 134217728[/B]
2D Linear Texture Size 65000 x 65000
2D Linear Texture Pitch 1048544
1D Mipmapped Texture Size 16384
2D Mipmapped Texture Size 16384 x 16384
[/CODE][URL]https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications[/URL]

ixfd64 2019-09-26 22:27

1 Attachment(s)
I thought I'd share some scripts that users might find useful:

1. For those with multi-GPU systems, [c]start-mfaktc.sh[/c] allows multiple mfaktc instances to be launched from a single copy of mfaktc. You'll first need to create a folder called [C]device-<device number>[/C] for each device. For example, you'll need to create the sub-folders [C]device-0[/C] through [C]device-3[/C] in your mfaktc folder on a system with four GPUs. You can then run [C]./start-mfaktc.sh <device number>[/C] on launch an instance. If no device number is specified, then the script will simply call mfaktc from its root folder.

This script works by creating symbolic links to your mfaktc executable and configuration file. It also uses a lock file to prevent the user from accidentally running more than one instance on the same device. Once the mfaktc instance terminates, the symbolic links and lock file are deleted. You can set the [c]SHARED_SETTINGS[/c] variable to [C]false[/C] if you want a separate [C]mfaktc.ini[/C] file for each GPU. This is useful if you want each device to use specific settings.

I'll try to port this script to Windows If I ever get access to a Windows system with multiple GPUs.

2. [C]archive-mfaktc.sh[/C] backs up your mfaktc results and is intended for people who use remote systems. You can then use SCP to copy the .zip file to a local computer, etc.

petrw1 2019-09-29 06:19

CPU slows GPU ... should it?
 
When I run my GTX-2080Ti all alone doing mfaktc TF it runs at 4,300 GhzDays/Day

However if I start up all 8 cores of the corresponding i5-7820x doing RAM intensive P-1 the GPU drops to under 4,000.
In other words it drops by more than the entire GDays throughput of the CPU.

Is this normal or do is there a GPU setting I simply need to change?

Thanks

Relevant GPU Config parms:

[CODE]SieveOnGPU=1
GPUSievePrimes=82486
GPUSieveSize=128
GPUSieveProcessSize=16[/CODE]

James Heinrich 2019-09-29 12:15

[QUOTE=petrw1;526872]Is this normal or do is there a GPU setting I simply need to change?[/QUOTE]I see the same thing (with much smaller numbers) with my RX 480 if there's something taking up all cores. I normally let P-1 use 5of6 cores leaving one free for mfakto and general system responsiveness. Even though mfakt[i]x[/i] takes up almost no CPU resources, if it doesn't get it immediately when it wants it the throughput seems to suffer.

kriesel 2019-09-29 13:56

[QUOTE=petrw1;526872]When I run my GTX-2080Ti all alone doing mfaktc TF it runs at 4,300 GhzDays/Day

However if I start up all 8 cores of the corresponding i5-7820x doing RAM intensive P-1 the GPU drops to under 4,000.
In other words it drops by more than the entire GDays throughput of the CPU.

Is this normal or do is there a GPU setting I simply need to change?

Thanks

Relevant GPU Config parms:

[CODE]SieveOnGPU=1
GPUSievePrimes=82486
GPUSieveSize=128
GPUSieveProcessSize=16[/CODE][/QUOTE]Do you see this effect only with P-1, not with PRP or LL?
Looks like the only parameter you've changed from the initial mfaktc.ini defaults is gpusievesize from 64 to 128. On a gtx1080Ti I found these improved throughput:
gpusiveprimes=90000
gpusivesize=128
gpusievprocesssize=32.
number of instances running simultaneously on the gpu= 2 added a couple percent itself; 3 instances was better than one but not as much overall throughput as 2 instances. All of that was with prime95 saturating all cpu cores. The effect of multiple instances is stronger the faster the gpu is, but increasing total throughput on gtx1050Ti also. (Presumably it's waiting for the result of a class to be written to display and disk. When two instances are run, providing they are not synchronized, one can keep the gpu busy while another waits.)

Others with RTX20xx have found gains of several percent with recompiled versions of mfaktc using gpusievesize up to 2047.

petrw1 2019-09-29 19:33

I've only tried it with P-1 on the CPU.
I don't have the tools (maybe the mental tools) to recompile mfaktc.

Larger gpuseiveprimes or gpusievprocesssize seem to have negligible impact for me.

Thanks

[QUOTE=kriesel;526890]Do you see this effect only with P-1, not with PRP or LL?
Looks like the only parameter you've changed from the initial mfaktc.ini defaults is gpusievesize from 64 to 128. On a gtx1080Ti I found these improved throughput:
gpusiveprimes=90000
gpusivesize=128
gpusievprocesssize=32.
number of instances running simultaneously on the gpu= 2 added a couple percent itself; 3 instances was better than one but not as much overall throughput as 2 instances. All of that was with prime95 saturating all cpu cores. The effect of multiple instances is stronger the faster the gpu is, but increasing total throughput on gtx1050Ti also. (Presumably it's waiting for the result of a class to be written to display and disk. When two instances are run, providing they are not synchronized, one can keep the gpu busy while another waits.)

Others with RTX20xx have found gains of several percent with recompiled versions of mfaktc using gpusievesize up to 2047.[/QUOTE]

kriesel 2019-09-29 20:54

[QUOTE=petrw1;526915]
I don't have the tools (maybe the mental tools) to recompile mfaktc.
[/QUOTE]If you're running Windows, see [URL]https://www.mersenneforum.org/showpost.php?p=508680&postcount=116[/URL]
or for linux [URL]https://www.mersenneforum.org/showpost.php?p=525178&postcount=3189[/URL]

Chuck 2019-09-29 22:17

[QUOTE=kriesel;526922]If you're running Windows, see [URL]https://www.mersenneforum.org/showpost.php?p=508680&postcount=116[/URL]
or for linux [URL]https://www.mersenneforum.org/showpost.php?p=525178&postcount=3189[/URL][/QUOTE]

I've been running the above Windows version on a 1080-Ti since February 2019 and it gave me a nice boost when I switched.

petrw1 2019-09-29 23:38

Wow too bad I didn't see this months ago.
Throughput changed from 3,900 to 4,500 on my 2080Ti doing TF to 74 in the 4xM ranges. 15% improvement.

When would I used the Less Classes version?

James Heinrich 2019-09-29 23:48

[QUOTE=petrw1;526935]When would I used the Less Classes version?[/QUOTE]When assignment runtime is ridiculously short, like a few seconds per exponent (e.g. [URL="https://www.mersenne.ca/tf1G.php"]TF>1G[/URL]).

ATH 2019-09-30 01:09

On my 2080 assignments that take 9.3 seconds with normal mfaktc takes 8.8 seconds with Less Classes, so the cutoff is somewhere near that.


All times are UTC. The time now is 22:50.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.