mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2019-09-12, 14:40   #3202
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

152C16 Posts
Default

Quote:
Originally Posted by Prime95 View Post
The first step should be to look at the code and convince oneself that 2047 ought to work. No one has done that. So.....

I took a look at the code. I see no reason why a setting of 2047 would not work. In fact, changing gpu_sieve_size to an unsigned int might allow for values up to 4095. Changing to unsigned long long could allow much higher values. There would also be some typecasts required to avoid compiler warnings.

The real limit is imposed by CUDA on this code line:

Code:
    SegSieve<<<(sieve_size + block_size - 1) / block_size, threadsPerBlock>>>((uint8 *)mystuff->d_bitarray, (uint8 *)mystuff->d_sieve_info, primes_per_thread);
What is CUDA's limit on the first parameter ((sieve_size + block_size - 1) / block_size)?
That's good news.

The ini file says gpusievesize is in units of megaBITS there. If the relevant parameter is maximum pitch, that is ~2047MBYTES. There's a daunting variety of CUDA parameters. Following is GPU-Z 2.25.0's Advanced, CUDA output for an older gpu:
Code:
General
CUDA Device Name    Tesla C2075
Compute Capability    2.0
Processor Count    14
Cores per Processor    32
GPU Clock Rate    1147.0 MHz
Memory Clock Rate    1566.0 MHz
Memory Bus Width    384
L2 Cache Size    768 KB
Global Memory Size    5316 MB
Async Engines    2
SP to DP Ratio    1:2
ECC Supported    Yes
Using TCC Driver    Yes
Compute Mode    Default
Multi-GPU Board    No (0)
PCI ID    Bus 6, Dev 0, Domain 0
Threads per Multiprocessor    1536
Max Shmem per Multiprocessor    48 KB
Execute Multiple Kernels    Yes
Preemption Supported    No

Memory
Native Atomic Supported    No
Unified Address Space    Yes
Integrated w/ Host Memory    No
Can map Host Memory    Yes
Can allocate Managed Memory    No
Pageable Memory Access    No
Concurrent Managed Memory Access    No
Can use Host Memory Pointers    No
Supports Stream Priorities    No
Can Cache Globals in L1    Yes
Can Cache Locals in L1    Yes
Max Block Size    1024 x 1024 x 64
Max # of Threads per Block    1024
Max Shmem per Block    48 KB
Max Grid Size    65535 x 65535 x 65535
Max Registers per Block    32768
Max Registers per Block    32768
Total Constant Memory    64 KB
Warp Size    32 Threads
Maximum Pitch    2097151 KB
Texture Alignment    0 KB
Surface Alignment    512
Texture Pitch Alignment    32
GPU Overlap    Yes
Kernel Runtime Limit    No

Size Constraints
1D Texture Size    65536
1D Layered Texture Size    16384 x 2048
2D Texture Size    65536 x 65535
2D Layered Texture Size    16384 x 16384 x 2048
2D Texture Size Gather    16384 x 16384
3D Texture Size    2048 x 2048 x 2048
3D Texture Size Alt    0 x 0 x 0
Cubemap Texture Size    16384 x 16384
Layered Cubemap Texture Size    32768 x 32768 x 2046
1D Surface Size    65536
1D Layered Surface Size    65536 x 2048
2D Surface Size    65536 x 32768
2D Layered Texture Size    65536 x 65536 x 2048
3D Surface Size    65536 x 32768 x 2048
Cubemap Surface Size    32768 x 32768
Cubemap Layererd Surface Size    32768 x 32768 x 2046
1D Linear Texture Size    134217728
2D Linear Texture Size    65000 x 65000
2D Linear Texture Pitch    1048544
1D Mipmapped Texture Size    16384
2D Mipmapped Texture Size    16384 x 16384
https://en.wikipedia.org/wiki/CUDA#V...specifications
kriesel is offline   Reply With Quote
Old 2019-09-26, 22:27   #3203
ixfd64
Bemusing Prompter
 
ixfd64's Avatar
 
"Danny"
Dec 2002
California

5·479 Posts
Default

I thought I'd share some scripts that users might find useful:

1. For those with multi-GPU systems, start-mfaktc.sh allows multiple mfaktc instances to be launched from a single copy of mfaktc. You'll first need to create a folder called device-<device number> for each device. For example, you'll need to create the sub-folders device-0 through device-3 in your mfaktc folder on a system with four GPUs. You can then run ./start-mfaktc.sh <device number> on launch an instance. If no device number is specified, then the script will simply call mfaktc from its root folder.

This script works by creating symbolic links to your mfaktc executable and configuration file. It also uses a lock file to prevent the user from accidentally running more than one instance on the same device. Once the mfaktc instance terminates, the symbolic links and lock file are deleted. You can set the SHARED_SETTINGS variable to false if you want a separate mfaktc.ini file for each GPU. This is useful if you want each device to use specific settings.

I'll try to port this script to Windows If I ever get access to a Windows system with multiple GPUs.

2. archive-mfaktc.sh backs up your mfaktc results and is intended for people who use remote systems. You can then use SCP to copy the .zip file to a local computer, etc.
Attached Files
File Type: zip mfaktc scripts.zip (802 Bytes, 134 views)

Last fiddled with by ixfd64 on 2019-09-26 at 22:40
ixfd64 is offline   Reply With Quote
Old 2019-09-29, 06:19   #3204
petrw1
1976 Toyota Corona years forever!
 
petrw1's Avatar
 
"Wayne"
Nov 2006
Saskatchewan, Canada

469210 Posts
Default CPU slows GPU ... should it?

When I run my GTX-2080Ti all alone doing mfaktc TF it runs at 4,300 GhzDays/Day

However if I start up all 8 cores of the corresponding i5-7820x doing RAM intensive P-1 the GPU drops to under 4,000.
In other words it drops by more than the entire GDays throughput of the CPU.

Is this normal or do is there a GPU setting I simply need to change?

Thanks

Relevant GPU Config parms:

Code:
SieveOnGPU=1
GPUSievePrimes=82486
GPUSieveSize=128
GPUSieveProcessSize=16
petrw1 is online now   Reply With Quote
Old 2019-09-29, 12:15   #3205
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

11×311 Posts
Default

Quote:
Originally Posted by petrw1 View Post
Is this normal or do is there a GPU setting I simply need to change?
I see the same thing (with much smaller numbers) with my RX 480 if there's something taking up all cores. I normally let P-1 use 5of6 cores leaving one free for mfakto and general system responsiveness. Even though mfaktx takes up almost no CPU resources, if it doesn't get it immediately when it wants it the throughput seems to suffer.
James Heinrich is offline   Reply With Quote
Old 2019-09-29, 13:56   #3206
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

152C16 Posts
Default

Quote:
Originally Posted by petrw1 View Post
When I run my GTX-2080Ti all alone doing mfaktc TF it runs at 4,300 GhzDays/Day

However if I start up all 8 cores of the corresponding i5-7820x doing RAM intensive P-1 the GPU drops to under 4,000.
In other words it drops by more than the entire GDays throughput of the CPU.

Is this normal or do is there a GPU setting I simply need to change?

Thanks

Relevant GPU Config parms:

Code:
SieveOnGPU=1
GPUSievePrimes=82486
GPUSieveSize=128
GPUSieveProcessSize=16
Do you see this effect only with P-1, not with PRP or LL?
Looks like the only parameter you've changed from the initial mfaktc.ini defaults is gpusievesize from 64 to 128. On a gtx1080Ti I found these improved throughput:
gpusiveprimes=90000
gpusivesize=128
gpusievprocesssize=32.
number of instances running simultaneously on the gpu= 2 added a couple percent itself; 3 instances was better than one but not as much overall throughput as 2 instances. All of that was with prime95 saturating all cpu cores. The effect of multiple instances is stronger the faster the gpu is, but increasing total throughput on gtx1050Ti also. (Presumably it's waiting for the result of a class to be written to display and disk. When two instances are run, providing they are not synchronized, one can keep the gpu busy while another waits.)

Others with RTX20xx have found gains of several percent with recompiled versions of mfaktc using gpusievesize up to 2047.
kriesel is offline   Reply With Quote
Old 2019-09-29, 19:33   #3207
petrw1
1976 Toyota Corona years forever!
 
petrw1's Avatar
 
"Wayne"
Nov 2006
Saskatchewan, Canada

22×3×17×23 Posts
Default

I've only tried it with P-1 on the CPU.
I don't have the tools (maybe the mental tools) to recompile mfaktc.

Larger gpuseiveprimes or gpusievprocesssize seem to have negligible impact for me.

Thanks

Quote:
Originally Posted by kriesel View Post
Do you see this effect only with P-1, not with PRP or LL?
Looks like the only parameter you've changed from the initial mfaktc.ini defaults is gpusievesize from 64 to 128. On a gtx1080Ti I found these improved throughput:
gpusiveprimes=90000
gpusivesize=128
gpusievprocesssize=32.
number of instances running simultaneously on the gpu= 2 added a couple percent itself; 3 instances was better than one but not as much overall throughput as 2 instances. All of that was with prime95 saturating all cpu cores. The effect of multiple instances is stronger the faster the gpu is, but increasing total throughput on gtx1050Ti also. (Presumably it's waiting for the result of a class to be written to display and disk. When two instances are run, providing they are not synchronized, one can keep the gpu busy while another waits.)

Others with RTX20xx have found gains of several percent with recompiled versions of mfaktc using gpusievesize up to 2047.
petrw1 is online now   Reply With Quote
Old 2019-09-29, 20:54   #3208
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

124548 Posts
Default

Quote:
Originally Posted by petrw1 View Post
I don't have the tools (maybe the mental tools) to recompile mfaktc.
If you're running Windows, see https://www.mersenneforum.org/showpo...&postcount=116
or for linux https://www.mersenneforum.org/showpo...postcount=3189

Last fiddled with by kriesel on 2019-09-29 at 20:54
kriesel is offline   Reply With Quote
Old 2019-09-29, 22:17   #3209
Chuck
 
Chuck's Avatar
 
May 2011
Orange Park, FL

3×5×59 Posts
Default

Quote:
Originally Posted by kriesel View Post
I've been running the above Windows version on a 1080-Ti since February 2019 and it gave me a nice boost when I switched.
Chuck is offline   Reply With Quote
Old 2019-09-29, 23:38   #3210
petrw1
1976 Toyota Corona years forever!
 
petrw1's Avatar
 
"Wayne"
Nov 2006
Saskatchewan, Canada

22×3×17×23 Posts
Default

Wow too bad I didn't see this months ago.
Throughput changed from 3,900 to 4,500 on my 2080Ti doing TF to 74 in the 4xM ranges. 15% improvement.

When would I used the Less Classes version?
petrw1 is online now   Reply With Quote
Old 2019-09-29, 23:48   #3211
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

342110 Posts
Default

Quote:
Originally Posted by petrw1 View Post
When would I used the Less Classes version?
When assignment runtime is ridiculously short, like a few seconds per exponent (e.g. TF>1G).
James Heinrich is offline   Reply With Quote
Old 2019-09-30, 01:09   #3212
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

315810 Posts
Default

On my 2080 assignments that take 9.3 seconds with normal mfaktc takes 8.8 seconds with Less Classes, so the cutoff is somewhere near that.

Last fiddled with by ATH on 2019-09-30 at 01:10
ATH is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 14:09.


Mon Aug 2 14:09:27 UTC 2021 up 10 days, 8:38, 0 users, load averages: 3.84, 3.68, 3.15

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.