mersenneforum.org Tips for building mfaktc 0.21 on Ubuntu 20.04
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

 2021-05-13, 20:22 #1 drkirkby   "David Kirkby" Jan 2021 Althorne, Essex, UK 22·107 Posts Tips for building mfaktc 0.21 on Ubuntu 20.04 I'm not sure where best to post this - it is mainly a software issue, but there's no GPU specific forum for software, but there is for hardware. However, there is a bit of hardware specific stuff, so I guess here is okay. I tried to build mfaktc 0.21 https://www.mersenneforum.org/mfaktc...tc-0.21.tar.gz on a Dell 7920 tower workstation with a Nvidia Quadro P2200 graphics card running Ubuntu 20.04 linux. It would not build, essentially as the Nvida have dropped support for early comute versions. That means mfaktc will not build with the latest Nviida CUDA development tools without some changes. These tips might help. 1) When one downloads the development kit from Nvidia, it puts most/all files in /usr/local/cuda-11.3. The Makefile expects them to be in /usr/local/cuda, but that's obviously easy to change. 2) The Makefile has these lines Code: NVCCFLAGS += --generate-code arch=compute_11,code=sm_11 # CC 1.1, 1.2 and 1.3 GPUs will use this code (1.0 is not possible for mfaktc) NVCCFLAGS += --generate-code arch=compute_20,code=sm_20 # CC 2.x GPUs will use this code, one code fits all! NVCCFLAGS += --generate-code arch=compute_30,code=sm_30 # all CC 3.x GPUs _COULD_ use this code NVCCFLAGS += --generate-code arch=compute_35,code=sm_35 # but CC 3.5 (3.2?) _CAN_ use funnel shift which is useful for mfaktc NVCCFLAGS += --generate-code arch=compute_50,code=sm_50 # CC 5.x GPUs will use this code It appears the author is adding support for as many cards as possible, which makes sense. However, the executable fails to build, as the nvcc: NVIDIA (R) Cuda compiler driver will not accept the old options that the Makefile gives it. Code: gcc -Wall -Wextra -O2 -I/usr/local/cuda-11.3/include/ -malign-double -c output.c -o output.o nvcc -I/usr/local/cuda-11.3/include/ --ptxas-options=-v --generate-code arch=compute_11,code=sm_11 --generate-code arch=compute_20,code=sm_20 --generate-code arch=compute_30,code=sm_30 --generate-code arch=compute_35,code=sm_35 --generate-code arch=compute_50,code=sm_50 --compiler-options=-Wall -c tf_72bit.cu -o tf_72bit.o nvcc fatal : Unsupported gpu architecture 'compute_11' make: *** [Makefile:56: tf_72bit.o] Error 1 As far as I can ascertain, compute_11 (compute capability 1.1) is just too old for the current development system. So I commented out that line. Again it fails, but this time because compute_20 is too old too. So I commented out that line. Then it fails because of compute_30, so I commented that line out. Finally it built. There are two weird things about the build. a) The executable has the .exe extension - most unusual on a Linux system. b) The executable is put in the directory above the location of the source code - I have never seen this before. There was an exe there before, so the build overwrites that. However, although the executable will run, it would not work with my card. I think both the CC 3.5 and 5.0 are too old for my Nvidia P2200 graphics card. I get the following error message, when its self-test runs. Code: drkirkby@jackdaw:~/mfaktc-0.21$./mfaktc.exe mfaktc v0.21 (64bit built) Compiletime options THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 193154bits SIEVE_SPLIT 250 MORE_CLASSES enabled Runtime options SievePrimes 25000 SievePrimesAdjust 1 SievePrimesMin 5000 SievePrimesMax 100000 NumStreams 3 CPUStreams 3 GridSize 3 GPU Sieving enabled GPUSievePrimes 82486 GPUSieveSize 64Mi bits GPUSieveProcessSize 16Ki bits Checkpoints enabled CheckpointDelay 30s WorkFileAddDelay 600s Stages enabled StopAfterFactor bitlevel PrintMode full V5UserID (none) ComputerID (none) AllowSleep no TimeStampInResults no CUDA version info binary compiled for CUDA 11.30 CUDA runtime version 11.30 CUDA driver version 11.30 CUDA device info name NVIDIA Quadro P2200 compute capability 6.1 max threads per block 1024 max shared memory per MP 98304 byte number of multiprocessors 10 clock rate (CUDA cores) 1493MHz memory clock rate: 5005MHz memory bus width: 160 bit Automatic parameters threads per grid 655360 GPUSievePrimes (adjusted) 82486 GPUsieve minimum exponent 1055144 running a simple selftest... ERROR: cudaGetLastError() returned 209: no kernel image is available for execution on the device Notice the line compute capability 6.1 At that point, I took a guess that adding this line in the Makefile would add the 6.1 capability my card possibly needs. Code: NVCCFLAGS += --generate-code arch=compute_61,code=sm_61 Finally that built and works with my card, but I do see these warnings. Code: The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). Here's a diff, showing the differences between the original Makefile which I renamed to Makefile.bak, and the new one, which I called Makefile Code: drkirkby@jackdaw:~/mfaktc-0.21/src$ diff Makefile Makefile.bak 2c2 < CUDA_DIR = /usr/local/cuda-11.3 --- > CUDA_DIR = /usr/local/cuda 16,18c16,18 < #NVCCFLAGS += --generate-code arch=compute_11,code=sm_11 # CC 1.1, 1.2 and 1.3 GPUs will use this code (1.0 is not possible for mfaktc) < #NVCCFLAGS += --generate-code arch=compute_20,code=sm_20 # CC 2.x GPUs will use this code, one code fits all! < #NVCCFLAGS += --generate-code arch=compute_30,code=sm_30 # all CC 3.x GPUs _COULD_ use this code --- > NVCCFLAGS += --generate-code arch=compute_11,code=sm_11 # CC 1.1, 1.2 and 1.3 GPUs will use this code (1.0 is not possible for mfaktc) > NVCCFLAGS += --generate-code arch=compute_20,code=sm_20 # CC 2.x GPUs will use this code, one code fits all! > NVCCFLAGS += --generate-code arch=compute_30,code=sm_30 # all CC 3.x GPUs _COULD_ use this code 21d20 < NVCCFLAGS += --generate-code arch=compute_61,code=sm_61 # Needed for my Nvidia P2200 with compute capability 6.1 I run a quick check, Code: ./mfaktc.exe -tf 754454689 72 73 as I knew there was a factor in 2^754454689 -1, with the factor being between 2^72 and 2^73. https://www.mersenne.org/report_expo...exp_hi=&full=1 mfaktc found the factor 7136025663302317823497 okay. The reported speed is around 191 GHz-day/day. Code: Date Time | class Pct | time ETA | GHz-d/day Sieve Wait May 13 21:04 | 0 0.1% | 0.595 9m31s | 191.77 82485 n.a.% May 13 21:04 | 11 0.2% | 0.580 9m16s | 196.73 82485 n.a.% May 13 21:04 | 15 0.3% | 0.593 9m28s | 192.42 82485 n.a.% May 13 21:04 | 20 0.4% | 0.588 9m22s | 194.05 82485 n.a.% I've not used my Xeon's much for trial factoring, and have done no benchmarks for trial factoring, but for PRP tests, an exponent around 103 million (425 GHz days) takes about 44 hours (1.83 days), so I'm guessing that means one Intel Xeon 8167M is doing 425/1.83=231 GHz-days/day. So whilst the Xeons were about 6/5 times faster than the GPU on PRP tests, on trial factoring, the GPU is similar performance to the Xeons. (I've not benchmarked the Xeons properly with trial factoring - I'm just assuming they work as well as PRP tests). There's a lot of assumptions there, but my thoughts are the GPU is not totally useless when it comes to trial factoring. Based on my interests, I will not be shelling out the cost of a new GPU. Dave Last fiddled with by drkirkby on 2021-05-13 at 20:22
 2021-05-13, 20:53 #2 drkirkby   "David Kirkby" Jan 2021 Althorne, Essex, UK 22·107 Posts Oops, I just realised that gpuowl was running on the graphics card at the same time as mfaktc! Now the speed of mfaktc seems much more impressive than the Intel Xeon Platinum 8167M. In fact, I have a PRP test of 103777013 to do, but I'm going to do a trial factor to 77 bits on the GPU, as it will only take a few hours. If it finds a factor I will not bother with the PRP test. https://www.mersenne.org/report_expo...3777013&full=1 Code: Date Time Pct ETA | Exponent Bits | GHz-d/day Sieve Wait May 13 21:45 0.2 8h22m | 103777013 76-77 | 421.78 82485 n.a.% May 13 21:46 0.3 8h24m | 103777013 76-77 | 419.59 82485 n.a.% Last fiddled with by drkirkby on 2021-05-13 at 20:56
 2021-05-14, 04:54 #3 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 24×3×113 Posts The Quadro P2200 TF GhzD/day performance, while much higher than its PRP or LL or P-1 performance, is modest. RTX2080 (by no means fastest) : got assignment: exp=114021059 bit_min=75 bit_max=76 (67.11 GHz-days) Starting trial factoring M114021059 from 2^75 to 2^76 (67.11 GHz-days) k_min = 165666466326900 k_max = 331332932655512 Using GPU kernel "barrett76_mul32_gs" Date Time | class Pct | time ETA | GHz-d/day Sieve Wait May 13 23:22 | 0 0.1% | 1.921 30m42s | 3144.20 106037 n.a.% An alternative to the iterative compile approach to finding what compute levels are supported is to read the release notes. Or https://www.mersenneforum.org/showpo...1&postcount=11 Next step is to read about and try tuning mfaktc for your card and workload. Last fiddled with by kriesel on 2021-05-14 at 04:56
2021-05-14, 10:28   #4
drkirkby

"David Kirkby"
Jan 2021
Althorne, Essex, UK

1AC16 Posts

Quote:
 Originally Posted by kriesel The Quadro P2200 TF GhzD/day performance, while much higher than its PRP or LL or P-1 performance, is modest. RTX2080 (by no means fastest) : got assignment: exp=114021059 bit_min=75 bit_max=76 (67.11 GHz-days) Starting trial factoring M114021059 from 2^75 to 2^76 (67.11 GHz-days) k_min = 165666466326900 k_max = 331332932655512 Using GPU kernel "barrett76_mul32_gs" Date Time | class Pct | time ETA | GHz-d/day Sieve Wait May 13 23:22 | 0 0.1% | 1.921 30m42s | 3144.20 106037 n.a.% An alternative to the iterative compile approach to finding what compute levels are supported is to read the release notes. Or https://www.mersenneforum.org/showpo...1&postcount=11 Next step is to read about and try tuning mfaktc for your card and workload.
The iterative approach to compute levels was pretty quick, despite that was the first Cuda application I had ever built. I don't think reading the documentation would have been quicker.

Quadro cards are generally over-priced compared to the more mainstream cards, but the over-priced (£21,000 GBP per year for use of just 4 cores) software I used was optimised for Quadro cards. I had a limited time to use that software, so just went for supported operating system (CentOS) and graphics card (Quadro). FWIW, I did let that trial factor complete
Code:
May 14 06:14   99.7  1m36s | 103777013  76-77 |    414.84    82485    n.a.%
May 14 06:15   99.8  1m04s | 103777013  76-77 |    414.29    82485    n.a.%
May 14 06:15   99.9  0m32s | 103777013  76-77 |    414.84    82485    n.a.%
May 14 06:16  100.0  0m00s | 103777013  76-77 |    415.10    82485    n.a.%
no factor for M103777013 from 2^76 to 2^77 [mfaktc 0.21 barrett87_mul32_gs]
tf(): time spent since restart:    8h 30m 42.972s
estimated total time spent:  8h 31m 14.925s
Is there any point in me reading the documentation to find the other compute levels supported, then building a revised source and linux binary for others to use? If so, where should it be put? A 6-year old source code that will not build with the latest software development system, along with a 6-year old binary that will not work with recent cards, is perhaps worth updating.

 2021-05-14, 12:42 #5 drkirkby   "David Kirkby" Jan 2021 Althorne, Essex, UK 22×107 Posts Given I have a PRP test reserved on 103777013, but no trial-factor assignment ID, can the results be usefully uploaded? Code: no factor for M103777013 from 2^76 to 2^77 [mfaktc 0.21 barrett87_mul32_gs] Obviously, in the highly unlikely event 103777013 turns out to be prime, that fact would be irrelevant. I can see that it would be very open to abuse with people uploading false results, but on an open-source code, one can't really prevent that anyway. I don't have an ego and so need to collect CPU days, although I would like to find the trick to getting allocated category 0 assignments. I've had one of them. I can get category 1 assignments easy enough, and are completing more than one per day, but I can't seem to find the way to get category 0 assignments. Last fiddled with by drkirkby on 2021-05-14 at 12:42
 2021-05-14, 13:23 #6 slandrum   Jan 2021 California 9916 Posts Cat 0 assignments are always fully assigned. They can only become available when an assignment within the cat 0 range expires or TF/PM1 completes on an assignment in cat 0 range, and will only remain unassigned for a few minutes unless a large number expire at once. You shouldn't worry about getting cat 0 assignments. The category only exists to make sure that the assignments at the trailing edge will get cleared eventually. Last fiddled with by slandrum on 2021-05-14 at 13:26
2021-05-14, 16:00   #7
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

153016 Posts

Quote:
 Originally Posted by drkirkby Is there any point in me reading the documentation to find the other compute levels supported, then building a revised source and linux binary for others to use?
If you were inclined to do such builds, do it with the 2047Mbit sieve size version. Or figure out how to change the program to unsigned int to allow up to 4095 Mbits sieve size, which some of the fastest cards would benefit from.
Quote:
 If so, where should it be put? A 6-year old source code that will not build with the latest software development system, along with a 6-year old binary that will not work with recent cards, is perhaps worth updating.
Put any resulting builds in the Mfaktc original thread. TGZ files one per well labeled build type for Linux, .ZIP for modified source suggested. Forum software supports up to 5 attachments per post. Such content will likely get added to the mersenne.ca mirror. I suggest Ubuntu 18.04 get some attention.

Quote:
 Originally Posted by drkirkby Given I have a PRP test reserved on 103777013, but no trial-factor assignment ID, can the results be usefully uploaded?
Yes, submit the result. It will be useful to the additional-factor-hunters someday, when 103M is considered a small exponent.
Quote:
 I would like to find the trick to getting allocated category 0 assignments. I've had one of them.
Lots of reliable throughput and patience. There are ~5000 GIMPS participants, and only 200 Cat 0 at any time. Completing exponents before they become Cat 0 is a good thing. There are not only 200 Cat 0 first tests, but also 200 cat 0 double checks or PRP with proof as DC for LL first tests.

There's a very good chance of completing double checking up to Mp#48* this year.
There's also the strategic rechecking list Uncwilly updates regularly, of exponents with conflicting results, which would benefit from some quick tie breaker runs from those 26-core cpus.

 Similar Threads Thread Thread Starter Forum Replies Last Post CRGreathouse PARI/GP 40 2018-02-15 18:20 JC GPU Computing 4 2015-09-22 04:40 VolMike Msieve 9 2012-10-14 07:57 Bob Stein Information & Answers 1 2008-04-11 17:52 Unreg Hardware 6 2004-09-18 18:19

All times are UTC. The time now is 01:56.

Tue Aug 3 01:56:22 UTC 2021 up 10 days, 20:25, 0 users, load averages: 2.66, 2.48, 1.93