mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2021-05-13, 20:22   #1
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

22·107 Posts
Default Tips for building mfaktc 0.21 on Ubuntu 20.04

I'm not sure where best to post this - it is mainly a software issue, but there's no GPU specific forum for software, but there is for hardware. However, there is a bit of hardware specific stuff, so I guess here is okay.

I tried to build mfaktc 0.21
https://www.mersenneforum.org/mfaktc...tc-0.21.tar.gz

on a Dell 7920 tower workstation with a Nvidia Quadro P2200 graphics card running Ubuntu 20.04 linux. It would not build, essentially as the Nvida have dropped support for early comute versions. That means mfaktc will not build with the latest Nviida CUDA development tools without some changes. These tips might help.

1) When one downloads the development kit from Nvidia, it puts most/all files in /usr/local/cuda-11.3. The Makefile expects them to be in /usr/local/cuda, but that's obviously easy to change.

2) The Makefile has these lines
Code:
NVCCFLAGS += --generate-code arch=compute_11,code=sm_11 # CC 1.1, 1.2 and 1.3 GPUs will use this code (1.0 is not possible for mfaktc)
NVCCFLAGS += --generate-code arch=compute_20,code=sm_20 # CC 2.x GPUs will use this code, one code fits all!
NVCCFLAGS += --generate-code arch=compute_30,code=sm_30 # all CC 3.x GPUs _COULD_ use this code 
NVCCFLAGS += --generate-code arch=compute_35,code=sm_35 # but CC 3.5 (3.2?) _CAN_ use funnel shift which is useful for mfaktc
 NVCCFLAGS += --generate-code arch=compute_50,code=sm_50 # CC 5.x GPUs will use this code
It appears the author is adding support for as many cards as possible, which makes sense. However, the executable fails to build, as the nvcc: NVIDIA (R) Cuda compiler driver will not accept the old options that the Makefile gives it.
Code:
gcc -Wall -Wextra -O2 -I/usr/local/cuda-11.3/include/ -malign-double -c output.c -o output.o
nvcc -I/usr/local/cuda-11.3/include/ --ptxas-options=-v --generate-code arch=compute_11,code=sm_11  --generate-code arch=compute_20,code=sm_20  --generate-code arch=compute_30,code=sm_30  --generate-code arch=compute_35,code=sm_35  --generate-code arch=compute_50,code=sm_50  --compiler-options=-Wall -c tf_72bit.cu -o tf_72bit.o
nvcc fatal   : Unsupported gpu architecture 'compute_11'
make: *** [Makefile:56: tf_72bit.o] Error 1
As far as I can ascertain, compute_11 (compute capability 1.1) is just too old for the current development system. So I commented out that line. Again it fails, but this time because compute_20 is too old too. So I commented out that line. Then it fails because of compute_30, so I commented that line out. Finally it built. There are two weird things about the build.

a) The executable has the .exe extension - most unusual on a Linux system.
b) The executable is put in the directory above the location of the source code - I have never seen this before. There was an exe there before, so the build overwrites that.

However, although the executable will run, it would not work with my card. I think both the CC 3.5 and 5.0 are too old for my Nvidia P2200 graphics card. I get the following error message, when its self-test runs.
Code:
drkirkby@jackdaw:~/mfaktc-0.21$ ./mfaktc.exe
mfaktc v0.21 (64bit built)

Compiletime options
  THREADS_PER_BLOCK         256
  SIEVE_SIZE_LIMIT          32kiB
  SIEVE_SIZE                193154bits
  SIEVE_SPLIT               250
  MORE_CLASSES              enabled

Runtime options
  SievePrimes               25000
  SievePrimesAdjust         1
  SievePrimesMin            5000
  SievePrimesMax            100000
  NumStreams                3
  CPUStreams                3
  GridSize                  3
  GPU Sieving               enabled
  GPUSievePrimes            82486
  GPUSieveSize              64Mi bits
  GPUSieveProcessSize       16Ki bits
  Checkpoints               enabled
  CheckpointDelay           30s
  WorkFileAddDelay          600s
  Stages                    enabled
  StopAfterFactor           bitlevel
  PrintMode                 full
  V5UserID                  (none)
  ComputerID                (none)
  AllowSleep                no
  TimeStampInResults        no

CUDA version info
  binary compiled for CUDA  11.30
  CUDA runtime version      11.30
  CUDA driver version       11.30

CUDA device info
  name                      NVIDIA Quadro P2200
  compute capability        6.1
  max threads per block     1024
  max shared memory per MP  98304 byte
  number of multiprocessors 10
  clock rate (CUDA cores)   1493MHz
  memory clock rate:        5005MHz
  memory bus width:         160 bit

Automatic parameters
  threads per grid          655360
  GPUSievePrimes (adjusted) 82486
  GPUsieve minimum exponent 1055144

running a simple selftest...
ERROR: cudaGetLastError() returned 209: no kernel image is available for execution on the device
Notice the line
compute capability 6.1
At that point, I took a guess that adding this line in the Makefile would add the 6.1 capability my card possibly needs.
Code:
NVCCFLAGS += --generate-code arch=compute_61,code=sm_61
Finally that built and works with my card, but I do see these warnings.

Code:
The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
Here's a diff, showing the differences between the original Makefile which I renamed to Makefile.bak, and the new one, which I called Makefile

Code:
drkirkby@jackdaw:~/mfaktc-0.21/src$ diff Makefile Makefile.bak
2c2
< CUDA_DIR = /usr/local/cuda-11.3
---
> CUDA_DIR = /usr/local/cuda
16,18c16,18
< #NVCCFLAGS += --generate-code arch=compute_11,code=sm_11 # CC 1.1, 1.2 and 1.3 GPUs will use this code (1.0 is not possible for mfaktc)
< #NVCCFLAGS += --generate-code arch=compute_20,code=sm_20 # CC 2.x GPUs will use this code, one code fits all!
< #NVCCFLAGS += --generate-code arch=compute_30,code=sm_30 # all CC 3.x GPUs _COULD_ use this code 
---
> NVCCFLAGS += --generate-code arch=compute_11,code=sm_11 # CC 1.1, 1.2 and 1.3 GPUs will use this code (1.0 is not possible for mfaktc)
> NVCCFLAGS += --generate-code arch=compute_20,code=sm_20 # CC 2.x GPUs will use this code, one code fits all!
> NVCCFLAGS += --generate-code arch=compute_30,code=sm_30 # all CC 3.x GPUs _COULD_ use this code 
21d20
< NVCCFLAGS += --generate-code arch=compute_61,code=sm_61 # Needed for my Nvidia P2200 with compute capability 6.1
I run a quick check,
Code:
./mfaktc.exe -tf 754454689 72 73
as I knew there was a factor in 2^754454689 -1, with the factor being between 2^72 and 2^73.

https://www.mersenne.org/report_expo...exp_hi=&full=1

mfaktc found the factor 7136025663302317823497 okay. The reported speed is around 191 GHz-day/day.

Code:
Date    Time | class   Pct |   time     ETA | GHz-d/day    Sieve     Wait
May 13 21:04 |    0   0.1% |  0.595   9m31s |    191.77    82485    n.a.%
May 13 21:04 |   11   0.2% |  0.580   9m16s |    196.73    82485    n.a.%
May 13 21:04 |   15   0.3% |  0.593   9m28s |    192.42    82485    n.a.%
May 13 21:04 |   20   0.4% |  0.588   9m22s |    194.05    82485    n.a.%
I've not used my Xeon's much for trial factoring, and have done no benchmarks for trial factoring, but for PRP tests, an exponent around 103 million (425 GHz days) takes about 44 hours (1.83 days), so I'm guessing that means one Intel Xeon 8167M is doing 425/1.83=231 GHz-days/day. So whilst the Xeons were about 6/5 times faster than the GPU on PRP tests, on trial factoring, the GPU is similar performance to the Xeons. (I've not benchmarked the Xeons properly with trial factoring - I'm just assuming they work as well as PRP tests). There's a lot of assumptions there, but my thoughts are the GPU is not totally useless when it comes to trial factoring.

Based on my interests, I will not be shelling out the cost of a new GPU.

Dave

Last fiddled with by drkirkby on 2021-05-13 at 20:22
drkirkby is offline   Reply With Quote
Old 2021-05-13, 20:53   #2
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

22·107 Posts
Default

Oops, I just realised that gpuowl was running on the graphics card at the same time as mfaktc! Now the speed of mfaktc seems much more impressive than the Intel Xeon Platinum 8167M. In fact, I have a PRP test of 103777013 to do, but I'm going to do a trial factor to 77 bits on the GPU, as it will only take a few hours. If it finds a factor I will not bother with the PRP test.

https://www.mersenne.org/report_expo...3777013&full=1

Code:
Date   Time     Pct    ETA | Exponent    Bits | GHz-d/day    Sieve     Wait
May 13 21:45    0.2  8h22m | 103777013  76-77 |    421.78    82485    n.a.%
May 13 21:46    0.3  8h24m | 103777013  76-77 |    419.59    82485    n.a.%

Last fiddled with by drkirkby on 2021-05-13 at 20:56
drkirkby is offline   Reply With Quote
Old 2021-05-14, 04:54   #3
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24×3×113 Posts
Default

The Quadro P2200 TF GhzD/day performance, while much higher than its PRP or LL or P-1 performance, is modest.

RTX2080 (by no means fastest) :
got assignment: exp=114021059 bit_min=75 bit_max=76 (67.11 GHz-days)
Starting trial factoring M114021059 from 2^75 to 2^76 (67.11 GHz-days)
k_min = 165666466326900
k_max = 331332932655512
Using GPU kernel "barrett76_mul32_gs"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
May 13 23:22 | 0 0.1% | 1.921 30m42s | 3144.20 106037 n.a.%

An alternative to the iterative compile approach to finding what compute levels are supported is to read the release notes. Or https://www.mersenneforum.org/showpo...1&postcount=11

Next step is to read about and try tuning mfaktc for your card and workload.

Last fiddled with by kriesel on 2021-05-14 at 04:56
kriesel is offline   Reply With Quote
Old 2021-05-14, 10:28   #4
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

1AC16 Posts
Thumbs up

Quote:
Originally Posted by kriesel View Post
The Quadro P2200 TF GhzD/day performance, while much higher than its PRP or LL or P-1 performance, is modest.

RTX2080 (by no means fastest) :
got assignment: exp=114021059 bit_min=75 bit_max=76 (67.11 GHz-days)
Starting trial factoring M114021059 from 2^75 to 2^76 (67.11 GHz-days)
k_min = 165666466326900
k_max = 331332932655512
Using GPU kernel "barrett76_mul32_gs"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
May 13 23:22 | 0 0.1% | 1.921 30m42s | 3144.20 106037 n.a.%

An alternative to the iterative compile approach to finding what compute levels are supported is to read the release notes. Or https://www.mersenneforum.org/showpo...1&postcount=11

Next step is to read about and try tuning mfaktc for your card and workload.
The iterative approach to compute levels was pretty quick, despite that was the first Cuda application I had ever built. I don't think reading the documentation would have been quicker.

Quadro cards are generally over-priced compared to the more mainstream cards, but the over-priced (£21,000 GBP per year for use of just 4 cores) software I used was optimised for Quadro cards. I had a limited time to use that software, so just went for supported operating system (CentOS) and graphics card (Quadro). FWIW, I did let that trial factor complete
Code:
May 14 06:14   99.7  1m36s | 103777013  76-77 |    414.84    82485    n.a.%
May 14 06:15   99.8  1m04s | 103777013  76-77 |    414.29    82485    n.a.%
May 14 06:15   99.9  0m32s | 103777013  76-77 |    414.84    82485    n.a.%
May 14 06:16  100.0  0m00s | 103777013  76-77 |    415.10    82485    n.a.%
no factor for M103777013 from 2^76 to 2^77 [mfaktc 0.21 barrett87_mul32_gs]
tf(): time spent since restart:    8h 30m 42.972s
      estimated total time spent:  8h 31m 14.925s
Is there any point in me reading the documentation to find the other compute levels supported, then building a revised source and linux binary for others to use? If so, where should it be put? A 6-year old source code that will not build with the latest software development system, along with a 6-year old binary that will not work with recent cards, is perhaps worth updating.
drkirkby is offline   Reply With Quote
Old 2021-05-14, 12:42   #5
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

22×107 Posts
Default

Given I have a PRP test reserved on 103777013, but no trial-factor assignment ID, can the results be usefully uploaded?
Code:
no factor for M103777013 from 2^76 to 2^77 [mfaktc 0.21 barrett87_mul32_gs]
Obviously, in the highly unlikely event 103777013 turns out to be prime, that fact would be irrelevant. I can see that it would be very open to abuse with people uploading false results, but on an open-source code, one can't really prevent that anyway.

I don't have an ego and so need to collect CPU days, although I would like to find the trick to getting allocated category 0 assignments. I've had one of them. I can get category 1 assignments easy enough, and are completing more than one per day, but I can't seem to find the way to get category 0 assignments.

Last fiddled with by drkirkby on 2021-05-14 at 12:42
drkirkby is offline   Reply With Quote
Old 2021-05-14, 13:23   #6
slandrum
 
Jan 2021
California

9916 Posts
Default

Cat 0 assignments are always fully assigned. They can only become available when an assignment within the cat 0 range expires or TF/PM1 completes on an assignment in cat 0 range, and will only remain unassigned for a few minutes unless a large number expire at once.

You shouldn't worry about getting cat 0 assignments. The category only exists to make sure that the assignments at the trailing edge will get cleared eventually.

Last fiddled with by slandrum on 2021-05-14 at 13:26
slandrum is online now   Reply With Quote
Old 2021-05-14, 16:00   #7
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

153016 Posts
Default

Quote:
Originally Posted by drkirkby View Post
Is there any point in me reading the documentation to find the other compute levels supported, then building a revised source and linux binary for others to use?
If you were inclined to do such builds, do it with the 2047Mbit sieve size version. Or figure out how to change the program to unsigned int to allow up to 4095 Mbits sieve size, which some of the fastest cards would benefit from.
Quote:
If so, where should it be put? A 6-year old source code that will not build with the latest software development system, along with a 6-year old binary that will not work with recent cards, is perhaps worth updating.
Put any resulting builds in the Mfaktc original thread. TGZ files one per well labeled build type for Linux, .ZIP for modified source suggested. Forum software supports up to 5 attachments per post. Such content will likely get added to the mersenne.ca mirror. I suggest Ubuntu 18.04 get some attention.

Quote:
Originally Posted by drkirkby View Post
Given I have a PRP test reserved on 103777013, but no trial-factor assignment ID, can the results be usefully uploaded?
Yes, submit the result. It will be useful to the additional-factor-hunters someday, when 103M is considered a small exponent.
Quote:
I would like to find the trick to getting allocated category 0 assignments. I've had one of them.
Lots of reliable throughput and patience. There are ~5000 GIMPS participants, and only 200 Cat 0 at any time. Completing exponents before they become Cat 0 is a good thing. There are not only 200 Cat 0 first tests, but also 200 cat 0 double checks or PRP with proof as DC for LL first tests.

There's a very good chance of completing double checking up to Mp#48* this year.
There's also the strategic rechecking list Uncwilly updates regularly, of exponents with conflicting results, which would benefit from some quick tie breaker runs from those 26-core cpus.
kriesel is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
PARI/GP tips and tricks CRGreathouse PARI/GP 40 2018-02-15 18:20
mfaktc on Ubuntu 15.04 JC GPU Computing 4 2015-09-22 04:40
Building msieve under Ubuntu 12.04 VolMike Msieve 9 2012-10-14 07:57
Tool Tips clobbered by Prime95 (on Win2K) Bob Stein Information & Answers 1 2008-04-11 17:52
Help/Tips on Buiding Computer? Unreg Hardware 6 2004-09-18 18:19

All times are UTC. The time now is 01:56.


Tue Aug 3 01:56:22 UTC 2021 up 10 days, 20:25, 0 users, load averages: 2.66, 2.48, 1.93

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.