mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2023-01-02, 20:14   #3598
DeleteNull
 
DeleteNull's Avatar
 
Mar 2017
Germany, Wolfsburg

22 Posts
Default

The main difference is that the lib file "libcudart.so.12" included in the zip, is now loaded from local directory. The Makefile for this is (LD = clang -Wl,-rpath,.):


# where is the CUDA Toolkit installed?
CUDA_DIR = /usr/local/cuda
CUDA_INCLUDE = -I$(CUDA_DIR)/include/
CUDA_LIB = -L$(CUDA_DIR)/lib64/

# Compiler settings for .c files (CPU)
CC = clang -static-libgcc -static-libstdc++
CFLAGS = -Wall -Wextra -O2 $(CUDA_INCLUDE) -malign-double
CFLAGS_EXTRA_SIEVE = -funroll-all-loops
# Compiler settings for .cu files (CPU/GPU)
NVCC = nvcc
NVCCFLAGS = $(CUDA_INCLUDE) --ptxas-options=-v

# generate code for various compute capabilities
#NVCCFLAGS += --generate-code arch=compute_11,code=sm_11 # CC 1.1, 1.2 and 1.3 GPUs will use this code (1.0 is not possible for mfaktc)
#NVCCFLAGS += --generate-code arch=compute_20,code=sm_20 # CC 2.x GPUs will use this code, one code fits all!
#NVCCFLAGS += --generate-code arch=compute_30,code=sm_30 # all CC 3.x GPUs _COULD_ use this code
#NVCCFLAGS += --generate-code arch=compute_35,code=sm_35 # but CC 3.5 (3.2?) _CAN_ use funnel shift which is useful for mfaktc
#NVCCFLAGS += --generate-code arch=compute_50,code=sm_50 # CC 5.x GPUs will use this code
NVCCFLAGS += --generate-code arch=compute_61,code=sm_61
NVCCFLAGS += --generate-code arch=compute_75,code=sm_75
NVCCFLAGS += --generate-code arch=compute_86,code=sm_86 # CC 5.x GPUs will use this code
NVCCFLAGS += --generate-code arch=compute_89,code=sm_89
NVCCFLAGS += --generate-code arch=compute_90,code=sm_90
# pass some options to the C host compiler (e.g. gcc on Linux)
NVCCFLAGS += --compiler-options=-Wall

# Linker
LD = clang -Wl,-rpath,.
LDFLAGS = -fPIC $(CUDA_LIB) -lcudart -lm -lstdc++

##############################################################################

CSRC = sieve.c timer.c parse.c read_config.c mfaktc.c checkpoint.c \
signal_handler.c output.c
CUSRC = tf_72bit.cu tf_96bit.cu tf_barrett96.cu tf_barrett96_gs.cu gpusieve.cu

COBJS = $(CSRC:.c=.o)
CUOBJS = $(CUSRC:.cu=.o) tf_75bit.o

##############################################################################

all: ../mfaktc.exe

../mfaktc.exe : $(COBJS) $(CUOBJS)
$(LD) $^ -o $@ $(LDFLAGS)

clean :
rm -f *.o *~

sieve.o : sieve.c
$(CC) $(CFLAGS) $(CFLAGS_EXTRA_SIEVE) -c $< -o $@

tf_75bit.o : tf_96bit.cu
$(NVCC) $(NVCCFLAGS) -c $< -o $@ -DSHORTCUT_75BIT

%.o : %.cu
$(NVCC) $(NVCCFLAGS) -c $< -o $@

%.o : %.c
$(CC) $(CFLAGS) -c $< -o $@

##############################################################################

# dependencies generated by cpp -MM
checkpoint.o: checkpoint.c params.h

# manually add selftest-data-mersenne.c or selftest-data-wagstaff.c
mfaktc.o: mfaktc.c params.h my_types.h compatibility.h sieve.h \
read_config.h parse.h timer.h tf_72bit.h tf_96bit.h tf_barrett96.h \
checkpoint.h signal_handler.h output.h gpusieve.h \
selftest-data-mersenne.c selftest-data-wagstaff.c

output.o: output.c params.h my_types.h output.h compatibility.h

parse.o: parse.c compatibility.h parse.h params.h

read_config.o: read_config.c params.h my_types.h

sieve.o: sieve.c params.h compatibility.h

signal_handler.o: signal_handler.c params.h my_types.h compatibility.h

timer.o: timer.c timer.h compatibility.h

tf_72bit.o: tf_72bit.cu params.h my_types.h compatibility.h \
my_intrinsics.h sieve.h timer.h output.h tf_debug.h tf_common.cu

tf_96bit.o: tf_96bit.cu params.h my_types.h compatibility.h \
my_intrinsics.h sieve.h timer.h output.h tf_debug.h \
tf_96bit_base_math.cu tf_96bit_helper.cu gpusieve_helper.cu tf_common.cu \
tf_common_gs.cu gpusieve.h

tf_barrett96.o: tf_barrett96.cu params.h my_types.h compatibility.h \
my_intrinsics.h sieve.h timer.h output.h tf_debug.h \
tf_96bit_base_math.cu tf_96bit_helper.cu tf_barrett96_div.cu \
tf_barrett96_core.cu tf_common.cu

tf_barrett96_gs.o: tf_barrett96_gs.cu params.h my_types.h compatibility.h \
my_intrinsics.h sieve.h timer.h output.h tf_debug.h \
tf_96bit_base_math.cu tf_96bit_helper.cu tf_barrett96_div.cu \
tf_barrett96_core.cu gpusieve_helper.cu tf_common_gs.cu gpusieve.h

gpusieve.o: gpusieve.cu params.h my_types.h compatibility.h \
my_intrinsics.h gpusieve.h

# manually generated dependency

tf_75bit.o: tf_96bit.cu params.h my_types.h compatibility.h \
my_intrinsics.h sieve.h timer.h output.h tf_debug.h \
tf_96bit_base_math.cu tf_96bit_helper.cu gpusieve_helper.cu tf_common.cu \
tf_common_gs.cu gpusieve.h

Last fiddled with by DeleteNull on 2023-01-02 at 20:15
DeleteNull is offline   Reply With Quote
Old 2023-01-03, 06:56   #3599
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/

24×199 Posts
Default

I did a little more experimenting with my 1070s and 332M exponents: turns out increasing GPUSievePrimes to 150,000 was a benefit as was GPUSieveSize to 1024. Got about 10% more performance making those changes.

I'll have to try increasing GPUSievePrimes on the 3070s later.
Mark Rose is offline   Reply With Quote
Old 2023-01-03, 11:34   #3600
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/

24·199 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
So I spent the last nine days going over this, learning CUDA, etc., and it turns out that using floats would take 32% more time on compute 3.x hardware. For those not familiar, compute 3.x hardware can do 6 floating point multiply-adds but only 1 integer multiply-add per cycle. I'm mainly posting this in case anyone else has thought of pursuing the idea of using floating point.

The current barrett76 algorithm does 20 integer FMA's cycles per loop. It also must spend 2 cycles doing addition and subtraction. So 22 cycles.

The hypothetical floating point algorithm requires 2*7*7 FMA's for the basic multiplications. The high bits of each float are found by multiplying by 1/(2^11) and adding 1 x 2^23 as an FMA, rounding down to shift away the fraction, for another 2 * 14 FMA's. The 1 x 2^23 is then subtracted our for 2 * 14 subs. The high bits are then subtracted away for another 2 * 14 subs. Finally 7 subs are done to find the remainder.

That's a total of 53 FMA's for 9 cycles, 28 subs for 5 cycles, 53 FMA's for 9 cycles, then 35 subs for 6 cycles, for a total of 29 cycles. That's about 32% slower than the integer version, not taking into consider register pressure, etc.

The new Maxwell chips, compute 5.x, keep a similar ratio to the 3.x chips in floating point to integer instructions, so there's no win there either.
I think it may be worth revisiting an FP32 implementation.
  • There is no advantage for Pascal (1000 series) and earlier
  • Turing (2000 series) can run equal FP32 and INT32 concurrently
  • Ampere (3000 series) and Lovelace (4000 series) can run equal FP32 and INT32 concurrently, or all FP32

With Turing, Ampere, and Lovelace, half the hardware is being unused.

For Turing, the trick would be writing an algorithm that is able to do some of the math in FP32 efficiently.

For Ampere and Lovelace, an entirely FP32 algorithm may be simpler.

Turing also introduces Tensor Cores, good for INT8. There be an advantage to using these as well.
Mark Rose is offline   Reply With Quote
Old 2023-01-04, 00:13   #3601
Rodrigo
 
Rodrigo's Avatar
 
Jun 2010
Pennsylvania

947 Posts
Default

Happy New Year, everyone.

Ref. post https://www.mersenneforum.org/showpo...postcount=2744:

Yesterday I upgraded that Kubuntu box from 20.04 LTS to 22.04 LTS, and now mfaktc isn't working anymore. (The previous transition, from 18.04 LTS to 20.04 LTS, didn't seem to have this nasty effect.) Here 's the output of mfaktc.exe -d0:

Code:
Compiletime options
  THREADS_PER_BLOCK         256
  SIEVE_SIZE_LIMIT          32kiB
  SIEVE_SIZE                193154bits
  SIEVE_SPLIT               250
  MORE_CLASSES              enabled

Runtime options
  SievePrimes               25000
  SievePrimesAdjust         1
  SievePrimesMin            5000
  SievePrimesMax            100000
  NumStreams                3
  CPUStreams                3
  GridSize                  3
  GPU Sieving               enabled
  GPUSievePrimes            82486
  GPUSieveSize              64Mi bits
  GPUSieveProcessSize       16Ki bits
  Checkpoints               enabled
  CheckpointDelay           30s
  WorkFileAddDelay          600s
  Stages                    enabled
  StopAfterFactor           bitlevel
  PrintMode                 full
  V5UserID                  (none)
  ComputerID                (none)
  AllowSleep                no
  TimeStampInResults        no

CUDA version info
  binary compiled for CUDA  8.0
  CUDA runtime version      32.31
  CUDA driver version       9.10
ERROR: CUDA runtime version must match the CUDA toolkit version used during compile!
Strangely, that CUDA runtime version goes up a little bit each time that mfaktc is run (e.g., 32.18, then 32.31, followed by 32.57, and now 32.86). However, it's not the astronomically high figures (>50000) I saw once before, years ago.

Some more information:
GPU: GeForce GTX 1050
NVIDIA Driver Version: 390.157

I'll be happy to provide additional details that might help to hone in on a solution.
Rodrigo is offline   Reply With Quote
Old 2023-01-04, 00:25   #3602
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/

24×199 Posts
Default

apt-get install nvidia-driver-510 nvidia-cuda-dev nvidia-cuda-toolkit, reboot, then recompile mfaktc.
Mark Rose is offline   Reply With Quote
Old 2023-01-04, 06:39   #3603
Luminescence
 
"Florian"
Oct 2021
Germany

2·103 Posts
Default

I've compiled a Windows version with CUDA 12.0, CC 8.9, more classes and GPU sieve size 2047

Instead of cudart.lib I linked cudart_static.lib and it generated the executable and two additional files.

Can anyone check if it runs without any additional CUDA DLL file? Obviously I can't, since I have the full toolkit installed.
Attached Files
File Type: zip mfaktc-0.21-win64-cuda12.0-CC89-2047.zip (819.3 KB, 52 views)
Luminescence is offline   Reply With Quote
Old 2023-01-04, 06:40   #3604
Rodrigo
 
Rodrigo's Avatar
 
Jun 2010
Pennsylvania

947 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
apt-get install nvidia-driver-510 nvidia-cuda-dev nvidia-cuda-toolkit, reboot, then recompile mfaktc.
Thank you. I've never compiled a program before, have only used the ready-made executables available for download. Not unwilling to give it a shot, but it's hard to avoid the sense that this would be way above my pay grade.

Is there a comprehensive guide here to compiling mfaktc for Kubuntu (Ubuntu)?
Rodrigo is offline   Reply With Quote
Old 2023-01-04, 12:55   #3605
rebirther
 
rebirther's Avatar
 
Sep 2011
Germany

360510 Posts
Default

Quote:
Originally Posted by Luminescence View Post
I've compiled a Windows version with CUDA 12.0, CC 8.9, more classes and GPU sieve size 2047

Instead of cudart.lib I linked cudart_static.lib and it generated the executable and two additional files.

Can anyone check if it runs without any additional CUDA DLL file? Obviously I can't, since I have the full toolkit installed.
Can you pls also include cc6.1-cc9.0 and a short guide how did you compile it?
rebirther is offline   Reply With Quote
Old 2023-01-04, 14:20   #3606
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

782410 Posts
Default

Quote:
Originally Posted by rebirther View Post
Can you pls also include cc6.1-cc9.0 and a short guide how did you compile it?
+2. And PTX, all in one executable. Doc of setting up, and compile. The whole project could get by with few executables, if they included support for the full feasible list of CC values for a given SDK version; SDK 12 and 8 could together cover CC 2.0-9.0; GPUs released over a ~12 year span. Two executables per OS.

Last fiddled with by kriesel on 2023-01-04 at 15:04
kriesel is online now   Reply With Quote
Old 2023-01-04, 20:35   #3607
lalera
 
lalera's Avatar
 
Jul 2003

12008 Posts
Default

hi,
to mr. james heinrich
why is the file from post #3572 not included anymore in your downloads ?
it works and includes the source and the makefile
you need to have the toolkit12 installed - it is not made for boinc
lalera is offline   Reply With Quote
Old 2023-01-04, 21:10   #3608
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

7·13·47 Posts
Default

Quote:
Originally Posted by lalera View Post
why is the file from post #3572 not included anymore in your downloads ?
it works and includes the source and the makefile
you need to have the toolkit12 installed - it is not made for boinc
I assumed the versions posted in #3593 and/or #3595 supersede that. What does #3572 offer that the later two don't?
James Heinrich is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1724 2023-06-04 23:31
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 42 2022-12-18 05:59
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 14:43.


Fri Jul 7 14:43:46 UTC 2023 up 323 days, 12:12, 0 users, load averages: 1.48, 1.34, 1.14

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔