mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

firejuggler 2014-04-14 16:46

1 Attachment(s)
reporting 750 ti
[code]
mfaktc v0.20 (64bit built)

Compiletime options
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 193154bits
SIEVE_SPLIT 250
MORE_CLASSES enabled

Runtime options
SievePrimes 25000
SievePrimesAdjust 1
SievePrimesMin 5000
SievePrimesMax 100000
NumStreams 3
CPUStreams 3
GridSize 3
GPU Sieving enabled
GPUSievePrimes 82485
GPUSieveSize 64Mi bits
GPUSieveProcessSize 16Ki bits
WorkFile worktodo.txt
Checkpoints enabled
CheckpointDelay 30s
Stages enabled
StopAfterFactor bitlevel
PrintMode compact
V5UserID (none)
ComputerID (none)
AllowSleep no
TimeStampInResults yes

CUDA version info
binary compiled for CUDA 4.20
CUDA runtime version 4.20
CUDA driver version 6.0

CUDA device info
name GeForce GTX 750 Ti
compute capability 5.0
maximum threads per block 1024
number of mutliprocessors 5 (unknown number of shader cores)
clock rate 1110MHz

Automatic parameters
threads per grid 655360

running a simple selftest...
ERROR: cudaGetLastError() returned 8: invalid device function
[/code]
edit: just spotted [url]http://www.mersenneforum.org/showthread.php?p=369671[/url], so i'm not alone

Flow 2014-04-17 21:19

A small bug might be present in mfaktc when finding factors.

I have made sure my ini file has stopafterfactor = 1 and in fact has always worked before. This time around the work was resumed after a ctrl+c break and upon launching mfaktc again it found a factor straight away yet decided to keep working until the end of the bit level.

picture for support...
[url]http://rapidshare.com/share/22572B26ECFF1C96608EED7AED3F28B8[/url]

James Heinrich 2014-04-17 22:32

[QUOTE=Flow;371465]stopafterfactor = 1 ... found a factor straight away yet decided to keep working until the end of the bit level.[/QUOTE]Sounds like it's working as intended:[code]# possible values for StopAfterFactor:
# 0: Do not stop the current assignment after a factor was found.
# 1: When a factor was found for the current assignment stop after the
# current bitlevel. This makes only sense when Stages is enabled.
# 2: When a factor was found for the current assignment stop after the
# current class.
#
# Default: StopAfterFactor=1[/code]If you want it to stop immediately on finding a factor then you should have [b]StopAfterFactor=[color=red]2[/color][/b]

Flow 2014-04-17 23:52

[QUOTE=James Heinrich;371469]If you want it to stop immediately on finding a factor then you should have [b]StopAfterFactor=[color=red]2[/color][/b][/QUOTE]

I could've sworn I was seeing factor founds stopping right away. Anyway here goes not reading the documentation. Care to enlighten me as to what the purpose of finishing the bit level is? I can only assume another factor could be possible but is it actually beneficial to the project to know more than one?

LaurV 2014-04-18 04:10

It is not necessary for [U]this[/U] project (i.e. GIMPS) to finish the bitlevel, once a factor is found. The goal of gimps is to find primes, and once a mersenne number is proved composite, by finding at least on of its factors, the corespondent exponent present no interest anymore for the project. So, it is safely to stop after the class. However, some people argue that is polite to finish the bitlevel, the reason being that some other projects/people may have other reasons beside finding primes, like finding factors, whatever. Finding factors is cool. :D

If a factor is found, the information about how that factor was found, what program was used, how many classes, etc, is lost (not stored in the database). So, because different programs split the bitlevel differently, an eventual "factors hunter" will not know which candidates were checked and which not, on the bitlevel where the factor was found, so he will have to redo all the bitlevel completely, duplicating the efforts of initial searcher. So, some argue that doing the bitlevel is "polite" at least, but here the opinions vary, most people stop after each class, because anyhow, if another guy will start factoring after years, he will not know which bitlevels were done and which not, he will anyhow start from scratch, and at the speed his computer will have, he will need only fractions of the time we use here, to factor everything from scratch. Who the heck knows where the computer speed will be after few years...

Flow 2014-04-18 15:34

[QUOTE=LaurV;371503]It is not necessary for [U]this[/U] project (i.e. GIMPS) to finish the bitlevel, once a factor is found. The goal of gimps is to find primes, and once a mersenne number is proved composite, by finding at least on of its factors, the corespondent exponent present no interest anymore for the project. So, it is safely to stop after the class. However, some people argue that is polite to finish the bitlevel, the reason being that some other projects/people may have other reasons beside finding primes, like finding factors, whatever. Finding factors is cool. :D

If a factor is found, the information about how that factor was found, what program was used, how many classes, etc, is lost (not stored in the database). So, because different programs split the bitlevel differently, an eventual "factors hunter" will not know which candidates were checked and which not, on the bitlevel where the factor was found, so he will have to redo all the bitlevel completely, duplicating the efforts of initial searcher. So, some argue that doing the bitlevel is "polite" at least, but here the opinions vary, most people stop after each class, because anyhow, if another guy will start factoring after years, he will not know which bitlevels were done and which not, he will anyhow start from scratch, and at the speed his computer will have, he will need only fractions of the time we use here, to factor everything from scratch. Who the heck knows where the computer speed will be after few years...[/QUOTE]

Thank you. For the record, I understand the information is not only lost but primenet does not even grant credit for finishing the bit level even though my result file specifically indicates it. Putting two and two together leaves little incentive to actually terminate the assigned work.

tapion64 2014-04-18 17:38

@LaurV, isn't the bit depth for TF remembered on GIMPS? If you look at exponent status and select show full details, it says whether or not it's prime, whether or not a factor has been found, the known factors, LL residues if any, TF factoring bit depth, and P-1/ECM bounds.

EDIT: Nevermind, it does discard that information on mersenne.org once a factor is found. It keeps it until a factor is found, even if it's composite by LL test. However mersenne.ca keeps the information at least.

bayanne 2014-04-25 17:27

[QUOTE=Flow;371535]Thank you. For the record, I understand the information is not only lost but primenet does not even grant credit for finishing the bit level even though my result file specifically indicates it. Putting two and two together leaves little incentive to actually terminate the assigned work.[/QUOTE]
I am now stopping at class level, when a factor is found, rather than bit level ...

James Heinrich 2014-04-25 17:49

[QUOTE=Flow;371535]...leaves little incentive to actually terminate the assigned work.[/QUOTE]My interpretation of GIMPS TF assignments are that they are [color=darkgreen]see if there are any factors between 2[sup]x[/sup] and 2[sup]y[/sup][/color], so you can legitimately complete the assignment by [i]either[/i] finding a factor, or verifying there are no factors in that range. Stopping after finding the factor doesn't make the assignment any less complete.

James Heinrich 2014-05-03 00:55

I finally got ahold of my SLI bridge from storage, so I enabled SLI mode on my now two GTX 670s. Since I wasn't quite with it and plugged the SLI bridge in backwards, I couldn't get SLI detected at first, so I tried reinstalling the current NVIDIA driver (335.23), then figured out what I did wrong and SLI went on just fine.

However, I broke mfaktc. While it has been running fine on 335.23 since that came out (early March), mfaktc won't start now:[code]CUDA version info
binary compiled for CUDA 4.20
CUDA runtime version 0.0
CUDA driver version 0.77
ERROR: current CUDA driver version is lower than the CUDA toolkit version used during compile!
Please update your graphics driver.[/code]How did I delete my CUDA runtime version, and how do I get it back? :unsure:

flashjh 2014-05-03 00:57

Do you have an integrated card also?

James Heinrich 2014-05-03 00:59

[QUOTE=flashjh;372510]Do you have an integrated card also?[/QUOTE]Nope, i7-3930K has no built-in GPU.

flashjh 2014-05-03 01:01

Well, SLI can change the video card numbers. Try using
-d00 and -d01 to start mfaktc.

kladner 2014-05-03 01:05

I may be confused, but I thought that there was no benefit from SLI for mfaktc. Don't x90 dual cards still run just one instance on each GPU?

The driver issue gives me pause, though. I'm running 332.21 without problems. Perhaps "not broken, don't fix" is a good philosophy for now, especially since I don't game, and these are 500 series cards.

flashjh 2014-05-03 01:07

Yes, that's right but it would benefit other things when not using mfaktc.

I use the newest drivers and don't have any problems with 580s.

James Heinrich 2014-05-03 01:29

[QUOTE=flashjh;372512]Well, SLI can change the video card numbers. Try using
-d00 and -d01 to start mfaktc.[/QUOTE]That's not the issue. These two cards were working just fine for the last 3 weeks as two independent cards, running mfaktc on both with -d0/-d1, it's only since I messed up the drivers that it fails to run.

edit: SLI enabled or disabled makes no difference, it's failing because of the non-detected CUDA version.

edit2: 337.50-beta drivers also failed to fix anything

edit3: for what it's worth this is what the NVIDIA System Information box says:
[url]http://s11.postimg.org/l5lbe1ber/nvidia.png[/url]

James Heinrich 2014-05-03 01:41

[QUOTE=James Heinrich;372508][code]CUDA version info
binary compiled for CUDA 4.20
CUDA runtime version 0.0
CUDA driver version 0.77[/code][/QUOTE]Hmm... That was for the LessClasses version of mfaktc I've been running as my personal project for the last long while, but on a whim I tried the regular version of mfaktc and it played nice:[code]CUDA version info
binary compiled for CUDA 4.20
CUDA runtime version 4.20
CUDA driver version 6.0[/code]Doesn't solve my problem, and in fact makes me more confused, but it's a new aspect to the problem.

To add more confusion, I tried an old backup installation of mfaktc-LessClasses from 2012, and I got even weirder detected CUDA driver version:[code]CUDA version info
binary compiled for CUDA 4.20
CUDA runtime version 0.0
CUDA driver version 4068.80[/code]

James Heinrich 2014-05-03 02:00

[QUOTE=flashjh;372512]Well, SLI can change the video card numbers. Try using
-d00 and -d01 to start mfaktc.[/QUOTE]For me, at least, with SLI enabled it's weird. Starting any single instance of mfaktc always runs on the second GPU no matter what -d parameter is set. Starting two instances (with -d0/-d1 that works nicely without SLI) makes the system unusable: the GUI is so sluggish it's hard to click on what you want, and GPU usage for the first GPU is solid 100% (presumably overtaxed, hence the slow GUI) while second GPU usage fluctuates wildly between 10% and 90%. It's like both instances are fighting for the first GPU and some leftovers are spilling to the second.

In short, it doesn't play nice. I guess for now I'll keep SLI disabled, I don't really have many (any?) games that benefit from it anyways.

TheJudger 2014-05-03 12:00

Hi James,

sounds just like a wrong CUDA runtime DLL is loaded.
There is a function inside the runtime DLL which reports the version of it... and it reports bullsh*t when it is not the correct version. I've tried to file a bug against it but nvidia said: "Works as designed, not a bug! Just use the correct runtime version." :sad:
So version detection only works reliable if the correct runtime dll (correct version) is used...
Back to your issue: do you have the CUDA 4.2 runtimes in the working directories?
[LIST][*]CUDA runtime DLL [B]must match[/B] the version used during compile (CUDA toolkit version)[*]CUDA driver must support [B]at least[/B] the version from the CUDA toolkit[/LIST]So for the CUDA 4.20 binary you must use CUDA runtime 4.20 and your driver must support at least 4.20.
Oliver

James Heinrich 2014-05-03 12:28

[QUOTE=TheJudger;372541]do you have the CUDA 4.2 runtimes in the working directories?[/quote]Of course -- I have cudart64_42_9.dll in the LessClasses installation (64-bit, faster for CPU-sieving), and cudart32_42_9.dll in the "regular" mfaktc installation (32-bit, faster for GPU-sieving). Both were working fine for many months, until my latest attempt at driver installation which broke things, apparently.

Compiled for 4.2, got the 4.2 DLL in the folder, and driver supports 6.0

If this was a new driver I might think NVIDIA broke something, but I've been 335.23 since March without problem, until yesterday.
Just to be sure something didn't get corrupted I just copied a mfaktc installation from my other computer, which is happily crunching on my GTX580, but I get the same nonsense (runtime 0.0, driver 0.77).

aaronhaviland 2014-05-03 12:56

[QUOTE=James Heinrich;372543]Of course -- I have cudart64_42_9.dll in the LessClasses installation (64-bit, faster for CPU-sieving), and cudart32_42_9.dll in the "regular" mfaktc installation (32-bit, faster for GPU-sieving). Both were working fine for many months, until my latest attempt at driver installation which broke things, apparently.[/QUOTE]

Using 32-bit binaries on a 64-bit OS has been a problem for at least one person in the past:
[URL]https://devtalk.nvidia.com/default/topic/469002/[/URL]

James Heinrich 2014-05-03 13:12

[QUOTE=aaronhaviland;372545]Using 32-bit binaries on a 64-bit OS has been a problem for at least one person in the past[/QUOTE]It's the 32-bit version that runs fine, the 64-bit version that's giving issues.

TheJudger 2014-05-03 19:24

Hi James,

can you check whether there are other CUDA runtimes (64bit) on your system or not?

Oliver

James Heinrich 2014-05-03 19:51

[QUOTE=TheJudger;372557]can you check whether there are other CUDA runtimes (64bit) on your system or not?[/QUOTE]How? Do you mean like cudart64*.dll? There's a number of them scatted through various directories, including "C:\NVIDIA\" (default unpack location for temp files for driver install), "C:\Program Files\NVIDIA Corporation\Installer2\" (looks like a similar pre-install repository). Not counting stuff I copied manually into GIMPS-related folders, I see (in various places)[quote]cudart64_41_0.dll
cudart64_55.dll
cudart64_60.dll[/quote]Full list (cudart*.dll), if relevant:[code]C:\NVIDIA\DisplayDriver\335.23\Win8_WinVista_Win7_64\English\GFExperience.NvStreamC\cudart32_41_0.dll
C:\NVIDIA\DisplayDriver\335.23\Win8_WinVista_Win7_64\English\GFExperience.NvStreamSrv\amd64\server\cudart64_41_0.dll
C:\NVIDIA\DisplayDriver\335.23\Win8_WinVista_Win7_64\English\GFExperience.NvStreamSrv\x86\server\cudart32_41_0.dll
C:\NVIDIA\DisplayDriver\335.23\Win8_WinVista_Win7_64\English\ShadowPlay\cudart32_55.dll
C:\NVIDIA\DisplayDriver\335.23\Win8_WinVista_Win7_64\English\ShadowPlay\cudart64_55.dll
C:\NVIDIA\DisplayDriver\337.50\Win8_WinVista_Win7_64\English\GFExperience.NvStreamSrv\amd64\server\cudart64_41_0.dll
C:\NVIDIA\DisplayDriver\337.50\Win8_WinVista_Win7_64\English\GFExperience.NvStreamSrv\x86\server\cudart32_41_0.dll
C:\NVIDIA\DisplayDriver\337.50\Win8_WinVista_Win7_64\English\ShadowPlay\cudart32_55.dll
C:\NVIDIA\DisplayDriver\337.50\Win8_WinVista_Win7_64\English\ShadowPlay\cudart64_55.dll
C:\Prime95\cudalucas\cudart64_42_9.dll
C:\Prime95\cudapm1\cudart64_55.dll
C:\Prime95\mfaktc\_archive\1\cudart64_42_9.dll
C:\Prime95\mfaktc\_archive\2\cudart64_42_9.dll
C:\Prime95\mfaktc\_archive\3\cudart64_42_9.dll
C:\Prime95\mfaktc\_archive\4\cudart64_42_9.dll
C:\Prime95\mfaktc\_archive\5\cudart64_42_9.dll
C:\Prime95\mfaktc\_archive\6\cudart64_42_9.dll
C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart32_60.dll
C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_60.dll
C:\Program Files\NVIDIA Corporation\Installer2\GFExperience.NvStreamC.{49C76DBD-6D64-4E73-8D48-9B7EA264D8B3}\cudart32_41_0.dll
C:\Program Files\NVIDIA Corporation\Installer2\GFExperience.NvStreamC.{E76E2E52-96FD-4075-897B-061FB8C61607}\cudart32_41_0.dll
C:\Program Files\NVIDIA Corporation\Installer2\GFExperience.NvStreamSrv.{0ECD3D52-E3C1-42C9-8A83-B3EC0354FD57}\amd64\server\cudart64_41_0.dll
C:\Program Files\NVIDIA Corporation\Installer2\GFExperience.NvStreamSrv.{0ECD3D52-E3C1-42C9-8A83-B3EC0354FD57}\x86\server\cudart32_41_0.dll
C:\Program Files\NVIDIA Corporation\Installer2\GFExperience.NvStreamSrv.{12B5216B-714A-4831-B4CE-04FD1A0B55D2}\amd64\server\cudart64_41_0.dll
C:\Program Files\NVIDIA Corporation\Installer2\GFExperience.NvStreamSrv.{12B5216B-714A-4831-B4CE-04FD1A0B55D2}\x86\server\cudart32_41_0.dll
C:\Program Files\NVIDIA Corporation\Installer2\ShadowPlay.{A8945935-7B10-4CEE-9F43-777C614EA9DB}\cudart32_55.dll
C:\Program Files\NVIDIA Corporation\Installer2\ShadowPlay.{A8945935-7B10-4CEE-9F43-777C614EA9DB}\cudart64_55.dll
C:\Program Files\NVIDIA Corporation\Installer2\ShadowPlay.{AC36DD82-64BA-4542-A601-0575C7937360}\cudart32_55.dll
C:\Program Files\NVIDIA Corporation\Installer2\ShadowPlay.{AC36DD82-64BA-4542-A601-0575C7937360}\cudart64_55.dll
C:\Program Files\NVIDIA Corporation\Installer2\ShadowPlay.{C7FFF38C-8B7D-4178-A583-1CEFE383AFF1}\cudart32_55.dll
C:\Program Files\NVIDIA Corporation\Installer2\ShadowPlay.{C7FFF38C-8B7D-4178-A583-1CEFE383AFF1}\cudart64_55.dll
C:\Program Files\NVIDIA Corporation\NvStreamSrv\cudart64_41_0.dll
C:\Program Files\NVIDIA Corporation\ShadowPlay\cudart64_55.dll
C:\Users\User\AppData\Local\Temp\NVIDIA\GeForceExperienceSelfUpdate\12.4.55.0\GFExperience.NvStreamSrv\amd64\server\cudart64_41_0.dll
C:\Users\User\AppData\Local\Temp\NVIDIA\GeForceExperienceSelfUpdate\12.4.55.0\GFExperience.NvStreamSrv\x86\server\cudart32_41_0.dll
C:\Users\User\AppData\Local\Temp\NVIDIA\GeForceExperienceSelfUpdate\12.4.55.0\ShadowPlay\cudart32_55.dll
C:\Users\User\AppData\Local\Temp\NVIDIA\GeForceExperienceSelfUpdate\12.4.55.0\ShadowPlay\cudart64_55.dll[/code]

TheJudger 2014-05-04 10:59

Hi James,

can you (temporary) remove all CUDA DLLs from your mfaktc folder and try again. The symptoms are exactly what I expect when using a wrong CUDA runtime DLL. When you remove all CUDA DLLs from your mfaktc folder and still get the same issues than you can be sure that somehow you're using the wrong DLLs.

This is from mfaktc 0.20 src/mfaktc.c:[CODE]
int drv_ver, rt_ver;
if(mystuff.verbosity >= 1)printf("\nCUDA version info\n");
if(mystuff.verbosity >= 1)printf(" binary compiled for CUDA %d.%d\n", CUDART_VERSION/1000, CUDART_VERSION%100);
#if CUDART_VERSION >= 2020
cudaRuntimeGetVersion(&rt_ver);
if(mystuff.verbosity >= 1)printf(" CUDA runtime version %d.%d\n", rt_ver/1000, rt_ver%100);
cudaDriverGetVersion(&drv_ver);
if(mystuff.verbosity >= 1)printf(" CUDA driver version %d.%d\n", drv_ver/1000, drv_ver%100);

if(drv_ver < CUDART_VERSION)
{
printf("ERROR: current CUDA driver version is lower than the CUDA toolkit version used during compile!\n");
printf(" Please update your graphics driver.\n");
return 1;
}
if(rt_ver != CUDART_VERSION)
{
printf("ERROR: CUDA runtime version must match the CUDA toolkit version used during compile!\n");
return 1;
}
#endif
[/CODE]
This is the first call the any CUDA related function in mfaktc.

Oliver

James Heinrich 2014-05-04 11:18

[QUOTE=TheJudger;372606]can you (temporary) remove all CUDA DLLs from your mfaktc folder and try again.[/QUOTE]That gives a totally different type of error: a Windows popup error box saying[quote]The program can't start because cudart32_42_9.dll is missing from your computer. Try reinstalling the program to fix this problem.[/quote](Same thing for 64-bit version, with cudart64_42_9.dll). mfaktc doesn't start at all, no CLI output, just the Windows popup.

With the DLLs in the folder then mfaktc starts and complains with my previously-reported errors about CUDA versions. Without them mfaktc doesn't start at all.

TheJudger 2014-05-04 11:32

Hi James,

the good news is that there is a high chance that you're using the correct CUDA runtimes. The bad news is that I've no clue what's wrong there.

Oliver

James Heinrich 2014-05-04 11:57

[QUOTE=TheJudger;372610]the good news is that there is a high chance that you're using the correct CUDA runtimes. The bad news is that I've no clue what's wrong there.[/QUOTE]They're the DLLs that I've been using for years, and are the ones you distributed with mfaktc. :smile:

I am also confused, which is why I posted here. :ermm:

TheJudger 2014-05-12 17:57

OK, thanks to a friend of mine (not a GIMPS participant) I could put my hands on a Maxwell GPU. He bought a GTX 750Ti for his PC and than (instead of driving home to put his new GPU in his own system) he visited me and gave me the GPU for a quick test. :smile: Obviously I didn't spent time on optimizing the code (I'm not even sure whether there are any Maxwell specific optimizations possible/feasible in mfaktc or not).

For those who don't know what a (CUDA-)multiprocessor (nvidia speaking) is: stop reading here! When I say Maxwell I'm talking about CC 5.0 and observations made on a GTX 750Ti.

Talking about mfaktc:[LIST][*]performance per multiprocessor per clock seems to be a little bit [B]below[/B] my Kepler (GTX 680 aka GK104)[*]while browsing the [URL="http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#maximize-instruction-throughput"]CUDA documentation[/URL] I'm afraid about the integer multiplication throughput on Maxwell, there is no [I]native[/I] integer multiplication on Maxwell?[*]comparing chips of same diesize I *guess* 28nm Maxwell wins by a small margin compared to 28nm Kepler[*]when talking about performance per watt I *guess* 28nm Maxwell chips the first choice[/LIST]
Oliver

P.S. I can't wait for the [I]big[/I] Maxwells, no matter if 28nm or 20nm.

blip 2014-05-13 05:36

[CODE]May 13 07:08 | 604 13.4% | 1.700 23m33s | 1157.78 82485 n.a.%
ERROR: cudaGetLastError() returned 73: an illegal instruction was encountered


Some background info:

mfaktc v0.20 (64bit built)

Compiletime options
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 193154bits
SIEVE_SPLIT 250
MORE_CLASSES enabled

Runtime options
SievePrimes 25000
SievePrimesAdjust 1
SievePrimesMin 5000
SievePrimesMax 100000
NumStreams 3
CPUStreams 3
GridSize 3
GPU Sieving enabled
GPUSievePrimes 82486
GPUSieveSize 64Mi bits
GPUSieveProcessSize 16Ki bits
WorkFile worktodo.txt
Checkpoints enabled
CheckpointDelay 30s
Stages enabled
StopAfterFactor bitlevel
PrintMode full
V5UserID blip
ComputerID evclient004
AllowSleep no
TimeStampInResults no

CUDA version info
binary compiled for CUDA 6.0
CUDA runtime version 6.0
CUDA driver version 6.0

CUDA device info
name GeForce GTX 590
compute capability 2.0
maximum threads per block 1024
number of multiprocessors 16 (512 shader cores)
clock rate 1215MHz

Automatic parameters
threads per grid 1048576

running a simple selftest...
Selftest statistics
number of tests 92
successfull tests 92

selftest PASSED!

got assignment: exp=87475907 bit_min=73 bit_max=74 (21.87 GHz-days)
Starting trial factoring M87475907 from 2^73 to 2^74 (21.87 GHz-days)
k_min = 53984767285680
k_max = 107969534579839
Using GPU kernel "barrett76_mul32_gs"

found a valid checkpoint file!
last finished class was: 384
found 0 factor(s) already[/CODE]

blip 2014-05-13 05:56

GPU hangs after restart at self-test.
After swiching jobs with the second GPU, both continue.

TheJudger 2014-05-13 18:59

blib: might be defective hardware? It is running for a while and than throws a "cudaGetLastError() returned 73: an illegal instruction was encountered"... there is no JIT-compiling in mfaktc, after startup everything is static.

Oliver

kracker 2014-05-14 19:03

So, do you think it is a step up or down from Kepler?

firejuggler 2014-05-14 20:13

or a sidestep?

James Heinrich 2014-05-27 12:36

[QUOTE=James Heinrich;372508]However, I broke mfaktc. While it has been running fine on 335.23 since that came out (early March), mfaktc won't start now[/QUOTE]The good news is that drivers v337.88 released today fixed the problem.

TheJudger 2014-09-19 19:53

Seems we have new highscore for energy efficient trial factoring:

Stock/reference GTX 980
[CODE] ./mfaktc.exe -tf 66362159 72 73
mfaktc v0.21-pre6 (64bit built)
[...]
CUDA device info
name GeForce GTX 980
compute capability 5.2
maximum threads per block 1024
number of mutliprocessors 16 (unknown number of shader cores)
clock rate 1215MHz
[...]
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Sep 19 21:52 | 256 5.6% | 2.391 36m06s | 542.54 82485 n.a.%
Sep 19 21:52 | 261 5.7% | 2.373 35m48s | 546.66 82485 n.a.%
Sep 19 21:52 | 264 5.8% | 2.379 35m51s | 545.28 82485 n.a.%
Sep 19 21:52 | 265 5.9% | 2.380 35m49s | 545.05 82485 n.a.%
Sep 19 21:52 | 276 6.0% | 2.377 35m44s | 545.74 82485 n.a.%
[...]
[/CODE]

Oliver

Mark Rose 2014-09-19 20:00

What exactly determines/effects mfaktc performance on a given GPU?

Dubslow 2014-09-19 20:07

[QUOTE=TheJudger;383456]
Stock/reference GTX 980[/QUOTE]

Where did you get that? It was only launched today!

VictordeHolland 2014-09-19 20:54

[QUOTE=Mark Rose;383458]What exactly determines/effects mfaktc performance on a given GPU?[/QUOTE]
From what I can tell:
- Compute capability (higher is not always better)
- Number of CUDA cores
- Core/shader Clockspeed

Memory clock/bandwidth has little to no effect.

But I guess you already knew that and want a more specific/ architectural answer?

James Heinrich 2014-09-19 20:56

[QUOTE=TheJudger;383456]Seems we have new highscore for energy efficient trial factoring: Stock/reference GTX 980[/QUOTE]I just added the 980 to my [url=http://www.mersenne.ca/mfaktc.php]benchmark chart[/url] yesterday, but your numbers are exactly 20% higher than predicted by my lack-of-data (expected: 420.5 * (1215/1126) = 453GHd/d).
What Compute version does the 980 claim to be? (NVIDIA hasn't updated [url=https://developer.nvidia.com/cuda-gpus]their chart[/url] yet)
If you can, a benchmark submission would be most welcome:
[url]http://www.mersenne.ca/mfaktc.php#benchmark[/url]

Mark Rose 2014-09-19 21:01

I figured it was CUDA cores x core clock. What's worse about the higher compute capability/versions though? Do instructions on later version sometimes take more clock cycles? Do later compute versions allow anything to be done more efficiently?

James Heinrich 2014-09-19 22:08

I can't tell you what's better or worse about the different versions, but in terms of performance this is how many GFLOPS you need to get 1GHz-day/day of throughput (therefore lower is better):[code]NVIDIA:
1.x => 14.00 // horrible
2.0 => 3.65 // awesome
2.1 => 5.35 // pretty good
3.0 => 10.50 // not great
3.5 => 11.20 // getting worse

AMD:
VLIW5 => 11.3
VLIW4 => 10.5
GCN => 9.3[/code]So in terms of compute throughput NVIDIA seems to get worse with each revision (except, as noted above, the GTX 980 seems to have jumped 20% in the good direction from what I was expecting based on the previous generation). Which is why the relatively ancient GTX 580 (Compute 2.0 is still very competitive in terms of single-GPU throughput so many years later). AMD on the other hand seems to get more mfakto-efficient with each generation.

TheJudger 2014-09-19 23:52

[QUOTE=Mark Rose;383458]What exactly determines/effects mfaktc performance on a given GPU?[/QUOTE]

Integer intruction throughput (some pages ago in this thread).

[QUOTE=Dubslow;383459]Where did you get that? It was only launched today![/QUOTE]

It was a hard launch, just bought in local shop here.

[QUOTE=James Heinrich;383463]What Compute version does the 980 claim to be?[/QUOTE]
5.2

Oliver

kladner 2014-09-19 23:54

[QUOTE=Dubslow;383459]Where did you get that? It was only launched today![/QUOTE]

Long time no see! :smile:

Mark Rose 2014-09-20 00:05

[QUOTE=James Heinrich;383469]I can't tell you what's better or worse about the different versions, but in terms of performance this is how many GFLOPS you need to get 1GHz-day/day of throughput (therefore lower is better):[code]NVIDIA:
1.x => 14.00 // horrible
2.0 => 3.65 // awesome
2.1 => 5.35 // pretty good
3.0 => 10.50 // not great
3.5 => 11.20 // getting worse

AMD:
VLIW5 => 11.3
VLIW4 => 10.5
GCN => 9.3[/code][/quote]

Thanks for that table! I was curious what the factors were.

[quote]
So in terms of compute throughput NVIDIA seems to get worse with each revision (except, as noted above, the GTX 980 seems to have jumped 20% in the good direction from what I was expecting based on the previous generation). Which is why the relatively ancient GTX 580 (Compute 2.0 is still very competitive in terms of single-GPU throughput so many years later). AMD on the other hand seems to get more mfakto-efficient with each generation.[/QUOTE]

Over the last two months I bought a couple of used GTX 580's to contribute to the project ([url=http://www.gpu72.com/reports/worker_exact/ea39a75de82cd896610be22735054fc5/]see the bumps[/url]) because I saw they were so awesome. It seemed strange that 4 year old cards were still some of the best, but hey, cheap to acquire ($130 and $150). It also explains why the equally old GT 430's and GT 520's (both compute 2.1) I have crunching away are still worth bothering with (160 GHz-d/day total).

I'm really tempted to sell my GTX 760 in my home desktop (might get $150) and replace it with a GTX 980. The power requirements are basically the same and I wouldn't need to upgrade anything else. I find the GTX 760 struggles to keep up with 2560x1440 resolution in games.

ET_ 2014-09-20 11:30

I read that the 980 has 96KB of shared memory instead of 48K-64K of the previous versions.

I don't know if this would account for the augmented efficiency, as I suppose that mfaktc doesn't dynamically check for the shared memory presence/quantity.

BTW, anybody did benchmarks for CUDALucas/CUDAP-1? From what I see, there should still be the issue of 1/16 between doubles and floats, but maybe cc5.2 and the higher number of cores and clocks may show some interesting surprises... :smile:

Luigi

kracker 2014-09-20 14:19

[QUOTE=ET_;383527]I read that the 980 has 96KB of shared memory instead of 48K-64K of the previous versions.

I don't know if this would account for the augmented efficiency, as I suppose that mfaktc doesn't dynamically check for the shared memory presence/quantity.

BTW, anybody did benchmarks for CUDALucas/CUDAP-1? From what I see, there should still be the issue of 1/16 between doubles and floats, but maybe cc5.2 and the higher number of cores and clocks may show some interesting surprises... :smile:

Luigi[/QUOTE]

Maxwell has 1/32 DP.

Mark Rose 2014-09-20 23:36

[QUOTE=TheJudger;383476]Integer intruction throughput (some pages ago in this thread).
[/QUOTE]

So I spent six hours today reading the whole thread. It cleared up a lot.

From what I read it's possible to create a kernel that uses floating point instructions instead. Is it still worth investigating?

TheJudger 2014-09-21 14:43

[QUOTE=James Heinrich;383463]If you can, a benchmark submission would be most welcome:
[url]http://www.mersenne.ca/mfaktc.php#benchmark[/url][/QUOTE]

Done!

[QUOTE=James Heinrich;383469][code]NVIDIA:
1.x => 14.00 // horrible
2.0 => 3.65 // awesome
2.1 => 5.35 // pretty good
3.0 => 10.50 // not great
3.5 => 11.20 // getting worse
[...]
[/code]So in terms of compute throughput NVIDIA seems to get worse with each revision (except, as noted above, the GTX 980 seems to have jumped 20% in the good direction from what I was expecting based on the previous generation). Which is why the relatively ancient GTX 580[...].[/QUOTE]

Yes, but does anyone really care about this? I mean this just shows the relative speed of theoretical single precission floating point throughput (multiply-adds) versus mfaktc performance. For me performance per watt is a very important measurement and each iteration of GPUs is an improvement usually. Remebering my stock/reference GTX 470... 50-55% performance of my GTX 980 now for mfaktc... but with the 470 my PC sound like a starting jumbojet. I bet while running mfaktc the power consumption of my GTX 980 is still well below the TDP of 165W. :smile:
Don't know how you and other act: but I buy my GPUs for playing PC games, GPU computing (mfaktc) is not really a concern when buying GPUs, except that I only choose nvidia GPUs for two reasons:[LIST][*]CUDA[*]used to use nvidia GPU for a long time now, I'm lazy and don't want to teach myself with another driver[/LIST]

[QUOTE=ET_;383527]I read that the 980 has 96KB of shared memory instead of 48K-64K of the previous versions.

I don't know if this would account for the augmented efficiency, as I suppose that mfaktc doesn't dynamically check for the shared memory presence/quantity.[/QUOTE]

Well, mfaktc 0.21 (not released, don't ask for timeframe...) checks this. This might be the reason for some reported mfaktc crashes. Huge GPU sieve with low sieveprimes triggers the issue. I found it while enabling GPU sievinig for CC 1.x which have lower amount of shared memory.

[QUOTE=Mark Rose;383577]So I spent six hours today reading the whole thread. It cleared up a lot.

From what I read it's possible to create a kernel that uses floating point instructions instead. Is it still worth investigating?[/QUOTE]

Don't really know but my feeling tells me: no, not worth testing.
Integer: we can easily do 32x32 multiplication with 64bit result so for 96/192 bit number we need 3/6 ints and a full 96x96->192bit multiplication needs 3*3*2 = 18 multiplications, the 2 is the lower/higher 32 bit part.
For SP floats we have a 23bit mantissa, so just a guess: we can use up to 11 bits of data per chunk (11x11 -> 22 bit result), so for "only" 77/144 bit we need 7*7=49 multiplications, no sure how efficient one can do adds and shifts, too. I guess this is worse than for ints.
If we can use only 10 bits per chunk it's even worse. Might run out of register space, too.

Oliver

Mark Rose 2014-09-24 03:15

[QUOTE=TheJudger;383607]Don't really know but my feeling tells me: no, not worth testing.
Integer: we can easily do 32x32 multiplication with 64bit result so for 96/192 bit number we need 3/6 ints and a full 96x96->192bit multiplication needs 3*3*2 = 18 multiplications, the 2 is the lower/higher 32 bit part.
For SP floats we have a 23bit mantissa, so just a guess: we can use up to 11 bits of data per chunk (11x11 -> 22 bit result), so for "only" 77/144 bit we need 7*7=49 multiplications, no sure how efficient one can do adds and shifts, too. I guess this is worse than for ints.
If we can use only 10 bits per chunk it's even worse. Might run out of register space, too.[/QUOTE]

This inspired me to read mfaktc to see how everything was being done. It took a bit to wrap my head around everything, but I think I've made sense of it now. I had the hubris to think I might be able to find some unscavenged optimization somewhere but I found none in many hours of studying the code over the last two days. mfaktc is the tightest code I've ever looked at.

James Heinrich 2014-09-25 22:51

[QUOTE=TheJudger;383607]Yes, but does anyone really care about this?[/QUOTE]No, they shouldn't. I care about it only in the sense of having some basis for predicting mfakt_ performance for my chart. Overall performance, performance per watt, and performance per dollar (hardware+power) are the really useful metrics.

Mark Rose 2014-10-01 04:13

[QUOTE=TheJudger;383607]
Don't really know but my feeling tells me: no, not worth testing.
Integer: we can easily do 32x32 multiplication with 64bit result so for 96/192 bit number we need 3/6 ints and a full 96x96->192bit multiplication needs 3*3*2 = 18 multiplications, the 2 is the lower/higher 32 bit part.
For SP floats we have a 23bit mantissa, so just a guess: we can use up to 11 bits of data per chunk (11x11 -> 22 bit result), so for "only" 77/144 bit we need 7*7=49 multiplications, no sure how efficient one can do adds and shifts, too. I guess this is worse than for ints.
If we can use only 10 bits per chunk it's even worse. Might run out of register space, too.[/QUOTE]

So I spent the last nine days going over this, learning CUDA, etc., and it turns out that using floats would take 32% more time on compute 3.x hardware. For those not familiar, compute 3.x hardware can do 6 floating point multiply-adds but only 1 integer multiply-add per cycle. I'm mainly posting this in case anyone else has thought of pursuing the idea of using floating point.

The current barrett76 algorithm does 20 integer FMA's cycles per loop. It also must spend 2 cycles doing addition and subtraction. So 22 cycles.

The hypothetical floating point algorithm requires 2*7*7 FMA's for the basic multiplications. The high bits of each float are found by multiplying by 1/(2^11) and adding 1 x 2^23 as an FMA, rounding down to shift away the fraction, for another 2 * 14 FMA's. The 1 x 2^23 is then subtracted our for 2 * 14 subs. The high bits are then subtracted away for another 2 * 14 subs. Finally 7 subs are done to find the remainder.

That's a total of 53 FMA's for 9 cycles, 28 subs for 5 cycles, 53 FMA's for 9 cycles, then 35 subs for 6 cycles, for a total of 29 cycles. That's about 32% slower than the integer version, not taking into consider register pressure, etc.

The new Maxwell chips, compute 5.x, keep a similar ratio to the 3.x chips in floating point to integer instructions, so there's no win there either.

LaurV 2014-10-01 05:15

:goodposting: Excellent post Mark Rose! Very well put and explained.

Karl M Johnson 2014-10-01 05:51

[QUOTE=Mark Rose;384137]The new Maxwell chips, compute 5.x, keep a similar ratio to the 3.x chips in floating point to integer instructions, so there's no win there either.[/QUOTE]
There's still hope for "proper" high-end Maxwell GPUs, which will be based on GM200.
First and foremost, it should have a better DPFP performance per SMM, crowning it king of the LL tests.
Nvidia also mentioned a "high-end Maxwell with an ARM cpu onboard", their purpose is to either create an independent device(like Intel did with Xeon Phi) or to surprise us with new goodies.
Probably a bit of both.
However, I have a feeling the beast will only be sold as a Tesla GPU.

Mark Rose 2014-10-01 07:21

I wish I could edit the typos I missed earlier.

I'm pretty sure it will be a Tesla-only part, too. The current Maxwells have awful DPFP throughput. They've never increased the DPFP performance for the consumer parts in the past.

One nice thing about the Maxwells is the reduced instruction latency. That frees up a lot of registers because fewer threads are needed to get ideal occupancy of the SMMs.

Karl M Johnson 2014-10-01 12:04

[QUOTE=Mark Rose;384147]They've never increased the DPFP performance for the consumer parts in the past[/QUOTE]
*cough* *cough* GTX 580, Titan *cough* *cough*

Mark Rose 2014-10-01 14:49

[QUOTE=Karl M Johnson;384154]*cough* *cough* GTX 580, Titan *cough* *cough*[/QUOTE]

The GTX 580 had the same ratio as the other GF110 consumer cards.

You're right about the Titan. I stand corrected.

And considering that, I'd like to go back on what I said earlier: there's a good chance for a consumer card will be released with better DPFP performance.

I shouldn't post hours past my bedtime lol

Karl M Johnson 2014-10-01 17:53

Our wait should not be long, as some sources suggest that a better, faster [STRIKE]and greener[/STRIKE] GeForce card will be released in Q4 2014.
Back to our topic, does CuLu actually use DPFP calculus anywhere in the code?
As far as I remember, it's about int performance along with memory latencies.

owftheevil 2014-10-01 22:25

The rounding and carying kernel (~8-10% of the iteration time) is mostly integer arithmetic, but the ffts and the pointwise multiplication are dpfp.

chalsall 2014-10-01 22:55

[QUOTE=owftheevil;384188]The rounding and carying kernel (~8-10% of the iteration time) is mostly integer arithmetic, but the ffts and the pointwise multiplication are dpfp.[/QUOTE]

Did you know the Z80 only had a 4 bit ALU?

F' me over a bread board! No wonder a 6502 beat it's pants off for real-world work!

owftheevil 2014-10-02 03:00

[QUOTE=chalsall;384189]Did you know the Z80 only had a 4 bit ALU?
![/QUOTE]

No I didn't.

[QUOTE]
F' me over a bread board! ... [/QUOTE]

No thanks,not my thing, but thanks for the offer :-)

kracker 2014-10-02 03:02

[QUOTE=chalsall;384189]Did you know the Z80 only had a 4 bit ALU?

F' me over a bread board! No wonder a 6502 beat it's pants off for real-world work![/QUOTE]

...Is that you davieddy?

kladner 2014-10-02 03:53

[QUOTE=kracker;384197]...Is that you davieddy?[/QUOTE]
:devil::goodposting::stirpot: Wonder if he's still on this plane of existence. :davieddy:

Karl M Johnson 2014-10-02 05:12

[QUOTE=owftheevil;384188]The rounding and carying kernel (~8-10% of the iteration time) is mostly integer arithmetic, but the ffts and the pointwise multiplication are dpfp.[/QUOTE]
BAH!
I meant mfaktc.
CuLu is, obviously, mostly about dpfp.

Mark Rose 2014-10-02 15:20

[QUOTE=Karl M Johnson;384201]BAH!
I meant mfaktc.
CuLu is, obviously, mostly about dpfp.[/QUOTE]

Oh, in that case, no. mfaktc doesn't use any DPFP in CUDA.

Ralf Recker 2014-10-16 11:34

[QUOTE=TheJudger;383607]Don't know how you and other act: but I buy my GPUs for playing PC games, GPU computing (mfaktc) is not really a concern when buying GPUs, except that I only choose nvidia GPUs for two reasons:[LIST][*]CUDA[*]used to use nvidia GPU for a long time now, I'm lazy and don't want to teach myself with another driver[/LIST][/QUOTE]
I hope next time NVIDIA tests their cards and drivers on boxes with 32 GB RAM installed.

I had to use msconfig to tell windows to use only 30 GB on boot to avoid a bunch of crashes.

flashjh 2014-10-16 13:35

I don't have any problems with 32gb or more.

James Heinrich 2014-10-16 13:47

[QUOTE=Ralf Recker;385319]I hope next time NVIDIA tests their cards and drivers on boxes with 32 GB RAM installed.[/QUOTE]I've has 64GB in my machine for the last 3 years with (at various times) a GTX 570, GTX 670, two GTX 670s, even an 8800GT and not experienced any issues.

Ralf Recker 2014-10-16 14:20

[QUOTE=James Heinrich;385323]I've has 64GB in my machine for the last 3 years with (at various times) a GTX 570, GTX 670, two GTX 670s, even an 8800GT and not experienced any issues.[/QUOTE]

I had no problems either until I installed a 4GB maxwell card and tested it with a program that can use more than 2GB of video memory. A least it's an already known bug and not defective hardware.

Mark Rose 2014-10-16 19:10

32 GB of RAM is a common setup these days with RAM being so cheap. I'm surprised they weren't testing that configuration.

TheJudger 2014-10-16 22:02

Hi,

some CUDA internal functions (cudaHostAlloc()?) fail if, well don't know exactly, physical memory addresses are above 1TB?
I see issues on big iron, pinning CUDA process to lowest available NUMA-node and it runs fine... pin to another NUMA-node (which memory range is above 1TB) and it fails immediately. Even worse: I see silent data cooruption when process is moved from lower to higher adresses in this case.

Oliver

TheJudger 2014-10-16 22:06

help needed
 
Hi,

anyone able to build Windows binaries with CUDA 6.0 or 6.5? Currently I don't feel like f*cking up my Windows installation with Visual Studio or so.
It would be nice if someone[LIST][*](shortterm/once) build CUDA 6.x binaries of mfaktc 0.20[*](longterm(?)/repeated) build future mfaktc binaries, including pre-releases and testing[/LIST]
If you want to/can help (both cases): Just drop me a note.

Oliver

firejuggler 2014-10-17 11:09

check the " cant run on 980' thread, there is a compiled version. My 750 TI work with it.

flashjh 2014-10-17 16:52

[QUOTE=TheJudger;385355]Hi,

anyone able to build Windows binaries with CUDA 6.0 or 6.5? Currently I don't feel like f*cking up my Windows installation with Visual Studio or so.
It would be nice if someone[LIST][*](shortterm/once) build CUDA 6.x binaries of mfaktc 0.20[*](longterm(?)/repeated) build future mfaktc binaries, including pre-releases and testing[/LIST]
If you want to/can help (both cases): Just drop me a note.

Oliver[/QUOTE]

Hey Oliver,

I should be available to compile shortterm/longterm for you. I don't always track the forums anymore, so you can PM me. I see those quickly.

I compiled mfaktc .20 CUDA 6.5 Win 32 and Win 64

I used "[URL="https://developer.nvidia.com/cuda-downloads-geforce-gtx9xx"]CUDA 6.5 Production Release with Support for GeForce GTX9xx GPUs[/URL]"

Currently supported compilation architectures included ([URL="http://docs.nvidia.com/cuda/pdf/CUDA_Compiler_Driver_NVCC.pdf"]in NVCC[/URL]) (and in this build) are:

virtual architectures: compute_11, compute_12, compute_13, compute_20, compute_30, computer_37, compute_32, compute_35, compute_50, compute_52;

and GPU architectures: sm_11, sm_12, sm_13, sm_20, sm_21, sm_30, sm_32, sm_37, sm_35, sm_50, sm_52.

compute_37, sm_37 is not documented, but it's supported so I include it in the build

I emailed the build to you for upload, if you need anything else built, let me know.

Jerry

edit: if anyone need the latest lib files for mfaktc, see [URL="https://sourceforge.net/projects/cudalucas/files/CUDA%20Libs/"]here[/URL]

TheJudger 2014-10-17 18:34

Hi Jerry,

thank you for your help. :smile:

[QUOTE=flashjh;385410]edit: if anyone need the latest lib files for mfaktc, see [URL="https://sourceforge.net/projects/cudalucas/files/CUDA%20Libs/"]here[/URL][/QUOTE]
I'll include these libs in the zipfile just I did before. So it will be "all inclusive".

Oliver

TheJudger 2014-10-18 22:33

Hi all,

[B][U]thanks to Jerry[/U][/B] we have mfaktc 0.20 compiled with CUDA 6.5 for Windows now.
You can find it [URL="http://www.mersenneforum.org/mfaktc/mfaktc-0.20/"]here[/URL] - [URL="http://www.mersenneforum.org/mfaktc/mfaktc-0.20/mfaktc-0.20.win.cuda65.zip"]mfaktc-0.20.win.cuda65.zip[/URL][LIST][*][B]will run on [I]Maxwell[/I] GPUs, e.g. GTX 750 (Ti), GTX 970 and GTX 980[/B][*]code is unchanged to the CUDA 4.2 version, just recompiled (and enabled code generation for [I]Maxwell[/I] GPUs)[*]will read savefiles from the CUDA 4.2 version[*]there is no need to upgrade if the CUDA 4.2 version is running fine for you[/LIST]
I think you need nvidia 340 series driver or newer.

Oliver

firejuggler 2014-10-19 19:00

3 Attachment(s)
James, you have under estimated GTX 750 ti troughput by at least a third.
This is with default setting

James Heinrich 2014-10-19 20:24

[QUOTE=firejuggler;385533]James, you have under estimated GTX 750 ti troughput by at least a third.[/QUOTE]Please send me [url=http://www.mersenne.ca/mfaktc.php#benchmark]a benchmark[/url], including both that GPU-Z screenshot as well as one from the Sensors tab (the only place I've found where the proper boosted clock speed is displayed).

I did have the wrong GFLOPS value for the 750ti, so thanks for catching there was a problem. But I still lack sufficient benchmarks for Compute 5.0 (and 5.2) cards. Based on the single benchmark I have (and my now-corrected data) I suspect your card is running at ~1220MHz. Your benchmark will help refine my data.

firejuggler 2014-10-19 21:06

Benchmark sent. 1176 Mhz it seems.

Ralf Recker 2014-10-20 11:43

[QUOTE=James Heinrich;385539]Please send me [URL="http://www.mersenne.ca/mfaktc.php#benchmark"]a benchmark[/URL], including both that GPU-Z screenshot as well as one from the Sensors tab (the only place I've found where the proper boosted clock speed is displayed).

I did have the wrong GFLOPS value for the 750ti, so thanks for catching there was a problem. But I still lack sufficient benchmarks for Compute 5.0 (and 5.2) cards. Based on the single benchmark I have (and my now-corrected data) I suspect your card is running at ~1220MHz. Your benchmark will help refine my data.[/QUOTE]
I've sent a benchmark for a GTX 970 (two submissions because of a typo in the bitlevel field. 71-72 is the right one. 71-73 is wrong).

TheJudger 2014-10-22 19:07

decission(s)
 
Hi all,

after a longer break I started CUDA coding (mfaktc) again.
For mfaktc 0.21 some decissions have to be made.

Enabling GPU sieve on lower factor sizes AND (relativ low) exponents are problematic, see additional information [URL="http://www.mersenneforum.org/showpost.php?p=363143&postcount=2290"]here[/URL].
For mfakt[B]c[/B] 0.20 this isn't a real issue (it can miss [B][U]composite[/U][/B] factors of relativ low exponents, depending on GPUSievePrimes). It affects only composite factors because lower FC size limit is 2[SUP]64[/SUP] for GPU sieving in mfaktc 0.20. AFAIK mfakt[B]o[/B] has the same problems.

In mfaktc 0.21-preX I've generalized GPU sieving code so it is very easy to addept GPU sieving for (allmost) all kernels. So I did for "75bit_mul32" -> "75bit_mul32_gs" and "95bit_mul32" -> "95bit_mul32_gs". Those kernels can handle FCs starting at very low numbers, in mfaktc 0.20 this is 2kp+1 where p is the minimum exponent (1.000.000) so those kernels handle FCs starting at ~2.000.000. Here starts the problem (remember [URL="http://www.mersenneforum.org/showpost.php?p=363143&postcount=2290"]post #2290[/URL]!). GPU sieving supports incredible high values of GPUSievePrimes. With GPUSievePrimes around 149.000 the prime base will contain primes up to a bit above 2.000.000.

A good example is [URL="http://www.mersenne.org/report_exponent/?exp_lo=1000151&exp_hi="]M1.000.151[/URL], mfaktc 0.20 will miss the [B][U]composite[/U][/B] factor 1285410593336863915299551 (2000303 * 642607941565284817) when GPU sieving is used AND GPUSievePriemes set above ~148.000.

For mfaktc 0.21 I plan to support exponents as low as M100.000 (even if not really usefull for GIMPS finding Mersenne primes it was requested a couple of times). This (and the fact that GPU sieving is enabled for kernels which can TF below 2[SUP]64[/SUP]) causes that mfaktc [B]0.21-preX[/B] currently misses [B][U]prime[/U][/B] factors (on low exponents), depending on the setting of GPUSievePrimes.

The simple [I]trick[/I] for CPU sieve (description in [URL="http://www.mersenneforum.org/showpost.php?p=363143&postcount=2290"]post #2290[/URL]) isn't possible that easy for GPU sieving because GPU sieve works with prime distances instead of absolut values (prime gaps stored in 7 bit integers). Remove primes out of the prime base might overflow the prime gaps! I don't say that it is impossible but I don't want to spent (much) time on [I]not-useful-for-GIMPS[/I] features, which, on the other hand, might complicate that code and introduce possible bugs now and in the future.

Possible other solutions:
[LIST=1][*]Keep exponent minimum at 1.000.000 for GPU sieving AND limit GPUSievePrimes to ~140.000. - simplest, but not smart solution[*]Require min factor size of e.g. 2[SUP]40[/SUP] for GPU sieve enabled kernels (if they, in theory, support lower FCs). - simple but will miss [B][U]composite[/U][/B] factors. [B]I don't like this.[/B][*]dynamical lower GPUSievePrimes - leads into other problems, lower GPUSievePrimes needs more shared memory on GPU. (check [URL="http://www.mersenneforum.org/showpost.php?p=365377&postcount=2302"]post #2302[/URL]) I don't really like this solution.[*]Depending in user setting of GPUSievePrimes calculate minimum valid exponent size for GPU kernels.[LIST][*]currently [B]my prefered solution[/B][*]no additional code in GPU kernels (performance critical code), only some code on mfaktc startup and check exponent size before entering performance critical code. So for future optimizations in the code no need to think about this corner case![*]how to handle if exponent is too low for GPU sieving/GPUSievePrimes setting (but valid for CPU sieving)?[LIST=A][*]write a WARNING (include hint that lowering GPUSievePrimes might be an option) and use CPU sieve[*]write a WARNING (include hint that lowering GPUSievePrimes might be an option) and ignore the assignment[*]write an ERROR (include hint that lowering GPUSievePrimes might be an option) and exit[/LIST][/LIST][/LIST]
Because users that do TF on low exponents [B]should[/B] know what they are doing I prefer 4C for now! Currenty Wavefront (TF around M70.000.000+) is not affected.

Comments?

Oliver

P.S. I really prefer 4C, if you want an other solution this solution must be clever/smart and feasible and/or you need good arguments against 4C!

James Heinrich 2014-10-22 20:19

Does this mean GPU sieving will be available below 2^64 for large exponents, such as my own pet project of 1000M-4296M? I only need support down to 2^52 since everything has been done below that, but if GPU sieving could be enabled... that would be of tremendous benefit :smile:

Bdot 2014-10-22 20:55

mfakto's _gs kernels start at 2[SUP]60[/SUP], also for your pet project, James.

So far, I was not approached to lower that limit, and I don't have that on my plan (yet).

[code]
got assignment: exp=4201971233 bit_min=66 bit_max=67 (0.00 GHz-days)
Starting trial factoring M4201971233 from 2^66 to 2^67 (0.00GHz-days)
Using GPU kernel "cl_barrett15_69_gs_2"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Oct 22 22:42 | 4611 100.0% | 0.002 0m00s | 160.06 80181 0.00%
no factor for M4201971233 from 2^66 to 2^67 [mfakto 0.15pre4-Win cl_barrett15_69_gs_2]
tf(): total time spent: 2.751s (111.71 GHz-days / day)
[/code]Maybe you have a factor (> 2[sup]60[/sup]) for me that I could verify to find?

edit2: Using less-classes, that same thing is even more fun:
[code]
got assignment: exp=4201971233 bit_min=66 bit_max=67 (0.00 GHz-days)
Starting trial factoring M4201971233 from 2^66 to 2^67 (0.00GHz-days)
Using GPU kernel "cl_barrett15_69_gs_2"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Oct 22 23:02 | 412 100.0% | 0.009 0m00s | 355.68 80181 0.00%
no factor for M4201971233 from 2^66 to 2^67 [mfakto 0.15pre4-Win cl_barrett15_69_gs_2]
tf(): total time spent: 0.995s (308.85 GHz-days / day)
[/code]

James Heinrich 2014-10-22 21:06

[QUOTE=Bdot;385788]Maybe you have a factor (> 2[sup]60[/sup]) for me that I could verify to find?[/QUOTE]The list of [URL="http://www.mersenne.ca/manyfactors.php?exp_min=1000000000&exp_max=4294967295&fac_min=7&fac_max=10"]exponents-with-many-factors[/URL] has many entries you can pick from. I can give you a pile of more specific exponents if you're interested. Here's a tiny sample with factors slightly larger than 60 bits:[code]Factor=2001862367,60,61
Factor=2000098873,60,61
Factor=2004561407,60,61
Factor=2005844293,60,61
Factor=2009094883,60,61
Factor=2003270579,60,61
Factor=2006109223,60,61
Factor=2008886611,60,61
Factor=2004315961,60,61
Factor=2001388097,60,61[/code]

Bdot 2014-10-22 21:11

[code]
got assignment: exp=2001862367 bit_min=60 bit_max=61 (0.00 GHz-days)
Starting trial factoring M2001862367 from 2^60 to 2^61 (0.00GHz-days)
Using GPU kernel "cl_barrett15_69_gs_2"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Oct 22 23:08 | 417 100.0% | 0.001 0m00s | 104.99 80181 0.00%
M2001862367 has a factor: 1153068867805081159

found 1 factor for M2001862367 from 2^60 to 2^61 [mfakto 0.15pre4-Win cl_barrett15_69_gs_2]
tf(): total time spent: 0.167s (60.35 GHz-days / day)

got assignment: exp=2000098873 bit_min=60 bit_max=61 (0.00 GHz-days)
Starting trial factoring M2000098873 from 2^60 to 2^61 (0.00GHz-days)
Using GPU kernel "cl_barrett15_69_gs_2"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Oct 22 23:08 | 75 18.8% | 0.001 0m00s | 105.08 80181 0.00%
M2000098873 has a factor: 1153427718610610551
Oct 22 23:08 | 416 100.0% | 0.001 0m00s | 105.08 80181 0.00%
found 1 factor for M2000098873 from 2^60 to 2^61 [mfakto 0.15pre4-Win cl_barrett15_69_gs_2]
tf(): total time spent: 0.149s (67.70 GHz-days / day)

got assignment: exp=2004561407 bit_min=60 bit_max=61 (0.00 GHz-days)
Starting trial factoring M2004561407 from 2^60 to 2^61 (0.00GHz-days)
Using GPU kernel "cl_barrett15_69_gs_2"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Oct 22 23:08 | 232 56.3% | 0.001 0m00s | 104.85 80181 0.00%
M2004561407 has a factor: 1153386835577909609
Oct 22 23:08 | 417 100.0% | 0.001 0m00s | 104.85 80181 0.00%
found 1 factor for M2004561407 from 2^60 to 2^61 [mfakto 0.15pre4-Win cl_barrett15_69_gs_2]
tf(): total time spent: 0.147s (68.47 GHz-days / day)

got assignment: exp=2005844293 bit_min=60 bit_max=61 (0.00 GHz-days)
Starting trial factoring M2005844293 from 2^60 to 2^61 (0.00GHz-days)
Using GPU kernel "cl_barrett15_69_gs_2"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Oct 22 23:08 | 116 29.2% | 0.001 0m00s | 104.78 80181 0.00%
M2005844293 has a factor: 1153405062405321977
Oct 22 23:08 | 416 100.0% | 0.001 0m00s | 104.78 80181 0.00%
found 1 factor for M2005844293 from 2^60 to 2^61 [mfakto 0.15pre4-Win cl_barrett15_69_gs_2]
tf(): total time spent: 0.145s (69.37 GHz-days / day)

got assignment: exp=2009094883 bit_min=60 bit_max=61 (0.00 GHz-days)
Starting trial factoring M2009094883 from 2^60 to 2^61 (0.00GHz-days)
Using GPU kernel "cl_barrett15_69_gs_2"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Oct 22 23:08 | 53 14.6% | 0.001 0m00s | 104.61 80181 0.00%
M2009094883 has a factor: 1153540764802817159
Oct 22 23:08 | 417 100.0% | 0.001 0m00s | 104.61 80181 0.00%
found 1 factor for M2009094883 from 2^60 to 2^61 [mfakto 0.15pre4-Win cl_barrett15_69_gs_2]
tf(): total time spent: 0.146s (68.78 GHz-days / day)

got assignment: exp=2003270579 bit_min=60 bit_max=61 (0.00 GHz-days)
Starting trial factoring M2003270579 from 2^60 to 2^61 (0.00GHz-days)
Using GPU kernel "cl_barrett15_69_gs_2"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Oct 22 23:08 | 300 70.8% | 0.001 0m00s | 104.91 80181 0.00%
M2003270579 has a factor: 1153716299952772441
Oct 22 23:08 | 417 100.0% | 0.001 0m00s | 104.91 80181 0.00%
found 1 factor for M2003270579 from 2^60 to 2^61 [mfakto 0.15pre4-Win cl_barrett15_69_gs_2]
tf(): total time spent: 0.146s (68.98 GHz-days / day)

got assignment: exp=2006109223 bit_min=60 bit_max=61 (0.00 GHz-days)
Starting trial factoring M2006109223 from 2^60 to 2^61 (0.00GHz-days)
Using GPU kernel "cl_barrett15_69_gs_2"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Oct 22 23:08 | 152 35.4% | 0.001 0m00s | 104.77 80181 0.00%
M2006109223 has a factor: 1153764818690030153
Oct 22 23:08 | 416 100.0% | 0.001 0m00s | 104.77 80181 0.00%
found 1 factor for M2006109223 from 2^60 to 2^61 [mfakto 0.15pre4-Win cl_barrett15_69_gs_2]
tf(): total time spent: 0.145s (69.36 GHz-days / day)

got assignment: exp=2008886611 bit_min=60 bit_max=61 (0.00 GHz-days)
Starting trial factoring M2008886611 from 2^60 to 2^61 (0.00GHz-days)
Using GPU kernel "cl_barrett15_69_gs_2"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Oct 22 23:08 | 189 45.8% | 0.001 0m00s | 104.62 80181 0.00%
M2008886611 has a factor: 1153863845653352359
Oct 22 23:08 | 416 100.0% | 0.001 0m00s | 104.62 80181 0.00%
found 1 factor for M2008886611 from 2^60 to 2^61 [mfakto 0.15pre4-Win cl_barrett15_69_gs_2]
tf(): total time spent: 0.148s (67.86 GHz-days / day)

got assignment: exp=2004315961 bit_min=60 bit_max=61 (0.00 GHz-days)
Starting trial factoring M2004315961 from 2^60 to 2^61 (0.00GHz-days)
Using GPU kernel "cl_barrett15_69_gs_2"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Oct 22 23:08 | 291 70.8% | 0.001 0m00s | 104.86 80181 0.00%
M2004315961 has a factor: 1154230767950158183
Oct 22 23:08 | 416 100.0% | 0.001 0m00s | 104.86 80181 0.00%
found 1 factor for M2004315961 from 2^60 to 2^61 [mfakto 0.15pre4-Win cl_barrett15_69_gs_2]
tf(): total time spent: 0.146s (68.95 GHz-days / day)

got assignment: exp=2001388097 bit_min=60 bit_max=61 (0.00 GHz-days)
Starting trial factoring M2001388097 from 2^60 to 2^61 (0.00GHz-days)
Using GPU kernel "cl_barrett15_69_gs_2"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Oct 22 23:08 | 163 39.6% | 0.001 0m00s | 105.01 80181 0.00%
M2001388097 has a factor: 1154536360470905183
Oct 22 23:08 | 415 100.0% | 0.001 0m00s | 105.01 80181 0.00%
found 1 factor for M2001388097 from 2^60 to 2^61 [mfakto 0.15pre4-Win cl_barrett15_69_gs_2]
tf(): total time spent: 0.146s (69.05 GHz-days / day)
[/code]

TheJudger 2014-10-22 21:55

Hi James,

[QUOTE=James Heinrich;385785]Does this mean GPU sieving will be available below 2^64 for large exponents, such as my own pet project of 1000M-4296M? I only need support down to 2^52 since everything has been done below that, but if GPU sieving could be enabled... that would be of tremendous benefit :smile:[/QUOTE]

Yes, will use the "slow" 75bit schoolbook division kernel in 0.21 below 2[SUP]64[/SUP] but anyway, GPU sieving is possible and feasible.
Slowish GT 630 (GK208) @900MHz:
[CODE]
M2001862367 has a factor: 1153068867805081159
found 1 factor for M2001862367 from 2^40 to 2^64 [mfaktc 0.21-pre7 75bit_mul32_gs]
tf(): total time spent: 8.702s

no factor for M2001862367 from 2^64 to 2^66 [mfaktc 0.21-pre7 barrett76_mul32_gs]
tf(): total time spent: 11.213s
[/CODE]

Oliver

Bdot 2014-10-22 22:12

But back to Oliver's question ...[QUOTE=TheJudger;385780]
Comments?
[/QUOTE]

I'd vote for 3.

In combination with your other plan to drop CC 1.x support, there should be sufficient shared memory to at least run, maybe not at best performance. Insufficient shared memory can also be detected, and lowering GPUSieveProcessSize can counter that.

But if you go for 4C (or B), the user also has to lower GPUSievePrimes, and may run into the same trouble.

Ralf Recker 2014-10-24 15:44

[QUOTE=Ralf Recker;385319]I hope next time NVIDIA tests their cards and drivers on boxes with 32 GB RAM installed.

I had to use msconfig to tell windows to use only 30 GB on boot to avoid a bunch of crashes.[/QUOTE]
Seems that the 344.48 driver fixed the issue.

Karl M Johnson 2014-10-28 07:24

[QUOTE=Ralf Recker;385973]Seems that the 344.48 driver fixed the issue.[/QUOTE]
Another confirmation from me.

wombatman 2014-11-05 05:26

How do I enable less classes? I can compile for Windows 7 64-bit with no issue. Just wasn't clear what I needed to enable/disable to get that and try it out. Thanks!

Mark Rose 2014-11-05 05:39

[QUOTE=wombatman;386895]How do I enable less classes? I can compile for Windows 7 64-bit with no issue. Just wasn't clear what I needed to enable/disable to get that and try it out. Thanks![/QUOTE]

Remove [code]
#define MORE_CLASSES
[/code] in params.h and recompile.

wombatman 2014-11-05 13:18

Much appreciated!

kladner 2014-11-23 02:12

I have a bit of suspicion about GeForce driver '344.75-desktop-win8-win7-winvista-64bit-international-whql'. Since I installed it, I have had a few instances of finding mfaktc not running. When I restart it, I find that the 580 has locked down to 400 MHz, requiring a reboot. It must be a couple of years since I saw such a problem.

I have now rolled back to the next previous 344.65, but haven't been using it long enough to establish stability. I also have earlier versions like 344.48, in which I have confidence.

Has anyone else noticed a change in behavior with 344.75?

Chuck 2014-11-23 16:09

[QUOTE=kladner;388258]I have a bit of suspicion about GeForce driver '344.75-desktop-win8-win7-winvista-64bit-international-whql'. Since I installed it, I have had a few instances of finding mfaktc not running. When I restart it, I find that the 580 has locked down to 400 MHz, requiring a reboot. It must be a couple of years since I saw such a problem.

I have now rolled back to the next previous 344.65, but haven't been using it long enough to establish stability. I also have earlier versions like 344.48, in which I have confidence.[/QUOTE]

I moved to 344.65 on Nov 11 and haven't noticed any problems with it. I don't see any performance changes from earlier versions (I don't do any gaming).

kladner 2014-11-23 16:55

[QUOTE=Chuck;388269]I moved to 344.65 on Nov 11 and haven't noticed any problems with it. I don't see any performance changes from earlier versions (I don't do any gaming).[/QUOTE]

So far, so good, here, with 344.65. I am thinking that there may be little reason to keep up with the latest drivers for something as old as the 500 series, if the drivers' main purpose is to tune the latest cards to the latest games.

henryzz 2014-11-23 17:57

[QUOTE=kladner;388271]So far, so good, here, with 344.65. I am thinking that there may be little reason to keep up with the latest drivers for something as old as the 500 series, if the drivers' main purpose is to tune the latest cards to the latest games.[/QUOTE]

I still update occasionally for a 8000 series

Chuck 2014-11-24 00:31

[QUOTE=kladner;388271]So far, so good, here, with 344.65. I am thinking that there may be little reason to keep up with the latest drivers for something as old as the 500 series, if the drivers' main purpose is to tune the latest cards to the latest games.[/QUOTE]

Agreed. I don't think driver updates have ever made any significant difference on mfaktc performance.


All times are UTC. The time now is 22:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.