mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

James Heinrich 2019-07-06 02:20

[QUOTE=hansl;520834]Also strangely I seem to get the better performance from my laptop Quadro M1000M vs a GTX 780. I would have expected the GTX to be faster , having more than 4x CUDA core count of the mobile quadro.[/QUOTE]What actual performance are you getting for both, because I would [url=https://www.mersenne.ca/mfaktc.php?filter=m1000m%7Cgtx+780]expect[/url] that any version of a GTX 780 (even the 780M) should handily beat the M1000M.

For short-running exponents the answer is there's not much you can do -- mfakt[i]x[/i] just doesn't scale all that well to micro assignments. Even with less_classes and tweaks, if your runtime is less than a second you're leaving lots of performance unused. You may be able to recoup some of it by running multiple mfaktc instances simultaneously to try to maximize GPU load (keep adding instances until the sum of your throughput stops increasing).

hansl 2019-07-06 03:01

[QUOTE=James Heinrich;520838]What actual performance are you getting for both, because I would [url=https://www.mersenne.ca/mfaktc.php?filter=m1000m%7Cgtx+780]expect[/url] that any version of a GTX 780 (even the 780M) should handily beat the M1000M.

For short-running exponents the answer is there's not much you can do -- mfakt[i]x[/i] just doesn't scale all that well to micro assignments. Even with less_classes and tweaks, if your runtime is less than a second you're leaving lots of performance unused. You may be able to recoup some of it by running multiple mfaktc instances simultaneously to try to maximize GPU load (keep adding instances until the sum of your throughput stops increasing).[/QUOTE]
Not sure in terms of Ghz-d/d values because I commented out a lot of the print statements in this build to help things along, since they are ultra short runs. But I've chunked out the work into 10,000 exponents per "batch" in separate worktodo files, and the Quadro does a batch in about 9 minutes, while the GTX takes more like 12 minutes (1-55 bits, StopAfterFactor=0).

[spoiler]I guess I'm basically outing myself as mersenne.ca's TJAOI-copycat here, but James already knew that anyways :) [/spoiler]

I had tried before to do much larger batches of like 1million exponents at a time, but it seemed that the rewriting of such a large worktodo was causing massive overhead, maybe that's still a big factor here possibly limiting by CPU/disk IO rather than GPU, that I should be doing even smaller batches than 10k.

Also, somewhat offtopic, but on the GTX I can't see the GPU utilization % using "nvidia-smi" utility on linux (it says "N/A"), anyone know if that was just not implemented in GTX 700 series or older GPUs?

James Heinrich 2019-07-06 03:21

[QUOTE=hansl;520842]I had tried before to do much larger batches of like 1million exponents at a time, but it seemed that the rewriting of such a large worktodo was causing massive overhead, maybe that's still a big factor here possibly limiting by CPU/disk IO rather than GPU, that I should be doing even smaller batches than 10k.[/QUOTE]I haven't looked at the code, but I believe the entire worktodo.txt is completely rewritten line by line (not simply copy all-except-the-first line; notably Windows build will always rewrite worktodo.txt with \r\n even if the original has unix linebreaks) after each assignment, so there's plenty of CPU overhead for that, plus disk I/O. You will likely get some improvement running from a RAMdisk for both worktodo.txt and results.txt You will probably also benefit greatly from working with smaller batches (e.g. 1k instead of 10k or 1M assignments). You could even try 100 assignments at a time if the rest of your infrastructure can cope with such fast turnover.

kriesel 2019-07-06 13:08

[QUOTE=James Heinrich;520845]I haven't looked at the code, but I believe the entire worktodo.txt is completely rewritten line by line (not simply copy all-except-the-first line[/QUOTE]It definitely rewrites even the first line, in the case of a multiple-bit-level entry.
Factor=exponent,74,78
becomes
Factor=exponent,75,78
then
Factor=exponent,76,78
etc as bit levels are completed. It's only after 77,78 is completed that such a line would be removed. In the higher exponents and bit levels there's ample time to observe such behavior since run times can be days or weeks per bit level.

hansl 2019-07-06 19:22

[QUOTE=kriesel;520879]It definitely rewrites even the first line, in the case of a multiple-bit-level entry.
Factor=exponent,74,78
becomes
Factor=exponent,75,78
then
Factor=exponent,76,78
etc as bit levels are completed. It's only after 77,78 is completed that such a line would be removed. In the higher exponents and bit levels there's ample time to observe such behavior since run times can be days or weeks per bit level.[/QUOTE]
I think this does not occur if Stages=0 though, which definitely makes sense for my use case, so that was one of the first things I configured.

I did some testing with batches of 500, and here is a condensed version of my mfaktc.ini that I settled on:
[quote]
SievePrimes=2000
SievePrimesAdjust=0
SievePrimesMin=2000
SievePrimesMax=100000

NumStreams=1
CPUStreams=1
GridSize=0

Checkpoints=0
CheckpointDelay=0
WorkFileAddDelay=0

Stages=0
StopAfterFactor=0
PrintMode=0
AllowSleep=1
TimeStampInResults=0

SieveOnGPU=1
GPUSievePrimes=4000
GPUSieveSize=512
GPUSieveProcessSize=8
[/quote]
Am I right in my understanding that none of the "Sieve*" settings at the top make a difference when SieveOnGPU=1?

Tweaking GPUSievePrimes for these workloads seemed to have some of the most noticeable effects (again this is just for exponents around 1e9 range and only to 55 bits).

Note: I also raised GPU_SIEVE_SIZE_MAX in params.h. It doesn't make a lot of difference, but seemed like GPUSieveSize=512 was maybe a couple % faster than the default max of 128.

I've basically commented out all the status printing lines in the source now(but carefully leaving in the ones that print to results file!), so I wouldn't think PrintMode should make a difference at this point, but it looked like it was maybe slightly faster with it set to 0 rather than 1 (but also maybe within margin of error).

All in all I went from ~26s/500 exponents down to ~20-21s/500 after testing various changes here (on my Quadro).

I still see slower times on the GTX 780 @ ~29s/500, but I haven't tested ramdisk yet.

Running 2 instances on the GTX gives me ~44s/500 per instance so that averages out to ~22s/500 throughput which is a definite improvement over single instance, but still *just* slower than the Quadro.

Maybe notable is that the GTX 780 is being fed by older (dual socketed) Xeon E5-2697 v2's, which are running consistenly @ 3.0Ghz vs my laptop's i7-6820HQ which runs at constant boost of 3.2Ghz.

Running 2 instances on the Quadro gave me ~40s/500 which is only maybe a fraction of a second better time per batch throughput vs single instance.

I'll try benchmarking on ramdisk soon and see how that fares.

hansl 2019-07-06 19:36

One small suggestion based on the issue of rewriting large worktodo.txt files.
I think it may be more efficient to work backwards from the last line, so it would only need to rewrite one line at a time vs the whole file each time it updates? I totally understand if that's not considered worth the effort for this extremely niche case though.

kriesel 2019-07-08 20:12

Probability of multiple factors in same class and bit level
 
I recall seeing reference to finding two or more factors in the same class of the same bit level being rare, and perhaps not necessary to handle both factors in that case. (I think it was in this thread, but have been unable/not patient enough to find that reference.)

Related are
"In some cases it misses factors when there are multiple factors in one class close together but this is not critical. The is a known problem since the first version... This has nothing to do with the calculations itself, it is just how the results are returned from the GPU to the CPU." [URL]https://www.mersenneforum.org/showpost.php?p=205332&postcount=131[/URL]
and
"There is no result bitmap in mfaktc, just a small array of integers (32x 4 Byte) after each class is finished. The array can hold up to 10 factors per class." [URL]https://www.mersenneforum.org/showpost.php?p=410784&postcount=35[/URL]

I've put together some calculated estimates for two to four factors in the same bit level and same class, at [URL]https://www.mersenneforum.org/showpost.php?p=520982&postcount=5[/URL].

SethTro 2019-07-10 08:03

I modified mfaktc to return k's where pow(2, P-1, 2*P*k+1) is very small so that NF-TF results still have some verifiable output. More details and code links in [url]https://www.mersenneforum.org/showpost.php?p=520512&postcount=199[/url]

TheJudger 2019-07-21 20:17

Hi there,

I was recently able to "reproduce" the issue where mfaktc reports 38814612911305349835664385407 as a (false) factor of M<insert prime number here>. While the origin of the factor is well known (last factor in the small selftest before doing some real work) it is still unknown why it shows up "randomly".

I have some evidence that this related to hardware errors. In my case the OS reported[CODE][150260.974505] NVRM: GPU at PCI:0000:5e:00: GPU-<some UID>
[150260.974510] NVRM: GPU Board Serial Number:
[150260.974514] NVRM: Xid (PCI:0000:5e:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 4, SM 0): Out Of Range Address
[150260.974521] NVRM: Xid (PCI:0000:5e:00): 13, Graphics Exception: ESR 0x51e730=0xc04000e 0x51e734=0x0 0x51e728=0x4c1eb72 0x51e72c=0x174
[150260.974764] NVRM: Xid (PCI:0000:5e:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 0, SM 0): Out Of Range Address
[150260.974769] NVRM: Xid (PCI:0000:5e:00): 13, Graphics Exception: ESR 0x524730=0xc05000e 0x524734=0x0 0x524728=0x4c1eb72 0x52472c=0x174
[150260.974857] NVRM: Xid (PCI:0000:5e:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 1, SM 1): Out Of Range Address
[150260.974863] NVRM: Xid (PCI:0000:5e:00): 13, Graphics Exception: ESR 0x524fb0=0xc05000e 0x524fb4=0x20 0x524fa8=0x4c1eb72 0x524fac=0x174
[150260.974954] NVRM: Xid (PCI:0000:5e:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 2, SM 1): Out Of Range Address
[150260.974959] NVRM: Xid (PCI:0000:5e:00): 13, Graphics Exception: ESR 0x5257b0=0xc05000e 0x5257b4=0x20 0x5257a8=0x4c1eb72 0x5257ac=0x174
[150260.975044] NVRM: Xid (PCI:0000:5e:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 3, SM 0): Out Of Range Address
[150260.975050] NVRM: Xid (PCI:0000:5e:00): 13, Graphics Exception: ESR 0x525f30=0xc06000e 0x525f34=0x20 0x525f28=0x4c1eb72 0x525f2c=0x174
[150260.975118] NVRM: Xid (PCI:0000:5e:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 3, SM 1): Out Of Range Address
[150260.975123] NVRM: Xid (PCI:0000:5e:00): 13, Graphics Exception: ESR 0x525fb0=0xc06000e 0x525fb4=0x20 0x525fa8=0x4c1eb72 0x525fac=0x174
[150260.975201] NVRM: Xid (PCI:0000:5e:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 4, SM 0): Out Of Range Address
[150260.975206] NVRM: Xid (PCI:0000:5e:00): 13, Graphics SM Global Exception on (GPC 4, TPC 4, SM 0): Multiple Warp Errors
[150260.975211] NVRM: Xid (PCI:0000:5e:00): 13, Graphics Exception: ESR 0x526730=0xc04000e 0x526734=0x24 0x526728=0x4c1eb72 0x52672c=0x174
[150260.975280] NVRM: Xid (PCI:0000:5e:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 4, SM 1): Out Of Range Address
[150260.975284] NVRM: Xid (PCI:0000:5e:00): 13, Graphics Exception: ESR 0x5267b0=0xc04000e 0x5267b4=0x20 0x5267a8=0x4c1eb72 0x5267ac=0x174
[150260.984693] NVRM: Xid (PCI:0000:5e:00): 13, Graphics Exception: ChID 0010, Class 0000c5c0, Offset 00000000, Data 00000000
[169206.556478] NVRM: Xid (PCI:0000:5e:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 2, SM 1): Out Of Range Address
[169206.556485] NVRM: Xid (PCI:0000:5e:00): 13, Graphics Exception: ESR 0x52d7b0=0xc07000e 0x52d7b4=0x0 0x52d7a8=0x4c1eb72 0x52d7ac=0x174
[169206.557301] NVRM: Xid (PCI:0000:5e:00): 13, Graphics Exception: ChID 0010, Class 0000c5c0, Offset 00000000, Data 00000000
[169206.624363] NVRM: Xid (PCI:0000:5e:00): 62, 0cb5(2d50) 8503d428 ffffff80
[/CODE]

around the time when the false factor was reported... a few minutes later the GPU was completely broken (as in driver is unable to initialize the GPU).

Oliver

kriesel 2019-07-22 01:58

[QUOTE=TheJudger;522041]Hi there,

I was recently able to "reproduce" the issue where mfaktc reports 38814612911305349835664385407 as a (false) factor of M<insert prime number here>. While the origin of the factor is well known (last factor in the small selftest before doing some real work) it is still unknown why it shows up "randomly".

I have some evidence that this related to hardware errors. [/QUOTE]
The false factor is also 1:1 reproducible with Windows TDRs. A slow card such as the NVS295 and default TdrDelay 2 seconds reliably produced the false factor with each TDR, 12 out of 12. See [url]https://www.mersenneforum.org/showpost.php?p=520615&postcount=3167[/url]

storm5510 2019-07-24 15:22

I have a 250GB Samsung SSD coming tomorrow. I have been running James Heinrich's project using a RAM-drive. I feel that I should continue to run it this way. Would this be a correct statement?

Thanks. :smile:


All times are UTC. The time now is 22:55.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.