![]() |
[QUOTE=kruoli;557116]That's weird. Why does it have an .exe file extension? That's usually not the case in Linux.[/QUOTE]
For as long as I've been using mfaktc, the Linux version has always been distributed thusly. [QUOTE=kruoli;557116]And while you can omit the .exe extension in CMD, that's not valid for BASH etc. So if you have your mfaktc named with extension, you'll have to write [c]./mfaktc.exe[/c].[/QUOTE] Yup. Exactly the way we like it! True Geeks don't like the command line guessing what it thinks we want to do... :smile: |
[QUOTE=TheJudger;551813]Hi,
seems like mfaktc runs fine with CUDA 11 on Ampere (no specific changes for Ampere except Makefile). :smile: [CODE]mfaktc v0.22-pre8 (64bit built) [...] CUDA version info binary compiled for CUDA 11.0 CUDA runtime version 11.0 CUDA driver version 11.0 CUDA device info name [COLOR="Red"][B]A100-SXM4-40GB[/B][/COLOR] compute capability 8.0 max threads per block 1024 max shared memory per MP 167936 byte number of multiprocessors 108 clock rate (CUDA cores) 1410MHz memory clock rate: 1215MHz memory bus width: 5120 bit [...] Starting trial factoring M66362159 from 2^74 to 2^75 (57.65 GHz-days) k_min = 142321062303420 k_max = 284642124610180 Using GPU kernel "barrett76_mul32_gs" Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Jul 19 21:19 | 0 0.1% | 0.829 13m15s | 6259.18 82485 n.a.% Jul 19 21:19 | 4 0.2% | 0.779 12m26s | 6660.92 82485 n.a.% Jul 19 21:19 | 9 0.3% | 0.780 12m26s | 6652.38 82485 n.a.% [...] Jul 19 21:31 | 4617 100.0% | 0.780 0m00s | 6652.38 82485 n.a.% no factor for [COLOR="red"][B]M66362159 from 2^74 to 2^75[/B][/COLOR] [mfaktc 0.22-pre8 barrett76_mul32_gs CUDA 11.0 arch 8.0] 51D74917 tf(): total time spent: [COLOR="red"][B]12m 32.323s[/B][/COLOR] [/CODE] New absolute performance champion and I guess best performance per watt, too! :smile: Older benchmark data for Turing (RTX 2080 Ti): [URL="https://mersenneforum.org/showpost.php?p=497430&postcount=2912"]https://mersenneforum.org/showpost.php?p=497430&postcount=2912[/URL] Oliver[/QUOTE] Sorry I looked all over but is 0.22 available anywhere to download? Or any prebuilt version compiled in Win64 with CUDA 11? Thanks! |
[QUOTE=TheJudger;551813]Hi,
seems like mfaktc runs fine with CUDA 11 on Ampere (no specific changes for Ampere except Makefile). :smile: [CODE]mfaktc v0.22-pre8 (64bit built) [...] CUDA version info binary compiled for CUDA 11.0 CUDA runtime version 11.0 CUDA driver version 11.0 CUDA device info name [COLOR=Red][B]A100-SXM4-40GB[/B][/COLOR] compute capability 8.0 max threads per block 1024 max shared memory per MP 167936 byte number of multiprocessors 108 clock rate (CUDA cores) 1410MHz memory clock rate: 1215MHz memory bus width: 5120 bit [...] Starting trial factoring M66362159 from 2^74 to 2^75 (57.65 GHz-days) k_min = 142321062303420 k_max = 284642124610180 Using GPU kernel "barrett76_mul32_gs" Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Jul 19 21:19 | 0 0.1% | 0.829 13m15s | 6259.18 82485 n.a.% Jul 19 21:19 | 4 0.2% | 0.779 12m26s | 6660.92 82485 n.a.% Jul 19 21:19 | 9 0.3% | 0.780 12m26s | 6652.38 82485 n.a.% [...] Jul 19 21:31 | 4617 100.0% | 0.780 0m00s | 6652.38 82485 n.a.% no factor for [COLOR=red][B]M66362159 from 2^74 to 2^75[/B][/COLOR] [mfaktc 0.22-pre8 barrett76_mul32_gs CUDA 11.0 arch 8.0] 51D74917 tf(): total time spent: [COLOR=red][B]12m 32.323s[/B][/COLOR] [/CODE]New absolute performance champion and I guess best performance per watt, too! :smile: Older benchmark data for Turing (RTX 2080 Ti): [URL]https://mersenneforum.org/showpost.php?p=497430&postcount=2912[/URL] Oliver[/QUOTE] With something like this, a person could do a lot of LL-DC work using [I]gpuOwl[/I]. There are many 10's of 1000's needing to be done. IMHO, using this for TF is a [U]waste[/U]. :ermm: |
[QUOTE=storm5510;557377]With something like this, a person could do a lot of LL-DC work using [I]gpuOwl[/I]. There are many 10's of 1000's needing to be done. IMHO, using this for TF is a [U]waste[/U]. :ermm:[/QUOTE]
Care to elaborate? I couldn't find any benchmarks from gpuowl for this card. |
[QUOTE=kracker;557379]Care to elaborate? I couldn't find any benchmarks from gpuowl for this card.[/QUOTE]Neither can I. I expect we'll see some GTX 3080 data for gpuowl sometime in the somewhat-near future, but few people have access to an A100. I expect mfaktc would run fairly similar between the two, but gpuowl performance may differ significantly. If Oliver still has access to that A100 a quick benchmark of gpuowl (and possibly cudalucas) would be nice, as always.
But still, I don't think there's anything wrong with the developer of mfaktc spending 12 minutes testing that his program works on a new generation of hardware. :smile: |
[QUOTE=James Heinrich;557380]Neither can I. I expect we'll see some GTX 3080 data for gpuowl sometime in the somewhat-near future, but few people have access to an A100. I expect mfaktc would run fairly similar between the two, but gpuowl performance may differ significantly. If Oliver still has access to that A100 a quick benchmark of gpuowl (and possibly cudalucas) would be nice, as always.
But still, I don't think there's anything wrong with the developer of mfaktc spending 12 minutes testing that his program works on a new generation of hardware. :smile:[/QUOTE] My guess is that (atleast for the consumer level) Ampere cards will perform quite poorly for LL/PRP and the like... according to techpowerup's GPU database, the [URL="https://www.techpowerup.com/gpu-specs/geforce-rtx-3080.c3621"]3080 [/URL]has a [I][U]1:64[/U][/I] for DP...(compare with 1;32 for RTX 2080 Ti) |
[QUOTE=kracker;557384]My guess is that (atleast for the consumer level) Ampere cards will perform quite poorly for LL/PRP and the like... according to techpowerup's GPU database, the [URL="https://www.techpowerup.com/gpu-specs/geforce-rtx-3080.c3621"]3080 [/URL]has a [I][U]1:64[/U][/I] for DP...(compare with 1;32 for RTX 2080 Ti)[/QUOTE]
That's more a case of the formerly INT32-only cores also now supporting FP32. Both Ampere and Turing have 2 FP64 cores per Streaming Multiprocessor (SM) block. The RTX 2080 Ti has 68 SMs and the RTX 3080 also has 68 SMs, so clock speed being equal they should perform similarly. |
[QUOTE=James Heinrich;557380]Neither can I. I expect we'll see some GTX 3080 data for gpuowl sometime in the somewhat-near future, but few people have access to an A100. I expect mfaktc would run fairly similar between the two, but gpuowl performance may differ significantly. If Oliver still has access to that A100 a quick benchmark of gpuowl (and possibly cudalucas) would be nice, as always.
But still, I don't think there's anything wrong with the developer of mfaktc spending 12 minutes testing that his program works on a new generation of hardware. :smile:[/QUOTE] That was pure speculation, and I was referring to the A100 he used for his test. A web site I looked at says [URL="https://www.pcmag.com/news/nvidia-signals-rtx-3080-founders-edition-will-be-back-in-stock-next-week"]RTX 3080's[/URL] will be back in stock before the end of this week. Not coming from the horse's mouth, I don't know how reliable it is. Until now, I never knew who the author of [I]mfaktc[/I] was. His TF GHz-d/day figure is 6x what I can do, for now. Something like this doesn't always translate into other work types. Even so, it still should be pretty good. |
1 Attachment(s)
[B]A100[/B], photo attached. I've never seen anything like these before. Nvidia calls them "Data Center" GPU's. TDP, 400W on the left and 250W on the right. Most of the specs are the same for both.
|
[QUOTE=storm5510;557465][B]A100[/B], photo attached. I've never seen anything like these before. Nvidia calls them "Data Center" GPU's. TDP, 400W on the left and 250W on the right. Most of the specs are the same for both.[/QUOTE]
They have been around since Pascal. They are SXM modules ( Pascal ) and SXM2 and SXM4 for Volta and Ampere respectively. For Data Center Machines, they are mounted on a carrier that can hold 4 or 8 of these and are connected via NVlink rather than PCIE. They will, of course, have a passive heat sink attached to their tops. The carrier boards are then attached to a "pizza box" server, usually just above it and holding 2 Xeon cpus. Then multiple serevrs are put in a rack etc. This is how supercomputers are made these days. When Jensen Huang announced Ampere in May there was a ridiculous video Nvidia made of his pulling a populated carrier out of an ordinary oven exclaiming "Look what we have cooked up!" The board on the right has one of these SXM4 modules within mounted on a board that has PCIE interface circuitry and the SXM4 will have a heatsink on its top that is different from the other ones I mentioned. These boards are also passively cooled and are for workstations. Since their cooling is less effective that the bare SXM4 modules, they are de-tuned to keep from overheating. Hence, they take 150 watts less that the datacenter SXM4 modules. They are usually referred to as Tesla boards. |
[QUOTE=storm5510;557377]With something like this, a person could do a lot of LL-DC work using [I]gpuOwl[/I]. There are many 10's of 1000's needing to be done. IMHO, using this for TF is a [U]waste[/U]. :ermm:[/QUOTE]
using 3080 is enough for TF I borrowed a 3080 card and test its performance.. it could brings 2/3 TF performance of A100 but using only 1/10 price. [CODE] Starting trial factoring M210230299 from 2^72 to 2^73 (4.55 GHz-days) k_min = 11231412658620 k_max = 22462825317437 Using GPU kernel "barrett76_mul32_gs" Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Sep 26 23:57 | 0 0.1% | 0.088 n.a. | 4653.23 82485 n.a.% Sep 26 23:57 | 5 0.2% | 0.089 n.a. | 4600.94 82485 n.a.% Sep 26 23:57 | 9 0.3% | 0.089 n.a. | 4600.94 82485 n.a.% Sep 26 23:57 | 12 0.4% | 0.090 n.a. | 4549.82 82485 n.a.% Sep 26 23:57 | 17 0.5% | 0.088 n.a. | 4653.23 82485 n.a.% Sep 26 23:57 | 20 0.6% | 0.089 n.a. | 4600.94 82485 n.a.% Sep 26 23:57 | 21 0.7% | 0.090 n.a. | 4549.82 82485 n.a.% Sep 26 23:57 | 24 0.8% | 0.089 n.a. | 4600.94 82485 n.a.% Sep 26 23:57 | 29 0.9% | 0.091 n.a. | 4499.83 82485 n.a.% Sep 26 23:57 | 36 1.0% | 0.092 n.a. | 4450.91 82485 n.a.% Sep 26 23:57 | 41 1.1% | 0.092 n.a. | 4450.91 82485 n.a.% Sep 26 23:57 | 44 1.2% | 0.091 n.a. | 4499.83 82485 n.a.% Sep 26 23:57 | 45 1.4% | 0.092 n.a. | 4450.91 82485 n.a.% Sep 26 23:57 | 56 1.5% | 0.092 n.a. | 4450.91 82485 n.a.% Sep 26 23:57 | 57 1.6% | 0.092 n.a. | 4450.91 82485 n.a.% Sep 26 23:57 | 65 1.7% | 0.092 n.a. | 4450.91 82485 n.a.% Sep 26 23:57 | 69 1.8% | 0.092 n.a. | 4450.91 82485 n.a.% Sep 26 23:57 | 72 1.9% | 0.091 n.a. | 4499.83 82485 n.a.% Sep 26 23:57 | 77 2.0% | 0.092 n.a. | 4450.91 82485 n.a.% Sep 26 23:57 | 80 2.1% | 0.092 n.a. | 4450.91 82485 n.a.% Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Sep 26 23:57 | 84 2.2% | 0.091 n.a. | 4499.83 82485 n.a.% Sep 26 23:57 | 89 2.3% | 0.089 n.a. | 4600.94 82485 n.a.% Sep 26 23:57 | 96 2.4% | 0.090 n.a. | 4549.82 82485 n.a.% Sep 26 23:57 | 101 2.5% | 0.089 n.a. | 4600.94 82485 n.a.% Sep 26 23:57 | 104 2.6% | 0.089 n.a. | 4600.94 82485 n.a.% Sep 26 23:57 | 105 2.7% | 0.090 n.a. | 4549.82 82485 n.a.% Sep 26 23:57 | 117 2.8% | 0.090 n.a. | 4549.82 82485 n.a.% Sep 26 23:57 | 120 2.9% | 0.089 n.a. | 4600.94 82485 n.a.% Sep 26 23:57 | 129 3.0% | 0.090 n.a. | 4549.82 82485 n.a.% Sep 26 23:57 | 132 3.1% | 0.090 n.a. | 4549.82 82485 n.a.% Sep 26 23:57 | 140 3.2% | 0.091 n.a. | 4499.83 82485 n.a.% Sep 26 23:57 | 141 3.3% | 0.092 n.a. | 4450.91 82485 n.a.% Sep 26 23:57 | 149 3.4% | 0.091 n.a. | 4499.83 82485 n.a.% Sep 26 23:57 | 152 3.5% | 0.089 n.a. | 4600.94 82485 n.a.% Sep 26 23:57 | 156 3.6% | 0.089 n.a. | 4600.94 82485 n.a.% Sep 26 23:57 | 161 3.8% | 0.089 n.a. | 4600.94 82485 n.a.% Sep 26 23:57 | 164 3.9% | 0.089 n.a. | 4600.94 82485 n.a.% Sep 26 23:57 | 176 4.0% | 0.090 n.a. | 4549.82 82485 n.a.% Sep 26 23:57 | 177 4.1% | 0.089 n.a. | 4600.94 82485 n.a.% Sep 26 23:57 | 185 4.2% | 0.089 n.a. | 4600.94 82485 n.a.%[/CODE] |
| All times are UTC. The time now is 22:00. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.