View Single Post
Old 2021-03-27, 18:40   #29
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5·7·132 Posts
Default P-1 speed, 103M, V6.11-380 vs. v7.2-53

Test environment description:

XFX Radeon VII, gpu ram clock 1150 MHz,
Windows 10 X64 build 19041.867, Celeron G1840 cpu, 16GB system ram; GPUs on this system have no display involvement, that's done by the motherboard VGA circuitry.

V6.11-380-g79ea0cc
config.txt: -user kriesel -cpu asr2/radeonvii4 -d 4 -use NO_ASM -maxAlloc 15000 -cleanup
B=1000000,B2=30000000;PFactor=AID,1,2,103218151,-1,76,2
P-1 start stage 1, 2021-03-21 11:17:00
P-1 begin stage 2, 2021-03-21 11:37:15, gpu stage 1 time 0:20:15
P-1 begin stage 2 GCD, 2021-03-21 12:02:56, gpu stage 2 time 25:41; gpu both stages time 0:45:56
P-1 finish last stage 2 GCD, 2021-03-21 12:03:50, total elapsed time 0:46:50
This program version can perform 24hr/0:45:56 = 31.35 wavefront P-1 to B1=1M, B2=30M per day on this gpu.
Code:
2021-03-21 11:16:54 asr2/radeonvii4 103218151 FFT: 5.50M 1K:11:256 (17.90 bpw)
2021-03-21 11:16:54 asr2/radeonvii4 Expected maximum carry32: 41980000
2021-03-21 11:16:55 asr2/radeonvii4 OpenCL args "-DEXP=103218151u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DPM1=1 -DAMDGPU=1 -DCARRYM64=1 -DWEIGHT_STEP_MINUS_1=0x9.6bad53f9e1ae8p-7 -DIWEIGHT_STEP_MINUS_1=-0x8.c6595b0a85358p-7 -DNO_ASM=1  -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only "
2021-03-21 11:16:59 asr2/radeonvii4 OpenCL compilation in 4.06 s
2021-03-21 11:17:00 asr2/radeonvii4 103218151 P1 B1=1000000, B2=30000000; 1442134 bits; starting at 0
2021-03-21 11:17:08 asr2/radeonvii4 103218151 P1    10000   0.69%;  853 us/it; ETA 0d 00:20; 72f84500b0ce1bff
V7.2-53-ge27846f -maxalloc 15G
config.txt: -user kriesel -cpu asr2/radeonvii4 -d 4 -maxAlloc 15G -proof 9 -use NO_ASM -autoverify 10
B1=1000000,B2=30000000;PRP=AID,1,2,103276501,-1,76,2
P-1 start stage 1, 2021-03-25 17:34:31
P-1 begin stage 2, 2021-03-25 18:01:12, gpu stage 1 time 0:26:41
P-1 begin last stage 2 GCD, 2021-03-25 21:31:20, gpu stage 2 time 3:30:18, while no PRP progress occurs; gpu both stages time 3:56:59
P-1 finish last stage 2 GCD, 2021-03-25 21:32:17, total elapsed time 3:57:46
This program version can perform 24h/3:56:59 = 6.08 wavefront P-1 to B1=1M, B2=30M per day on this gpu.
Even ignoring the stage 1 time, if the PRP progress occurring with stage 1 is useful, the drop in P-1 productivity is considerable.
(24hr/3:30:18 = 7.26/day of stage 2 time, 23.2% of the V6.11-380 rate of both stages.
Is stage 2 being slowed by using too much gpu ram, causing paging? retest with -maxalloc 14G)

System restart and v7.2-53 changed to -maxalloc 14G, retest:
B1=1000000,B2=30000000;PRP=AID,1,2,103281509,-1,76,2
start program 2021-03-26 07:20:37 (don't count startup time which is incurred once)
P-1 stage 1 start 2021-03-26 07:20:55
P-1 stage 1 end gpu compute, start stage 2 2021-03-26 07:46:35; stage 1 gpu time 0:25:40
P-1 stage 2 end gpu compute 2021-03-26 08:21:30, stage 2 gpu time 0:34:55, combined gpu time 1:00:35
P-1 stage 2 final gcd done & result written 2021-03-26 08:22:35 total elapsed time 1:01:40
(24hr/1:00:35 = 23.77 wavefront P-1 to B1=1M, B2=30M per day on this gpu as configured;
needed lower -maxalloc than v6.11-380 for possibly full performance; 23.77/31.35=75.82% of v6.11-380 throughput.)
Code:
2021-03-26 07:20:37 GpuOwl VERSION v7.2-53-ge27846f
2021-03-26 07:20:37 config: -user kriesel -cpu asr2/radeonvii4 -d 4 -maxAlloc 14G -proof 9 -use NO_ASM -autoverify 10
2021-03-26 07:20:37 device 4, unique id ''
2021-03-26 07:20:37 asr2/radeonvii4 103281509 FFT: 5.50M 1K:11:256 (17.91 bpw)
2021-03-26 07:20:37 asr2/radeonvii4 103281509 OpenCL args "-DEXP=103281509u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DAMDGPU=1 -DWEIGHT_STEP_MINUS_1=0.065454243866181372 -DIWEIGHT_STEP_MINUS_1=-0.061433181427547327 -DIWEIGHTS={0,-0.061433181427547327,-0.1190923270747847,-0.17320928796651797,-0.22400167178148192,-0.27167371786640354,-0.31641711849516779,-0.35841178967541976,-0.39782659460206504,-0.43482002266672043,-0.46954082675345288,-0.004257242766829498,-0.065428888227101079,-0.12284256489359018,-0.17672913674500432,-0.22730528505136197,} -DNO_ASM=1  -cl-std=CL2.0 -cl-finite-math-only "
2021-03-26 07:20:41 asr2/radeonvii4 103281509 OpenCL compilation in 3.91 s
2021-03-26 07:20:41 asr2/radeonvii4 103281509 trig table : 65 points, cos 73.77 bits, sin 73.34 bits
2021-03-26 07:20:41 asr2/radeonvii4 103281509 trig table : 353 points, cos 72.91 bits, sin 73.05 bits
2021-03-26 07:20:42 asr2/radeonvii4 103281509 trig table : 360449 points, cos 72.51 bits, sin 72.42 bits
2021-03-26 07:20:43 asr2/radeonvii4 103281509 maxAlloc: 14.0 GB
2021-03-26 07:20:43 asr2/radeonvii4 103281509 P1(1M) 1442134 bits
2021-03-26 07:20:43 asr2/radeonvii4 103281509 PRP starting from beginning
2021-03-26 07:20:43 asr2/radeonvii4 103281509 Acquired memory lock 'memlock-4'
2021-03-26 07:20:43 asr2/radeonvii4 103281509 P1(1M) using 636 buffers
2021-03-26 07:20:55 asr2/radeonvii4 103281509 OK         0 on-load: blockSize 400, 0000000000000003
2021-03-26 07:20:55 asr2/radeonvii4 103281509 validating proof residues for power 9
2021-03-26 07:20:55 asr2/radeonvii4 103281509 Proof using power 9
2021-03-26 07:21:05 asr2/radeonvii4 103281509        10000   0.01% 16587bc1b5984c5e 1006 us/it
 ...
2021-03-26 07:46:30 asr2/radeonvii4 103281509      1440000   1.39% 9879e6d9e5f6f57a 1002 us/it
2021-03-26 07:46:35 asr2/radeonvii4 103281509 P1(1M) releasing 636 buffers
2021-03-26 07:46:35 asr2/radeonvii4 103281509 Released memory lock 'memlock-4'
2021-03-26 07:46:36 asr2/radeonvii4 103281509 OK   1442400   1.40% 2399e963ff197b84  989 us/it + check 0.54s + save 2.99s; ETA 1d 03:59
2021-03-26 07:47:33 asr2/radeonvii4 103281509 P1 Jacobi OK @ 1442400 86bb88b6716d5a4e
2021-03-26 07:47:35 asr2/radeonvii4 103281509 OK   1446000   1.40% 4f9562fe18bd70dc 16098 us/it + check 0.53s + save 0.40s; ETA 18d 23:22
2021-03-26 07:47:35 asr2/radeonvii4 103281509 P2(1M,30M) D=330, nBuf=315
2021-03-26 07:47:35 asr2/radeonvii4 103281509 P2(1M,30M) Generating P2 plan, please wait..
2021-03-26 07:47:44 asr2/radeonvii4 103281509 P2(1M,30M) D=330: 1779361 primes in [1000003, 29999999]: cost 1.22M (pair: 715271, single: 348819, (80% paired), blocks: 77916)
2021-03-26 07:47:44 asr2/radeonvii4 103281509 P2(1M,30M) 77916 blocks: 12991 - 90906; start from 12991
2021-03-26 07:47:44 asr2/radeonvii4 103281509 P2(1M,30M) Acquired memory lock 'memlock-4'
2021-03-26 07:47:44 asr2/radeonvii4 103281509 P2(1M,30M) Allocated 315 buffers
2021-03-26 07:47:44 asr2/radeonvii4 103281509 P2(1M,30M) Starting P1 GCD
2021-03-26 07:48:10 asr2/radeonvii4 103281509 P2(1M,30M) Setup 315 P2 buffers in 26.7s
2021-03-26 07:48:11 asr2/radeonvii4 103281509 P2(1M,30M) OK @12991: 0b4553ba507d47d5 (0.2s)
2021-03-26 07:48:11 asr2/radeonvii4 103281509 P2(1M,30M) MULs: done 0, left 1219922; 0.0%
2021-03-26 07:48:18 asr2/radeonvii4 103281509 P2(1M,30M)   0.3%  3211 muls, 2231 us/mul, ETA 00:45
2021-03-26 07:48:29 asr2/radeonvii4 103281509 P2(1M,30M)   0.8%  6141 muls, 1891 us/mul, ETA 00:38
2021-03-26 07:48:39 asr2/radeonvii4 103281509 P2(1M,30M) GCD : no factor
2021-03-26 07:48:40 asr2/radeonvii4 103281509 P2(1M,30M)   1.3%  6151 muls, 1694 us/mul, ETA 00:34
V6.11-380 stage 1 took 20:15/25:40 = 78.9% as long as V7.2-53.
V6.11-380 stage 2 took 25:41/34:55 = 73.6% as long as V7.2-53.

V7.2-53 P-1 is only faster if neglecting the V7.2 stage 1 cost, as in completing the PRP also.
V6.11-380 P-1 is faster for standalone runs, in each stage, and combined.

V7.2-53 apparently does not support B2 extension after initial stage 2 completion,
despite https://mersenneforum.org/showpost.p...8&postcount=61. Test method:
B1=1000000,B2=30000000;PRP=AID,1,2,103276501,-1,76,2 in worktodo
Run until those bounds are complete, results file contains the P-1 result line, halt program.
Observe worktodo entry remains B1=1000000,B2=30000000;PRP=AID,1,2,103276501,-1,76,2, gpuowl did not change it to ...,76,0.
Modify to B1=1000000,B2=40000000;PRP=AID,1,2,103276501,-1,76,2, restart program.
Program continues PRP only, without P-1 stage 2 from 30M to 40M or from 1M to 40M.
Perhaps the extension only works during the time window before B2=30000000 is reached.


Conclusion: stay with V6.11-380 gpuowl for wavefront production P-1 factoring, which is standalone P-1.


Top of this reference thread: https://www.mersenneforum.org/showthread.php?t=23391
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-07-12 at 05:08
kriesel is online now