![]() |
I started a new thread: [url="http://www.mersenneforum.org/showthread.php?t=22413"]5000e15 - 5500e15[/url]
Previous: [URL="http://www.mersenneforum.org/showpost.php?p=460275&postcount=163"]4300-4399[/URL] (complete) [URL="http://www.mersenneforum.org/showpost.php?p=461106&postcount=229"]4700-4799[/URL] (complete) [URL="http://www.mersenneforum.org/showpost.php?p=461691&postcount=301"]4520-4539[/URL] (3 days left) [URL="http://www.mersenneforum.org/showpost.php?p=461818&postcount=325"]4850-4899[/URL] (4-6 days left) |
Dana, can you please start a new thread for the range 5e18-6e18?
|
1 Attachment(s)
Results for 4540-4545.
I concatenated all of the gap_solutions.txt, I hope it's OK. I can reformat to whatever is preferred if needed :) The ten largest gaps: [code]1158 4544981293636392709 1160 4543141211462563781 1170 4540065569807950537 1176 4543645113625191593 1190 4544720841843709097 1200 4544595694865919163 1216 4540814097792317827 1238 4543497368344753163 1292 4543949541317727059 1336 4544946810033612391[/code] |
[B]STATISTICS[/B]
Over 2/3 now!! 671 to be precise. [B]4000-4500[/B] Of the first 500 Ranges only 28 Ranges need to be completed and are all reserved. 4285-4290 , 4293-4300 , 4484-4500 [B]4500-5000[/B] As to the second 500 ranges only Range 4700-4800 is completed. The rest of the ranges vary from 10% to 35% completion. There is still a lot of Ranges not reserved. There have been 11 "records". 1 per every 61 Ranges checked. |
2 Attachment(s)
Here is the latest version of the c code (version 0.05.g), along with Windows executables (now reduced to two versions - one for pre-Haswell the other for Haswell and later processors, as discussed in earlier posts).
There are only minor changes in this version: [CODE]// version 0.05.g Fixed quick sort, added optimized bubble sort. Antonio Key // Replaced add30 & sub30 constant arrays with a single // wheel30 constant array - they became identical in 0.05. // Added ETA to status display line. // Removed some of the old commented-out code. // Modified the smart check on input data. // version 0.05.f next_prime, prec_prime tuning. Antonio Key // version 0.05.e Hensel lifting. Robert Gerbicz // version 0.05.d Fermat & Euler-Plumb PRP tests using Montgomery math. Dana Jacobsen // version 0.05 Bug fix. Robert Gerbicz [/CODE]The compiled versions use the optimized bubble sort, I don't think there is any advantage to using the quick sort for the small number of items we have to order at the end of each stage and the bubble sort may be faster. The modification to the input data smart check came about as a result of my further investigation into tuning the code. I have found that for my i5-3570k running 4 threads I can increase the throughput to 27.5e9 n/sec using bs= 18, the old smart check would issue a warning for bs > 16, which didn't seem reasonable. The smart check now only displays a warning if sb <16 or (sb - bs) < 3. My input parameters have now changed from (ignore the n1 and n2 parameters - which are my test range values only):[CODE]gap5_f_ivybridge -n1 4248785e12 -n2 4248789e12 -gap 1250 -delta 150 -sb 24 -bs 16 -t 4 -mem 12 throughput 26.4e9 n/sec.[/CODE]to:[CODE]gap5_g -n1 4248785e12 -n2 4248789e12 -gap 1250 -delta 155 -sb 24 -bs 18 -t 4 -mem 12 throughput 27.5e9 n/sec.[/CODE] |
Thank you Antonio.
PS (loved showing the ETA for the current load work, nice...) |
2 Attachment(s)
Here are some Linux binaries (Nehalem and Haswell) for Antonio's latest version gap5_g.
|
BTW: It seems that there is actually no benefit from manual loop unrolling like in the code section found in [I]sieve_small_primes[/I] given below. The gcc compiler doesn't generate any parallelized code from it.
[CODE] if(i==0){ for(k=0;k+15<LEN64;){ arr2[k]=off2[k];k++; arr2[k]=off2[k];k++; arr2[k]=off2[k];k++; arr2[k]=off2[k];k++; arr2[k]=off2[k];k++; arr2[k]=off2[k];k++; arr2[k]=off2[k];k++; arr2[k]=off2[k];k++; arr2[k]=off2[k];k++; arr2[k]=off2[k];k++; arr2[k]=off2[k];k++; arr2[k]=off2[k];k++; arr2[k]=off2[k];k++; arr2[k]=off2[k];k++; arr2[k]=off2[k];k++; arr2[k]=off2[k];k++; } for(;k<LEN64;k++)arr2[k]=off2[k];} else{ for(k=0;k+15<LEN64;){ arr2[k]&=off2[k];k++; arr2[k]&=off2[k];k++; arr2[k]&=off2[k];k++; arr2[k]&=off2[k];k++; arr2[k]&=off2[k];k++; arr2[k]&=off2[k];k++; arr2[k]&=off2[k];k++; arr2[k]&=off2[k];k++; arr2[k]&=off2[k];k++; arr2[k]&=off2[k];k++; arr2[k]&=off2[k];k++; arr2[k]&=off2[k];k++; arr2[k]&=off2[k];k++; arr2[k]&=off2[k];k++; arr2[k]&=off2[k];k++; arr2[k]&=off2[k];k++; } for(;k<LEN64;k++)arr2[k]&=off2[k];} [/CODE] Thus, the above could be replaced by: [CODE] if(i==0){ for(k=0;k<LEN64;k++) arr2[k]=off2[k]; } else{ for(k=0;k<LEN64;k++) arr2[k]&=off2[k]; } [/CODE] |
[QUOTE=Antonio;462112]The modification to the input data smart check came about as a result of my further investigation into tuning the code. I have found that for my i5-3570k running 4 threads I can increase the throughput to 27.5e9 n/sec using bs= 18, the old smart check would issue a warning for bs > 16, which didn't seem reasonable. The smart check now only displays a warning if sb <16 or (sb - bs) < 3.
My input parameters have now changed from (ignore the n1 and n2 parameters - which are my test range values only):[CODE]gap5_f_ivybridge -n1 4248785e12 -n2 4248789e12 -gap 1250 -delta 150 -sb 24 -bs 16 -t 4 -mem 12 throughput 26.4e9 n/sec.[/CODE]to:[CODE]gap5_g -n1 4248785e12 -n2 4248789e12 -gap 1250 -delta 155 -sb 24 -bs 18 -t 4 -mem 12 throughput 27.5e9 n/sec.[/CODE][/QUOTE] I noticed a similar ([I]apparent[/I]) improvement from 16.9e9 n/sec to 18.6e9 n/sec (already for version 5_f) when using your test range and bs=18, 19, or 20 (with sb=22 or 23). However, when running a full range, the jobs with bs=18/19 actually took longer (up to one hour) than the jobs with bs=16. I will try again with version 5_g to see what happens... |
1 Attachment(s)
4500-4510 done by my friend Steve Cole in one day.
|
[QUOTE=Thomas11;462142]BTW: It seems that there is actually no benefit from manual loop unrolling like in the code section found in [I]sieve_small_primes[/I] given below. The gcc compiler doesn't generate any parallelized code from it.
[/QUOTE] Is this a slow portion of the program? Might be worth tweaking the code until it does produce parallelized code or adding our own asm. Is the following easier for the compiler(with 16 rather than 4)? I thought compilers were supposed to be good at unrolling the loop themselves these days. [CODE] for(k=0;k+3<LEN64;k+=4){ arr2[k]=off2[k]; arr2[k+1]=off2[k+1]; arr2[k+2]=off2[k+2]; arr2[k+3]=off2[k+3]; }[/CODE] What size integers are arr2 and off2? If they are smaller than 64 bits then multiple could be moved in one instruction. |
| All times are UTC. The time now is 15:30. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.