mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Prime Gap Searches (https://www.mersenneforum.org/forumdisplay.php?f=131)
-   -   4e18-5e18 (https://www.mersenneforum.org/showthread.php?t=22187)

danaj 2017-06-25 19:12

I started a new thread: [url="http://www.mersenneforum.org/showthread.php?t=22413"]5000e15 - 5500e15[/url]

Previous:

[URL="http://www.mersenneforum.org/showpost.php?p=460275&postcount=163"]4300-4399[/URL] (complete)
[URL="http://www.mersenneforum.org/showpost.php?p=461106&postcount=229"]4700-4799[/URL] (complete)
[URL="http://www.mersenneforum.org/showpost.php?p=461691&postcount=301"]4520-4539[/URL] (3 days left)
[URL="http://www.mersenneforum.org/showpost.php?p=461818&postcount=325"]4850-4899[/URL] (4-6 days left)

pinhodecarlos 2017-06-25 19:27

Dana, can you please start a new thread for the range 5e18-6e18?

ldesnogu 2017-06-26 12:57

1 Attachment(s)
Results for 4540-4545.

I concatenated all of the gap_solutions.txt, I hope it's OK. I can reformat to whatever is preferred if needed :)

The ten largest gaps:
[code]1158 4544981293636392709
1160 4543141211462563781
1170 4540065569807950537
1176 4543645113625191593
1190 4544720841843709097
1200 4544595694865919163
1216 4540814097792317827
1238 4543497368344753163
1292 4543949541317727059
1336 4544946810033612391[/code]

rudy235 2017-06-26 15:39

[B]STATISTICS[/B]
Over 2/3 now!! 671 to be precise.

[B]4000-4500[/B]

Of the first 500 Ranges only 28 Ranges need to be completed and are all reserved.
4285-4290 , 4293-4300 , 4484-4500

[B]4500-5000[/B]

As to the second 500 ranges only Range 4700-4800 is completed. The rest of the ranges vary from 10% to 35% completion.

There is still a lot of Ranges not reserved.

There have been 11 "records". 1 per every 61 Ranges checked.

Antonio 2017-06-26 15:44

2 Attachment(s)
Here is the latest version of the c code (version 0.05.g), along with Windows executables (now reduced to two versions - one for pre-Haswell the other for Haswell and later processors, as discussed in earlier posts).
There are only minor changes in this version:

[CODE]// version 0.05.g Fixed quick sort, added optimized bubble sort. Antonio Key
// Replaced add30 & sub30 constant arrays with a single
// wheel30 constant array - they became identical in 0.05.
// Added ETA to status display line.
// Removed some of the old commented-out code.
// Modified the smart check on input data.

// version 0.05.f next_prime, prec_prime tuning. Antonio Key

// version 0.05.e Hensel lifting. Robert Gerbicz

// version 0.05.d Fermat & Euler-Plumb PRP tests using Montgomery math. Dana Jacobsen

// version 0.05 Bug fix. Robert Gerbicz
[/CODE]The compiled versions use the optimized bubble sort, I don't think there is any advantage to using the quick sort for the small number of items we have to order at the end of each stage and the bubble sort may be faster.
The modification to the input data smart check came about as a result of my further investigation into tuning the code. I have found that for my i5-3570k running 4 threads I can increase the throughput to 27.5e9 n/sec using bs= 18, the old smart check would issue a warning for bs > 16, which didn't seem reasonable. The smart check now only displays a warning if sb <16 or (sb - bs) < 3.
My input parameters have now changed from (ignore the n1 and n2 parameters - which are my test range values only):[CODE]gap5_f_ivybridge -n1 4248785e12 -n2 4248789e12 -gap 1250 -delta 150 -sb 24 -bs 16 -t 4 -mem 12
throughput 26.4e9 n/sec.[/CODE]to:[CODE]gap5_g -n1 4248785e12 -n2 4248789e12 -gap 1250 -delta 155 -sb 24 -bs 18 -t 4 -mem 12
throughput 27.5e9 n/sec.[/CODE]

pinhodecarlos 2017-06-26 15:56

Thank you Antonio.

PS (loved showing the ETA for the current load work, nice...)

Thomas11 2017-06-27 08:58

2 Attachment(s)
Here are some Linux binaries (Nehalem and Haswell) for Antonio's latest version gap5_g.

Thomas11 2017-06-27 09:13

BTW: It seems that there is actually no benefit from manual loop unrolling like in the code section found in [I]sieve_small_primes[/I] given below. The gcc compiler doesn't generate any parallelized code from it.

[CODE]
if(i==0){
for(k=0;k+15<LEN64;){
arr2[k]=off2[k];k++;
arr2[k]=off2[k];k++;
arr2[k]=off2[k];k++;
arr2[k]=off2[k];k++;
arr2[k]=off2[k];k++;
arr2[k]=off2[k];k++;
arr2[k]=off2[k];k++;
arr2[k]=off2[k];k++;
arr2[k]=off2[k];k++;
arr2[k]=off2[k];k++;
arr2[k]=off2[k];k++;
arr2[k]=off2[k];k++;
arr2[k]=off2[k];k++;
arr2[k]=off2[k];k++;
arr2[k]=off2[k];k++;
arr2[k]=off2[k];k++;
}
for(;k<LEN64;k++)arr2[k]=off2[k];}
else{
for(k=0;k+15<LEN64;){
arr2[k]&=off2[k];k++;
arr2[k]&=off2[k];k++;
arr2[k]&=off2[k];k++;
arr2[k]&=off2[k];k++;
arr2[k]&=off2[k];k++;
arr2[k]&=off2[k];k++;
arr2[k]&=off2[k];k++;
arr2[k]&=off2[k];k++;
arr2[k]&=off2[k];k++;
arr2[k]&=off2[k];k++;
arr2[k]&=off2[k];k++;
arr2[k]&=off2[k];k++;
arr2[k]&=off2[k];k++;
arr2[k]&=off2[k];k++;
arr2[k]&=off2[k];k++;
arr2[k]&=off2[k];k++;
}
for(;k<LEN64;k++)arr2[k]&=off2[k];}
[/CODE]

Thus, the above could be replaced by:
[CODE]
if(i==0){
for(k=0;k<LEN64;k++)
arr2[k]=off2[k];
}
else{
for(k=0;k<LEN64;k++)
arr2[k]&=off2[k];
}
[/CODE]

Thomas11 2017-06-27 09:24

[QUOTE=Antonio;462112]The modification to the input data smart check came about as a result of my further investigation into tuning the code. I have found that for my i5-3570k running 4 threads I can increase the throughput to 27.5e9 n/sec using bs= 18, the old smart check would issue a warning for bs > 16, which didn't seem reasonable. The smart check now only displays a warning if sb <16 or (sb - bs) < 3.
My input parameters have now changed from (ignore the n1 and n2 parameters - which are my test range values only):[CODE]gap5_f_ivybridge -n1 4248785e12 -n2 4248789e12 -gap 1250 -delta 150 -sb 24 -bs 16 -t 4 -mem 12
throughput 26.4e9 n/sec.[/CODE]to:[CODE]gap5_g -n1 4248785e12 -n2 4248789e12 -gap 1250 -delta 155 -sb 24 -bs 18 -t 4 -mem 12
throughput 27.5e9 n/sec.[/CODE][/QUOTE]

I noticed a similar ([I]apparent[/I]) improvement from 16.9e9 n/sec to 18.6e9 n/sec (already for version 5_f) when using your test range and bs=18, 19, or 20 (with sb=22 or 23).
However, when running a full range, the jobs with bs=18/19 actually took longer (up to one hour) than the jobs with bs=16.
I will try again with version 5_g to see what happens...

pinhodecarlos 2017-06-27 10:30

1 Attachment(s)
4500-4510 done by my friend Steve Cole in one day.

henryzz 2017-06-27 11:28

[QUOTE=Thomas11;462142]BTW: It seems that there is actually no benefit from manual loop unrolling like in the code section found in [I]sieve_small_primes[/I] given below. The gcc compiler doesn't generate any parallelized code from it.
[/QUOTE]

Is this a slow portion of the program? Might be worth tweaking the code until it does produce parallelized code or adding our own asm.
Is the following easier for the compiler(with 16 rather than 4)? I thought compilers were supposed to be good at unrolling the loop themselves these days.
[CODE] for(k=0;k+3<LEN64;k+=4){
arr2[k]=off2[k];
arr2[k+1]=off2[k+1];
arr2[k+2]=off2[k+2];
arr2[k+3]=off2[k+3];
}[/CODE]
What size integers are arr2 and off2? If they are smaller than 64 bits then multiple could be moved in one instruction.


All times are UTC. The time now is 15:30.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.