mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Prime Gap Searches (https://www.mersenneforum.org/forumdisplay.php?f=131)
-   -   4e18-5e18 (https://www.mersenneforum.org/showthread.php?t=22187)

rudy235 2017-06-27 11:42

Dr. Nicely has published new tables today June 27th. [URL="http://trnicely.net/index.html#TPG"]http://trnicely.net/index.html#TPG[/URL]

However the Gaps page [URL="http://trnicely.net/gaps/gaplist.html"]First Occurrence Prime Gaps[/URL] remains the same as the June 12 update.

Thomas11 2017-06-27 13:31

[QUOTE=henryzz;462147]Is this a slow portion of the program? Might be worth tweaking the code until it does produce parallelized code or adding our own asm.
Is the following easier for the compiler(with 16 rather than 4)? I thought compilers were supposed to be good at unrolling the loop themselves these days.
[CODE] for(k=0;k+3<LEN64;k+=4){
arr2[k]=off2[k];
arr2[k+1]=off2[k+1];
arr2[k+2]=off2[k+2];
arr2[k+3]=off2[k+3];
}[/CODE]
What size integers are arr2 and off2? If they are smaller than 64 bits then multiple could be moved in one instruction.[/QUOTE]

These are 64 bit unsigned integers.
Of course, the first iteration (i==0) is just a copy operation which could be replaced by a simple memcpy. The later iterations are bitwise AND (&) operations. In principle these could be done "4 in 1" using the 256 bit AVX registers and the VPAND operation. But I'm afraid that the necessary loading and storing operations will negate any speed-up.

rudy235 2017-06-27 14:03

Dr Nicely has updated the First Occurrence Prime Gaps
[CODE]All prime gaps in 0 < x < 4.285e18 have now been analyzed. The upper bound of exhaustive analysis of gaps has been extended successively, as follows.
UPPER BOUND ATTAINED BY DATE
----------- ----------- -----------
4285e15 Robert W. Smith et al. 27 Jun 2017
4000e15 Silva & Herzog & Pardi 04 Apr 2012[/CODE]

Also the first 3 C?C have been changed to CFC
[CODE]
1368 CFC LLnhardy 2011 31.92 19 4105079953458040849
1440 CFC LLnhardy 2014 33.57 19 4253027105513399527
1372 CFC LLnhardy 2014 31.99 19 4219088970046367161
[/CODE]

ldesnogu 2017-06-27 14:13

[QUOTE=henryzz;462147]Is this a slow portion of the program? Might be worth tweaking the code until it does produce parallelized code or adding our own asm.
Is the following easier for the compiler(with 16 rather than 4)? I thought compilers were supposed to be good at unrolling the loop themselves these days.

After running this:
[CODE]./gap -n1 4248785e12 -n2 4248789e12 -gap 1250 -delta 155 -sb 24 -bs 18 -t 1 -mem 8[/CODE]
Here is the gcov output:

[CODE] for(k=0;k+3<LEN64;k+=4){
arr2[k]=off2[k];
arr2[k+1]=off2[k+1];
arr2[k+2]=off2[k+2];
arr2[k+3]=off2[k+3];
}[/CODE]What size integers are arr2 and off2? If they are smaller than 64 bits then multiple could be moved in one instruction.[/QUOTE]
A plain memcpy might be good enough. This code is executed about 6 times less than the other branch of the if it is within.
[code]
280: 672:static void sieve_small_primes(ui64 *sieve_array,bucket *C,ui32 num_intervals,
280: 673: ui64 first_k,ui32 res,ui32 mod,ui32 p1,ui32 p2){
-: 674:
280: 675: assert(cnt_offsets>0);
280: 676: ui32 cnt2,g,h,i,j,k,o,p,st[cnt_offsets];
-: 677: ui64 *arr=sieve_array;
-: 678:
280: 679: if(LEN<131072)cnt2=0;
280: 680: else cnt2=PPI_131072;// PPI_131072 number of primes not used in small sieves up to 2^17
-: 681:
1960: 682: for(i=0;i<cnt_offsets;i++){
1960: 683: ui32 P2=PROD[i];
1960: 684: ui64 inv=single_modinv(mod,P2);
3920: 685: ui64 pos=((first_k%P2)+mulmod(inv,res%P2,P2))%P2;
1960: 686: k=((64-(pos&63))*inv64[P2&63])&63;// P2 is odd
1960: 687: pos+=(ui64)k*P2;
1960: 688: assert(pos%64==0);
1960: 689: pos>>=6;
1960: 690: st[i]=pos;
-: 691: }
-: 692:
8960: 693: for(h=0;h<num_intervals;arr+=LEN64,h++){
-: 694: ui64 sh=0;
62720: 695: for(i=0;i<cnt_offsets;i++){
-: 696: ui64 *arr2=arr;
62720: 697: ui64 *off2=offsets+sh+st[i];
62720: 698: sh+=SIZE[i]>>6;
62720: 699: if(i==0){
146809600: 700: for(k=0;k+15<LEN64;){
146800640: 701: arr2[k]=off2[k];k++;
146800640: 702: arr2[k]=off2[k];k++;
146800640: 703: arr2[k]=off2[k];k++;
146800640: 704: arr2[k]=off2[k];k++;
146800640: 705: arr2[k]=off2[k];k++;
146800640: 706: arr2[k]=off2[k];k++;
146800640: 707: arr2[k]=off2[k];k++;
146800640: 708: arr2[k]=off2[k];k++;
146800640: 709: arr2[k]=off2[k];k++;
146800640: 710: arr2[k]=off2[k];k++;
146800640: 711: arr2[k]=off2[k];k++;
146800640: 712: arr2[k]=off2[k];k++;
146800640: 713: arr2[k]=off2[k];k++;
146800640: 714: arr2[k]=off2[k];k++;
146800640: 715: arr2[k]=off2[k];k++;
146800640: 716: arr2[k]=off2[k];k++;
-: 717: }
#####: 718: for(;k<LEN64;k++)arr2[k]=off2[k];}
-: 719: else{
880857600: 720: for(k=0;k+15<LEN64;){
880803840: 721: arr2[k]&=off2[k];k++;
880803840: 722: arr2[k]&=off2[k];k++;
880803840: 723: arr2[k]&=off2[k];k++;
880803840: 724: arr2[k]&=off2[k];k++;
880803840: 725: arr2[k]&=off2[k];k++;
880803840: 726: arr2[k]&=off2[k];k++;
880803840: 727: arr2[k]&=off2[k];k++;
880803840: 728: arr2[k]&=off2[k];k++;
880803840: 729: arr2[k]&=off2[k];k++;
880803840: 730: arr2[k]&=off2[k];k++;
880803840: 731: arr2[k]&=off2[k];k++;
880803840: 732: arr2[k]&=off2[k];k++;
880803840: 733: arr2[k]&=off2[k];k++;
880803840: 734: arr2[k]&=off2[k];k++;
880803840: 735: arr2[k]&=off2[k];k++;
880803840: 736: arr2[k]&=off2[k];k++;
-: 737: }
#####: 738: for(;k<LEN64;k++)arr2[k]&=off2[k];}
62720: 739: st[i]=(st[i]+LEN64)%PROD[i];
-: 740: }
-: 741:
-: 742: ui64 *arr2=arr;
1146880: 743: for(g=0;g<(LEN>>17);arr2+=2048,g++){
14014873600: 744: for(j=0;j<PPI_131072;j++){
14014873600: 745: o=C[j].offset;
14014873600: 746: p=C[j].pr;
130895597957: 747: for(k=o;k<131072;k+=p)arr2[k>>6]&=InvBits[k&63];
14014873600: 748: C[j].offset=k-131072;
-: 749: }}
9547955200: 750: for(j=cnt2;j<PPI_LEN;j++){
9547955200: 751: o=C[j].offset;
9547955200: 752: p=C[j].pr;
51804791973: 753: for(k=o;k<LEN;k+=p)arr[k>>6]&=InvBits[k&63];
9547955200: 754: C[j].offset=k-LEN;
-: 755: }
-: 756: }
-: 757:
-: 758: // correction the offsets for the next block of array
301795200: 759: for(j=0;j<PPI_LEN;j++){
301795200: 760: C[j].offset+=res_table[j];
301795200: 761: if(C[j].offset>=C[j].pr)C[j].offset-=C[j].pr;
-: 762: }
-: 763:
280: 764: return;
-: 765:}
[/code]Unrolling and specializing the line 747 might be more beneficial.

ldesnogu 2017-06-27 14:16

For some reason, in the above post the command I used to get the gcov output isn't displayed. So here it is:
[code]gap -n1 4248785e12 -n2 4248789e12 -gap 1250 -delta 155 -sb 24 -bs 18 -t 1 -mem 8
[/code]

danaj 2017-06-27 14:33

Definitely memcpy (on assumption they don't overlap). memcpy is more succinct, it is more obvious what the intention is, and it's almost certainly faster. memset and memcpy are hand-written assembly for the individual architecture.

Add: The compiler might make it moot, but for the AND section it might be worth playing with using fixed offsets instead of incrementing every operation, e.g. "arr[k+0] &= off2[k+0]; arr[k+1] &= off2[k+1];" etc. which is the usual way those are unrolled. Of course include the k += 16 in the for loop.

pinhodecarlos 2017-06-27 16:38

[QUOTE=rudy235;462110][B]STATISTICS[/B]


[B]4000-4500[/B]

Of the first 500 Ranges only 28 Ranges need to be completed and are all reserved.
4285-4290 , 4293-4300 , 4484-4500

.[/QUOTE]

With regards to 4484-4500 range it is already at 4487 with an ETA of 188 hours to completion.

R. Gerbicz 2017-06-27 17:26

[QUOTE=Antonio;462112]There are only minor changes in this version:
The modification to the input data smart check came about as a result of my further investigation into tuning the code. I have found that for my i5-3570k running 4 threads I can increase the throughput to 27.5e9 n/sec using bs= 18, the old smart check would issue a warning for bs > 16, which didn't seem reasonable. The smart check now only displays a warning if sb <16 or (sb - bs) < 3.
My input parameters have now changed from (ignore the n1 and n2 parameters - which are my test range values only):[CODE]gap5_f_ivybridge -n1 4248785e12 -n2 4248789e12 -gap 1250 -delta 150 -sb 24 -bs 16 -t 4 -mem 12
throughput 26.4e9 n/sec.[/CODE]to:[CODE]gap5_g -n1 4248785e12 -n2 4248789e12 -gap 1250 -delta 155 -sb 24 -bs 18 -t 4 -mem 12
throughput 27.5e9 n/sec.[/CODE][/QUOTE]

It is better to test the default 1e15 range, you can stop the program at the first sieve rate print. (you have to wait for it roughly 30 minutes on a fast core-i7).
For me using sb=24;bs=18 gives a speedup of 1%. (note the reduced sb value, but sb=25 has giving close rate).

Investigating the mentioned ideas.

rudy235 2017-06-27 23:59

We are at almost 700 ranges completed. Reservations are close to 175
No gaps ≥ 1346 except those previously discovered [URL="http://www.mersenneforum.org/showpost.php?p=461965&postcount=348"] Post #348[/URL]

danaj 2017-06-28 02:15

[QUOTE=rudy235;462215]
No gaps ≥ 1346 except those previously discovered [URL="http://www.mersenneforum.org/showpost.php?p=461965&postcount=348"] Post #348[/URL][/QUOTE]

4883011923347099963 +1454

Does not beat the CFC 3219107182492871783 found by TOeS et al.

axn 2017-06-28 02:51

I'd like to reserve 4995-5000 range.


All times are UTC. The time now is 15:30.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.