mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > Prime Gap Searches

Reply
 
Thread Tools
Old 2017-10-03, 19:19   #12
danaj
 
"Dana Jacobsen"
Feb 2011
Bangkok, TH

22·227 Posts
Default

I guess a good thing to decide would be what the goal is.

Lowest energy use? Most efficient energy use? Most efficient cluster energy use? Easiest to set up? Highest raw performance under some budget threshold? etc.

In my case it'd be nice to have something that excels at apps like Primo and other parallel searches as well, hence looking at many more cores.

My daughter's mini-PC is an i5-4210U (2/4 core), probably lower power use but I bet it's not good performance. Maybe I'll annoy her and run something on it. I can also try my i7-6700HQ laptop. I suspect it's higher power use for not enough performance gain, as it's a gaming laptop not optimized for battery life.
danaj is offline   Reply With Quote
Old 2017-10-03, 20:05   #13
pinhodecarlos
 
pinhodecarlos's Avatar
 
"Carlos Pinho"
Oct 2011
Milton Keynes, UK

5×7×139 Posts
Default

I''m an energy consultant so I need to stick with the energy efficiency....but UK is so cold for my body that we need USA to increase GHG so we can have more heat waves here. So I have contradiction feelings...

Last fiddled with by pinhodecarlos on 2017-10-03 at 20:05
pinhodecarlos is offline   Reply With Quote
Old 2017-10-04, 10:27   #14
robert44444uk
 
robert44444uk's Avatar
 
Jun 2003
Oxford, UK

22·13·37 Posts
Default

I think that when this search reaches 2^64 we will return to parallel processes such as a*b#/c-d (a variable, b#/c fixed, d = prevprime) and hence large numbers of cores which run threads efficiently would be the best solution for gap hunters.

Energy efficiency needs to be decent but perhaps from a benchmark test p.o.v. it would be good to look at throughput/ energy use for a primorial series, with variable a.

Is there a way to rewrite danaj's standard primorial code to run more efficiently as well or is it optimal in its use of threads?
robert44444uk is offline   Reply With Quote
Old 2017-10-04, 16:41   #15
danaj
 
"Dana Jacobsen"
Feb 2011
Bangkok, TH

16148 Posts
Default

Quote:
Originally Posted by robert44444uk View Post
Is there a way to rewrite danaj's standard primorial code to run more efficiently as well or is it optimal in its use of threads?
There are two questions there (at least).

1. Does it use threading well. It's not that bad, but for n threads, each thread takes number k*n of the range. If one of them finishes the range first then it waits. This isn't ideal, but in practice it isn't as bad as I thought. IIRC, some people run their searches in a different order so there are variants out there.

2. Could we optimize the search strategy. I haven't seen anything actionable, but there are threads on covering sets and the like that seem to have something to do with this.

3. Can the main computational task be faster. I'm sure it could be. It's been discussed before, and has gotten faster over the past few years. I see some things that could still be investigated. surround_primes looks good to me, so some possibilities include mod-6 or mod-30 replacing mod-2 bit mask, 3-prime-at-a-time to elide ~60000 mpz remainders, lots of prime_iterator optimizations (ensure everything is fast for initial primes; amortized bulk number extraction; use new slightly faster sieve in MPU). Or perhaps we could get an entirely better method from someone like JKA or RG.
danaj is offline   Reply With Quote
Old 2017-10-05, 06:38   #16
danaj
 
"Dana Jacobsen"
Feb 2011
Bangkok, TH

22×227 Posts
Default

So ... turns out there really is AVX2 code in the PGS source (in sieve_small_primes) and it's set on by default. Now I'm interested in seeing what difference it makes on vs. off.

My laptop, i7-6700HQ 2.6GHz, 4 threads inside VirtualBox, is getting 25.6 n/s running one of the 9-9.25 ranges. I should test it under Windows.
danaj is offline   Reply With Quote
Old 2017-10-05, 06:39   #17
robert44444uk
 
robert44444uk's Avatar
 
Jun 2003
Oxford, UK

22×13×37 Posts
Default

Quote:
Originally Posted by danaj View Post
Or perhaps we could get an entirely better method from someone like JKA or RG.
Maybe we could check in with Robert G. He may have ideas.
robert44444uk is offline   Reply With Quote
Old 2017-10-05, 09:19   #18
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

2×7×103 Posts
Default

Quote:
Originally Posted by danaj View Post
So ... turns out there really is AVX2 code in the PGS source (in sieve_small_primes) and it's set on by default. Now I'm interested in seeing what difference it makes on vs. off.
We spend a little time there, the avx2 code gives only a few percentage faster code. Ofcourse you can disable it with #define USE_AVX2 0 to see the effect of this part.
R. Gerbicz is offline   Reply With Quote
Old 2017-10-09, 06:52   #19
Antonio
 
Antonio's Avatar
 
"Antonio Key"
Sep 2011
UK

32·59 Posts
Default

Quote:
Originally Posted by danaj View Post
We have not chosen anything. Mostly people have reported n/s for current runs on their hardware. The reported n/s typically takes a while to settle down so it takes a couple hours to get something comparable. Sounds like a benchmark mode would be a good addition (pre-chosen n1,n2,res1,res2,m1,m2,numcoprime parameters leaving sb,bs,mem,t to user; ignore or separate report for all startup activity).

The main sieving operation, if I understand correctly, uses a variant of TOeS's cache friendly sieve, similar to primesieve. This uses craptons of memory while running but the access pattern is not terrible and it has good performance at this range compared to others (which was TOeS's design goal, and why primesieve uses three different sieve implementations depending on range).

From what I can see, most all the extended time is in the sieve. There is a search step that is more cpu bound.

51.2 n/s AWS c4.4xlarge, 16 threads, 28GB (30GB of ???)
46.8 n/s i7-6700K 4.2GHz, 8 threads, 13GB (4x8GB DDR4 2800 CL15 rank 2 1.2V)
34.8 n/s i7-4770K 4.3GHz, 8 threads, 13GB (2x8GB DDR3 2133 CL11 rank 2 1.5V)

The AWS machine produced 55.1 n/s when two 8 thread 14GB tasks were run at the same time. The 6700K vs. 4770K numbers above make me wonder if memory bandwidth is making more of a difference. Perhaps I need to do some tests with 1,2,4,6,8 threads on each to see.
Just to add some information, on my system:
33.51e9 n/s i5-3570k 4.4GHz, 4 threads, 12.5GB (2x8GB DDR3 1600 CL9 rank 2)
36.24e9 n/s i5-3570k 4.4GHz, 4 threads, 12.5GB (2x8GB DDR3 2133 CL11 rank 2)
29.66e9 n/s i5-3570k 4.4GHz, 3 threads, 12.5GB (2x8GB DDR3 2133 CL11 rank 2)
20.43e9 n/s i5-3570k 4.4GHz, 2 threads, 12.5GB (2x8GB DDR3 2133 CL11 rank 2)
10.67e9 n/s i5-3570k 4.4GHz, 1 threads, 12.5GB (2x8GB DDR3 2133 CL11 rank 2)


Tests used a windows batch file consisting of:
gap11 -n1 9e18 -n2 925e16 -n 9e18 -res1 0 -res2 10 -res 0 -m1 1190 -m2 8151 -unknowngap 1382 -numcoprime 27 -sb 24 -bs 18 -t %1 -mem 12.5

The figures quoted above were those displayed at the end of the run.
The ratio of sieve to search times is approx. 20:7

Last fiddled with by Antonio on 2017-10-09 at 06:58
Antonio is offline   Reply With Quote
Old 2017-10-09, 08:11   #20
danaj
 
"Dana Jacobsen"
Feb 2011
Bangkok, TH

16148 Posts
Default

Nice results, Antonio!

I'll try running something similar on my 4770K when it finishes this batch. I'll make sure I run gap11 as well, and we're already using the same sieve parameters. I'm on Linux while you're on Windows, so different compilers and scheduler. But it looks like the machines are pretty similar in some ways.
danaj is offline   Reply With Quote
Old 2017-10-12, 15:55   #21
danaj
 
"Dana Jacobsen"
Feb 2011
Bangkok, TH

22·227 Posts
Default

i7-6700K 4.2GHz, 4x8GB DDR4 2800 CL15 rank 2 1.2V, Fedora 23, gcc 5.3.1

1 thr 10.95e9 n/sec.; time=8404 sec
2 thr 21.07e9 n/sec.; time=4367 sec
3 thr 30.72e9 n/sec.; time=2995 sec
4 thr 38.68e9 n/sec.; time=2379 sec
6 thr 37.28e9 n/sec.; time=2468 sec
8 thr 43.73e9 n/sec.; time=2104 sec

gcc -m64 -fopenmp -O2 -frename-registers -fomit-frame-pointer -flto -mavx2 -march=native -o g11 gap11.c -lm
./g11 -n1 9e18 -n2 9.25e18 -n 9e18 -res1 7300 -res2 7303 -res 7300 -m1 1190 -m2 8151 -numcoprime 27 -sb 25 -bs 18 -mem 13 -t $i
danaj is offline   Reply With Quote
Old 2017-10-12, 17:34   #22
Antonio
 
Antonio's Avatar
 
"Antonio Key"
Sep 2011
UK

32·59 Posts
Default

Quote:
Originally Posted by danaj View Post
i7-6700K 4.2GHz, 4x8GB DDR4 2800 CL15 rank 2 1.2V, Fedora 23, gcc 5.3.1

1 thr 10.95e9 n/sec.; time=8404 sec
2 thr 21.07e9 n/sec.; time=4367 sec
3 thr 30.72e9 n/sec.; time=2995 sec
4 thr 38.68e9 n/sec.; time=2379 sec
6 thr 37.28e9 n/sec.; time=2468 sec
8 thr 43.73e9 n/sec.; time=2104 sec

gcc -m64 -fopenmp -O2 -frename-registers -fomit-frame-pointer -flto -mavx2 -march=native -o g11 gap11.c -lm
./g11 -n1 9e18 -n2 9.25e18 -n 9e18 -res1 7300 -res2 7303 -res 7300 -m1 1190 -m2 8151 -numcoprime 27 -sb 25 -bs 18 -mem 13 -t $i
Interesting!
Is that dip in performance at 6 threads consistent?
Have you tried >4 threads with -sb 24 -bs 17, it may improve the performance and it would be interesting to see if it made a difference.
Antonio is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
NAS hardware VictordeHolland Hardware 5 2015-03-05 23:37
Possible hardware errors... SverreMunthe Hardware 16 2013-08-19 14:39
GPU hardware problem Prime95 GPU Computing 33 2013-07-12 05:25
Hardware error Citrix Prime Sierpinski Project 12 2006-06-07 09:40
Hardware Problem! Help!! matermoh Hardware 14 2004-12-09 05:19

All times are UTC. The time now is 19:35.

Sat Feb 27 19:35:36 UTC 2021 up 86 days, 15:46, 0 users, load averages: 1.48, 1.50, 1.57

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.