mersenneforum.org Hardware
 Register FAQ Search Today's Posts Mark Forums Read

 2017-10-03, 19:19 #12 danaj   "Dana Jacobsen" Feb 2011 Bangkok, TH 22·227 Posts I guess a good thing to decide would be what the goal is. Lowest energy use? Most efficient energy use? Most efficient cluster energy use? Easiest to set up? Highest raw performance under some budget threshold? etc. In my case it'd be nice to have something that excels at apps like Primo and other parallel searches as well, hence looking at many more cores. My daughter's mini-PC is an i5-4210U (2/4 core), probably lower power use but I bet it's not good performance. Maybe I'll annoy her and run something on it. I can also try my i7-6700HQ laptop. I suspect it's higher power use for not enough performance gain, as it's a gaming laptop not optimized for battery life.
 2017-10-03, 20:05 #13 pinhodecarlos     "Carlos Pinho" Oct 2011 Milton Keynes, UK 5×7×139 Posts I''m an energy consultant so I need to stick with the energy efficiency....but UK is so cold for my body that we need USA to increase GHG so we can have more heat waves here. So I have contradiction feelings... Last fiddled with by pinhodecarlos on 2017-10-03 at 20:05
 2017-10-04, 10:27 #14 robert44444uk     Jun 2003 Oxford, UK 22·13·37 Posts I think that when this search reaches 2^64 we will return to parallel processes such as a*b#/c-d (a variable, b#/c fixed, d = prevprime) and hence large numbers of cores which run threads efficiently would be the best solution for gap hunters. Energy efficiency needs to be decent but perhaps from a benchmark test p.o.v. it would be good to look at throughput/ energy use for a primorial series, with variable a. Is there a way to rewrite danaj's standard primorial code to run more efficiently as well or is it optimal in its use of threads?
2017-10-04, 16:41   #15
danaj

"Dana Jacobsen"
Feb 2011
Bangkok, TH

16148 Posts

Quote:
 Originally Posted by robert44444uk Is there a way to rewrite danaj's standard primorial code to run more efficiently as well or is it optimal in its use of threads?
There are two questions there (at least).

1. Does it use threading well. It's not that bad, but for n threads, each thread takes number k*n of the range. If one of them finishes the range first then it waits. This isn't ideal, but in practice it isn't as bad as I thought. IIRC, some people run their searches in a different order so there are variants out there.

2. Could we optimize the search strategy. I haven't seen anything actionable, but there are threads on covering sets and the like that seem to have something to do with this.

3. Can the main computational task be faster. I'm sure it could be. It's been discussed before, and has gotten faster over the past few years. I see some things that could still be investigated. surround_primes looks good to me, so some possibilities include mod-6 or mod-30 replacing mod-2 bit mask, 3-prime-at-a-time to elide ~60000 mpz remainders, lots of prime_iterator optimizations (ensure everything is fast for initial primes; amortized bulk number extraction; use new slightly faster sieve in MPU). Or perhaps we could get an entirely better method from someone like JKA or RG.

 2017-10-05, 06:38 #16 danaj   "Dana Jacobsen" Feb 2011 Bangkok, TH 22×227 Posts So ... turns out there really is AVX2 code in the PGS source (in sieve_small_primes) and it's set on by default. Now I'm interested in seeing what difference it makes on vs. off. My laptop, i7-6700HQ 2.6GHz, 4 threads inside VirtualBox, is getting 25.6 n/s running one of the 9-9.25 ranges. I should test it under Windows.
2017-10-05, 06:39   #17
robert44444uk

Jun 2003
Oxford, UK

22×13×37 Posts

Quote:
 Originally Posted by danaj Or perhaps we could get an entirely better method from someone like JKA or RG.
Maybe we could check in with Robert G. He may have ideas.

2017-10-05, 09:19   #18
R. Gerbicz

"Robert Gerbicz"
Oct 2005
Hungary

2×7×103 Posts

Quote:
 Originally Posted by danaj So ... turns out there really is AVX2 code in the PGS source (in sieve_small_primes) and it's set on by default. Now I'm interested in seeing what difference it makes on vs. off.
We spend a little time there, the avx2 code gives only a few percentage faster code. Ofcourse you can disable it with #define USE_AVX2 0 to see the effect of this part.

2017-10-09, 06:52   #19
Antonio

"Antonio Key"
Sep 2011
UK

32·59 Posts

Quote:
 Originally Posted by danaj We have not chosen anything. Mostly people have reported n/s for current runs on their hardware. The reported n/s typically takes a while to settle down so it takes a couple hours to get something comparable. Sounds like a benchmark mode would be a good addition (pre-chosen n1,n2,res1,res2,m1,m2,numcoprime parameters leaving sb,bs,mem,t to user; ignore or separate report for all startup activity). The main sieving operation, if I understand correctly, uses a variant of TOeS's cache friendly sieve, similar to primesieve. This uses craptons of memory while running but the access pattern is not terrible and it has good performance at this range compared to others (which was TOeS's design goal, and why primesieve uses three different sieve implementations depending on range). From what I can see, most all the extended time is in the sieve. There is a search step that is more cpu bound. 51.2 n/s AWS c4.4xlarge, 16 threads, 28GB (30GB of ???) 46.8 n/s i7-6700K 4.2GHz, 8 threads, 13GB (4x8GB DDR4 2800 CL15 rank 2 1.2V) 34.8 n/s i7-4770K 4.3GHz, 8 threads, 13GB (2x8GB DDR3 2133 CL11 rank 2 1.5V) The AWS machine produced 55.1 n/s when two 8 thread 14GB tasks were run at the same time. The 6700K vs. 4770K numbers above make me wonder if memory bandwidth is making more of a difference. Perhaps I need to do some tests with 1,2,4,6,8 threads on each to see.
Just to add some information, on my system:
33.51e9 n/s i5-3570k 4.4GHz, 4 threads, 12.5GB (2x8GB DDR3 1600 CL9 rank 2)
36.24e9 n/s i5-3570k 4.4GHz, 4 threads, 12.5GB (2x8GB DDR3 2133 CL11 rank 2)
29.66e9 n/s i5-3570k 4.4GHz, 3 threads, 12.5GB (2x8GB DDR3 2133 CL11 rank 2)
20.43e9 n/s i5-3570k 4.4GHz, 2 threads, 12.5GB (2x8GB DDR3 2133 CL11 rank 2)
10.67e9 n/s i5-3570k 4.4GHz, 1 threads, 12.5GB (2x8GB DDR3 2133 CL11 rank 2)

Tests used a windows batch file consisting of:
gap11 -n1 9e18 -n2 925e16 -n 9e18 -res1 0 -res2 10 -res 0 -m1 1190 -m2 8151 -unknowngap 1382 -numcoprime 27 -sb 24 -bs 18 -t %1 -mem 12.5

The figures quoted above were those displayed at the end of the run.
The ratio of sieve to search times is approx. 20:7

Last fiddled with by Antonio on 2017-10-09 at 06:58

 2017-10-09, 08:11 #20 danaj   "Dana Jacobsen" Feb 2011 Bangkok, TH 16148 Posts Nice results, Antonio! I'll try running something similar on my 4770K when it finishes this batch. I'll make sure I run gap11 as well, and we're already using the same sieve parameters. I'm on Linux while you're on Windows, so different compilers and scheduler. But it looks like the machines are pretty similar in some ways.
 2017-10-12, 15:55 #21 danaj   "Dana Jacobsen" Feb 2011 Bangkok, TH 22·227 Posts i7-6700K 4.2GHz, 4x8GB DDR4 2800 CL15 rank 2 1.2V, Fedora 23, gcc 5.3.1 1 thr 10.95e9 n/sec.; time=8404 sec 2 thr 21.07e9 n/sec.; time=4367 sec 3 thr 30.72e9 n/sec.; time=2995 sec 4 thr 38.68e9 n/sec.; time=2379 sec 6 thr 37.28e9 n/sec.; time=2468 sec 8 thr 43.73e9 n/sec.; time=2104 sec gcc -m64 -fopenmp -O2 -frename-registers -fomit-frame-pointer -flto -mavx2 -march=native -o g11 gap11.c -lm ./g11 -n1 9e18 -n2 9.25e18 -n 9e18 -res1 7300 -res2 7303 -res 7300 -m1 1190 -m2 8151 -numcoprime 27 -sb 25 -bs 18 -mem 13 -t $i 2017-10-12, 17:34 #22 Antonio "Antonio Key" Sep 2011 UK 32·59 Posts Quote:  Originally Posted by danaj i7-6700K 4.2GHz, 4x8GB DDR4 2800 CL15 rank 2 1.2V, Fedora 23, gcc 5.3.1 1 thr 10.95e9 n/sec.; time=8404 sec 2 thr 21.07e9 n/sec.; time=4367 sec 3 thr 30.72e9 n/sec.; time=2995 sec 4 thr 38.68e9 n/sec.; time=2379 sec 6 thr 37.28e9 n/sec.; time=2468 sec 8 thr 43.73e9 n/sec.; time=2104 sec gcc -m64 -fopenmp -O2 -frename-registers -fomit-frame-pointer -flto -mavx2 -march=native -o g11 gap11.c -lm ./g11 -n1 9e18 -n2 9.25e18 -n 9e18 -res1 7300 -res2 7303 -res 7300 -m1 1190 -m2 8151 -numcoprime 27 -sb 25 -bs 18 -mem 13 -t$i
Interesting!
Is that dip in performance at 6 threads consistent?
Have you tried >4 threads with -sb 24 -bs 17, it may improve the performance and it would be interesting to see if it made a difference.

 Similar Threads Thread Thread Starter Forum Replies Last Post VictordeHolland Hardware 5 2015-03-05 23:37 SverreMunthe Hardware 16 2013-08-19 14:39 Prime95 GPU Computing 33 2013-07-12 05:25 Citrix Prime Sierpinski Project 12 2006-06-07 09:40 matermoh Hardware 14 2004-12-09 05:19

All times are UTC. The time now is 19:35.

Sat Feb 27 19:35:36 UTC 2021 up 86 days, 15:46, 0 users, load averages: 1.48, 1.50, 1.57