mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
Thread Tools
Old 2019-05-24, 16:27   #133
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

3·2,447 Posts
Default

I have done some very limited tests with srsieve2. Here are some numbers comparing srsieve to srsieve2:

Code:
srsieve -n25000 -N50000 -P1e8 -m1e8 r56.in
srsieve 1.1.4 -- A sieve for integer sequences in n of the form k*b^n+c.
Read 200 sequences from input file `r56.in'.
srsieve started: 25000 <= n <= 50000, 3 <= p <= 100000000
Split 200 base 3 sequences into 1471 base 3^24 subsequences.
Sieving 3 <= p <= 100000000 eliminated 4838822 terms, 161378 remain.
Wrote 161378  terms for 200 sequences to srsieve format file `srsieve.out'.
srsieve stopped: at p=100000000 because --pmax was reached.
Processor time: 555.30 sec
Code:
srsieve2 -n25000 -N50000 -P1e8 -fA -sr56.in
srsieve2 v2.0.0, a program to find factors of k*b^n+c numbers for fixed b and variable k and n
Sieve started: 2 < p < 1e8 with 5000200 terms (25000 < n < 50000, k*3^n+c) (expecting 4812049 factors)
Split 200 base 3 sequences into 1470 base 3^24 sequences.
Sieve completed at p=100000007.
Processor time: 340.83 sec. (0.00 sieving) (1.00 cores)
161378 terms written to b3_n.pfgw
Primes tested: 4761456.  Factors found: 4838822.  Remaining terms: 161378.  Time: 341.24 seconds.
srsieve2 is nearly 40% faster than srsieve for this range.

I do not know why there is a different number of subsequences, but I will look into that. It doesn't seem to be impacting the results.

I have not incorporated the logic from sr1sieve or sr2sieve into it yet. I really want to focus on making sure this code is solid before I start working on that. I will release the code "as is" once I feel confident in what I've written.

I have to run more tests before I release it. To help me in my testing, could people send me (via PM) sequences that they would like me to run my tests with? Whether it is one sequence or a thousand, I don't care. I will limit the upper bound of sieving based upon the sequences. The key to the testing is ti compare the outputs and times between the two programs.

Last fiddled with by rogue on 2019-05-24 at 16:27
rogue is online now   Reply With Quote
Old 2019-05-24, 20:15   #134
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

3×2,447 Posts
Default

Here is a comparison between sr2sieve and srsieve2:

Code:
sr2sieve -p1e7 -P4e7 -q -i sr_3.abcd
sr2sieve 2.0.0 -- A sieve for multiple sequences k*b^n+/-1 or b^n+/-k.
Read 2489826 terms for 1007 sequences from ABCD format file `sr_3.abcd'.
Split 1007 base 3 sequences into 7082 base 3^12 subsequences.
Expecting to find factors for about 197186.63 terms in this range.
sr2sieve 2.0.0 started: 25000 <= n <= 50000, 10000000 <= p <= 40000000
p=35035009, 95979 p/sec, 179938 factors, 83.5% done, 0 sec/factor, ETA 24 May 14:53
sr2sieve 2.0.0 stopped: at p=40000000 because range is complete.
Found factors for 197251 terms in 513.223 sec. (expected about 197186.63)
Code:
srsieve2 -ib3_n.pfgw -ob3.out -p1e7 -P4e7
srsieve2 v2.0.0, a program to find factors of k*b^n+c numbers for fixed b and variable k and n
Split 1007 base 3 sequences into 3028 base 3^4 sequences.
Sieve started: 1e7 < p < 4e7 with 2489826 terms (25000 < n < 50000, k*3^n+c) (expecting 197187 factors)
  p=26274431, 4.040K p/sec, 141072 factors found at 435.4 f/sec, 54.2% done. ETC 2019-05-24 15:01
Sieve completed at p=40000003.
Processor time: 441.67 sec. (0.03 sieving) (1.00 cores)
2292575 terms written to b3.out
Primes tested: 1769076.  Factors found: 197251.  Remaining terms: 2292575.  Time: 439.72 seconds.
Note that the above run of sr2sieve is using Legendre tables and srsieve2 does not use Legendre tables. The extra time for sr2sieve is the time needed to build the Legendre tables. The speeds are actually much closer (well under 10%) if one excludes that time. sr2sieve without Legendre tables is much slower than srsieve2. I will need to run more varied ranges to verify that. If that proves to be true, then that limits the code I need to copy from sr2sieve to srsieve2. srsieve2 will automatically build Legendre tables when p > max(k, n, c) (but only if max(k, n, c) < 2^32). I'll add an option to not build them when I get there.
rogue is online now   Reply With Quote
Old 2019-05-24, 23:30   #135
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

11100101011012 Posts
Default

I forgot that I had mixe FPU and SSE code for testing. Switching to SSE only, the numbers look even better:

Code:
Primes tested: 1769076.  Factors found: 197251.  Remaining terms: 2292575.  Time: 341.86 seconds.
It is possible that srsieve2 is now faster than sr2sieve even with Legendre symbols. That will verified with more testing.
rogue is online now   Reply With Quote
Old 2019-06-11, 23:12   #136
hansl
 
hansl's Avatar
 
Apr 2019

5×41 Posts
Default

I built this on Linux, but couldn't figure out how to get it to link with OpenCL, so I ended up settings "ENABLE_GPU=no".

Also had this build error which forced me to remove srsieve2 from the list of programs in the makefile, after which I was able to build:
Code:
make: *** No rule to make target 'srsieve2', needed by 'all'.  Stop.
Also the "f/sec" or "sec per factor" calculation seems buggy (for psieve at least)

I got this line at one point
Code:
p=10519166057, 3.179K p/sec, 140559 factors found at 15.29 sec per factor, 0.0% done. ETC 4219-12-25 02:25
Then later in the same run:
Code:
p=36020724671, 3.132K p/sec, 150902 factors found at 7.623 f/sec, 0.0% done. ETC 3946-08-09 02:00
Finally after Ctrl-c:
Code:
Primes tested: 1552746720.  Factors found: 150913.  Remaining terms: 191421.  Time: 487670.27 seconds.
So it seems the overall timing should have been 487670.27 / 150913 ~= 3.23 sec per factor

Last fiddled with by hansl on 2019-06-11 at 23:25
hansl is offline   Reply With Quote
Old 2019-06-11, 23:53   #137
hansl
 
hansl's Avatar
 
Apr 2019

5·41 Posts
Default

Another note (*emphasis mine)
Quote:
Originally Posted by https://mersenneforum.org/rogue/mtsieve.html
This is the sieveing source from primesieve 6.3, a library whose sole purpose is to generate a list of primes as quickly as possible. Per the primsieve license, no changes should be made to this code.
I found this statement odd and it made me curious enough to look up the license being used by primesieve:
https://github.com/kimwalisch/primes...master/COPYING
It seems to me that modification is 100% ok, as far as licensing is concerned. However that "COPYING" file should probably be included in your "sieve" directory(if not your top level), especially since all the code comments make reference to it as such.

Last fiddled with by hansl on 2019-06-11 at 23:56
hansl is offline   Reply With Quote
Old 2019-06-12, 12:53   #138
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

734110 Posts
Default

srsieve2 is still a "work in progress". It is not officially released yet.

The calculation of the factoring rate is rather complex. The first few minutes that the sieve is running significantly skew that rate so the code tries to avoid using data from the first few minutes in the calculation. That being written, I'm sure there is more room for improvement. Anyone is welcome to submit suggestions for improvement.

I put a copy of COPYING into the folder with the primesieve code.
rogue is online now   Reply With Quote
Old 2019-06-12, 18:24   #139
hansl
 
hansl's Avatar
 
Apr 2019

5·41 Posts
Default

Question about using gfndsieve: how do you determine a good pmax -P setting? At what point is it optimal to move over to PFGW?
hansl is offline   Reply With Quote
Old 2019-06-12, 18:30   #140
pepi37
 
pepi37's Avatar
 
Dec 2011
After 1.58M nines:)

1,699 Posts
Default

Quote:
Originally Posted by hansl View Post
Question about using gfndsieve: how do you determine a good pmax -P setting? At what point is it optimal to move over to PFGW?
I think as any other sieve: when removal rate is lower then processed rate from PFGW
pepi37 is online now   Reply With Quote
Old 2019-06-12, 18:54   #141
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

1CAD16 Posts
Default

Quote:
Originally Posted by pepi37 View Post
I think as any other sieve: when removal rate is lower then processed rate from PFGW
Correct. Assuming that the range being sieved has all terms of relatively the same size (within 20% the number of bits), then use pfgw to do a PRP test on one of the terms. You want the factoring rate to be close to that time.

Of course if any of the mtsieve programs is outputting bad factoring rates, that can be a problem.
rogue is online now   Reply With Quote
Old 2019-06-14, 04:14   #142
hansl
 
hansl's Avatar
 
Apr 2019

110011012 Posts
Default

Quote:
Originally Posted by hansl View Post
I built this on Linux, but couldn't figure out how to get it to link with OpenCL, so I ended up settings "ENABLE_GPU=no".
Well I feel a little silly to say this, but all I had to do was add "OPENCL_LIBS=-lOpenCL" in the makefile and that got it working. I guess I was previously assuming that the makefile was already set up for this and that I just hadn't installed the correct library/package for it to work.

So I have been testing out GPU capability in gfndsieve, and as was mentioned earlier in the thread by pepi37, it doesn't do much. I have been trying a bunch of different settings combinations to see if any particular setup is more efficient.
Starting at p=3 I don't see any noticeable load, but it does seem to increase with larger p. So trying pmin=2^34 and pmax=2^35 just as a benchmark, I see 60-75% load on my laptop's Quadro M1000M. Not sure if its really doing more work or just dealing with much more overhead in some way. Nothing I did to try tuning -G or -g at this point would max the load though.

Its difficult to measure what optimal settings are, since the CPU throughput dwarfs the GPU, so maybe its a moot point to try tuning at this stage, but would it be possible to support -W0 when -G is nonzero to make GPU-only tuning more testable?

Anyways I don't know a whole lot about optimizing OpenCL or CUDA configurations, but I'm wondering if there's a general rule of thumb for setting -G and -g? Does it make most sense to set -G equal to the number of "Compute Units" aka "Streaming Multiprocessors"? Maybe add 1 or 2 to help keep the work queue filled or does that not help?
Then I guess set -g to whatever still fits comfortably in GPU memory?

Here's a portion of output from "clinfo" for what its worth.
Code:
  Device Name                                     Quadro M1000M           
  Device Vendor                                   NVIDIA Corporation                                                                                                                                                                          
  Device Vendor ID                                0x10de                                                                                                                                                                                      
  Device Version                                  OpenCL 1.2 CUDA                    
  Driver Version                                  418.39      
  Device OpenCL C Version                         OpenCL C 1.2      
  Device Type                                     GPU      
  Device Topology (NV)                            PCI-E, 01:00.0
  Device Profile                                  FULL_PROFILE     
  Device Available                                Yes             
  Compiler Available                              Yes                  
  Linker Available                                Yes                  
  Max compute units                               4                           
  Max clock frequency                             1071MHz                        
  Compute Capability (NV)                         5.0                       
  Device Partition                                (core)                                       
    Max number of sub-devices                     1           
    Supported partition types                     None              
  Max work item dimensions                        3                    
  Max work item sizes                             1024x1024x64          
  Max work group size                             1024        
  Preferred work group size multiple              32                
  Warp size (NV)                                  32                 
  Preferred / native vector sizes                                 
    char                                                 1 / 1     
    short                                                1 / 1   
    int                                                  1 / 1     
    long                                                 1 / 1     
    half                                                 0 / 0        (n/a)
    float                                                1 / 1    
    double                                               1 / 1        (cl_khr_fp64)
The GPU has 512 total "cuda cores" by the way, since clinfo doesn't seem to provide that number in its output anywhere.
hansl is offline   Reply With Quote
Old 2019-06-14, 12:55   #143
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

3·2,447 Posts
Default

I could allow for -W0, but some sieves have other constraints which require a minimum p before using the GPU.

It is certainly possible that someone could write better OpenCL code than I. Some sieves will see a noticeable gain when using the GPU and others will not.
rogue is online now   Reply With Quote
Reply



All times are UTC. The time now is 14:03.


Fri Jul 7 14:03:31 UTC 2023 up 323 days, 11:32, 0 users, load averages: 0.89, 1.02, 1.09

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔