20070826, 21:45  #1 
(loop (#_fork))
Feb 2006
Cambridge, England
6,353 Posts 
Sieving on multicore machines
On a twocore machine two copies of gnfslasieve14e run twice as fast as one, so doing anything more sophisticated seems a waste of effort.
I haven't seen timings for four copies of gnfslasieve14e on a singlechip quadcore, though I did some tests on our dual dualcore Opteron at work and it seems that four copies run four times as fast as one on that. On the other hand the ddc Opteron has one memory controller per dualcore, and it's memory bandwidth which you'd expect to be the bottleneck, so timings on a Q6600 would be interesting. For larger problems I wonder whether it would make sense to have a multithreaded sieve, to avoid having to use 2GB of memory times four for sieving a 65536x32768 area. You have a single sieving area, four full sets of buckets for accumulating updates, and two gangs each of four threads. Initially, four threads sieve small primes into their own subareas. Then, each thread sieves larger primes, accumulating updates into its own set of buckets (in the standard way where updates to byte N go in bucket N/2^B). When one thread observes that it's filled a bucket, it tells all the siever threads to stop after finishing their current prime. A control thread notices that all the siever threads have stopped and starts up the distributor threads. Each of the four distributor threads is responsible for a quarter of the arrays  that is, for a quarter of each of the sieve threads' buckets  and does the updates one bucket at a time in the normal cachefriendly way. Once all the updates are done, the control thread starts up the siever threads again. This process avoids having the same things being updated by more than one thread  there is a change of ownership when the distributor threads start, but I think that's unavoidable, and it's readonly. Is there something obviously theoretically wrong with this, or should I start coding it up when I get a quadcore? 
20070826, 22:19  #2 
Jun 2003
The Texas Hill Country
3^{2}×11^{2} Posts 
I suspect that the inefficiency associated with resyncing threads is going to hurt you significantly. Also, if I understand your proposed methodology, each sieve will be working a smaller subspace and therefore be hurt by the greater number of "edges" in the sieving area.
Something you might consider is having some cores doing the sieving and others doing the followup verification that the observed candidates are truly smooth. OTOH, since resieving is an efficient way to find many of the factors, this strategy may not help you in reducing the RAM/core ratio that is needed to remain efficient. When Paul was doing LA on the cluster at MSFT, he found that the additional cpus did not really help all that much. He could dedicate one of the cpus to a different, compute intensive, task and still saturate the limiting resource. Even on a single multicore processor, doing LA which can be partitioned in an efficient manner, we found the the sync time significantly reduced the effective scaling factor. 
20070826, 23:47  #3  
Nov 2003
2^{2}·5·373 Posts 
Quote:
I have divided the sieve work into 3 pieces: (1) Select a specialq, compute the sieve start points (this is integer arithmetic intensive), and compute the sieve boundaries. Pass control to (2) then prepare for the next specialq. [Note! This takes a lot of memory] (2) Sieve, identify smooth candidates, then resieve for the smooth candidates (3) Factor and output the candidates. I will have to do some experiments to get good balance. I may break (2) into two separate parts. If (3) doesn't keep a core busy, I may have it help out with (1). This is sort of like a pipeline implementation. Each stage/core does its work, then passes to the next stage. 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
P1 factoring: B1 and B2 vs. multicore scaling  TheJudger  Software  1  20160502 21:09 
Single exponent testing on MultiCore  graizada  Software  3  20130205 14:36 
Factor5 on Multicore Machines  Rodrigo  Operation Billion Digits  4  20110102 04:50 
Using PRIMO on a multicore system  Cybertronic  Five or Bust  The Dual Sierpinski Problem  6  20101013 18:25 
starting mprime at boot time on a multicore pc  tha  Software  6  20081015 23:38 