mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2007-08-26, 21:45   #1
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

6,353 Posts
Default Sieving on multicore machines

On a two-core machine two copies of gnfs-lasieve14e run twice as fast as one, so doing anything more sophisticated seems a waste of effort.

I haven't seen timings for four copies of gnfs-lasieve14e on a single-chip quad-core, though I did some tests on our dual dual-core Opteron at work and it seems that four copies run four times as fast as one on that. On the other hand the ddc Opteron has one memory controller per dual-core, and it's memory bandwidth which you'd expect to be the bottleneck, so timings on a Q6600 would be interesting.

For larger problems I wonder whether it would make sense to have a multi-threaded sieve, to avoid having to use 2GB of memory times four for sieving a 65536x32768 area. You have a single sieving area, four full sets of buckets for accumulating updates, and two gangs each of four threads.

Initially, four threads sieve small primes into their own sub-areas.

Then, each thread sieves larger primes, accumulating updates into its own set of buckets (in the standard way where updates to byte N go in bucket N/2^B). When one thread observes that it's filled a bucket, it tells all the siever threads to stop after finishing their current prime. A control thread notices that all the siever threads have stopped and starts up the distributor threads.

Each of the four distributor threads is responsible for a quarter of the arrays - that is, for a quarter of each of the sieve threads' buckets - and does the updates one bucket at a time in the normal cache-friendly way. Once all the updates are done, the control thread starts up the siever threads again.

This process avoids having the same things being updated by more than one thread - there is a change of ownership when the distributor threads start, but I think that's unavoidable, and it's read-only.

Is there something obviously theoretically wrong with this, or should I start coding it up when I get a quad-core?
fivemack is offline   Reply With Quote
Old 2007-08-26, 22:19   #2
Wacky
 
Wacky's Avatar
 
Jun 2003
The Texas Hill Country

32×112 Posts
Default

I suspect that the inefficiency associated with re-syncing threads is going to hurt you significantly. Also, if I understand your proposed methodology, each sieve will be working a smaller subspace and therefore be hurt by the greater number of "edges" in the sieving area.

Something you might consider is having some cores doing the sieving and others doing the follow-up verification that the observed candidates are truly smooth. OTOH, since re-sieving is an efficient way to find many of the factors, this strategy may not help you in reducing the RAM/core ratio that is needed to remain efficient.

When Paul was doing LA on the cluster at MSFT, he found that the additional cpus did not really help all that much. He could dedicate one of the cpus to a different, compute intensive, task and still saturate the limiting resource.

Even on a single multi-core processor, doing LA which can be partitioned in an efficient manner, we found the the sync time significantly reduced the effective scaling factor.
Wacky is offline   Reply With Quote
Old 2007-08-26, 23:47   #3
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

22·5·373 Posts
Default

Quote:
Originally Posted by Wacky View Post
I suspect that the inefficiency associated with re-syncing threads is going to hurt you significantly. Also, if I understand your proposed methodology, each sieve will be working a smaller subspace and therefore be hurt by the greater number of "edges" in the sieving area.

Something you might consider is having some cores doing the sieving and others doing the follow-up verification that the observed candidates are truly smooth. .
I agree with Richard. I am working on a multi-threaded implementation.
I have divided the sieve work into 3 pieces:

(1) Select a special-q, compute the sieve start points (this is integer-
arithmetic intensive), and compute the sieve boundaries. Pass control to
(2) then prepare for the next special-q. [Note! This takes a lot of memory]
(2) Sieve, identify smooth candidates, then resieve for the smooth
candidates
(3) Factor and output the candidates.

I will have to do some experiments to get good balance. I may break
(2) into two separate parts. If (3) doesn't keep a core busy, I may have
it help out with (1). This is sort of like a pipeline implementation. Each
stage/core does its work, then passes to the next stage.
R.D. Silverman is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
P-1 factoring: B1 and B2 vs. multicore scaling TheJudger Software 1 2016-05-02 21:09
Single exponent testing on MultiCore graizada Software 3 2013-02-05 14:36
Factor5 on Multicore Machines Rodrigo Operation Billion Digits 4 2011-01-02 04:50
Using PRIMO on a multicore system Cybertronic Five or Bust - The Dual Sierpinski Problem 6 2010-10-13 18:25
starting mprime at boot time on a multicore pc tha Software 6 2008-10-15 23:38

All times are UTC. The time now is 12:21.

Thu Aug 13 12:21:41 UTC 2020 up 8:57, 1 user, load averages: 1.82, 1.65, 1.64

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.