mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Msieve

Reply
 
Thread Tools
Old 2012-12-05, 10:34   #1
poily
 
Nov 2010

2×52 Posts
Default Distributed NFS postprocessing

I recently started working on NFS postprocessing for DLOG and found one funny thing called MapReduce. I noticed that duplicate and singleton removal stages of the postprocessing fit very well MapReduce distributed programming paradigm. I wonder if anyone tried to implement distributed postprocessing before? It could be quite interesting for all the distributed sieving projects since using MapReduce one doesn't generally need to collect all the data from all the sources in one place but do some point-to-point exchanges between peers.
poily is offline   Reply With Quote
Old 2012-12-05, 11:18   #2
em99010pepe
 
em99010pepe's Avatar
 
Sep 2004

283010 Posts
Default

I didn't read the link you gave but I was wondering how it would deal to organize in parallel the relations file with all computers connected.
For a 31 LP bits task the size is of ~14 GB zipped. For example 2,1019- had a relation file of more than 100 GB with a matrix size of ~25 GB. The point-to-point exchanges between peers are a lot of GB's.

Last fiddled with by em99010pepe on 2012-12-05 at 11:19
em99010pepe is offline   Reply With Quote
Old 2012-12-05, 11:43   #3
poily
 
Nov 2010

2·52 Posts
Default

Say, you have N relations on P peers each having N/P relation locally. If you want to collect all the data in one place you need O((P-1)N/P) of communication time then you need to process the data in O(N).

In MapReduce both times could be about P times smaller: say, after the Map stage each peer has O(N/P) key-values he needs to sort and then send O(N/P^2) of which to any other peer. He sends first O(N/P^2) key-values to his nearest neighbor, next O(N/P^2) to the next nearest, etc. Here each peer needs to be able to connect to any other peer but only one at a time. Since all the send-recv's are done in parallel overall communication time will be O((P-1)N/P^2). After all the communications finish each peer again has O(N/P) key-values he needs to process on the Reduce stage.

Of course, in real life not all the peers have equal amount of data and key-values but in general there should be a very sensible speedup.
poily is offline   Reply With Quote
Old 2012-12-05, 12:14   #4
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

637610 Posts
Default

Dedupe and singleton removal were done on a cluster for RSA768.

For anything smaller than that, they're a trivial amount of work (100GB of disc and a few GB of memory) and basically limited by disc I/O speed.
fivemack is offline   Reply With Quote
Old 2012-12-05, 12:21   #5
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

2×3×7×137 Posts
Default

Quote:
Originally Posted by fivemack View Post
Dedupe and singleton removal were done on a cluster for RSA768.

For anything smaller than that, they're a trivial amount of work (100GB of disc and a few GB of memory) and basically limited by disc I/O speed.
So in other words do it on a SSD.
henryzz is offline   Reply With Quote
Old 2012-12-05, 12:35   #6
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

353210 Posts
Default

MapReduce works extremely well on these kinds of problems, but the implicit assumption when using it is that you have an HPC-class distributed filesystem. Otherwise the main operation, shuffling keys around, finishes very quickly and disk IO becomes the bottleneck.

I have considered an MPI decomposition of the problem across a few machines, so that duplicate and singleton removal split the large hashtables involved across several CPU memories. MPI even has primitives for remote access to a distributed hashtable.
jasonp is offline   Reply With Quote
Old 2012-12-05, 12:45   #7
poily
 
Nov 2010

3216 Posts
Default

And there's working MPI MapReduce implementaion. ;)

I believe one can also do naive large prime removal using double MapReduce.
poily is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Postprocessing error: double free or corruption Syd Msieve 2 2009-03-21 15:12
Postprocessing error: "too many merge attempts" mdettweiler Msieve 4 2009-03-03 16:53
distributed.net completes OGR-25 ixfd64 Lounge 4 2008-11-22 01:59
Distributed Internet Cryptography ewmayer Math 1 2006-08-21 20:53
distributed proofreading adpowers Lounge 10 2004-11-05 01:54

All times are UTC. The time now is 01:33.

Sun Dec 6 01:33:33 UTC 2020 up 2 days, 21:44, 0 users, load averages: 2.22, 2.49, 2.63

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.