mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Msieve (https://www.mersenneforum.org/forumdisplay.php?f=83)
-   -   Msieve with GNFS support (https://www.mersenneforum.org/showthread.php?t=5413)

R.D. Silverman 2008-05-03 11:23

[QUOTE=bdodson;132633]
Uhm, that's 14 threads, four each on the first two matrices,
six on the new one, 10,257+. A monster, by the way, 15.479M^2,
despite having sieved the same region. Switching from x^6 -10
to x^6+10 seems to have been a radical alteration in the roots;
even the ratings of the polyn looks different:

...)[/QUOTE]

Note that x^6 + 10 has no real roots....

jasonp 2008-05-03 13:12

[QUOTE=bdodson;132633]
Uhm, that's 14 threads, four each on the first two matrices,
six on the new one, 10,257+. A monster, by the way, 15.479M^2,
despite having sieved the same region. Switching from x^6 -10
to x^6+10 seems to have been a radical alteration in the roots;
even the ratings of the polyn looks different:

size score = 1.839597e-12, Murphy alpha = 1.635261, combined =
1.152956e-12

for 257+ and

size score = 2.036089e-12, Murphy alpha = 1.321650, combined =
1.395729e-12

for 257- despite having the same "difficulty"[/QUOTE]
With three huge matrix jobs going, I really hope you don't run into mysterious problems that cause trivial dependencies to be found. Out of curiosity, what were the final reported matrix densities before the Lanczos iterations started?

The difference in combined scores above is supposed to indicate that 257- will have 1.39/1.15 = 120% as many relations for a siever to find compared to 257+

bdodson 2008-05-03 14:21

[QUOTE=R.D. Silverman;132646]Note that x^6 + 10 has no real roots....[/QUOTE]

I was presuming this would be a softball, directly in Bob's "wheel house";
thanks for the confirmation. There's been substantial speculation about
how msieve would run with additional threads on an smp machine (we
payed extra for the sgi, 8 quadcores; way more per core than on the
40 dual xeon quadcore "whitebox"; so 32 cores with some 29 available
where msieve is running, -vs- 320 cores and exclusive scheduling under
condor -- no loggins to the compute nodes -- where 3/4 of the gnfs-lasieve's
are running, the other 1/4 some 100+ Opterons including the condor master
for both clusters. Peak sieving at c. 350 cores; several days at 0 under
heavy use by others.). I'm appending three text files, with commentary,
below with data to support my current view(s).

One of my friends reports an analysis that msieve might not perform
better than a generic program without much parallelism; so that one
needs to double the number of threads to reduce the runtime by a factor
of 1/2. That's not the way I see doubling 4-threads to 8, or trippling to
-t 12; neither of which seemed to me to be running well. There appears
to be a master thread, which subcontracts some of its computation to
a 2nd thread; or to 3 extra threads (-t 4); or more, I checked up to
11 extra threads. As near as I could tell (recall that I don't code, and
in this case haven't even opened Jason's files --- although he did get
me through a compile), the master thread appears to have a limited
amount of possible computation to sub-contract out to additional
threads. So for a fixed matrix size and hardware configuration there
will be a "sweet spot" ("local max", if one prefers) where reducing the
number of threads slows down speed/performance, but adding threads
(much less doubling them!) doesn't improve matters much. I'm just
speculating here (since all of my matrices are large), but it seems to me
that for small problems 2-threads (total) might be better than 4, or at
least there would be no substantial improvement at 4 [corrections and/or
objections?]. The data I'm reporting suggests that for matrices in the
range of 15M^2 (on this hardware), using 6-threads is an improvement
over 4-threads; but the current msieve structure doesn't need/want
8-threads (at least, not 12-threads).

I'll start with why I concluded that 6-threads is worthwhile.

Here's -t 6:

[code]
21413 bad0 25 0 5326m 5.2g 476 R 505 4.7 2542:14 msieve
24569 bad0 25 0 4661m 4.5g 476 R 398 4.1 10949:28 msieve
24614 bad0 25 0 4766m 4.6g 476 S 272 4.2 10739:29 msieve
---

showing threads:

Tasks: 473 total, 14 running,
Mem: 114835776k total, 98371768k used, 16464008k free, 96128k buffers
Swap: 125009788k total, 0k used, 125009788k free, 72856876k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21449 bad0 25 0 5326m 5.2g 476 R 103 4.7 385:12.11 msieve
24569 bad0 25 0 4661m 4.5g 476 R 103 4.1 4040:18 msieve [1]
24614 bad0 25 0 4766m 4.6g 476 R 103 4.2 3881:42 msieve [2]
21413 bad0 25 0 5326m 5.2g 476 R 103 4.7 823:11.22 msieve [3]
21453 bad0 25 0 5326m 5.2g 476 R 102 4.7 353:47.64 msieve
24642 bad0 23 0 4766m 4.6g 476 R 88 4.2 2124:01 msieve
24645 bad0 23 0 4766m 4.6g 476 R 86 4.2 2385:25 msieve
24646 bad0 20 0 4766m 4.6g 476 R 61 4.2 2354:26 msieve
21452 bad0 22 0 5326m 5.2g 476 S 41 4.7 334:38.59 msieve
21424 bad0 25 0 5326m 5.2g 476 S 40 4.7 352:26.46 msieve
24617 bad0 19 0 4661m 4.5g 476 R 28 4.1 2220:36 msieve
21423 bad0 21 0 5326m 5.2g 476 S 0 4.7 300:33.46 msieve
24620 bad0 25 0 4661m 4.5g 476 S 0 4.1 2326:51 msieve
24624 bad0 25 0 4661m 4.5g 476 S 0 4.1 2367:39 msieve
[/code]

These are my three matrices. There are two longer-running ones with
4-threads, and at this point I've added a third with 6-threads. The
ones showing "4.5 g, 4.1%" memory are from the small matrix (13.9M^2,
from 10,257- difficulty 257); the ones with "4.6g, 4.2%" from the
midsize 14.2M^2 (3,536+ c252); and the ones with "5.2g 4.7%" are
from the new/large matrix, with the 6-threads. The first listing (without
showing threads) gives the total time by all threads for each of the
three computations. In the listing showing threads I've labeled what
I'm referring to as the "master thread" for each matrix as [1], [2] & [3].

The calculation I found informative compares the total amount of
computation done on the slave threads with twice the time spent by
the master thread:

[code]
385+353+334+352+300 = 1724 > 2*823 [3]

2124+2385+2354 = 6863 < 2*3811 [2]

2220+2326+2367 = 6913 < 2*4040 [1] [/code]

Seems to me that the extra time spent by the additional two
threads is doing something useful [IMHO].

Next, let's look at what happens with 12-threads (done first). I
took two readings, both showing the threads. This time the threads
for the large matrix [3'] are using "5.8g, 5.3%", a substantial jump in
memory, to begin with. The first 75min or so of the master thread's
time was spent on the initial matrix build (once only, provided one
remembers -ncr instead of -nc2 --- which over-writes the matrix data,
making saved checkpoints of no use ... guess that ought to be saved
as well, for those of us thinking about saving things. Peter alternates
chk-odd and chk-even, so as not to lose the checkpoint when the
computer is going down for a reboot "in 2 minutes ...").

What bothers me here is the % cpu use by the other 11 threads
("slaves"? a better term?). The 28% was a relative large reading
[I'm not claiming any of the displays are random; more like peak].
Just for completeness, the 2nd reading is a way-out outlier, which
hits 44%; but at no time did I observe the master thread having
enough tasks to get the other threads above 50%; and the other
peak uses seemed to cluster around 27%/28%. That's not good
cpu use, even when there aren't other users that might want the
cycles. I didn't see much difference for 8-threads, not enough
to even have saved a reading (sorry). So here's 12-threads:

[code]
Tasks: 472 total, 15 running, 457 sleeping, 0 stopped, 0 zombie
Cpu(s): 24.4%us, 0.4%sy, 0.0%ni, 75.3%id, 0.0%wa, ...
Mem: 114835776k total, 94949716k used, 19886060k free, 79708k buffers
Swap: 125009788k total, 0k used, 125009788k free, 77057740k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

11556 bad0 25 0 6074m 5.8g 484 R 103 5.3 159:24.40 msieve [3']
24569 bad0 25 0 4661m 4.5g 476 R 101 4.1 2747:42 msieve
24614 bad0 19 0 4766m 4.6g 476 R 100 4.2 2648:28 msieve
24624 bad0 25 0 4661m 4.5g 476 S 67 4.1 1590:22 msieve
24620 bad0 25 0 4661m 4.5g 476 S 56 4.1 1576:26 msieve
24617 bad0 25 0 4661m 4.5g 476 S 52 4.1 1518:06 msieve
12153 bad0 19 0 6074m 5.8g 484 R 28 5.3 13:26.47 msieve
12154 bad0 19 0 6074m 5.8g 484 R 28 5.3 15:29.80 msieve
12155 bad0 19 0 6074m 5.8g 484 R 28 5.3 10:20.64 msieve
12157 bad0 19 0 6074m 5.8g 484 R 28 5.3 10:46.38 msieve
12158 bad0 19 0 6074m 5.8g 484 R 28 5.3 15:22.78 msieve
12162 bad0 19 0 6074m 5.8g 484 R 28 5.3 14:39.83 msieve
12164 bad0 19 0 6074m 5.8g 484 R 28 5.3 16:09.36 msieve
12165 bad0 19 0 6074m 5.8g 484 R 28 5.3 15:31.10 msieve
12166 bad0 19 0 6074m 5.8g 484 R 28 5.3 13:39.44 msieve
12167 bad0 19 0 6074m 5.8g 484 R 28 5.3 10:48.01 msieve
12169 bad0 19 0 6074m 5.8g 484 R 28 5.3 14:41.54 msieve
24645 bad0 25 0 4766m 4.6g 476 S 4 4.2 1614:29 msieve
...
24642 bad0 25 0 4766m 4.6g 476 S 0 4.2 1427:52 msieve
24646 bad0 25 0 4766m 4.6g 476 S 0 4.2 1595:21 msieve
------

c. 20 cpuminutes later:

24569 bad0 25 0 4661m 4.5g 476 R 99 4.1 2769:57 msieve
24614 bad0 25 0 4766m 4.6g 476 R 99 4.2 2670:49 msieve
11556 bad0 25 0 6074m 5.8g 484 R 99 5.3 183:20.23 msieve
24646 bad0 25 0 4766m 4.6g 476 S 64 4.2 1607:02 msieve
24620 bad0 25 0 4661m 4.5g 476 S 55 4.1 1589:19 msieve
24645 bad0 25 0 4766m 4.6g 476 S 48 4.2 1627:04 msieve
24617 bad0 25 0 4661m 4.5g 476 S 46 4.1 1530:30 msieve
24624 bad0 25 0 4661m 4.5g 476 S 46 4.1 1602:15 msieve
12153 bad0 19 0 6074m 5.8g 484 R 44 5.3 17:28.40 msieve
12154 bad0 19 0 6074m 5.8g 484 R 44 5.3 19:41.78 msieve
12155 bad0 20 0 6074m 5.8g 484 R 44 5.3 12:58.44 msieve
12157 bad0 20 0 6074m 5.8g 484 R 44 5.3 13:34.84 msieve
12158 bad0 19 0 6074m 5.8g 484 R 43 5.3 19:31.82 msieve
12165 bad0 19 0 6074m 5.8g 484 R 43 5.3 19:55.62 msieve
12166 bad0 19 0 6074m 5.8g 484 R 43 5.3 17:27.61 msieve
12162 bad0 19 0 6074m 5.8g 484 R 43 5.3 18:41.72 msieve
12164 bad0 19 0 6074m 5.8g 484 R 43 5.3 20:55.56 msieve
12167 bad0 19 0 6074m 5.8g 484 R 43 5.3 13:32.75 msieve
12169 bad0 19 0 6074m 5.8g 484 R 43 5.3 18:48.96 msieve
24642 bad0 25 0 4766m 4.6g 476 S 0 4.2 1438:41 msieve
[/code]

One more set with 6-threads, this time with all three matrices previously
built, a clean start following a reboot:

[code]

Friday pm, re-start

10129 bad0 25 0 5326m 5.2g 476 S 591 4.7 65:45.73 msieve [3'']
10123 bad0 25 0 4661m 4.5g 476 S 398 4.1 65:02.85 msieve
10105 bad0 18 0 4766m 4.6g 476 S 310 4.2 57:28.75 msieve
---------

sat am, may03:

10129 bad0 25 0 5326m 5.2g 476 S 499 4.7 1840:09 msieve [3'']
10105 bad0 25 0 4766m 4.6g 476 R 384 4.2 1554:44 msieve
10123 bad0 25 0 4661m 4.5g 476 R 277 4.1 1587:03 msieve
--------

threads:

10169 bad0 25 0 5326m 5.2g 476 R 102 4.7 234:11.45 msieve
10172 bad0 25 0 5326m 5.2g 476 R 101 4.7 217:03.43 msieve
10177 bad0 25 0 5326m 5.2g 476 S 84 4.7 233:12.79 msieve
10173 bad0 25 0 5326m 5.2g 476 S 83 4.7 245:12.68 msieve
10129 bad0 25 0 5326m 5.2g 476 S 77 4.7 527:39.62 msieve [3'']
10105 bad0 20 0 4766m 4.6g 476 R 75 4.2 498:41.07 msieve [2'']
10178 bad0 25 0 5326m 5.2g 476 S 74 4.7 240:16.63 msieve
10123 bad0 19 0 4661m 4.5g 476 R 60 4.1 514:17.28 msieve [1'']
10135 bad0 25 0 4661m 4.5g 476 S 58 4.1 336:47.76 msieve
10168 bad0 25 0 4766m 4.6g 476 S 57 4.2 302:31.43 msieve
10171 bad0 25 0 4766m 4.6g 476 S 42 4.2 310:40.66 msieve
10138 bad0 25 0 4661m 4.5g 476 S 42 4.1 303:04.09 msieve
10174 bad0 25 0 4766m 4.6g 476 S 41 4.2 325:08.25 msieve
10137 bad0 25 0 4661m 4.5g 476 S 33 4.1 314:00.02 msieve
------------

5 min later:

10174 bad0 25 0 4766m 4.6g 476 R 100 4.2 327:48.39 msieve
10123 bad0 25 0 4661m 4.5g 476 R 100 4.1 517:35.61 msieve
10105 bad0 25 0 4766m 4.6g 476 R 100 4.2 502:19.29 msieve
10129 bad0 25 0 5326m 5.2g 476 R 98 4.7 531:01.45 msieve
10168 bad0 25 0 4766m 4.6g 476 S 82 4.2 305:28.98 msieve
10171 bad0 25 0 4766m 4.6g 476 S 74 4.2 313:14.15 msieve
10135 bad0 21 0 4661m 4.5g 476 R 66 4.1 339:48.34 msieve
10137 bad0 21 0 4661m 4.5g 476 R 65 4.1 316:29.86 msieve
10138 bad0 21 0 4661m 4.5g 476 R 62 4.1 305:46.22 msieve
10169 bad0 19 0 5326m 5.2g 476 R 34 4.7 236:45.32 msieve
10172 bad0 18 0 5326m 5.2g 476 R 10 4.7 219:38.43 msieve
10173 bad0 18 0 5326m 5.2g 476 R 6 4.7 247:07.66 msieve
10177 bad0 18 0 5326m 5.2g 476 R 5 4.7 235:08.08 msieve
10178 bad0 18 0 5326m 5.2g 476 R 3 4.7 242:03.25 msieve
---

another 5 min:

10168 bad0 25 0 4766m 4.6g 476 R 100 4.2 307:36.20 msieve
10171 bad0 25 0 4766m 4.6g 476 R 100 4.2 315:09.34 msieve
10174 bad0 25 0 4766m 4.6g 476 R 100 4.2 329:51.74 msieve
10135 bad0 25 0 4661m 4.5g 476 R 100 4.1 341:59.71 msieve
10137 bad0 25 0 4661m 4.5g 476 R 100 4.1 318:24.65 msieve
10129 bad0 23 0 5326m 5.2g 476 R 100 4.7 533:36.07 msieve
10138 bad0 25 0 4661m 4.5g 476 R 100 4.1 307:48.47 msieve
10105 bad0 25 0 4766m 4.6g 476 S 98 4.2 505:06.21 msieve
10123 bad0 25 0 4661m 4.5g 476 S 93 4.1 520:03.12 msieve
10169 bad0 21 0 5326m 5.2g 476 R 66 4.7 238:24.80 msieve
10177 bad0 21 0 5326m 5.2g 476 R 66 4.7 236:36.33 msieve
10172 bad0 21 0 5326m 5.2g 476 R 66 4.7 221:29.64 msieve
10173 bad0 21 0 5326m 5.2g 476 R 66 4.7 248:35.45 msieve
10178 bad0 21 0 5326m 5.2g 476 R 66 4.7 243:24.18 msieve
[/code]

At least the peaks look better, and the big matrix [3''] is getting in
some extra time; good considering that the mid-size one is past 50%,
well on its way to 4 weeks, walltime. Have I mentioned that this is
my first msieve use/matrix? Seems consistent with my first quadratic
sieve (RSA120, the first time, with a report at the crypto just before
the start of RSA129; both being the first use of "larger large primes")
and first gnfs (a c119 partition number, the first application of gnfs
to a crypto-sized number, 384-bits, at the time; so early that it
was before Lanczos and the Montgomery sqrt --- those on a CWI
c106 snfs, iirc --- our paper did two c116's, before-and-after Peter's
new improvements, and set a c112 crosover point for qs/gnfs, for the first
time.). Looking forward to a first msieve factor; as well as the factors
from the c260 base-12 of Greg/Richard/bd. -bd

frmky 2008-05-03 16:27

Our 8-cpu quad-core Barcelona arrived yesterday, so I can throw in another data point. As Bruce, I haven't measured the total time to complete a matrix (yet) but have only looked at the reports by top. On this computer, it appears that additional threads are doing (at least some) additional work up to 16 threads.

[CODE] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
21863 childers 25 0 3266m 3.0g 860 R 918 4.8 5062:10 21 ./msieve -ncr -nc3 -t 16 -v
21675 childers 25 0 4870m 4.6g 904 R 765 7.3 2694:02 3 ./msieve -nc -t 16 -v [/CODE]

Adding additional threads beyond 16 actually slows the computation down, presumably due to the communication overhead in the synchronization. Because of the NUMA architecture of the Opterons, it's very important to have these 16 threads running on only 4 cpus. By default, Linux doesn't seem to care which cpus it launches the threads on, and if they are spread across the cpus in the machine, it slows down dramatically.

Greg

jasonp 2008-05-03 16:57

[QUOTE=bdodson;132653]
There appears to be a master thread, which subcontracts some of its computation to a 2nd thread; or to 3 extra threads (-t 4); or more, I checked up to 11 extra threads. As near as I could tell (recall that I don't code, and in this case haven't even opened Jason's files --- although he did get me through a compile), the master thread appears to have a limited amount of possible computation to sub-contract out to additional threads. So for a fixed matrix size and hardware configuration there will be a "sweet spot" ("local max", if one prefers) where reducing the number of threads slows down speed/performance, but adding threads (much less doubling them!) doesn't improve matters much.
[/QUOTE]
When using more than one thread, the first acts as master and farms out work to the others. The matrix is statically partitioned into slabs of columns, one slab per thread, with the master activating threads from the pool in round-robin fashion and taking the last slab of columns for itself (this saves two synchronize operations per matrix multiply). The extra memory use is also expected; each thread needs a scratch vector for temporary results that are XORed together after the matrix-vector product is completed. Multiplying by the transpose of the matrix is easier, each thread just fills in a range of the output vector. The extra memory isn't strictly necessary, but avoiding it would require giving each thread a slab of rows rather than columns, and since the density of nonzeros across rows is highly nonuniform it becomes harder to statically partition the matrix.

Previous experience with multiple threads definitely shows there's a point of diminishing returns, since the matrix entries have to stream in from RAM even when the vectors to be multiplied are decomposed into blocks that fit into cache. Running two threads on a dual-CPU machine produces a larger speedup than on a dual-core machine, so there is an implicit dependence on the number of bus wires that can be brought to bear on the problem.

I originally assumed that the best runtime would require the largest possible blocks of work between thread synchronizations, but lately I wonder if it would be better to treat each block of the matrix as a work unit to be scheduled, perhaps allocated using a 2-D space-filling tiling of the plane. The load balancing would be much easier, and maybe the cache performance could be improved. I also have not experimented with local allocation of the matrix entries, and that could make a big difference on a NUMA architecture.

PS: Greg, the changes to make each thread allocate matrix entries for itself are pretty straightforward, I'll try to get those changes into the next version

jasonp 2008-05-03 19:45

multithreading with node-local memory
 
1 Attachment(s)
The two files in the attached zip implement node-local thread memory allocation when running multithreaded linear algebra, as well as a few other minor changes. Just overwrite the corresponding files with the new version and recompile the library. Anyone who feels brave can give them a try, though I don't recommend relying on them for big long-running jobs.

bdodson 2008-05-05 13:29

[QUOTE=jasonp;132650] Out of curiosity, what were the final reported matrix densities before the Lanczos iterations started?
[/QUOTE]

Leaving aside the first 48 matrix rows,

3,536+ "sparse part has weight 874437208 (61.45/col)"

10,257- "sparse part has weight 858020543 (61.56/col)" and

10,257+ "sparse part has weight 952964006 (61.56/col)".

Current status is 57.4%, 29.6% and 6.5%. Maybe a month
and a half for 10,257+, if I'm estimating the walltime less
reboots correctly. -Bruce

PS -- with the first 48 rows, that was

"weight 1266010783 (88.97/col) sparse part 898702280 (63.16/col)"
"weight 1239088034 (88.90/col) sparse part 879366766 (63.09/col)"
"weight 1382748757 (89.33/col) sparse part 983086477 (63.51/col)"

Is there a comparison with these and the known cases with
false dependencies?

fivemack 2008-05-07 07:11

I would be reasonably confident that those will work; false dependencies in my experience have come from a sparse-part weight below 57 or so.

frmky 2008-05-07 22:38

[QUOTE=frmky;132658]Our 8-cpu quad-core Barcelona arrived yesterday, so I can throw in another data point. As Bruce, I haven't measured the total time to complete a matrix (yet) but have only looked at the reports by top. On this computer, it appears that additional threads are doing (at least some) additional work up to 16 threads.[/QUOTE]

Nope. All wrong. Top lies.

I inserted code in msieve to output the total time to solve the matrix given the current time elapsed and dimensions solved. It turns out that the fastest is two threads, each on a single core of a separate processor. Second fastest is a single thread running alone on a processor. Adding a second thread to a quadcore processor slows the runtime. This surprises me as I expected the memory bandwidth for an Opteron Barcelona for a second thread to not be a bottleneck. Anything beyond 2 threads slows the runtime, no matter how they are allocated.

Each core of a Barcelona processor has a 512 Kb L2 cache, but all 4 cores share a 2MB L3 cache. msieve chooses a cache blocksize based on the L2 cache size, but increasing the blocksize to the maximum (by telling msieve that the L2 cache was 2MB) almost halved the matrix solve time.

I've done only a few tests with Jason's Lanczos changes for local memory, but in those it didn't really seem to make any significant difference in the runtime.

Greg

jasonp 2008-05-07 22:46

[QUOTE=frmky;132979]
Each core of a Barcelona processor has a 512 Kb L2 cache, but all 4 cores share a 2MB L3 cache. msieve chooses a cache blocksize based on the L2 cache size, but increasing the blocksize to the maximum (by telling msieve that the L2 cache was 2MB) almost halved the matrix solve time.
[/QUOTE]
For Intel processors the library will take the maximum of the L2 and L3 cache sizes. Modern AMD processors only had L1 and L2 caches until recently, the changes to find the L3 size should be pretty easy.

R.D. Silverman 2008-05-08 01:31

[QUOTE=jasonp;132980]For Intel processors the library will take the maximum of the L2 and L3 cache sizes. Modern AMD processors only had L1 and L2 caches until recently, the changes to find the L3 size should be pretty easy.[/QUOTE]

I'd like to try out the msieve solver, but my code emits and works
with the CWI format. Is there as CWI --> msieve relation converter???


All times are UTC. The time now is 22:43.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.