![]() |
[QUOTE=bdodson;132633]
Uhm, that's 14 threads, four each on the first two matrices, six on the new one, 10,257+. A monster, by the way, 15.479M^2, despite having sieved the same region. Switching from x^6 -10 to x^6+10 seems to have been a radical alteration in the roots; even the ratings of the polyn looks different: ...)[/QUOTE] Note that x^6 + 10 has no real roots.... |
[QUOTE=bdodson;132633]
Uhm, that's 14 threads, four each on the first two matrices, six on the new one, 10,257+. A monster, by the way, 15.479M^2, despite having sieved the same region. Switching from x^6 -10 to x^6+10 seems to have been a radical alteration in the roots; even the ratings of the polyn looks different: size score = 1.839597e-12, Murphy alpha = 1.635261, combined = 1.152956e-12 for 257+ and size score = 2.036089e-12, Murphy alpha = 1.321650, combined = 1.395729e-12 for 257- despite having the same "difficulty"[/QUOTE] With three huge matrix jobs going, I really hope you don't run into mysterious problems that cause trivial dependencies to be found. Out of curiosity, what were the final reported matrix densities before the Lanczos iterations started? The difference in combined scores above is supposed to indicate that 257- will have 1.39/1.15 = 120% as many relations for a siever to find compared to 257+ |
[QUOTE=R.D. Silverman;132646]Note that x^6 + 10 has no real roots....[/QUOTE]
I was presuming this would be a softball, directly in Bob's "wheel house"; thanks for the confirmation. There's been substantial speculation about how msieve would run with additional threads on an smp machine (we payed extra for the sgi, 8 quadcores; way more per core than on the 40 dual xeon quadcore "whitebox"; so 32 cores with some 29 available where msieve is running, -vs- 320 cores and exclusive scheduling under condor -- no loggins to the compute nodes -- where 3/4 of the gnfs-lasieve's are running, the other 1/4 some 100+ Opterons including the condor master for both clusters. Peak sieving at c. 350 cores; several days at 0 under heavy use by others.). I'm appending three text files, with commentary, below with data to support my current view(s). One of my friends reports an analysis that msieve might not perform better than a generic program without much parallelism; so that one needs to double the number of threads to reduce the runtime by a factor of 1/2. That's not the way I see doubling 4-threads to 8, or trippling to -t 12; neither of which seemed to me to be running well. There appears to be a master thread, which subcontracts some of its computation to a 2nd thread; or to 3 extra threads (-t 4); or more, I checked up to 11 extra threads. As near as I could tell (recall that I don't code, and in this case haven't even opened Jason's files --- although he did get me through a compile), the master thread appears to have a limited amount of possible computation to sub-contract out to additional threads. So for a fixed matrix size and hardware configuration there will be a "sweet spot" ("local max", if one prefers) where reducing the number of threads slows down speed/performance, but adding threads (much less doubling them!) doesn't improve matters much. I'm just speculating here (since all of my matrices are large), but it seems to me that for small problems 2-threads (total) might be better than 4, or at least there would be no substantial improvement at 4 [corrections and/or objections?]. The data I'm reporting suggests that for matrices in the range of 15M^2 (on this hardware), using 6-threads is an improvement over 4-threads; but the current msieve structure doesn't need/want 8-threads (at least, not 12-threads). I'll start with why I concluded that 6-threads is worthwhile. Here's -t 6: [code] 21413 bad0 25 0 5326m 5.2g 476 R 505 4.7 2542:14 msieve 24569 bad0 25 0 4661m 4.5g 476 R 398 4.1 10949:28 msieve 24614 bad0 25 0 4766m 4.6g 476 S 272 4.2 10739:29 msieve --- showing threads: Tasks: 473 total, 14 running, Mem: 114835776k total, 98371768k used, 16464008k free, 96128k buffers Swap: 125009788k total, 0k used, 125009788k free, 72856876k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 21449 bad0 25 0 5326m 5.2g 476 R 103 4.7 385:12.11 msieve 24569 bad0 25 0 4661m 4.5g 476 R 103 4.1 4040:18 msieve [1] 24614 bad0 25 0 4766m 4.6g 476 R 103 4.2 3881:42 msieve [2] 21413 bad0 25 0 5326m 5.2g 476 R 103 4.7 823:11.22 msieve [3] 21453 bad0 25 0 5326m 5.2g 476 R 102 4.7 353:47.64 msieve 24642 bad0 23 0 4766m 4.6g 476 R 88 4.2 2124:01 msieve 24645 bad0 23 0 4766m 4.6g 476 R 86 4.2 2385:25 msieve 24646 bad0 20 0 4766m 4.6g 476 R 61 4.2 2354:26 msieve 21452 bad0 22 0 5326m 5.2g 476 S 41 4.7 334:38.59 msieve 21424 bad0 25 0 5326m 5.2g 476 S 40 4.7 352:26.46 msieve 24617 bad0 19 0 4661m 4.5g 476 R 28 4.1 2220:36 msieve 21423 bad0 21 0 5326m 5.2g 476 S 0 4.7 300:33.46 msieve 24620 bad0 25 0 4661m 4.5g 476 S 0 4.1 2326:51 msieve 24624 bad0 25 0 4661m 4.5g 476 S 0 4.1 2367:39 msieve [/code] These are my three matrices. There are two longer-running ones with 4-threads, and at this point I've added a third with 6-threads. The ones showing "4.5 g, 4.1%" memory are from the small matrix (13.9M^2, from 10,257- difficulty 257); the ones with "4.6g, 4.2%" from the midsize 14.2M^2 (3,536+ c252); and the ones with "5.2g 4.7%" are from the new/large matrix, with the 6-threads. The first listing (without showing threads) gives the total time by all threads for each of the three computations. In the listing showing threads I've labeled what I'm referring to as the "master thread" for each matrix as [1], [2] & [3]. The calculation I found informative compares the total amount of computation done on the slave threads with twice the time spent by the master thread: [code] 385+353+334+352+300 = 1724 > 2*823 [3] 2124+2385+2354 = 6863 < 2*3811 [2] 2220+2326+2367 = 6913 < 2*4040 [1] [/code] Seems to me that the extra time spent by the additional two threads is doing something useful [IMHO]. Next, let's look at what happens with 12-threads (done first). I took two readings, both showing the threads. This time the threads for the large matrix [3'] are using "5.8g, 5.3%", a substantial jump in memory, to begin with. The first 75min or so of the master thread's time was spent on the initial matrix build (once only, provided one remembers -ncr instead of -nc2 --- which over-writes the matrix data, making saved checkpoints of no use ... guess that ought to be saved as well, for those of us thinking about saving things. Peter alternates chk-odd and chk-even, so as not to lose the checkpoint when the computer is going down for a reboot "in 2 minutes ..."). What bothers me here is the % cpu use by the other 11 threads ("slaves"? a better term?). The 28% was a relative large reading [I'm not claiming any of the displays are random; more like peak]. Just for completeness, the 2nd reading is a way-out outlier, which hits 44%; but at no time did I observe the master thread having enough tasks to get the other threads above 50%; and the other peak uses seemed to cluster around 27%/28%. That's not good cpu use, even when there aren't other users that might want the cycles. I didn't see much difference for 8-threads, not enough to even have saved a reading (sorry). So here's 12-threads: [code] Tasks: 472 total, 15 running, 457 sleeping, 0 stopped, 0 zombie Cpu(s): 24.4%us, 0.4%sy, 0.0%ni, 75.3%id, 0.0%wa, ... Mem: 114835776k total, 94949716k used, 19886060k free, 79708k buffers Swap: 125009788k total, 0k used, 125009788k free, 77057740k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11556 bad0 25 0 6074m 5.8g 484 R 103 5.3 159:24.40 msieve [3'] 24569 bad0 25 0 4661m 4.5g 476 R 101 4.1 2747:42 msieve 24614 bad0 19 0 4766m 4.6g 476 R 100 4.2 2648:28 msieve 24624 bad0 25 0 4661m 4.5g 476 S 67 4.1 1590:22 msieve 24620 bad0 25 0 4661m 4.5g 476 S 56 4.1 1576:26 msieve 24617 bad0 25 0 4661m 4.5g 476 S 52 4.1 1518:06 msieve 12153 bad0 19 0 6074m 5.8g 484 R 28 5.3 13:26.47 msieve 12154 bad0 19 0 6074m 5.8g 484 R 28 5.3 15:29.80 msieve 12155 bad0 19 0 6074m 5.8g 484 R 28 5.3 10:20.64 msieve 12157 bad0 19 0 6074m 5.8g 484 R 28 5.3 10:46.38 msieve 12158 bad0 19 0 6074m 5.8g 484 R 28 5.3 15:22.78 msieve 12162 bad0 19 0 6074m 5.8g 484 R 28 5.3 14:39.83 msieve 12164 bad0 19 0 6074m 5.8g 484 R 28 5.3 16:09.36 msieve 12165 bad0 19 0 6074m 5.8g 484 R 28 5.3 15:31.10 msieve 12166 bad0 19 0 6074m 5.8g 484 R 28 5.3 13:39.44 msieve 12167 bad0 19 0 6074m 5.8g 484 R 28 5.3 10:48.01 msieve 12169 bad0 19 0 6074m 5.8g 484 R 28 5.3 14:41.54 msieve 24645 bad0 25 0 4766m 4.6g 476 S 4 4.2 1614:29 msieve ... 24642 bad0 25 0 4766m 4.6g 476 S 0 4.2 1427:52 msieve 24646 bad0 25 0 4766m 4.6g 476 S 0 4.2 1595:21 msieve ------ c. 20 cpuminutes later: 24569 bad0 25 0 4661m 4.5g 476 R 99 4.1 2769:57 msieve 24614 bad0 25 0 4766m 4.6g 476 R 99 4.2 2670:49 msieve 11556 bad0 25 0 6074m 5.8g 484 R 99 5.3 183:20.23 msieve 24646 bad0 25 0 4766m 4.6g 476 S 64 4.2 1607:02 msieve 24620 bad0 25 0 4661m 4.5g 476 S 55 4.1 1589:19 msieve 24645 bad0 25 0 4766m 4.6g 476 S 48 4.2 1627:04 msieve 24617 bad0 25 0 4661m 4.5g 476 S 46 4.1 1530:30 msieve 24624 bad0 25 0 4661m 4.5g 476 S 46 4.1 1602:15 msieve 12153 bad0 19 0 6074m 5.8g 484 R 44 5.3 17:28.40 msieve 12154 bad0 19 0 6074m 5.8g 484 R 44 5.3 19:41.78 msieve 12155 bad0 20 0 6074m 5.8g 484 R 44 5.3 12:58.44 msieve 12157 bad0 20 0 6074m 5.8g 484 R 44 5.3 13:34.84 msieve 12158 bad0 19 0 6074m 5.8g 484 R 43 5.3 19:31.82 msieve 12165 bad0 19 0 6074m 5.8g 484 R 43 5.3 19:55.62 msieve 12166 bad0 19 0 6074m 5.8g 484 R 43 5.3 17:27.61 msieve 12162 bad0 19 0 6074m 5.8g 484 R 43 5.3 18:41.72 msieve 12164 bad0 19 0 6074m 5.8g 484 R 43 5.3 20:55.56 msieve 12167 bad0 19 0 6074m 5.8g 484 R 43 5.3 13:32.75 msieve 12169 bad0 19 0 6074m 5.8g 484 R 43 5.3 18:48.96 msieve 24642 bad0 25 0 4766m 4.6g 476 S 0 4.2 1438:41 msieve [/code] One more set with 6-threads, this time with all three matrices previously built, a clean start following a reboot: [code] Friday pm, re-start 10129 bad0 25 0 5326m 5.2g 476 S 591 4.7 65:45.73 msieve [3''] 10123 bad0 25 0 4661m 4.5g 476 S 398 4.1 65:02.85 msieve 10105 bad0 18 0 4766m 4.6g 476 S 310 4.2 57:28.75 msieve --------- sat am, may03: 10129 bad0 25 0 5326m 5.2g 476 S 499 4.7 1840:09 msieve [3''] 10105 bad0 25 0 4766m 4.6g 476 R 384 4.2 1554:44 msieve 10123 bad0 25 0 4661m 4.5g 476 R 277 4.1 1587:03 msieve -------- threads: 10169 bad0 25 0 5326m 5.2g 476 R 102 4.7 234:11.45 msieve 10172 bad0 25 0 5326m 5.2g 476 R 101 4.7 217:03.43 msieve 10177 bad0 25 0 5326m 5.2g 476 S 84 4.7 233:12.79 msieve 10173 bad0 25 0 5326m 5.2g 476 S 83 4.7 245:12.68 msieve 10129 bad0 25 0 5326m 5.2g 476 S 77 4.7 527:39.62 msieve [3''] 10105 bad0 20 0 4766m 4.6g 476 R 75 4.2 498:41.07 msieve [2''] 10178 bad0 25 0 5326m 5.2g 476 S 74 4.7 240:16.63 msieve 10123 bad0 19 0 4661m 4.5g 476 R 60 4.1 514:17.28 msieve [1''] 10135 bad0 25 0 4661m 4.5g 476 S 58 4.1 336:47.76 msieve 10168 bad0 25 0 4766m 4.6g 476 S 57 4.2 302:31.43 msieve 10171 bad0 25 0 4766m 4.6g 476 S 42 4.2 310:40.66 msieve 10138 bad0 25 0 4661m 4.5g 476 S 42 4.1 303:04.09 msieve 10174 bad0 25 0 4766m 4.6g 476 S 41 4.2 325:08.25 msieve 10137 bad0 25 0 4661m 4.5g 476 S 33 4.1 314:00.02 msieve ------------ 5 min later: 10174 bad0 25 0 4766m 4.6g 476 R 100 4.2 327:48.39 msieve 10123 bad0 25 0 4661m 4.5g 476 R 100 4.1 517:35.61 msieve 10105 bad0 25 0 4766m 4.6g 476 R 100 4.2 502:19.29 msieve 10129 bad0 25 0 5326m 5.2g 476 R 98 4.7 531:01.45 msieve 10168 bad0 25 0 4766m 4.6g 476 S 82 4.2 305:28.98 msieve 10171 bad0 25 0 4766m 4.6g 476 S 74 4.2 313:14.15 msieve 10135 bad0 21 0 4661m 4.5g 476 R 66 4.1 339:48.34 msieve 10137 bad0 21 0 4661m 4.5g 476 R 65 4.1 316:29.86 msieve 10138 bad0 21 0 4661m 4.5g 476 R 62 4.1 305:46.22 msieve 10169 bad0 19 0 5326m 5.2g 476 R 34 4.7 236:45.32 msieve 10172 bad0 18 0 5326m 5.2g 476 R 10 4.7 219:38.43 msieve 10173 bad0 18 0 5326m 5.2g 476 R 6 4.7 247:07.66 msieve 10177 bad0 18 0 5326m 5.2g 476 R 5 4.7 235:08.08 msieve 10178 bad0 18 0 5326m 5.2g 476 R 3 4.7 242:03.25 msieve --- another 5 min: 10168 bad0 25 0 4766m 4.6g 476 R 100 4.2 307:36.20 msieve 10171 bad0 25 0 4766m 4.6g 476 R 100 4.2 315:09.34 msieve 10174 bad0 25 0 4766m 4.6g 476 R 100 4.2 329:51.74 msieve 10135 bad0 25 0 4661m 4.5g 476 R 100 4.1 341:59.71 msieve 10137 bad0 25 0 4661m 4.5g 476 R 100 4.1 318:24.65 msieve 10129 bad0 23 0 5326m 5.2g 476 R 100 4.7 533:36.07 msieve 10138 bad0 25 0 4661m 4.5g 476 R 100 4.1 307:48.47 msieve 10105 bad0 25 0 4766m 4.6g 476 S 98 4.2 505:06.21 msieve 10123 bad0 25 0 4661m 4.5g 476 S 93 4.1 520:03.12 msieve 10169 bad0 21 0 5326m 5.2g 476 R 66 4.7 238:24.80 msieve 10177 bad0 21 0 5326m 5.2g 476 R 66 4.7 236:36.33 msieve 10172 bad0 21 0 5326m 5.2g 476 R 66 4.7 221:29.64 msieve 10173 bad0 21 0 5326m 5.2g 476 R 66 4.7 248:35.45 msieve 10178 bad0 21 0 5326m 5.2g 476 R 66 4.7 243:24.18 msieve [/code] At least the peaks look better, and the big matrix [3''] is getting in some extra time; good considering that the mid-size one is past 50%, well on its way to 4 weeks, walltime. Have I mentioned that this is my first msieve use/matrix? Seems consistent with my first quadratic sieve (RSA120, the first time, with a report at the crypto just before the start of RSA129; both being the first use of "larger large primes") and first gnfs (a c119 partition number, the first application of gnfs to a crypto-sized number, 384-bits, at the time; so early that it was before Lanczos and the Montgomery sqrt --- those on a CWI c106 snfs, iirc --- our paper did two c116's, before-and-after Peter's new improvements, and set a c112 crosover point for qs/gnfs, for the first time.). Looking forward to a first msieve factor; as well as the factors from the c260 base-12 of Greg/Richard/bd. -bd |
Our 8-cpu quad-core Barcelona arrived yesterday, so I can throw in another data point. As Bruce, I haven't measured the total time to complete a matrix (yet) but have only looked at the reports by top. On this computer, it appears that additional threads are doing (at least some) additional work up to 16 threads.
[CODE] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 21863 childers 25 0 3266m 3.0g 860 R 918 4.8 5062:10 21 ./msieve -ncr -nc3 -t 16 -v 21675 childers 25 0 4870m 4.6g 904 R 765 7.3 2694:02 3 ./msieve -nc -t 16 -v [/CODE] Adding additional threads beyond 16 actually slows the computation down, presumably due to the communication overhead in the synchronization. Because of the NUMA architecture of the Opterons, it's very important to have these 16 threads running on only 4 cpus. By default, Linux doesn't seem to care which cpus it launches the threads on, and if they are spread across the cpus in the machine, it slows down dramatically. Greg |
[QUOTE=bdodson;132653]
There appears to be a master thread, which subcontracts some of its computation to a 2nd thread; or to 3 extra threads (-t 4); or more, I checked up to 11 extra threads. As near as I could tell (recall that I don't code, and in this case haven't even opened Jason's files --- although he did get me through a compile), the master thread appears to have a limited amount of possible computation to sub-contract out to additional threads. So for a fixed matrix size and hardware configuration there will be a "sweet spot" ("local max", if one prefers) where reducing the number of threads slows down speed/performance, but adding threads (much less doubling them!) doesn't improve matters much. [/QUOTE] When using more than one thread, the first acts as master and farms out work to the others. The matrix is statically partitioned into slabs of columns, one slab per thread, with the master activating threads from the pool in round-robin fashion and taking the last slab of columns for itself (this saves two synchronize operations per matrix multiply). The extra memory use is also expected; each thread needs a scratch vector for temporary results that are XORed together after the matrix-vector product is completed. Multiplying by the transpose of the matrix is easier, each thread just fills in a range of the output vector. The extra memory isn't strictly necessary, but avoiding it would require giving each thread a slab of rows rather than columns, and since the density of nonzeros across rows is highly nonuniform it becomes harder to statically partition the matrix. Previous experience with multiple threads definitely shows there's a point of diminishing returns, since the matrix entries have to stream in from RAM even when the vectors to be multiplied are decomposed into blocks that fit into cache. Running two threads on a dual-CPU machine produces a larger speedup than on a dual-core machine, so there is an implicit dependence on the number of bus wires that can be brought to bear on the problem. I originally assumed that the best runtime would require the largest possible blocks of work between thread synchronizations, but lately I wonder if it would be better to treat each block of the matrix as a work unit to be scheduled, perhaps allocated using a 2-D space-filling tiling of the plane. The load balancing would be much easier, and maybe the cache performance could be improved. I also have not experimented with local allocation of the matrix entries, and that could make a big difference on a NUMA architecture. PS: Greg, the changes to make each thread allocate matrix entries for itself are pretty straightforward, I'll try to get those changes into the next version |
multithreading with node-local memory
1 Attachment(s)
The two files in the attached zip implement node-local thread memory allocation when running multithreaded linear algebra, as well as a few other minor changes. Just overwrite the corresponding files with the new version and recompile the library. Anyone who feels brave can give them a try, though I don't recommend relying on them for big long-running jobs.
|
[QUOTE=jasonp;132650] Out of curiosity, what were the final reported matrix densities before the Lanczos iterations started?
[/QUOTE] Leaving aside the first 48 matrix rows, 3,536+ "sparse part has weight 874437208 (61.45/col)" 10,257- "sparse part has weight 858020543 (61.56/col)" and 10,257+ "sparse part has weight 952964006 (61.56/col)". Current status is 57.4%, 29.6% and 6.5%. Maybe a month and a half for 10,257+, if I'm estimating the walltime less reboots correctly. -Bruce PS -- with the first 48 rows, that was "weight 1266010783 (88.97/col) sparse part 898702280 (63.16/col)" "weight 1239088034 (88.90/col) sparse part 879366766 (63.09/col)" "weight 1382748757 (89.33/col) sparse part 983086477 (63.51/col)" Is there a comparison with these and the known cases with false dependencies? |
I would be reasonably confident that those will work; false dependencies in my experience have come from a sparse-part weight below 57 or so.
|
[QUOTE=frmky;132658]Our 8-cpu quad-core Barcelona arrived yesterday, so I can throw in another data point. As Bruce, I haven't measured the total time to complete a matrix (yet) but have only looked at the reports by top. On this computer, it appears that additional threads are doing (at least some) additional work up to 16 threads.[/QUOTE]
Nope. All wrong. Top lies. I inserted code in msieve to output the total time to solve the matrix given the current time elapsed and dimensions solved. It turns out that the fastest is two threads, each on a single core of a separate processor. Second fastest is a single thread running alone on a processor. Adding a second thread to a quadcore processor slows the runtime. This surprises me as I expected the memory bandwidth for an Opteron Barcelona for a second thread to not be a bottleneck. Anything beyond 2 threads slows the runtime, no matter how they are allocated. Each core of a Barcelona processor has a 512 Kb L2 cache, but all 4 cores share a 2MB L3 cache. msieve chooses a cache blocksize based on the L2 cache size, but increasing the blocksize to the maximum (by telling msieve that the L2 cache was 2MB) almost halved the matrix solve time. I've done only a few tests with Jason's Lanczos changes for local memory, but in those it didn't really seem to make any significant difference in the runtime. Greg |
[QUOTE=frmky;132979]
Each core of a Barcelona processor has a 512 Kb L2 cache, but all 4 cores share a 2MB L3 cache. msieve chooses a cache blocksize based on the L2 cache size, but increasing the blocksize to the maximum (by telling msieve that the L2 cache was 2MB) almost halved the matrix solve time. [/QUOTE] For Intel processors the library will take the maximum of the L2 and L3 cache sizes. Modern AMD processors only had L1 and L2 caches until recently, the changes to find the L3 size should be pretty easy. |
[QUOTE=jasonp;132980]For Intel processors the library will take the maximum of the L2 and L3 cache sizes. Modern AMD processors only had L1 and L2 caches until recently, the changes to find the L3 size should be pretty easy.[/QUOTE]
I'd like to try out the msieve solver, but my code emits and works with the CWI format. Is there as CWI --> msieve relation converter??? |
| All times are UTC. The time now is 22:43. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.