![]() |
HT helps a lot on LA, at least for me.
|
[QUOTE=VBCurtis;537677]-nc1 was run with target-density 134. After remdups and adding freerels in, msieve states 99.3M unique relations. Matrix came out 4.57M dimensions. TD=140 did not complete filtering.[/QUOTE]
Machine: Xeon 2680v3 Haswell generation 12x2.5ghz, 48GB memory on 4 channels DDR4 (4x4GB+4x8GB). VBITS=128 on otherwise idle machine. ETA after 1% of job: 6-threaded 14hr 34 min 12-threads 8 hr 26 min 18-threads 9 hr 15 min 24-threads 8 hr 27 min These times look rather slow; I just installed the extra 32GB memory today, so perhaps filling all 8 slots slows memory access a bunch. Some time I'll remove the original 16GB and see if 4 sticks is faster than 8. |
1 Attachment(s)
[QUOTE=VBCurtis;539303]While unlikely, it is possible that 20 or 24 threads yields a bit of improvement. Hyperthreads don't always help on matrix solving, but since this is a benchmark thread it might be nice to demonstrate that.
I suggest 20 as alternative because using every possible HT might be impacted by any background process, but that effect should be reduced if we leave a few HTs 'open'. I've found situations where using N-1 cores runs faster than N cores, for what I presume are similar reasons.[/QUOTE]We ran a 24 thread test last night. It was 1.09% faster than the 12 thread job. During the run, the CPU reported roughly 1700% utilization, so there must be a lot of overhead and/or bottlenecks. We are currently running a 20 thread test that we will post later. Note that we only count the LA phase in our calculations. :mike: |
1 Attachment(s)
[QUOTE=Xyzzy;543938]We are currently running a 20 thread test that we will post later.[/QUOTE]The 20 thread run somehow ended up slower than the 12 thread run.
12 = 8h04m50s 20 = 8h32m31s 24 = 7h59m33s :mike: |
1 Attachment(s)
[C]CPU = i7-8565U
RAM = 2×16GB DDR4-2400 CMD = ./msieve -v -nc -t 8 LA = 47884s[/C] :mike: |
1 Attachment(s)
[C]CPU = i7-8565U
RAM = 2×16GB DDR4-2400 CMD = ./msieve -v -nc -t 4 LA = 51662s[/C] :mike: |
1 Attachment(s)
[C]CPU = 3950X
RAM = 2×8GB DDR4-3666 CMD = ./msieve -v -nc -t 16 LA = 27180s[/C] :mike: |
In my experience, the throughput depends on an additionally running program (e.g. gmp-ecm)
machine: i7-7820X - 8 cores + HT matrix: 49M * 49M memory: 64 GB msieve .... -t16 solo ~ 55% (power according task manager) msieve .... -t16 and gmp-ecm (prior: low) ~ 78% -"- With msieve + mprime/Prime95 the effectiveness is a litle lower Kurt |
[QUOTE=bsquared;537791]Machine: 2 sockets of 20-core Cascade-Lake Xeon
Just used the default density. matrix is 5149968 x 5150142 (1913.9 MB) with weight 597210677 (115.96/col) Here is a basic 40 threaded job across both sockets (actually, I guess it is thread-limited to 32 threads): 4 hrs 58 min: /msieve -v -nc2 -t 40 Using MPI helps a lot. Here are various configurations using different VBITS settings (timings after 1% elasped): [/QUOTE] I know this is late, but if you still have this data set up, try mpirun -np 2 msieve -nc2 1,2 -v -t 20 |
Here's a bench using compute nodes with one Xeon E5-2650 v4 Broadwell cpu with 12-cores, 24 threads.
1 node 7h 40m 2 nodes 2h 45m 4 nodes 1h 35m 8 nodes 1h 10m Not sure why the time for one node is so high compared to the others? Perhaps something fitting into the cache with the smaller matrices on each node? |
[QUOTE=frmky;553739]Here's a bench using compute nodes with one Xeon E5-2650 v4 Broadwell cpu with 12-cores, 24 threads.
1 node 7h 40m 2 nodes 2h 45m 4 nodes 1h 35m 8 nodes 1h 10m Not sure why the time for one node is so high compared to the others? Perhaps something fitting into the cache with the smaller matrices on each node?[/QUOTE] It never occured to me that MPI - 2 nodes would be more than twice as fast, under any test. Neat! Now, if only ubuntu would fix MPI.... |
| All times are UTC. The time now is 00:43. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.