mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Msieve

Reply
 
Thread Tools
Old 2016-09-28, 12:21   #1
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

11001001010012 Posts
Default Quad Opteron sub-optimality

Running 2x3 grid of 6-thread jobs (each locked to one set-of-cores-on-a-piece-of-silicon) on a 5.8M matrix

Code:
17554 driver    20   0 1188848 618260   4644 S 532.4  0.9   1060:35 msieve                             
17558 driver    20   0 1027572 470928   4632 R 322.4  0.7 798:53.65 msieve                             
17557 driver    20   0 1059040 503536   4616 R 272.5  0.8 772:02.11 msieve                             
17556 driver    20   0 1121032 489504   4640 R 257.6  0.7 725:01.56 msieve                             
17555 driver    20   0 1192980 585680   4632 R 245.0  0.9 850:02.20 msieve                             
17553 driver    20   0 1378408 824820   4824 R 237.4  1.3 859:21.04 msieve
So the memory use and CPU efficiency is quite non-uniform over the grid; within individual pieces-of-silicon I see
Code:
Average:      24   31.24    0.00    0.10    0.00    0.00    0.00    0.00    0.00    0.00   68.66
Average:      25   31.34    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   68.66
Average:      26   31.27    0.00    0.20    0.00    0.00    0.00    0.00    0.00    0.00   68.53
Average:      27   31.31    0.00    0.30    0.00    0.00    0.00    0.00    0.00    0.00   68.39
Average:      28   95.40    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    4.60
Average:      29   31.10    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   68.90

Average:      30   95.68    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    4.32
Average:      31   30.16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   69.84
Average:      32   30.40    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   69.60
Average:      33   30.74    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   69.26
Average:      34   30.33    0.00    0.30    0.00    0.00    0.00    0.00    0.00    0.00   69.37
Average:      35   30.57    0.00    0.30    0.00    0.00    0.00    0.00    0.00    0.00   69.13

Average:      36   33.13    0.00    0.30    0.00    0.00    0.00    0.00    0.00    0.00   66.57
Average:      37   33.53    0.00    0.10    0.00    0.00    0.00    0.00    0.00    0.00   66.37
Average:      38   95.40    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    4.60
Average:      39   33.67    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   66.33
Average:      40   33.47    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   66.53
Average:      41   33.50    0.00    0.30    0.00    0.00    0.00    0.00    0.00    0.00   66.20

Average:      42   95.70    0.00    0.10    0.00    0.00    0.00    0.00    0.00    0.00    4.20
Average:      43   44.34    0.00    0.10    0.00    0.00    0.00    0.00    0.00    0.00   55.56
Average:      44   44.36    0.00    0.10    0.00    0.00    0.00    0.00    0.00    0.00   55.54
Average:      45   43.93    0.00    0.20    0.00    0.00    0.00    0.00    0.00    0.00   55.87
Average:      46   44.10    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   55.90
Average:      47   44.34    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   55.66
So one hot core and five cold cores in each cluster. The ETA on 36 cores is about 40% slower than on six cores on one i7/4930 machine.

This is with svn checked out this morning.

Last fiddled with by fivemack on 2016-09-28 at 12:37
fivemack is offline   Reply With Quote
Old 2016-09-29, 10:02   #2
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

3·19·113 Posts
Default

I was also doing the filtering stage on a very large job on CPU 0 at the time, large enough that the memory didn't all fit in the RAM attached to that socket (ie numa nodes 0-1, cores 0-11), so maybe that was an issue - certainly the filtering was running much more slowly than I expected.

I've rebooted the machine and am rerunning the previous job with a 2x4 grid, so running on all the sockets and with numactl --membind to get each of the eight jobs to use the RAM attached to its node's memory controller, and will see how that goes.

On the 2x3 grid the job took 80438 seconds (vs 44770 on 6 cores i7/4930K).

ETA at the start of the job on the 2x4 grid is 10h42m (38520 seconds); the actual runtime was 37305 seconds.

Code:
 2320 driver    20   0 1121188 524888   4848 R 425.0  0.8 496:36.29 msieve                             
 2317 driver    20   0 1075440 477640   4840 R 423.6  0.7 498:33.41 msieve                             
 2325 driver    20   0  946104 391668   4960 R 417.3  0.6 485:40.94 msieve                             
 2323 driver    20   0  946160 392112   4944 R 414.0  0.6 488:12.00 msieve                             
 2315 driver    20   0 1249292 656012   5148 R 388.6  1.0 464:32.91 msieve                             
 2326 driver    20   0 1086968 472064   4852 R 387.9  0.7 460:47.93 msieve                             
 2324 driver    20   0 1029964 378188   4956 R 381.0  0.6 451:44.38 msieve                             
 2327 driver    20   0  919400 363928   4952 R 379.3  0.6 451:00.21 msieve
so that's 3216.7% CPU, so 67% machine utilization. I'm still seeing one hot and five cold cores per node, but the coldest cores are at about 56% now.

Since the quad-socket-Opteron is about 1.8 times the speed of one i7/4930K for sieving, I think comparative-advantage has me use it for sieving all the time and do linear algebra elsewhere.

Last fiddled with by fivemack on 2016-09-29 at 18:51 Reason: add actual runtime
fivemack is offline   Reply With Quote
Old 2016-09-29, 17:00   #3
bgbeuning
 
Dec 2014

3·5·17 Posts
Default

I have a two socket Opteron, with each socket having 12 cores.
Each socket has 2 NUMA nodes, so all together it has 4 NUMA
nodes with 6 cores per node. Mine has 8 DIMM so 2 DIMM per node.

The general rule is accessing cross node memory is 50% slower.
When I ran prime95 without thread affinity, it ran twice as slow as
with affinity.
bgbeuning is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Opteron is Hyperthreaded ? bgbeuning Information & Answers 3 2016-01-10 08:26
optimality of ecm depth mklasson Msieve 2 2009-03-08 20:18
Quad Quad-cores SlashDude Hardware 30 2009-01-30 22:22
AMD Athlon 64 vs AMD Opteron for ecm thomasn Factoring 6 2004-11-08 13:25
AMD Opteron naclosagc Software 27 2003-08-10 19:14

All times are UTC. The time now is 03:11.


Tue Oct 19 03:11:16 UTC 2021 up 87 days, 21:40, 0 users, load averages: 1.64, 1.68, 1.64

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.