![]() |
Quad Opteron sub-optimality
Running 2x3 grid of 6-thread jobs (each locked to one set-of-cores-on-a-piece-of-silicon) on a 5.8M matrix
[code] 17554 driver 20 0 1188848 618260 4644 S 532.4 0.9 1060:35 msieve 17558 driver 20 0 1027572 470928 4632 R 322.4 0.7 798:53.65 msieve 17557 driver 20 0 1059040 503536 4616 R 272.5 0.8 772:02.11 msieve 17556 driver 20 0 1121032 489504 4640 R 257.6 0.7 725:01.56 msieve 17555 driver 20 0 1192980 585680 4632 R 245.0 0.9 850:02.20 msieve 17553 driver 20 0 1378408 824820 4824 R 237.4 1.3 859:21.04 msieve [/code] So the memory use and CPU efficiency is quite non-uniform over the grid; within individual pieces-of-silicon I see [code] Average: 24 31.24 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 68.66 Average: 25 31.34 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 68.66 Average: 26 31.27 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 68.53 Average: 27 31.31 0.00 0.30 0.00 0.00 0.00 0.00 0.00 0.00 68.39 Average: 28 95.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.60 Average: 29 31.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 68.90 Average: 30 95.68 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.32 Average: 31 30.16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 69.84 Average: 32 30.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 69.60 Average: 33 30.74 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 69.26 Average: 34 30.33 0.00 0.30 0.00 0.00 0.00 0.00 0.00 0.00 69.37 Average: 35 30.57 0.00 0.30 0.00 0.00 0.00 0.00 0.00 0.00 69.13 Average: 36 33.13 0.00 0.30 0.00 0.00 0.00 0.00 0.00 0.00 66.57 Average: 37 33.53 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 66.37 Average: 38 95.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.60 Average: 39 33.67 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 66.33 Average: 40 33.47 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 66.53 Average: 41 33.50 0.00 0.30 0.00 0.00 0.00 0.00 0.00 0.00 66.20 Average: 42 95.70 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 4.20 Average: 43 44.34 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 55.56 Average: 44 44.36 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 55.54 Average: 45 43.93 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 55.87 Average: 46 44.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 55.90 Average: 47 44.34 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 55.66 [/code] So one hot core and five cold cores in each cluster. The ETA on 36 cores is about 40% slower than on six cores on one i7/4930 machine. This is with svn checked out this morning. |
I was also doing the filtering stage on a very large job on CPU 0 at the time, large enough that the memory didn't all fit in the RAM attached to that socket (ie numa nodes 0-1, cores 0-11), so maybe that was an issue - certainly the filtering was running much more slowly than I expected.
I've rebooted the machine and am rerunning the previous job with a 2x4 grid, so running on all the sockets and with numactl --membind to get each of the eight jobs to use the RAM attached to its node's memory controller, and will see how that goes. On the 2x3 grid the job took 80438 seconds (vs 44770 on 6 cores i7/4930K). ETA at the start of the job on the 2x4 grid is 10h42m (38520 seconds); the actual runtime was 37305 seconds. [code] 2320 driver 20 0 1121188 524888 4848 R 425.0 0.8 496:36.29 msieve 2317 driver 20 0 1075440 477640 4840 R 423.6 0.7 498:33.41 msieve 2325 driver 20 0 946104 391668 4960 R 417.3 0.6 485:40.94 msieve 2323 driver 20 0 946160 392112 4944 R 414.0 0.6 488:12.00 msieve 2315 driver 20 0 1249292 656012 5148 R 388.6 1.0 464:32.91 msieve 2326 driver 20 0 1086968 472064 4852 R 387.9 0.7 460:47.93 msieve 2324 driver 20 0 1029964 378188 4956 R 381.0 0.6 451:44.38 msieve 2327 driver 20 0 919400 363928 4952 R 379.3 0.6 451:00.21 msieve [/code] so that's 3216.7% CPU, so 67% machine utilization. I'm still seeing one hot and five cold cores per node, but the coldest cores are at about 56% now. Since the quad-socket-Opteron is about 1.8 times the speed of one i7/4930K for sieving, I think comparative-advantage has me use it for sieving all the time and do linear algebra elsewhere. |
I have a two socket Opteron, with each socket having 12 cores.
Each socket has 2 NUMA nodes, so all together it has 4 NUMA nodes with 6 cores per node. Mine has 8 DIMM so 2 DIMM per node. The general rule is accessing cross node memory is 50% slower. When I ran prime95 without thread affinity, it ran twice as slow as with affinity. |
| All times are UTC. The time now is 00:50. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.