![]() |
![]() |
#1 |
(loop (#_fork))
Feb 2006
Cambridge, England
2·7·461 Posts |
![]()
I really don't understand the figures I'm getting from the code attached (compile with -O3 -lpthread)
It ought to allocate some memory, fill it with encrypted values using a routine which takes lots of CPU and negligible memory bandwidth, then run over it in a way such that each thread streams through a chunk while doing some matmul-like work in cache. So I would expect that, if I use no more threads than I have CPUs, the initial fill will take on each thread parallel time proportional to 1/Nthread, and the streaming parts again will take time proportional to 1/Nthread. Observed results on Q6600, with other jobs on the CPUs but at nice 19, are that both parts take the same amount of elapsed time no matter how many threads I'm using! I fear I'm missing something absolutely critical about memory systems ... have I screwed up something really obvious in the code? Do other people get the same shape of results on things other than a Q6600? Code:
Started initialising with 1 threads at 0 Clearing from 0 to 268435456 Finished crypt setup after 55000000 Finished scanning after 74320000 74320000 Started initialising with 2 threads at 129320000 Clearing from 0 to 134217728 Clearing from 134217728 to 268435456 Finished crypt setup after 54130000 Finished scanning after 77490000 76790000 77480000 Started initialising with 3 threads at 260940000 Clearing from 0 to 89478485 Clearing from 89478485 to 178956970 Clearing from 178956970 to 268435455 Finished crypt setup after 54060000 Finished scanning after 77560000 76890000 77560000 77240000 Started initialising with 4 threads at 392560000 Clearing from 0 to 67108864 Clearing from 67108864 to 134217728 Clearing from 134217728 to 201326592 Clearing from 201326592 to 268435456 Finished crypt setup after 54120000 Finished scanning after 79870000 77870000 79600000 79770000 78850000 |
![]() |
![]() |
![]() |
#2 |
(loop (#_fork))
Feb 2006
Cambridge, England
144668 Posts |
![]()
Ah, I'm a fool; clock() tells me something like the total amount of CPU time used by all the threads in the program; adding loads of system("date") calls gave me results like
Code:
Started initialising with 2 threads at 129160000 Sun Jun 8 23:17:01 BST 2008 Clearing from 0 to 134217728 Sun Jun 8 23:17:01 BST 2008 Clearing from 134217728 to 268435456 Sun Jun 8 23:17:01 BST 2008 Sun Jun 8 23:17:28 BST 2008 Finished crypt setup after 55010000 What should I be using on linux for a reasonable-granularity (milliseconds would be good, microseconds would be lovely) system-wide clock? The RDTSC instruction has ideal granularity but isn't synced between processors, so I get odd effects if the thread moves between cores; clock() doesn't behave as I want. |
![]() |
![]() |
![]() |
#3 |
Jul 2003
So Cal
1010001001102 Posts |
![]()
I think clock_gettime() will give you what you want. Add -lrt to the compile command.
Code:
int usecs; struct timespec tsstart, tsstop; clock_gettime (CLOCK_REALTIME, &tsstart); ... code ... clock_gettime (CLOCK_REALTIME, &tsstop); usecs = ((tsstop.tv_sec - tsstart.tv_sec) * 1000000) + ((tsstop.tv_nsec - tsstart.tv_nsec) / 1000); |
![]() |
![]() |
![]() |
#4 |
Jul 2003
So Cal
2·3·433 Posts |
![]()
I modifed main to use clock_gettime(), and it seems to work correctly.
Code:
Started initialising with 1 threads at 0 Clearing from 0 to 268435456 Finished crypt setup after 53681925 Finished scanning after 153898136 153870000 Started initialising with 2 threads at 207530000 Clearing from 0 to 134217728 Clearing from 134217728 to 268435456 Finished crypt setup after 26359791 Finished scanning after 81944416 159760000 155670000 Started initialising with 3 threads at 420000000 Clearing from 0 to 89478485 Clearing from 89478485 to 178956970 Clearing from 178956970 to 268435455 Finished crypt setup after 17573829 Finished scanning after 56027011 158230000 152730000 153520000 Started initialising with 4 threads at 630940000 Clearing from 0 to 67108864 Clearing from 67108864 to 134217728 Clearing from 134217728 to 201326592 Clearing from 201326592 to 268435456 Finished crypt setup after 13213695 Finished scanning after 42539267 159320000 153620000 156240000 155210000 Started initialising with 5 threads at 843000000 Clearing from 0 to 53687091 Clearing from 53687091 to 107374182 Clearing from 107374182 to 161061273 Clearing from 214748364 to 268435455 Clearing from 161061273 to 214748364 Finished crypt setup after 10544309 Finished scanning after 34475833 168610000 168310000 168370000 163720000 164630000 Started initialising with 6 threads at 1064320000 Clearing from 0 to 44739242 Clearing from 44739242 to 89478484 Clearing from 89478485 to 134217727 Clearing from 134217728 to 178956970 Clearing from 223696213 to 268435455 Clearing from 178956970 to 223696212 Finished crypt setup after 8789439 Finished scanning after 29550770 166060000 161380000 164540000 164330000 161220000 160590000 Started initialising with 7 threads at 1283090000 Clearing from 0 to 38347922 Clearing from 38347922 to 76695844 Clearing from 76695844 to 115043766 Clearing from 115043766 to 153391688 Clearing from 230087533 to 268435455 Clearing from 191739611 to 230087533 Clearing from 153391689 to 191739611 Finished crypt setup after 7548236 Finished scanning after 25014537 166460000 163690000 162110000 165890000 164650000 162300000 162810000 Started initialising with 8 threads at 1502270000 Clearing from 0 to 33554432 Clearing from 33554432 to 67108864 Clearing from 134217728 to 167772160 Clearing from 167772160 to 201326592 Clearing from 201326592 to 234881024 Clearing from 67108864 to 100663296 Clearing from 100663296 to 134217728 Clearing from 234881024 to 268435456 Finished crypt setup after 6609198 Finished scanning after 22518269 170590000 169660000 164540000 159810000 169990000 168710000 167650000 166740000 Started initialising with 9 threads at 1725600000 Clearing from 0 to 29826161 Clearing from 29826161 to 59652322 Clearing from 208783132 to 238609293 Clearing from 59652323 to 89478484 Clearing from 178956970 to 208783131 Clearing from 89478485 to 119304646 Clearing from 119304647 to 149130808 Clearing from 149130808 to 178956969 Clearing from -238609294 to -208783133 Segmentation fault Greg |
![]() |
![]() |
![]() |
#5 |
Jul 2003
So Cal
1010001001102 Posts |
![]()
I moved the i to the end, and here's the result on the quad-core Barcelona. The first step seems to have near-linear speedup up to 14 threads. Not sure what happened there with 15 threads. The second step isn't quite so good, but ok up to 12 threads. Beyond 12, though, things slow down. Perhaps this is in part due to the NUMA architecture...
Code:
Started initialising with 1 threads at 0 Clearing from 0 to 268435456 Finished crypt setup after 53696751 Finished scanning after 153868902 153840000 Started initialising with 2 threads at 207520000 Clearing from 0 to 134217728 Clearing from 134217728 to 268435456 Finished crypt setup after 26419972 Finished scanning after 82081925 159310000 154490000 Started initialising with 3 threads at 419650000 Clearing from 0 to 89478485 Clearing from 89478485 to 178956970 Clearing from 178956970 to 268435455 Finished crypt setup after 17618014 Finished scanning after 55813765 157650000 151940000 153040000 Started initialising with 4 threads at 630130000 Clearing from 0 to 67108864 Clearing from 67108864 to 134217728 Clearing from 134217728 to 201326592 Clearing from 201326592 to 268435456 Finished crypt setup after 13212915 Finished scanning after 43587470 165670000 157040000 161180000 164280000 Started initialising with 5 threads at 848620000 Clearing from 0 to 53687091 Clearing from 53687091 to 107374182 Clearing from 107374182 to 161061273 Clearing from 161061273 to 214748364 Clearing from 214748364 to 268435455 Finished crypt setup after 10572543 Finished scanning after 33951399 163450000 156580000 162460000 162980000 157330000 Started initialising with 6 threads at 1064900000 Clearing from 0 to 44739242 Clearing from 44739242 to 89478484 Clearing from 89478484 to 134217726 Clearing from 134217726 to 178956968 Clearing from 223696210 to 268435452 Clearing from 178956968 to 223696210 Finished crypt setup after 8810310 Finished scanning after 29447003 166510000 160700000 163260000 163770000 165770000 157490000 Started initialising with 7 threads at 1284250000 Clearing from 0 to 38347922 Clearing from 38347922 to 76695844 Clearing from 76695844 to 115043766 Clearing from 230087532 to 268435454 Clearing from 191739610 to 230087532 Clearing from 153391688 to 191739610 Clearing from 115043766 to 153391688 Finished crypt setup after 7553484 Finished scanning after 25862978 174560000 174130000 173760000 173890000 172110000 167760000 166990000 Started initialising with 8 threads at 1511620000 Clearing from 0 to 33554432 Clearing from 33554432 to 67108864 Clearing from 67108864 to 100663296 Clearing from 234881024 to 268435456 Clearing from 201326592 to 234881024 Clearing from 167772160 to 201326592 Clearing from 134217728 to 167772160 Clearing from 100663296 to 134217728 Finished crypt setup after 6611317 Finished scanning after 22663095 167770000 167610000 164150000 166910000 145270000 162130000 159510000 158960000 Started initialising with 9 threads at 1732220000 Clearing from 0 to 29826161 Clearing from 29826161 to 59652322 Clearing from 59652322 to 89478483 Clearing from 208783127 to 238609288 Clearing from 89478483 to 119304644 Clearing from 178956966 to 208783127 Clearing from 149130805 to 178956966 Clearing from 119304644 to 149130805 Clearing from 238609288 to 268435449 Finished crypt setup after 5877745 Finished scanning after 21875875 174680000 177530000 173600000 176700000 170500000 166270000 164520000 164210000 175280000 Started initialising with 10 threads at 1962560000 Clearing from 0 to 26843545 Clearing from 26843545 to 53687090 Clearing from 80530635 to 107374180 Clearing from 53687090 to 80530635 Clearing from 107374180 to 134217725 Clearing from 241591905 to 268435450 Clearing from 214748360 to 241591905 Clearing from 187904815 to 214748360 Clearing from 161061270 to 187904815 Clearing from 134217725 to 161061270 Finished crypt setup after 5309065 Finished scanning after 21787666 190470000 194330000 191280000 192950000 184650000 189020000 192540000 189630000 179410000 171860000 Started initialising with 11 threads at -2085257296 Clearing from 0 to 24403223 Clearing from 24403223 to 48806446 Clearing from 97612892 to 122016115 Clearing from 48806446 to 73209669 Clearing from 73209669 to 97612892 Clearing from 122016115 to 146419338 Clearing from 244032230 to 268435453 Clearing from 219629007 to 244032230 Clearing from 195225784 to 219629007 Clearing from 170822561 to 195225784 Clearing from 146419338 to 170822561 Finished crypt setup after 4922136 Finished scanning after 18890423 193390000 193890000 192310000 188750000 185470000 191740000 186920000 194140000 188130000 190110000 179570000 Started initialising with 12 threads at -1838287296 Clearing from 0 to 22369621 Clearing from 22369621 to 44739242 Clearing from 178956968 to 201326589 Clearing from 44739242 to 67108863 Clearing from 156587347 to 178956968 Clearing from 134217726 to 156587347 Clearing from 111848105 to 134217726 Clearing from 89478484 to 111848105 Clearing from 67108863 to 89478484 Clearing from 201326589 to 223696210 Clearing from 246065831 to 268435452 Clearing from 223696210 to 246065831 Finished crypt setup after 4410209 Finished scanning after 18158087 200450000 201780000 197510000 201230000 192090000 201350000 188900000 197290000 200630000 194440000 185130000 190680000 Started initialising with 13 threads at -1583697296 Clearing from 0 to 20648881 Clearing from 20648881 to 41297762 Clearing from 41297762 to 61946643 Clearing from 144542167 to 165191048 Clearing from 61946643 to 82595524 Clearing from 123893286 to 144542167 Clearing from 103244405 to 123893286 Clearing from 82595524 to 103244405 Clearing from 165191048 to 185839929 Clearing from 185839929 to 206488810 Clearing from 206488810 to 227137691 Clearing from 247786572 to 268435453 Clearing from 227137691 to 247786572 Finished crypt setup after 4101279 Finished scanning after 18832256 215520000 219340000 219720000 216210000 217750000 218230000 213370000 218400000 206500000 217310000 185260000 183990000 183760000 Started initialising with 14 threads at -1311167296 Clearing from 0 to 19173961 Clearing from 19173961 to 38347922 Clearing from 38347922 to 57521883 Clearing from 153391688 to 172565649 Clearing from 57521883 to 76695844 Clearing from 134217727 to 153391688 Clearing from 115043766 to 134217727 Clearing from 95869805 to 115043766 Clearing from 76695844 to 95869805 Clearing from 172565649 to 191739610 Clearing from 191739610 to 210913571 Clearing from 249261493 to 268435454 Clearing from 230087532 to 249261493 Clearing from 210913571 to 230087532 Finished crypt setup after 3803282 Finished scanning after 20826194 208080000 210270000 208100000 206320000 206140000 217580000 206200000 205870000 203430000 209840000 207590000 181190000 215610000 193170000 Started initialising with 15 threads at -1040767296 Clearing from 0 to 17895697 Clearing from 17895697 to 35791394 Clearing from 71582788 to 89478485 Clearing from 35791394 to 53687091 Clearing from 53687091 to 71582788 Clearing from 89478485 to 107374182 Clearing from 107374182 to 125269879 Clearing from 125269879 to 143165576 Clearing from 143165576 to 161061273 Clearing from 161061273 to 178956970 Clearing from 178956970 to 196852667 Clearing from 196852667 to 214748364 Clearing from 250539758 to 268435455 Clearing from 214748364 to 232644061 Clearing from 232644061 to 250539758 Finished crypt setup after 5329815 Finished scanning after 20002748 214290000 214470000 212210000 208610000 208940000 212690000 209020000 211850000 183190000 210660000 220880000 221580000 210490000 179710000 207040000 |
![]() |
![]() |
![]() |
#6 |
(loop (#_fork))
Feb 2006
Cambridge, England
2×7×461 Posts |
![]()
Thanks for the help and for the runs on the large machine. I suspect this is a case where I see good scaling because the single thread is badly written and very slow - it's getting 665MB/sec.
I can get a pretty good speedup by unrolling the loop four times; I hoped I could get a better speedup by using SSE2 instructions to read the addresses, but using Code:
__m128i A,B; A = _mm_load_si128 ((__m128i*)j); B = _mm_shuffle_epi32(A, 0x1E); a = _mm_cvtsi128_si32 (A); b = _mm_cvtsi128_si32 (_mm_srli_epi64(A,32)); c = _mm_cvtsi128_si32 (B); d = _mm_cvtsi128_si32 (_mm_srli_epi64(B,32)); On a dual-quad-core Xeon with the new timing routines and the more efficient inmost loop (attached) I get Code:
nproc/setup/loop/implied bw 1 58.26 59.45 1.682GB/sec 2 28.75 31.62 3.162GB/sec 3 19.02 26.78 3.734GB/sec 4 14.49 25.77 3.880GB/sec Code:
nproc/setup/loop/implied bw/best thread/implied theory 1 49.21 64.78 1.544GB/sec 64.78 1.544 2 24.04 43.32 2.308GB/sec 29.56 3.382 3 16.31 30.52 3.277GB/sec 22.32 4.480 4 12.20 24.71 4.047GB/sec 18.37 5.444 Last fiddled with by fivemack on 2008-06-09 at 13:45 |
![]() |
![]() |
![]() |
#7 | |
Jul 2003
So Cal
A2616 Posts |
![]() Quote:
Code:
NOSSE 1 59.809 111.76 0.895 GB/s 2 29.735 57.522 1.738 GB/s 3 19.988 41.221 2.426 GB/s 4 14.963 31.942 3.131 GB/s Using SSE2 1 59.742 98.558 1.105 GB/s 2 29.797 48.809 2.049 GB/s 3 19.904 33.326 3.000 GB/s 4 15.014 24.534 4.076 GB/s |
|
![]() |
![]() |
![]() |
#8 |
Jul 2003
So Cal
2×3×433 Posts |
![]()
Something interesting ... removing the three if (stream==0) printf's doubled the SSE2 code's bandwidth but didn't affect the non-SSE2 results much at all. I removed the printf's simply to make the output more compact. I have no clue why it has such a huge effect on the SSE2 code's speed.
I'm seeing near-linear speedup to 11 threads, and then again at 14 threads. This was a single run with other processes niced in the background, which may be part of the reason for the jumps above 11 threads. Code:
SSE2 1 60.123 45.733 2.187 GB/s 2 29.546 23.760 4.209 GB/s 3 19.624 15.190 6.583 GB/s 4 14.783 11.529 8.674 GB/s 5 11.989 9.575 10.444 GB/s 6 9.973 7.821 12.786 GB/s 7 8.555 6.850 14.599 GB/s 8 7.517 5.923 16.883 GB/s 9 6.870 5.522 18.109 GB/s 10 6.188 4.858 20.585 GB/s 11 6.887 4.425 22.599 GB/s 12 5.238 5.577 17.931 GB/s 13 4.931 3.955 25.284 GB/s 14 5.197 3.474 28.785 GB/s 15 4.124 4.123 24.254 GB/s 16 5.596 4.182 23.912 GB/s |
![]() |
![]() |
![]() |
#9 | |
"Bob Silverman"
Nov 2003
North of Boston
22·1,877 Posts |
![]() Quote:
Perhaps the printf's were causing a pipeline stall? |
|
![]() |
![]() |
![]() |
#10 | |
Jul 2003
So Cal
2·3·433 Posts |
![]()
Taking a look at the AMD website, one of the improvements of the Opteron Barcelona cpu was to double the throughput of SSE/2/3 instructions.
Quote:
Greg |
|
![]() |
![]() |
![]() |
#11 |
Bamboozled!
"๐บ๐๐ท๐ท๐ญ"
May 2003
Down not across
266128 Posts |
![]()
A recent thread on the GMP developers' list was concerned with how very subtle code alignment issues could lead to large differences in execution speed. Further, the differences in speed with different code alignment wer very processor dependent. It doesn't surprise me at all that a far from subtle change, such as removing complex function calls, would have a similar effect.
Paul |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Memory Bandwidth | Fred | Hardware | 12 | 2016-02-01 18:29 |
High Bandwidth Memory | tha | GPU Computing | 4 | 2015-07-31 00:21 |
configuration for max memory bandwidth | smartypants | Hardware | 11 | 2015-07-26 09:16 |
P-1 memory bandwidth | TheMawn | Hardware | 1 | 2013-06-15 23:15 |
Skype's (mis)use of their customers' bandwidth | ewmayer | Lounge | 3 | 2007-05-22 20:13 |