![]() |
[QUOTE=ET_;425606]Can you tell us anything about its performances? I guess it's running with 2 threads...
Luigi[/QUOTE] Not fast :brian-e: [CODE] 448 msec/iter = 12.80 ROE[avg,max] = [0.224609375, 0.250000000] radices = 56 16 16 480 msec/iter = 14.04 ROE[avg,max] = [0.210880824, 0.250000000] radices = 60 16 16 512 msec/iter = 14.23 ROE[avg,max] = [0.281250000, 0.281250000] radices = 128 8 16 576 msec/iter = 16.14 ROE[avg,max] = [0.208354841, 0.250000000] radices = 144 8 16 640 msec/iter = 19.46 ROE[avg,max] = [0.257421875, 0.312500000] radices = 160 8 16 704 msec/iter = 21.52 ROE[avg,max] = [0.274654715, 0.343750000] radices = 176 8 16 768 msec/iter = 21.99 ROE[avg,max] = [0.209895543, 0.250000000] radices = 48 16 16 832 msec/iter = 24.75 ROE[avg,max] = [0.239439174, 0.312500000] radices = 208 8 16 896 msec/iter = 25.59 ROE[avg,max] = [0.227832031, 0.312500000] radices = 56 16 16 960 msec/iter = 28.33 ROE[avg,max] = [0.212360491, 0.250000000] radices = 60 16 16 1024 msec/iter = 28.24 ROE[avg,max] = [0.312500000, 0.312500000] radices = 128 16 16 1152 msec/iter = 32.89 ROE[avg,max] = [0.208562687, 0.253906250] radices = 144 16 16 1280 msec/iter = 40.32 ROE[avg,max] = [0.235714286, 0.312500000] radices = 20 8 16 1408 msec/iter = 42.28 ROE[avg,max] = [0.273688616, 0.343750000] radices = 176 16 16 1536 msec/iter = 44.80 ROE[avg,max] = [0.223493304, 0.281250000] radices = 192 16 16 1664 msec/iter = 48.48 ROE[avg,max] = [0.246149554, 0.312500000] radices = 208 16 16 1792 msec/iter = 51.94 ROE[avg,max] = [0.220703125, 0.281250000] radices = 224 16 16 1920 msec/iter = 61.24 ROE[avg,max] = [0.212430246, 0.257812500] radices = 60 16 32 2048 msec/iter = 56.56 ROE[avg,max] = [0.312500000, 0.312500000] radices = 128 16 16 2304 msec/iter = 65.73 ROE[avg,max] = [0.208895438, 0.250000000] radices = 144 16 16 2560 msec/iter = 79.33 ROE[avg,max] = [0.245312500, 0.281250000] radices = 20 16 16 2816 msec/iter = 85.93 ROE[avg,max] = [0.272896903, 0.343750000] radices = 176 16 16 3072 msec/iter = 91.91 ROE[avg,max] = [0.225892857, 0.281250000] radices = 192 16 16 3328 msec/iter = 97.41 ROE[avg,max] = [0.241322545, 0.281250000] radices = 208 16 16 3584 msec/iter = 105.64 ROE[avg,max] = [0.220870536, 0.250000000] radices = 224 16 16 3840 msec/iter = 132.28 ROE[avg,max] = [0.213867188, 0.242187500] radices = 60 32 32 4096 msec/iter = 116.38 ROE[avg,max] = [0.224023438, 0.250000000] radices = 16 16 16 4608 msec/iter = 141.80 ROE[avg,max] = [0.201425498, 0.250000000] radices = 144 16 32 5120 msec/iter = 162.11 ROE[avg,max] = [0.236607143, 0.281250000] radices = 20 16 16 5632 msec/iter = 186.77 ROE[avg,max] = [0.277120536, 0.312500000] radices = 44 16 16 6144 msec/iter = 192.85 ROE[avg,max] = [0.214425223, 0.250000000] radices = 48 16 16 6656 msec/iter = 223.12 ROE[avg,max] = [0.242299107, 0.281250000] radices = 208 16 32 7168 msec/iter = 230.10 ROE[avg,max] = [0.223437500, 0.281250000] radices = 56 16 16 7680 msec/iter = 253.42 ROE[avg,max] = [0.219891357, 0.250000000] radices = 60 16 16 8192 msec/iter = 252.43 ROE[avg,max] = [0.282589286, 0.312500000] radices = 1024 16 16 9216 msec/iter = 306.68 ROE[avg,max] = [0.208818163, 0.265625000] radices = 144 32 32 10240 msec/iter = 371.75 ROE[avg,max] = [0.248660714, 0.312500000] radices = 160 32 32 11264 msec/iter = 409.54 ROE[avg,max] = [0.275306920, 0.328125000] radices = 176 32 32 12288 msec/iter = 423.42 ROE[avg,max] = [0.209234401, 0.234375000] radices = 48 16 16 13312 msec/iter = 493.18 ROE[avg,max] = [0.236830357, 0.281250000] radices = 208 32 32 14336 msec/iter = 476.82 ROE[avg,max] = [0.218526786, 0.250000000] radices = 56 16 16 15360 msec/iter = 535.51 ROE[avg,max] = [0.217006138, 0.250000000] radices = 60 16 16 16384 msec/iter = 530.52 ROE[avg,max] = [0.276339286, 0.281250000] radices = 1024 16 16 18432 msec/iter = 606.73 ROE[avg,max] = [0.212458147, 0.250000000] radices = 144 16 16 20480 msec/iter = 745.91 ROE[avg,max] = [0.251116071, 0.281250000] radices = 160 16 16 22528 msec/iter = 822.14 ROE[avg,max] = [0.283984375, 0.328125000] radices = 176 16 16 24576 msec/iter = 833.16 ROE[avg,max] = [0.225502232, 0.250000000] radices = 192 16 16 26624 msec/iter = 975.42 ROE[avg,max] = [0.251785714, 0.281250000] radices = 208 16 16 28672 msec/iter = 971.73 ROE[avg,max] = [0.219098772, 0.250000000] radices = 224 16 16 30720 msec/iter = 1162.44 ROE[avg,max] = [0.242522321, 0.281250000] radices = 960 16 32 32768 msec/iter = 1075.11 ROE[avg,max] = [0.281250000, 0.281250000] radices = 1024 16 32 [/CODE] |
CPU Load
[CODE]top - 07:33:43 up 4 days, 3:39, 2 users, load average: 0,80, 0,58, 1,27 Tasks: 97 total, 1 running, 96 sleeping, 0 stopped, 0 zombie %Cpu(s): 98,8 us, 0,2 sy, 0,0 ni, 0,8 id, 0,0 wa, 0,0 hi, 0,0 si, 0,2 st KiB Mem : 2042848 total, 1076884 free, 111976 used, 853988 buff/cache KiB Swap: 501740 total, 501740 free, 0 used. 1853956 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5580 linux1 20 0 89624 36140 1120 S 197,3 1,8 0:59.77 mlucas [/CODE] |
But it's great that mLucas working on mainframe!!! :smile:
:primenet: |
[QUOTE=Lorenzo;425611]Not fast :brian-e:[/QUOTE]
Thanks, Lorenzo - it seems you truncated the rightmost columns of radices in posting your excerpt for the mlucas.cfg file (e.g. in the first line 448 means 448Kdoubles => complex FFT of legnth 224K = 56*16^3, i.e. there is a trailing 16 missing) - but those are easily inferred. Just as a point of 'slow' reference, the 32768K timing is roughly what I get on my aged Core2Duo running 2-threaded (1 thread per core) using the SSE2 version of the x86_64 build. My Haswell quad (4-threaded AVX2 build) is 10x faster. Aside from the overall slowness, the various non-powers-of-2 perform decently well with the notable exception of FFT lengths of form 15*2^n, which are uniformly dismal - the compiler really doesn't like my scalar-double radix-15 DFT macros, it seems. I guess the only positive thing I say (as with politics and economics it's all about the optimistic PR spin, you know) is that the scaling to larger runlengths is quite good - compare the 32768K and 1024K timings, for instance, with what one expects based on the asymptotic O(n log n) FFT opcount scaling. ------------------- Also, to repeat my earlier question: Do we have any way of seeing what kind of hardware is running underneath things? IBM's version of PowerPC? It would be silly if it were actually x86_64 and the cloud setup were masking that from users. |
cat /proc/cpuinfo should reveal some details.
|
[QUOTE=Mark Rose;425680]cat /proc/cpuinfo should reveal some details.[/QUOTE]
Not much info ... [CODE][linux1@lorenzoibm ~]$ cat /proc/cpuinfo vendor_id : IBM/S390 # processors : 2 bogomips per cpu: 20325.00 features : esan3 zarch stfle msa ldisp eimm dfp etf3eh highgprs cache0 : level=1 type=Data scope=Private size=128K line_size=256 associativity=8 cache1 : level=1 type=Instruction scope=Private size=96K line_size=256 associativity=6 cache2 : level=2 type=Data scope=Private size=2048K line_size=256 associativity=8 cache3 : level=2 type=Instruction scope=Private size=2048K line_size=256 associativity=8 cache4 : level=3 type=Unified scope=Shared size=65536K line_size=256 associativity=16 cache5 : level=4 type=Unified scope=Shared size=491520K line_size=256 associativity=30 processor 0: version = FF, identification = 016A77, machine = 2964 processor 1: version = FF, identification = 016A77, machine = 2964[/CODE] [QUOTE]LinuxOne is a specialised Z13 IBM mainframe for Linux. You can run up to 8000 VM simultaneously on it. It is a powerfull beast like IBM does, the top stuff.[/QUOTE] So it's [URL="https://en.wikipedia.org/wiki/IBM_z13_(microprocessor)"]IBM Z13 CPU[/URL]. Much more details you can find in [URL="http://www.redbooks.ibm.com/redbooks/pdfs/sg248251.pdf"]Technical Guide[/URL]. And i'm not expert but i think it's not Power architecture. It's something special ... |
[QUOTE=Mark Rose;425680]cat /proc/cpuinfo should reveal some details.[/QUOTE]
Yes, this usually works very well, but proc filesystem is linux-specific, it may not work for other kernels. There is also lscpu command which I believe simply read /proc/cpuinfo and display it in a nicer way (using your locale setting). For FreeBSD, I find this [URL="https://stackoverflow.com/questions/4083848/what-is-the-equivalent-of-proc-cpuinfo-on-freebsd-v8-1?rq=1"]post[/URL]. |
[QUOTE=Lorenzo;425707]Not much info ...
[CODE][linux1@lorenzoibm ~]$ cat /proc/cpuinfo vendor_id : IBM/S390 # processors : 2 bogomips per cpu: 20325.00 features : esan3 zarch stfle msa ldisp eimm dfp etf3eh highgprs cache0 : level=1 type=Data scope=Private size=128K line_size=256 associativity=8 cache1 : level=1 type=Instruction scope=Private size=96K line_size=256 associativity=6 cache2 : level=2 type=Data scope=Private size=2048K line_size=256 associativity=8 cache3 : level=2 type=Instruction scope=Private size=2048K line_size=256 associativity=8 cache4 : level=3 type=Unified scope=Shared size=65536K line_size=256 associativity=16 cache5 : level=4 type=Unified scope=Shared size=491520K line_size=256 associativity=30 processor 0: version = FF, identification = 016A77, machine = 2964 processor 1: version = FF, identification = 016A77, machine = 2964[/CODE]So it's [URL="https://en.wikipedia.org/wiki/IBM_z13_(microprocessor)"]IBM Z13 CPU[/URL]. Much more details you can find in [URL="http://www.redbooks.ibm.com/redbooks/pdfs/sg248251.pdf"]Technical Guide[/URL]. And i'm not expert but i think it's not Power architecture. It's something special ...[/QUOTE] Wauw! That is a lot of cache! L1 (per core) -96 KB instruction -128 KB Data L2 (per core) -2 MB instruction -2 MB Data L3 (shared) 64 MB eDRAM L4 (off die, on storage controller chip) 480 MB [quote] [FONT=sans-serif]The processor chip has an eight-core design, with either six, seven, or eight active cores, and [/FONT][FONT=sans-serif]operates at 5.0 GHz. Depending on the CPC drawer version (39 PU or 42 PU), 39 - 168 PUs [/FONT][FONT=sans-serif]are available on 1 - 4 CPC drawers.[/FONT][/quote]IBM names it a PU, we would call it a CPUcore. |
[QUOTE=VictordeHolland;425720]Wauw! That is a lot of cache!
L1 (per core) -96 KB instruction -128 KB Data L2 (per core) -2 MB instruction -2 MB Data L3 (shared) 64 MB eDRAM L4 (off die, on storage controller chip) 480 MB IBM names it a PU, we would call it a CPUcore.[/QUOTE] That explains the excellent timing-scaling in going to larger FFT lengths which we see in Lorenzo's cfg-file results. If we had some relatively efficient way to map x86_64 SIMD code to this arch's SIMD, things could get rather interesting. I shall have a look at the PDF Lorenzo linked later today. |
[QUOTE=ewmayer;425751]If we had some relatively efficient way to map x86_64 SIMD code to this arch's SIMD, things could get rather interesting. I shall have a look at the PDF Lorenzo linked later today.[/QUOTE]
Had a look - see nothing actually resembling an instruction set reference in there. Could someone point me to one? With just 139 SIMD instructions it wouldn't have taken up more than a decent-sized chapter or appendix in such a document. I did note this, however (Chapter 3. Central processor complex system design, p91), which mentions no floating-point among the SIMD - that would be a curious omission if indeed such are supported: [i] Here are some examples of SIMD instructions: o Integer byte to quadword add, sub, and compare o Integer byte to doubleword min, max, and average o Integer byte to word multiply o String find 8-bits, 16-bits, and 32-bits o String range compare o String find any equal o String load to block boundaries and load/store with length[/i] |
[QUOTE=ewmayer;425791]Had a look - see nothing actually resembling an instruction set reference in there. Could someone point me to one? With just 139 SIMD instructions it wouldn't have taken up more than a decent-sized chapter or appendix in such a document.
I did note this, however (Chapter 3. Central processor complex system design, p91), which mentions no floating-point among the SIMD - that would be a curious omission if indeed such are supported: [I] Here are some examples of SIMD instructions: o Integer byte to quadword add, sub, and compare o Integer byte to doubleword min, max, and average o Integer byte to word multiply o String find 8-bits, 16-bits, and 32-bits o String range compare o String find any equal o String load to block boundaries and load/store with length[/I][/QUOTE] Just find this documentation [URL="https://www-304.ibm.com/support/docview.wss?uid=isg29c69415c1e82603c852576700058075a&aid=1"]z/architecture reference summary[/URL] on the internet. Page 22 to page 25 shows the 139 vector instructions (of course I do not really try to count!), something like VMAH (vector multiple and add high)... Also I have created a s390x testing branch (there are only 2 commits), people interested are encouraged to test if it builds and passes the test, the instruction is as followed: $ git clone [URL]https://gitlab.com/mlucas-ll/mlucas.git[/URL] $ cd mlucas && touch * && git checkout s390x $ mkdir build && cd build && ../configure && make -j && make -j check (of course you must have git, gcc and make installed!) |
| All times are UTC. The time now is 05:04. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.