mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Mlucas (https://www.mersenneforum.org/forumdisplay.php?f=118)
-   -   MLucas on IBM Mainframe (https://www.mersenneforum.org/showthread.php?t=20962)

Lorenzo 2016-02-08 12:30

[QUOTE=ET_;425606]Can you tell us anything about its performances? I guess it's running with 2 threads...

Luigi[/QUOTE]
Not fast :brian-e:

[CODE] 448 msec/iter = 12.80 ROE[avg,max] = [0.224609375, 0.250000000] radices = 56 16 16
480 msec/iter = 14.04 ROE[avg,max] = [0.210880824, 0.250000000] radices = 60 16 16
512 msec/iter = 14.23 ROE[avg,max] = [0.281250000, 0.281250000] radices = 128 8 16
576 msec/iter = 16.14 ROE[avg,max] = [0.208354841, 0.250000000] radices = 144 8 16
640 msec/iter = 19.46 ROE[avg,max] = [0.257421875, 0.312500000] radices = 160 8 16
704 msec/iter = 21.52 ROE[avg,max] = [0.274654715, 0.343750000] radices = 176 8 16
768 msec/iter = 21.99 ROE[avg,max] = [0.209895543, 0.250000000] radices = 48 16 16
832 msec/iter = 24.75 ROE[avg,max] = [0.239439174, 0.312500000] radices = 208 8 16
896 msec/iter = 25.59 ROE[avg,max] = [0.227832031, 0.312500000] radices = 56 16 16
960 msec/iter = 28.33 ROE[avg,max] = [0.212360491, 0.250000000] radices = 60 16 16
1024 msec/iter = 28.24 ROE[avg,max] = [0.312500000, 0.312500000] radices = 128 16 16
1152 msec/iter = 32.89 ROE[avg,max] = [0.208562687, 0.253906250] radices = 144 16 16
1280 msec/iter = 40.32 ROE[avg,max] = [0.235714286, 0.312500000] radices = 20 8 16
1408 msec/iter = 42.28 ROE[avg,max] = [0.273688616, 0.343750000] radices = 176 16 16
1536 msec/iter = 44.80 ROE[avg,max] = [0.223493304, 0.281250000] radices = 192 16 16
1664 msec/iter = 48.48 ROE[avg,max] = [0.246149554, 0.312500000] radices = 208 16 16
1792 msec/iter = 51.94 ROE[avg,max] = [0.220703125, 0.281250000] radices = 224 16 16
1920 msec/iter = 61.24 ROE[avg,max] = [0.212430246, 0.257812500] radices = 60 16 32
2048 msec/iter = 56.56 ROE[avg,max] = [0.312500000, 0.312500000] radices = 128 16 16
2304 msec/iter = 65.73 ROE[avg,max] = [0.208895438, 0.250000000] radices = 144 16 16
2560 msec/iter = 79.33 ROE[avg,max] = [0.245312500, 0.281250000] radices = 20 16 16
2816 msec/iter = 85.93 ROE[avg,max] = [0.272896903, 0.343750000] radices = 176 16 16
3072 msec/iter = 91.91 ROE[avg,max] = [0.225892857, 0.281250000] radices = 192 16 16
3328 msec/iter = 97.41 ROE[avg,max] = [0.241322545, 0.281250000] radices = 208 16 16
3584 msec/iter = 105.64 ROE[avg,max] = [0.220870536, 0.250000000] radices = 224 16 16
3840 msec/iter = 132.28 ROE[avg,max] = [0.213867188, 0.242187500] radices = 60 32 32
4096 msec/iter = 116.38 ROE[avg,max] = [0.224023438, 0.250000000] radices = 16 16 16
4608 msec/iter = 141.80 ROE[avg,max] = [0.201425498, 0.250000000] radices = 144 16 32
5120 msec/iter = 162.11 ROE[avg,max] = [0.236607143, 0.281250000] radices = 20 16 16
5632 msec/iter = 186.77 ROE[avg,max] = [0.277120536, 0.312500000] radices = 44 16 16
6144 msec/iter = 192.85 ROE[avg,max] = [0.214425223, 0.250000000] radices = 48 16 16
6656 msec/iter = 223.12 ROE[avg,max] = [0.242299107, 0.281250000] radices = 208 16 32
7168 msec/iter = 230.10 ROE[avg,max] = [0.223437500, 0.281250000] radices = 56 16 16
7680 msec/iter = 253.42 ROE[avg,max] = [0.219891357, 0.250000000] radices = 60 16 16
8192 msec/iter = 252.43 ROE[avg,max] = [0.282589286, 0.312500000] radices = 1024 16 16
9216 msec/iter = 306.68 ROE[avg,max] = [0.208818163, 0.265625000] radices = 144 32 32
10240 msec/iter = 371.75 ROE[avg,max] = [0.248660714, 0.312500000] radices = 160 32 32
11264 msec/iter = 409.54 ROE[avg,max] = [0.275306920, 0.328125000] radices = 176 32 32
12288 msec/iter = 423.42 ROE[avg,max] = [0.209234401, 0.234375000] radices = 48 16 16
13312 msec/iter = 493.18 ROE[avg,max] = [0.236830357, 0.281250000] radices = 208 32 32
14336 msec/iter = 476.82 ROE[avg,max] = [0.218526786, 0.250000000] radices = 56 16 16
15360 msec/iter = 535.51 ROE[avg,max] = [0.217006138, 0.250000000] radices = 60 16 16
16384 msec/iter = 530.52 ROE[avg,max] = [0.276339286, 0.281250000] radices = 1024 16 16
18432 msec/iter = 606.73 ROE[avg,max] = [0.212458147, 0.250000000] radices = 144 16 16
20480 msec/iter = 745.91 ROE[avg,max] = [0.251116071, 0.281250000] radices = 160 16 16
22528 msec/iter = 822.14 ROE[avg,max] = [0.283984375, 0.328125000] radices = 176 16 16
24576 msec/iter = 833.16 ROE[avg,max] = [0.225502232, 0.250000000] radices = 192 16 16
26624 msec/iter = 975.42 ROE[avg,max] = [0.251785714, 0.281250000] radices = 208 16 16
28672 msec/iter = 971.73 ROE[avg,max] = [0.219098772, 0.250000000] radices = 224 16 16
30720 msec/iter = 1162.44 ROE[avg,max] = [0.242522321, 0.281250000] radices = 960 16 32
32768 msec/iter = 1075.11 ROE[avg,max] = [0.281250000, 0.281250000] radices = 1024 16 32
[/CODE]

Lorenzo 2016-02-08 12:35

CPU Load
[CODE]top - 07:33:43 up 4 days, 3:39, 2 users, load average: 0,80, 0,58, 1,27
Tasks: 97 total, 1 running, 96 sleeping, 0 stopped, 0 zombie
%Cpu(s): 98,8 us, 0,2 sy, 0,0 ni, 0,8 id, 0,0 wa, 0,0 hi, 0,0 si, 0,2 st
KiB Mem : 2042848 total, 1076884 free, 111976 used, 853988 buff/cache
KiB Swap: 501740 total, 501740 free, 0 used. 1853956 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5580 linux1 20 0 89624 36140 1120 S 197,3 1,8 0:59.77 mlucas
[/CODE]

Lorenzo 2016-02-08 12:39

But it's great that mLucas working on mainframe!!! :smile:

:primenet:

ewmayer 2016-02-08 22:22

[QUOTE=Lorenzo;425611]Not fast :brian-e:[/QUOTE]

Thanks, Lorenzo - it seems you truncated the rightmost columns of radices in posting your excerpt for the mlucas.cfg file (e.g. in the first line 448 means 448Kdoubles => complex FFT of legnth 224K = 56*16^3, i.e. there is a trailing 16 missing) - but those are easily inferred.

Just as a point of 'slow' reference, the 32768K timing is roughly what I get on my aged Core2Duo running 2-threaded (1 thread per core) using the SSE2 version of the x86_64 build. My Haswell quad (4-threaded AVX2 build) is 10x faster.

Aside from the overall slowness, the various non-powers-of-2 perform decently well with the notable exception of FFT lengths of form 15*2^n, which are uniformly dismal - the compiler really doesn't like my scalar-double radix-15 DFT macros, it seems. I guess the only positive thing I say (as with politics and economics it's all about the optimistic PR spin, you know) is that the scaling to larger runlengths is quite good - compare the 32768K and 1024K timings, for instance, with what one expects based on the asymptotic O(n log n) FFT opcount scaling.

-------------------

Also, to repeat my earlier question: Do we have any way of seeing what kind of hardware is running underneath things? IBM's version of PowerPC? It would be silly if it were actually x86_64 and the cloud setup were masking that from users.

Mark Rose 2016-02-08 22:37

cat /proc/cpuinfo should reveal some details.

Lorenzo 2016-02-09 07:57

[QUOTE=Mark Rose;425680]cat /proc/cpuinfo should reveal some details.[/QUOTE]
Not much info ...
[CODE][linux1@lorenzoibm ~]$ cat /proc/cpuinfo
vendor_id : IBM/S390
# processors : 2
bogomips per cpu: 20325.00
features : esan3 zarch stfle msa ldisp eimm dfp etf3eh highgprs
cache0 : level=1 type=Data scope=Private size=128K line_size=256 associativity=8
cache1 : level=1 type=Instruction scope=Private size=96K line_size=256 associativity=6
cache2 : level=2 type=Data scope=Private size=2048K line_size=256 associativity=8
cache3 : level=2 type=Instruction scope=Private size=2048K line_size=256 associativity=8
cache4 : level=3 type=Unified scope=Shared size=65536K line_size=256 associativity=16
cache5 : level=4 type=Unified scope=Shared size=491520K line_size=256 associativity=30
processor 0: version = FF, identification = 016A77, machine = 2964
processor 1: version = FF, identification = 016A77, machine = 2964[/CODE]

[QUOTE]LinuxOne is a specialised Z13 IBM mainframe for Linux. You can run up to 8000 VM simultaneously on it. It is a powerfull beast like IBM does, the top stuff.[/QUOTE]
So it's [URL="https://en.wikipedia.org/wiki/IBM_z13_(microprocessor)"]IBM Z13 CPU[/URL]. Much more details you can find in [URL="http://www.redbooks.ibm.com/redbooks/pdfs/sg248251.pdf"]Technical Guide[/URL]. And i'm not expert but i think it's not Power architecture. It's something special ...

alexvong1995 2016-02-09 10:02

[QUOTE=Mark Rose;425680]cat /proc/cpuinfo should reveal some details.[/QUOTE]
Yes, this usually works very well, but proc filesystem is linux-specific, it may not work for other kernels.
There is also lscpu command which I believe simply read /proc/cpuinfo and display it in a nicer way (using your locale setting).
For FreeBSD, I find this [URL="https://stackoverflow.com/questions/4083848/what-is-the-equivalent-of-proc-cpuinfo-on-freebsd-v8-1?rq=1"]post[/URL].

VictordeHolland 2016-02-09 13:00

[QUOTE=Lorenzo;425707]Not much info ...
[CODE][linux1@lorenzoibm ~]$ cat /proc/cpuinfo
vendor_id : IBM/S390
# processors : 2
bogomips per cpu: 20325.00
features : esan3 zarch stfle msa ldisp eimm dfp etf3eh highgprs
cache0 : level=1 type=Data scope=Private size=128K line_size=256 associativity=8
cache1 : level=1 type=Instruction scope=Private size=96K line_size=256 associativity=6
cache2 : level=2 type=Data scope=Private size=2048K line_size=256 associativity=8
cache3 : level=2 type=Instruction scope=Private size=2048K line_size=256 associativity=8
cache4 : level=3 type=Unified scope=Shared size=65536K line_size=256 associativity=16
cache5 : level=4 type=Unified scope=Shared size=491520K line_size=256 associativity=30
processor 0: version = FF, identification = 016A77, machine = 2964
processor 1: version = FF, identification = 016A77, machine = 2964[/CODE]So it's [URL="https://en.wikipedia.org/wiki/IBM_z13_(microprocessor)"]IBM Z13 CPU[/URL]. Much more details you can find in [URL="http://www.redbooks.ibm.com/redbooks/pdfs/sg248251.pdf"]Technical Guide[/URL]. And i'm not expert but i think it's not Power architecture. It's something special ...[/QUOTE]
Wauw! That is a lot of cache!

L1 (per core)
-96 KB instruction
-128 KB Data
L2 (per core)
-2 MB instruction
-2 MB Data
L3 (shared)
64 MB eDRAM
L4 (off die, on storage controller chip)
480 MB
[quote]
[FONT=sans-serif]The processor chip has an eight-core design, with either six, seven, or eight active cores, and [/FONT][FONT=sans-serif]operates at 5.0 GHz. Depending on the CPC drawer version (39 PU or 42 PU), 39 - 168 PUs [/FONT][FONT=sans-serif]are available on 1 - 4 CPC drawers.[/FONT][/quote]IBM names it a PU, we would call it a CPUcore.

ewmayer 2016-02-09 20:17

[QUOTE=VictordeHolland;425720]Wauw! That is a lot of cache!

L1 (per core)
-96 KB instruction
-128 KB Data
L2 (per core)
-2 MB instruction
-2 MB Data
L3 (shared)
64 MB eDRAM
L4 (off die, on storage controller chip)
480 MB
IBM names it a PU, we would call it a CPUcore.[/QUOTE]

That explains the excellent timing-scaling in going to larger FFT lengths which we see in Lorenzo's cfg-file results.

If we had some relatively efficient way to map x86_64 SIMD code to this arch's SIMD, things could get rather interesting. I shall have a look at the PDF Lorenzo linked later today.

ewmayer 2016-02-10 03:51

[QUOTE=ewmayer;425751]If we had some relatively efficient way to map x86_64 SIMD code to this arch's SIMD, things could get rather interesting. I shall have a look at the PDF Lorenzo linked later today.[/QUOTE]

Had a look - see nothing actually resembling an instruction set reference in there. Could someone point me to one? With just 139 SIMD instructions it wouldn't have taken up more than a decent-sized chapter or appendix in such a document.

I did note this, however (Chapter 3. Central processor complex system design, p91), which mentions no floating-point among the SIMD - that would be a curious omission if indeed such are supported:
[i]
Here are some examples of SIMD instructions:
o Integer byte to quadword add, sub, and compare
o Integer byte to doubleword min, max, and average
o Integer byte to word multiply
o String find 8-bits, 16-bits, and 32-bits
o String range compare
o String find any equal
o String load to block boundaries and load/store with length[/i]

alexvong1995 2016-02-10 05:24

[QUOTE=ewmayer;425791]Had a look - see nothing actually resembling an instruction set reference in there. Could someone point me to one? With just 139 SIMD instructions it wouldn't have taken up more than a decent-sized chapter or appendix in such a document.

I did note this, however (Chapter 3. Central processor complex system design, p91), which mentions no floating-point among the SIMD - that would be a curious omission if indeed such are supported:
[I]
Here are some examples of SIMD instructions:
o Integer byte to quadword add, sub, and compare
o Integer byte to doubleword min, max, and average
o Integer byte to word multiply
o String find 8-bits, 16-bits, and 32-bits
o String range compare
o String find any equal
o String load to block boundaries and load/store with length[/I][/QUOTE]
Just find this documentation [URL="https://www-304.ibm.com/support/docview.wss?uid=isg29c69415c1e82603c852576700058075a&aid=1"]z/architecture reference summary[/URL] on the internet. Page 22 to page 25 shows the 139 vector instructions (of course I do not really try to count!), something like VMAH (vector multiple and add high)...

Also I have created a s390x testing branch (there are only 2 commits), people interested are encouraged to test if it builds and passes the test, the instruction is as followed:
$ git clone [URL]https://gitlab.com/mlucas-ll/mlucas.git[/URL]
$ cd mlucas && touch * && git checkout s390x
$ mkdir build && cd build && ../configure && make -j && make -j check
(of course you must have git, gcc and make installed!)


All times are UTC. The time now is 05:04.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.