mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Msieve (https://www.mersenneforum.org/forumdisplay.php?f=83)
-   -   Msieve benchmarking (https://www.mersenneforum.org/showthread.php?t=25169)

Xyzzy 2020-02-03 22:39

Msieve benchmarking
 
2 Attachment(s)
We have uploaded a data file to use for msieve benchmarking.

[URL]https://www.dropbox.com/s/si1kyxq7yerahcw/benchmark.tar.gz[/URL]

[C]$ md5sum benchmark.tar.gz
ba398f2ff3c2aff9a6c6f8cbb6a96a93 benchmark.tar.gz[/C]

It would be cool if timings for various setups were posted here.

If you need help, please ask!

:mike:

VBCurtis 2020-02-15 23:29

Machine: HP Z620, dual 10-core Ivy Bridge Xeon @2.6ish ghz
64GB memory per socket
-nc1 was run with target-density 134. After remdups and adding freerels in, msieve states 99.3M unique relations. Matrix came out 4.57M dimensions. TD=140 did not complete filtering.
Using taskset -c 10-19 to lock to socket #2 with VBITS=128 msieve compilation option.
10-threaded ETA after 1% of job: 5 hr 25 min.
5-threaded ETA after 1% of job: 9 hr 30 min.

Future tests will explore ETAs of smaller target densities, as well as splitting the job over two sockets.

bsquared 2020-02-17 22:20

Machine: 2 sockets of 20-core Cascade-Lake Xeon
Just used the default density.
matrix is 5149968 x 5150142 (1913.9 MB) with weight 597210677 (115.96/col)

Here is a basic 40 threaded job across both sockets (actually, I guess it is thread-limited to 32 threads):
4 hrs 58 min: /msieve -v -nc2 -t 40

Using MPI helps a lot. Here are various configurations using different VBITS settings (timings after 1% elasped):

[CODE]2x20 core VBITS=64
2x20 core VBITS=64
2 hrs 30 min: mpirun -np 4 msieve -nc2 2,2 -v -t 10
2 hrs 43 min: mpirun -np 8 msieve -nc2 2,4 -v -t 5
3 hrs 1 min: mpirun -np 20 msieve -nc2 4,5 -v -t 2
3 hrs 23 min: mpirun -np 40 msieve -nc2 5,8 -v
3 hrs 23 min: mpirun -np 40 msieve -nc2 8,5 -v

2x20 core VBITS=128
2 hrs 32 min: mpirun -np 8 msieve -nc2 2,4 -v -t 5
2 hrs 36 min: mpirun -np 20 msieve -nc2 4,5 -v -t 2
2 hrs 45 min: mpirun -np 40 msieve -nc2 5,8 -v
2 hrs 47 min: mpirun -np 4 msieve -nc2 2,2 -v -t 10
2 hrs 54 min: mpirun -np 40 msieve -nc2 8,5 -v

2x20 core VBITS=256
2 hrs 43 min: mpirun -np 40 msieve -nc2 5,8 -v
2 hrs 44 min: mpirun -np 8 msieve -nc2 2,4 -v -t 5
2 hrs 47 min: mpirun -np 40 msieve -nc2 8,5 -v
2 hrs 49 min: mpirun -np 4 msieve -nc2 2,2 -v -t 10
3 hrs 2 min: mpirun -np 20 msieve -nc2 4,5 -v -t 2[/CODE]

VBITS=128 seems to be most-consistently fast.

Grids that were significantly less square (e.g., 2x20 or 4x1 -t 10) didn't do as well.

RichD 2020-02-18 01:50

I have a Sandy Bridge Core-i5 (4 cores, no HT). Would this small machine be beneficial for a benchmark? I have two versions of Msieve; one with VBITS=128 and an older one without VBITS I use for poly search (GPU enabled). When I created the VBITS=128 I noticed about a 10% boost (or more) in my post-processing speed.

VBCurtis 2020-02-18 01:58

RichD- yes! I'd like to see how various generations of hardware compare, regular desktop or Xeon-grade. This also helps others see if perhaps their msieve copy isn't as fast as it could be (e.g. compiling it oneself can prove *much* faster if the binary one finds online isn't compiled for the same architecture).

Xyzzy 2020-02-18 03:28

Is there any way to tell how many cores and what target density was used by viewing the log file?

Maybe we are looking in the wrong place? Or maybe we can patch the source to include this info?

:mike:

VBCurtis 2020-02-18 04:51

Target density is listed in the log just below the polynomial, before msieve begins reading relations. If no density line is evident, then none was specified by the user and default density of 70 was used.
I believe the number of cores is listed when -nc2 phase begins; something like "8 threads" usually appears in the lines just before the first ETA is printed to the log.

Xyzzy 2020-02-18 12:53

2 Attachment(s)
Since there is no mention of threads or target density in the log files, these runs must have been done with the default target density and one thread.

linux.log.gz is an AMD 1920X CPU with quad-channel DDR4-2666 memory.
windows.log.gz is an Intel i7-9700K CPU with dual-channel DDR4-3200.

We will re-run these later with various settings to tune our systems better.

:mike:

bsquared 2020-02-18 14:44

Looks like 2 threads for 43.9 hrs in the linux case:
[QUOTE=linux.log.gz]

[U]Thu Jan 30 18:22:25 2020 commencing Lanczos iteration (2 threads)[/U]
Thu Jan 30 18:22:25 2020 memory use: 1762.9 MB
Thu Jan 30 18:23:17 2020 linear algebra at 0.0%, ETA 46h11m
Thu Jan 30 18:23:34 2020 checkpointing every 120000 dimensions
Sat Feb 1 14:08:28 2020 lanczos halted after 81439 iterations (dim = 5149917)
Sat Feb 1 14:08:33 2020 recovered 25 nontrivial dependencies
Sat Feb 1 14:08:33 2020 BLanczosTime: 158039

[/QUOTE]

Summary:
43.9 hrs: 2 threads Linux AMD 1920X CPU with quad-channel DDR4-2666 memory
36.6 hrs: 1 thread Windows Intel i7-9700K CPU with dual-channel DDR4-3200

Xyzzy 2020-03-10 12:45

2 Attachment(s)
Here are benchmarks for 1 through 12 cores on our 1920X and a pretty chart.

The blue line in the chart represents perfect additional core utilization. For example, two cores would be twice as fast as one.

We graphed the linear algebra times.

All benchmarks were done on an otherwise idle system. IRL, with lots of stuff running, things slow down dramatically.

:mike:

VBCurtis 2020-03-10 18:07

Xyzzy-
While unlikely, it is possible that 20 or 24 threads yields a bit of improvement. Hyperthreads don't always help on matrix solving, but since this is a benchmark thread it might be nice to demonstrate that.

I suggest 20 as alternative because using every possible HT might be impacted by any background process, but that effect should be reduced if we leave a few HTs 'open'. I've found situations where using N-1 cores runs faster than N cores, for what I presume are similar reasons.

pinhodecarlos 2020-03-10 18:17

HT helps a lot on LA, at least for me.

VBCurtis 2020-03-12 23:36

[QUOTE=VBCurtis;537677]-nc1 was run with target-density 134. After remdups and adding freerels in, msieve states 99.3M unique relations. Matrix came out 4.57M dimensions. TD=140 did not complete filtering.[/QUOTE]

Machine: Xeon 2680v3 Haswell generation 12x2.5ghz, 48GB memory on 4 channels DDR4 (4x4GB+4x8GB).

VBITS=128 on otherwise idle machine. ETA after 1% of job:
6-threaded 14hr 34 min
12-threads 8 hr 26 min
18-threads 9 hr 15 min
24-threads 8 hr 27 min
These times look rather slow; I just installed the extra 32GB memory today, so perhaps filling all 8 slots slows memory access a bunch. Some time I'll remove the original 16GB and see if 4 sticks is faster than 8.

Xyzzy 2020-04-27 12:16

1 Attachment(s)
[QUOTE=VBCurtis;539303]While unlikely, it is possible that 20 or 24 threads yields a bit of improvement. Hyperthreads don't always help on matrix solving, but since this is a benchmark thread it might be nice to demonstrate that.

I suggest 20 as alternative because using every possible HT might be impacted by any background process, but that effect should be reduced if we leave a few HTs 'open'. I've found situations where using N-1 cores runs faster than N cores, for what I presume are similar reasons.[/QUOTE]We ran a 24 thread test last night. It was 1.09% faster than the 12 thread job. During the run, the CPU reported roughly 1700% utilization, so there must be a lot of overhead and/or bottlenecks. We are currently running a 20 thread test that we will post later.

Note that we only count the LA phase in our calculations.

:mike:

Xyzzy 2020-04-27 22:12

1 Attachment(s)
[QUOTE=Xyzzy;543938]We are currently running a 20 thread test that we will post later.[/QUOTE]The 20 thread run somehow ended up slower than the 12 thread run.

12 = 8h04m50s
20 = 8h32m31s
24 = 7h59m33s

:mike:

Xyzzy 2020-08-07 16:31

1 Attachment(s)
[C]CPU = i7-8565U
RAM = 2×16GB DDR4-2400
CMD = ./msieve -v -nc -t 8
LA = 47884s[/C]

:mike:

Xyzzy 2020-08-08 17:00

1 Attachment(s)
[C]CPU = i7-8565U
RAM = 2×16GB DDR4-2400
CMD = ./msieve -v -nc -t 4
LA = 51662s[/C]

:mike:

Xyzzy 2020-08-12 17:59

1 Attachment(s)
[C]CPU = 3950X
RAM = 2×8GB DDR4-3666
CMD = ./msieve -v -nc -t 16
LA = 27180s[/C]

:mike:

kurtb 2020-08-14 07:53

In my experience, the throughput depends on an additionally running program (e.g. gmp-ecm)

machine: i7-7820X - 8 cores + HT
matrix: 49M * 49M
memory: 64 GB

msieve .... -t16 solo ~ 55% (power according task manager)
msieve .... -t16 and gmp-ecm (prior: low) ~ 78% -"-

With msieve + mprime/Prime95 the effectiveness is a litle lower
Kurt

frmky 2020-08-14 18:06

[QUOTE=bsquared;537791]Machine: 2 sockets of 20-core Cascade-Lake Xeon
Just used the default density.
matrix is 5149968 x 5150142 (1913.9 MB) with weight 597210677 (115.96/col)

Here is a basic 40 threaded job across both sockets (actually, I guess it is thread-limited to 32 threads):
4 hrs 58 min: /msieve -v -nc2 -t 40

Using MPI helps a lot. Here are various configurations using different VBITS settings (timings after 1% elasped):
[/QUOTE]

I know this is late, but if you still have this data set up, try
mpirun -np 2 msieve -nc2 1,2 -v -t 20

frmky 2020-08-15 07:23

Here's a bench using compute nodes with one Xeon E5-2650 v4 Broadwell cpu with 12-cores, 24 threads.

1 node 7h 40m
2 nodes 2h 45m
4 nodes 1h 35m
8 nodes 1h 10m

Not sure why the time for one node is so high compared to the others? Perhaps something fitting into the cache with the smaller matrices on each node?

VBCurtis 2020-08-15 15:58

[QUOTE=frmky;553739]Here's a bench using compute nodes with one Xeon E5-2650 v4 Broadwell cpu with 12-cores, 24 threads.

1 node 7h 40m
2 nodes 2h 45m
4 nodes 1h 35m
8 nodes 1h 10m

Not sure why the time for one node is so high compared to the others? Perhaps something fitting into the cache with the smaller matrices on each node?[/QUOTE]

It never occured to me that MPI - 2 nodes would be more than twice as fast, under any test. Neat! Now, if only ubuntu would fix MPI....

bsquared 2020-08-18 13:15

[QUOTE=frmky;553671]I know this is late, but if you still have this data set up, try
mpirun -np 2 msieve -nc2 1,2 -v -t 20[/QUOTE]

After 1% elasped, the ETA is:

[CODE]
-np 2 1x2 -t 20: 3 hrs 9 min
-np 4 1x4 -t 10: 2 hrs 48 min
-np 5 1x5 -t 8: 3 hrs 49 min
-np 8 1x8 -t 5: 2 hrs 50 min
[/CODE]

The 1x5 time is not surprising as one of the processes is split across sockets. Of the others that split evenly, more processes with fewer threads each appear to be a bit better.

Xyzzy 2020-08-30 09:03

3 Attachment(s)
Here are binaries for 64-bit Linux with various "VBITS" flags set.

:mike:

Xyzzy 2020-10-30 10:53

1 Attachment(s)
[C]CPU = i5-10600K
RAM = 2×8GB DDR4-3200
CMD = ./msieve -v -nc -t 6
LA = 21988s[/C]

:mike:

Xyzzy 2020-10-30 11:08

Given that the 1920X and 3950X are pretty serious CPUs, does the result for the i5 seem abnormally fast?

[C]CPU = 1920X
RAM = 4×16GB DDR4-2666
CMD = ./msieve -v -nc -t 24
LA = 7h 58m 53s

CPU = 3950X
RAM = 2×8GB DDR4-3666
CMD = ./msieve -v -nc -t 16
LA = 7h 33m 00s

CPU = i5-10600K
RAM = 2×8GB DDR4-3200
CMD = ./msieve -v -nc -t 6
LA = 6h 06m 28s[/C]

Gimarel 2020-10-30 12:38

[QUOTE=Xyzzy;561535]
[C]CPU = 3950X
RAM = 2×8GB DDR4-3666
CMD = ./msieve -v -nc -t 16
LA = 7h 33m 00s
[/C][/QUOTE]

My timings for a AMD Ryzen 9 3950X, 2x32GB DDR4-3600:

-nc1: ~0h 43m 18s
-nc2: ~0h 5m 15s until the multithreaded LA starts


Timings for the multithreaded part:

-nc2: estimated 3h 24m msieve compiled with gcc-9.3
-nc2: estimated 3h 25m msieve compiled with gcc-10.0
-nc2: estimated 3h 22m msieve compiled with clang-9
-nc2: estimated 3h 24m msieve compiled with clang-10

Fastest total without -nc3: ~4h 21m

All runs with VBITS=256 and 32 threads. All other versions were slower.
I tried the objects for each compiler twice, to ensure that the clang-9 one is indeed the fastest.

Xyzzy 2020-11-17 14:47

1 Attachment(s)
[C]CPU = 5600X
RAM = 2×16GB DDR4-3200
CMD = ./msieve -v -nc -t 12
LA = 14805s[/C]

:mike:

Xyzzy 2020-11-17 14:56

[C]CPU = 1920X
RAM = 4×16GB DDR4-2666
CMD = ./msieve -v -nc -t 24
LA = 7h 58m 53s[/C]

[C]CPU = 3950X
RAM = 2×8GB DDR4-3666
CMD = ./msieve -v -nc -t 16
LA = 7h 33m 00s[/C]

[C]CPU = 5600X
RAM = 2×16GB DDR4-3200
CMD = ./msieve -v -nc -t 12
LA = 4h 6m 45s[/C]

We have used the same binary and the same setup/method for every benchmark we have posted.

This 5600X result just doesn't seem right unless we had the 1920X and 3950X set up wrong or something.

:confused2:

axn 2020-11-17 15:19

[QUOTE=Xyzzy;563491]
[C]CPU = 3950X
RAM = 2×8GB DDR4-3666
CMD = ./msieve -v -nc -t 16
LA = 7h 33m 00s[/C]

[C]CPU = 5600X
RAM = 2×16GB DDR4-3200
CMD = ./msieve -v -nc -t 12
LA = 4h 6m 45s[/C]

We have used the same binary and the same setup/method for every benchmark we have posted.

This 5600X result just doesn't seem right unless we had the 1920X and 3950X set up wrong or something.

:confused2:[/QUOTE]

Is it possible that the 8GB modules in 3950 are single rank vs 16GB in 5600X are dual ranks? Try swapping the RAM between the systems

Xyzzy 2020-11-17 17:13

[QUOTE=axn;563497]Is it possible that the 8GB modules in 3950 are single rank vs 16GB in 5600X are dual ranks? Try swapping the RAM between the systems[/QUOTE]They were [URL="https://www.mersenneforum.org/showpost.php?p=553372&postcount=30"]single rank[/URL] sticks.

We don't have the 3950X anymore so we can't retest it.

:sad:

Xyzzy 2021-04-13 22:36

[C]CPU = 10980XE (165W)
RAM = 8×32GB DDR4-3200
CMD = ./msieve -v -nc -t 18
LA = 16343s[/C]

[C]CPU = 10980XE (165W)
RAM = 8×32GB DDR4-3200
CMD = ./msieve -v -nc -t 36
LA = 14709s[/C]

:mike:

frmky 2021-05-09 08:57

Each node has a Fujitsu A64FX 64-bit ARM processor with 48 cores and 32 GB HBM memory divided into 4 NUMA regions.

VBITS = 128
1 node 3h 30m
2 nodes 1h 58m
4 nodes 1h 10m
8 nodes 0h 41m

VBITS makes a big difference for this processor
1 node
VBITS = 64 4h 5m
VBITS = 128 3h 30m
VBITS = 256 5h 40m

Two notes about compiling: The cache size must be set in the source since msieve doesn't detect it for ARM processors and the default is quite small. And removing the manual loop unrolling in the files in common/lanczos/cpu/ gives a small but consistent 1.5-2% improvement on this processor.

Xyzzy 2021-05-09 19:09

[QUOTE=Xyzzy;575862][C]CPU = 10980XE (165W)
RAM = 8×32GB DDR4-3200
CMD = ./msieve -v -nc -t 36
LA = 14709s[/C][/QUOTE][C]CPU = 10980XE (999W)
RAM = 8×32GB DDR4-3200
CMD = ./msieve -v -nc -t 36
LA = 12758s[/C]

:mike:

Xyzzy 2021-05-09 19:12

[QUOTE=Xyzzy;578098][C]LA = 12758s[/C][/QUOTE]3h32m38s is a new record, for us!

But look at this weird message:

[C]Msieve v. 1.54 (SVN 1030)
...
commencing linear algebra
...
commencing Lanczos iteration (32 threads)
...[/C]

We specified 36 threads but msieve only used 32 threads.

:help:

frmky 2021-05-09 20:16

[QUOTE=Xyzzy;578099]We specified 36 threads but msieve only used 32 threads.[/QUOTE]
common/lanczos/cpu/lanczos_cpu.h:#define MAX_THREADS 32

frmky 2021-05-09 22:24

One node with 2 x Cavium ThunderX2 CN9980 32-core 64-bit ARM cpus and DDR4 memory.

VBITS = 64 2h 57m
VBITS = 128 2h 1m
VBITS = 256 2h 2m

frmky 2021-05-10 04:38

nVidia Tesla V100 with now old CUDA code that only supports 64-bit vectors
[STRIKE]1h 26m[/STRIKE]
53 minutes after many code changes.

jasonp 2021-05-12 12:06

[QUOTE=frmky;578114]One node with 2 x Cavium ThunderX2 CN9980 32-core 64-bit ARM cpus and DDR4 memory.

VBITS = 64 2h 57m
VBITS = 128 2h 1m
VBITS = 256 2h 2m[/QUOTE]
[batman]Where does he get those wonderful toys??[/batman]

How difficult was the porting effort needed to run on ARM?

frmky 2021-05-12 18:47

[QUOTE=jasonp;578254]How difficult was the porting effort needed to run on ARM?[/QUOTE]
Set the cache size in the source, optionally remove the loop unrolling, set the optimization flags for the machine in Makefile (really just -Ofast -mcpu=native is usually fine) and compile. In the end I just used OpenMPI and GCC 10. I also tried the arm and Cray compilers, but GCC 10 was just as fast.

This exercise shattered my presumption that ARM cpus were efficient but slow.

drkirkby 2021-05-12 19:03

[QUOTE=VBCurtis;537807] This also helps others see if perhaps their msieve copy isn't as fast as it could be (e.g. compiling it oneself can prove *much* faster if the binary one finds online isn't compiled for the same architecture).[/QUOTE]


That's a bit of a problem with open-source benchmarks. The performance depends on the compiler and the computer. On really needs to compare hardware using the same binary. It would be worth the source code reporting the compiler version used. I believe that there are some pre-defined values in GCC that indicate the compiler version. I guess the benchmark could check its own md5 checksum, and report that when it runs. Then at least one would know if the exact same binary is being reported each time.

VBCurtis 2021-05-12 21:32

Well, no- in this case, that's the *benefit*, not the problem. If my compiler is vastly faster than yours, these benchmarks can show you that maybe you could try compiling yourself / with my compiler to get more speed.

We aren't benchmarking to compare hardware nearly as much as we're trying to share info on how to make msieve run faster.

We'd much rather compare various compilations of msieve than have one standardized binary that might not be fastest just for the sake of comparing hardware.

That said, there's a place for directly comparing hardware without software variations as you suggest; but in the context of this thread it's a secondary priority. There's just too many instructions available on some chips but not others- if we used a binary that runs on v2-era DDR3 Xeons, it would leave modern CPUS with more advanced instruction sets crippled compared to their potential speed. That's not a helpful comparison.

frmky 2021-07-30 21:56

[QUOTE=frmky;578128]nVidia Tesla V100 with now old CUDA code that only supports 64-bit vectors
[STRIKE]1h 26m[/STRIKE]
53 minutes after many code changes.[/QUOTE]
After reimplementing the CUDA SpMV with CUB, the Tesla V100 now takes 36 minutes.

Xyzzy 2021-08-02 15:20

1 Attachment(s)
[C]GPU = Quadro RTX 8000
LA = 3331s[/C]

Now that we can run msieve on a GPU we have no reason to ever run it on a CPU again. This result is 3.8× faster (!) than the best CPU time we ever recorded!

:ouch:

Thanks to frmky for the instructions to get it working. We had to do a few extra steps but if we were able to figure it out anybody can![CODE]+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 8000 Off | 00000000:17:00.0 On | 0 |
| 0% 54C P2 260W / 260W | 4888MiB / 45550MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2257 G /usr/libexec/Xorg 166MiB |
| 0 N/A N/A 2632 G /usr/bin/gnome-shell 73MiB |
| 0 N/A N/A 3313 G /usr/lib64/firefox/firefox 3MiB |
| 0 N/A N/A 19361 G /usr/lib64/firefox/firefox 53MiB |
| 0 N/A N/A 73091 G /usr/lib64/firefox/firefox 53MiB |
| 0 N/A N/A 73137 G /usr/lib64/firefox/firefox 3MiB |
| 0 N/A N/A 215174 C ./msieve 4529MiB |
+-----------------------------------------------------------------------------+[/CODE]

:mike:

ewmayer 2021-08-07 00:03

[QUOTE=frmky;578114]One node with 2 x Cavium ThunderX2 CN9980 32-core 64-bit ARM cpus and DDR4 memory.

VBITS = 64 2h 57m
VBITS = 128 2h 1m
VBITS = 256 2h 2m[/QUOTE]

By VBITS do you mean #threads running on the CPU? (The 9980 is like the Intel KNL, in supporting up to 4 threads per core).

Just curious, how much might such a beast cost? (Need not be new.) The [url=https://www.ebay.com/itm/184212692979]bare CPUs[/url] can be had for reasonably low cost, but what kind of motherboard are we talking? Looks like it uses standard server-type DDR4 RAM, at least.

frmky 2021-08-07 02:20

No, VBITS is the number of bits in each vector entry used in the block Lanczos iteration. It's adjustable at compile time to be 64, 128, or 256 (or 512 in my hosted version). In the code, it's implemented as a struct of 1, 2, or 4 (or 8) uint64's. Using "wider" vectors does more work per iteration but requires fewer iterations. x86_64 cpus typically work well with either VBITS of 64 or 128 with little difference in runtime. ARM and nVidia GPU architectures typically work better with 128 or 256.

The Cavium systems are usually enterprise servers and are priced as such. I'm not aware of anyone using them in workstations or consumer-grade hardware. I'm not sure how much they might go for in the used market.

ewmayer 2021-08-08 00:05

[QUOTE=frmky;585034]The Cavium systems are usually enterprise servers and are priced as such. I'm not aware of anyone using them in workstations or consumer-grade hardware. I'm not sure how much they might go for in the used market.[/QUOTE]

Quickly found this from early 2018:

[url=https://www.zdnet.com/article/caviums-thunderx2-processor-powers-first-64-bit-armv8-workstation/]Cavium's ThunderX2 processor powers first 64-bit ARMv8 workstation[/url] | ZDNet

Another from the same timeframe on Anandtech:

[url]https://www.anandtech.com/show/12571/gigabyte-thunderxstation-cavium-thunderx2-socs[/url]

Nothing about pricing in either - Anandtech says "GIGABYTE does not publish pricing of the ThunderXStation, but we have reached out to PhoenicsElectronics and will update the story once we get more information on the matter", but article shows no such updates. But cool to know such gear is a thing - ~4 years ago when they were new, KNL workstations sold new for $5000, now you can get them cheaply in the used/refurb market.


All times are UTC. The time now is 22:47.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.