Thread: Intel Xeon PHI?
View Single Post
Old 2020-11-29, 23:22   #133
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

101101011100002 Posts
Default

Quote:
Originally Posted by ewmayer View Post
So if the 53.0C for the KNL is to be believed - and the fact that a similar run using 'only' 32 cores gives a cooler 44.0C indicates so - that water cooling is working very well indeed.
Spoke too soon - I neglected to mention that temperature was with the case side panel on the CPU side of the mobo removed - I put the panel back on last night and the temp quickly rose by over 10C into the 65-70C range. When I rechecked just now I saw it at 70C but an added ALARM (CRIT) at end of the sensors output line - not sure precisely what temp triggers that, because it was still at 70C, which I first saw last night, without said alarm message. It probably rose a few degrees higher at some point in the last 15 hours and tripped the alarm. It seems to be a "once tripped, the alarm message persists" deal because I took the side panel back off and the temp quickly dropped back to ~60C, but the message still shows. I looked at the manpage to see if the 'sensors' command has a 'clear alarm' option, didn't find one.

Will look into replacing the side panel in question with a fine-perforated metal-mesh one, similar to the one on top of the casem covering the 2 water-cooler vent fans.

Here some Mlucas avx-512 build timings at 64M-FFT - more below on why that large FFT length is of special interest ATM - on the KNL, all same FFT length, 1-thread-per-core (I found no benefit from any combination of hyperthreading I tried), #threads from 1-64. Parallel scaling is good through 16-threads but then falls off a cliff beyond that:
Code:
64M FFT, 1-thread-per-core, #threads from 1-64:              #thread:	|| scaling (vs 1-thr):
     65536  msec/iter = 1765.36  radices =  16 16 16 16 16 32	 1	1.00
     65536  msec/iter =  943.43  radices =  16 16 16 16 16 32	 2	.936
     65536  msec/iter =  496.24  radices =  16 16 16 16 16 32	 4	.889
     65536  msec/iter =  259.18  radices =  16 16 16 16 16 32	 8	.851
     65536  msec/iter =  125.93  radices =  16 16 16 16 16 32	16	.876
     65536  msec/iter =   85.70  radices = 256 16 16 16 32  	32	.644
     65536  msec/iter =   69.06  radices = 256 16 16 16 32  	64	.399
The actual runtimes for a production run, once things settle down after a few minutes, are 5-10% faster - getting ~64ms/iter at 64-threads for the 64M-FFT run described below. Here results - these are just representative examples, I did many more experiments - of several supplemental timing tests, illustrating the ineffectiveness of hyperthreading and the total-throughput boost from running multiple jobs, each using 16 or 32 threads on nonoverlapping sets of cores:

[A] Hyperthreading: Physical-cores 0-15, 2-threads-per-core, radices 256,16,16,16,32: 131.97 ms/iter, slower than 16-thr/1-per-core;

[B] 2 side-by-side runs, each using 16-thr: Each nets 136 ms/iter, 1.85x total throughput of one 16-thr job, 1.26x total throughput of one 32-thr job;

[C] 4 side-by-side runs, each using 16-thr: Each nets 170 ms/iter, 2.96x total throughput of one 16-thr job, 1.62x total throughput of one 64-thr job.

First task I set the KNL on is to complete the 64M-FFT one of the pair of primality test of F30 I started several years ago. I did ~2/3 of the needed 2^30-1 = 1073741823 iterations of said test on a pair of machines: one at 60M on a 32-core AVX2 Xeon server, the other at 64M on the GIMPS KNL. Both machines were physically hosted by David Stanfill, who went AWOL early this year. Ryan Proper was kind enough to pick up the 60M run and complete that on a manycore virtual machine machine he had access to, but the 64M one remained in need of completion. Picked that up at iteration 730M last night, based on timings so far ETA for completion is a little over 8 months. Again, per the above table, this is getting less half the total throughput the CPU is capable of.

The above multiple-job results [B] and [C] indicate that a much better total throughput would be for me, as soon as the in-development Mlucas v20 has a working p-1 Stage 1 with restart-from savefile capability, I should switch the above F30-test completion to 32-threaded and fire up a second 32-threaded job, in form of a deep p-1 Stage 1 on F33. By deep I mean something on the order of a year's runtime. At that point - assuming none of the occasional GCDs during Stage 1 turns up a factor, which is the expected result based on the TF depth to date - the Stage 1 residue can be distributed to volunteers in possession of bigmem systems - any kind of halfway-fast Stage 2 will need at least 128GB of RAM - to run various Stage 2 subintervals in hopes one finds a factor. Said Stage 1 will of course slow the finishing-off of the F30@64M job, but we already know that number to be composite via small TF-found factor, the primality test is to generate a residue for cofactor PRP-checking.
ewmayer is online now   Reply With Quote