mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   Skylake X - random read penalty (https://www.mersenneforum.org/showthread.php?t=23482)

Prime95 2018-07-01 23:04

Skylake X - random read penalty
 
Background: A typical in-place two-pass FFT must do large stride reads and writes in one of its passes. My current code is much slower in the pass that does large strides. Prefetching is unable to hide this penalty.

I decided to run some benchmarks somewhat simulating a 9600K FFT (78MB). I timed 5 iterations of (1) in-place sequential read-write, (2) in-place large-stride sequential read-write, (3) sequential read, large-stride write to different memory area. Prime95 does (1) and (2) for its two passes, but could be rewritten to do two passes of (3). I also did two cases of (3). (3a) uses vmovapd to store and (3b) uses vmovntpd to store.

The data:
(1) 85M clocks, (2) 133M clocks, (3a) 201M clocks, (3b) 130M clocks.

I then added 50 clocks of "work" after every 4 cache lines read in:
(1) 142M clocks, (2) 341M clocks, (3a) 196M clocks, (3b) 139M clocks.

I then added prefetching after reading the current data and before doing the 50 clocks of work:
(1) 122M clocks, (2) 227M clocks, (3a) 194M clocks, (3b) 134M clocks.

There are some conclusions we can reach and some rather inexplicable results.
Conclusion 1: Large stride reads are very costly -- 50M clocks if you do no work, and 100M clocks if you do work and use prefetching.
Conclusion 2: If you are doing large-stride writes to a fresh memory block, use vmovntpd.
Conclusion 3: Sequential prefetching may-or-may not be beneficial. (1) when from 142M to 122M, but (3) went from 139M to 134M. Perhaps the hardware prefetcher is doing just as good a job as software prefetching.

Inexplicable result: How did case 2 go from 133M to 341M clocks with the addition of 95M clocks of work (I did time the work without doing any memory activity).

This data may be of use to Ernst (may prompt changing Mlucas memory layout) and Mystical (probably confirms what he already knows -- no code changes needed). Assembly code is available for inspection if you want to benchmark on different hardware.

Next up for me is to implement large-stride writes to a scratch area in prime95 to see if it makes substantial difference in either the single-core or all-core cases. This is no minor undertaking.

Mysticial 2018-07-02 04:34

Have you tried: in-place large-stride sequential read-write using NT-stores? IOW, you NT-store back to the same memory location you loaded from.

What does your "large stride sequential" look like? What's the block size before you "hop" to the next block? I've been seeing effects that seem to hint at cache associativity issues. But I haven't been able to (nor have I tried really hard) to test it. Or perhaps this is moot since you already insert enough padding to avoid it?

In your case 2, does your "work" have any memory accesses that go beyond L1?


[QUOTE]This data may be of use to Ernst (may prompt changing Mlucas memory layout) and Mystical (probably confirms what he already knows -- no code changes needed). [/QUOTE] I actually did not know about this. But I'm not surprised.

My own FFT/NTT code uses (2) both in memory and on disk. But it's not easily be changed since they need to work with any # of passes over memory/disk and they need to be done in place to keep the memory/disk consumption down. (since we're talking gigabytes/terabytes here)

Prime95 2018-07-02 14:58

[QUOTE=Mysticial;490991]Have you tried: in-place large-stride sequential read-write using NT-stores? IOW, you NT-store back to the same memory location you loaded from.[/quote]

I have tried this in the past, but not in this particular benchmark. I'll perform the benchmark today.

[quote]What does your "large stride sequential" look like? What's the block size before you "hop" to the next block?[/quote]

The test read 4 cache lines then "hopped" 7680*8+29*64 bytes. I can test some different paddings.

[quote]I've been seeing effects that seem to hint at cache associativity issues. But I haven't been able to (nor have I tried really hard) to test it. Or perhaps this is moot since you already insert enough padding to avoid it?[/quote]

I've seen strange results from using different large-stride padding in all Intel chips, but I've not come up with a useful theory to explain anything.

My code does avoid 4KB distances as much as possible as that's really bad for the L1 cache.

[quote] In your case 2, does your "work" have any memory accesses that go beyond L1?[/quote]

"Work" was simply 50 "vaddpd zmm0, zmm0, zmm0".

Hmmm, that's 200 clocks of work -- I'll go fix that.

[quote]My own FFT/NTT code uses (2) both in memory and on disk. But it's not easily be changed since they need to work with any # of passes over memory/disk and they need to be done in place to keep the memory/disk consumption down. (since we're talking gigabytes/terabytes here)[/QUOTE]

It could be partially done, but I sympathize that it is not easy. I'm sure once you load a chunk into memory you do as much FFT as possible which must involve two (or more) passes for optimal L2/L3 cache usage.

The first in-memory pass does large-stride loads but writes data to a scratch area so that the next pass can do sequential reads. The next pass benefits from sequential reads from the scratch area and writes back to your original memory area. Is it worthwhile to reduce your usable memory to allocate a scratch area? Does the benefit of sequential reads disappear when all cores are working? I don't know.

Mysticial 2018-07-02 16:51

[QUOTE=Prime95;491008]The test read 4 cache lines then "hopped" 7680*8+29*64 bytes. I can test some different paddings.
[/QUOTE]

That seems really small. The block sizes that I deal with are on the order of 1 - 4k - enough to flush the entire OOE window. So never are there multiple blocks in the OOE window and much of the L1 cache (including the strided memory accesses) except at the boundaries when switching from one block to the next.

Admittedly, using larger blocks will put pressure on the reduction sizes that can fit into the caches. As it's usually not possible to 2-pass a 1 billion-bit convolution length. But the block sizes can't be reduced much further anyway as they'll start splitting cache lines due to alignment issues.

I've been thinking of experimenting with special-case algorithms that can do super-small block sizes like what you're doing now with only 4 cache lines. But I think you might've scared me off from that.

Though my priorities have been shifting elsewhere anyway since optimizing the FFT/NTT at this point is beating a dead horse in the face insufficient memory bandwidth.


[QUOTE]I've seen strange results from using different large-stride padding in all Intel chips, but I've not come up with a useful theory to explain anything.

My code does avoid 4KB distances as much as possible as that's really bad for the L1 cache.

It could be partially done, but I sympathize that it is not easy. I'm sure once you load a chunk into memory you do as much FFT as possible which must involve two (or more) passes for optimal L2/L3 cache usage.

The first in-memory pass does large-stride loads but writes data to a scratch area so that the next pass can do sequential reads. The next pass benefits from sequential reads from the scratch area and writes back to your original memory area. Is it worthwhile to reduce your usable memory to allocate a scratch area? Does the benefit of sequential reads disappear when all cores are working? I don't know.[/QUOTE]

For both legacy and technical reasons, I don't pad out the "master copy" of the FFT/NTT data that sits in memory. Linear padding schemes get defeated by even larger strides. And non-linear padding schemes will complicate multi-level/multi-pass logic as well as disk logic. I tried it years ago and decided it wasn't worth the mess.

So yes, I do read blocks from the master copy into a small buffer that fits comfortably into the caches. Once the work is done, it's NT-stored directly back to the master copy in memory.

Once the data gets into the scratch buffer, it's (mostly) all good. But even though the work done in the scratch buffer is several times more than the work done at the boundaries, all the time (according to the profiler) is spent at the boundaries stalling on the memory accesses to the master copy.

The store side is bandwidth bound. The read side is latency bound. Prefetching doesn't fix everything as we've discussed in the other thread. The read side is probably associativity bound as well - prefetching doesn't help with associativity since now you have additional overlapping streams with critical stride to deal with.

I gave up trying to get a good overlap between the boundary memory stalls and the scratch-buffer work in the middle. So I rely almost entirely on HT.

Prime95 2018-07-02 16:53

Corrected numbers. The large stride hop corrected to 7680*16+29*64, the 50 clocks of work changed to 50 "vandpd zmm0, zmm0, zmm0" and I've added vmovntpd timings in all cases.

(1) In-place seq. read/write, (2) in-place strided read/write, (3) seq. read, strided write
(a) vmovapd (b) vmovntpd

Data only
(1a) 89M (1b) 155M (2a) 149M (2b) 246M (3a) 211M (3b) 131M

50 clocks of work
(1a) 104M (1b) 141M (2a) 252M (2b) 271M (3a) 197M (3b) 124M

50 clocks of work, prefetching
(1a) 95M (1b) 142M (2a) 184M (2b) 263M (3a) 197M (3b) 128M


vmovntpd is harmful for in-place FFTs, helpful for moving data to a new memory location.
strided reads are costly.

Prime95 now does one pass of (1a) and one of (2a) -- 279M clocks in the benchmark. Prime95 could instead do two passes of (3b) -- 256M clocks in the benchmark.

The 50 clocks of work every 4 cache lines translates to a total of 77M clocks of work. Next up I'll double the work to see which methods best hides the memory latency.

Prime95 2018-07-02 17:35

Nuts, "vandpd zmm0, zmm0, zmm0" is optimized away by the chip. I replaced it with "vandpd zmm0, zmm0, zmm1".

Corrected data (it did not change much) + new data for 100 clocks of work every 4 cache lines (CPU bound instead of memory bound).

50 clocks of work
(1a) 107M (1b) 146M (2a) 267M (2b) 282M (3a) 197M (3b) 131M

50 clocks of work, prefetching
(1a) 96M (1b) 143M (2a) 191M (2b) 265M (3a) 199M (3b) 129M

100 clocks of work, prefetching
(1a) 151M (2a) 282M (3b) 150M


Ugh. This confirms my earlier observations that large-stride prefetching is not very effective in hiding CPU latencies.

ewmayer 2018-07-03 03:19

Thanks for the data, George - just for curiosity's sake you might consider using something like the following for your work-simulation code:

vaddpd zmm0, zmm0, zmm0
vaddpd zmm1, zmm1, zmm1
vaddpd zmm2, zmm2, zmm2
...

with as many independent-register ADDs as are needed to hide the latency ... based on Agner's table this would be 8 ADDs for Skylake-X. (KNL can do the same 2 per clock but has 6-cycle latency here.)

My code does the large-stride phase in the fused final-iFFT-radix-pass/carry-step/initial-fFFT-radix-pass as follows, for an N-double array and R being the radix in question - I have multiple combinations of radices that get tried during the code self-test step users must do post-build, that writes the best-radix-combo for each FFT length on the user's hardware to the mlucas.cfg file which the code uses for production runs:

o Loop index runs from = 0 to < N/R, each loop pass processes R complex (vector-complex for SIMD builds) data separated by stride N/R;
o Radix-R iDFT on those large-stride inputs writes outputs to a small block of local contiguous memory, upon which the carry step operates;
o Radix-R fDFT on contiguous-mem inputs (the carry-step outputs) writes its outputs back to the same large-stride-separated addresses of the main data array as were the inputs for the opening iDFT.

All data read & write in my case use MOVAPD (actually MOVAPS per Agner's optimization notes, to save an instruction byte). For multiple threads, each gets an equal-sized subportion of the above data processing, e.g. for 2 threads, thread1 would have a loop index running from = 0 to < N/(2*R), and thread2's index from = N/(2*R) to < N/R, i.e. each thread handles half of the large-strided data, with a small 'cleanup' step to propagate carries across data boundaries.

I only did avx-512 build & test so far on KNL, where I got no discernible benefit (nor performance hit) from doing prefetching. I only tried roughly-100-instruction-separated prefetches worked into the carry step.

tServo 2018-07-04 03:33

This might be worth a shot ( maybe )
 
George,
I was wondering if you have tried reaching out to Intel to see if you can talk to some engineers who would be able to answer your questions?
I suggest this based on actual experience I had doing this a while ago.
From 1989 to 1995 I was working for a small software start-up that was using Intel 80386 and 80486 based machine to implement a virtual memory paging system so we could run lots of images of Dos at once. In a 1 meg machine, we could run about 8 - 16 images. We had some customers who were hot for this.
We were writing our code in protected mode assembler from a company called Phar Lap. Quite often, we just couldn't get stuff to work, even tho the principles of operation said it should. One day our prez walked in and gave us a few contacts at Intel. he explained he had spent the last couple of days calling around. As it turned out, we contacted the very engineers who worked on the internal micro code for their processors. When we explained what we were doing, they were quite enthused because we were the first to try these instructions. Essentially, we were debugging the processors for them. They fixed several bugs by issuing new steppings and we got the corrected processors first !
We developed a good relationship with these guys that was invaluable and saved us enormous amounts of time and debugging.

However, I realize that was 25 years ago and times change and it might be impossible to speak with anyone like that again. But you never know ......

kriesel 2018-07-04 15:39

[QUOTE=tServo;491103]George,
I was wondering if you have tried reaching out to Intel to see if you can talk to some engineers who would be able to answer your questions?
I suggest this based on actual experience I had doing this a while ago.
From 1989 to 1995 I was working for a small software start-up that was using Intel 80386 and 80486 based machine to implement a virtual memory paging system so we could run lots of images of Dos at once. In a 1 meg machine, we could run about 8 - 16 images. We had some customers who were hot for this.
We were writing our code in protected mode assembler from a company called Phar Lap. Quite often, we just couldn't get stuff to work, even tho the principles of operation said it should. One day our prez walked in and gave us a few contacts at Intel. he explained he had spent the last couple of days calling around. As it turned out, we contacted the very engineers who worked on the internal micro code for their processors. When we explained what we were doing, they were quite enthused because we were the first to try these instructions. Essentially, we were debugging the processors for them. They fixed several bugs by issuing new steppings and we got the corrected processors first !
We developed a good relationship with these guys that was invaluable and saved us enormous amounts of time and debugging.

However, I realize that was 25 years ago and times change and it might be impossible to speak with anyone like that again. But you never know ......[/QUOTE]
It seems worth a shot. In the early 1980s I contacted some guys at NASA that had developed finite element analysis software on the CDC6600 and subsequently ported it to numerous other computers, including the DEC VAX 11/780. They were quite helpful and gracious about fixing a bug or two I found in the VAX VMS port. Similarly in the late 1980s or so, puzzling over interoperability problems between VAX smtp mail service and PC-based Eudora, I got direct access to the DEC software engineer supporting their port of IUPOP to VMS, and quick turnarounds of bug fixes relating to buffer overflows truncating records transmitted to POP3 clients. If I could get that sort of support from NASA or DEC as a young unknown engineer, George with some name recognition could have a good chance at Intel. And if they instead snub him, this very minor stockholder would be miffed. Sharp people at such organizations recognize the value to them of what amounts to a free skilled and motivated tester.
These days there are more channels available for this sort of thing. Back then it was telephone, snail mail, and maybe DECNET email. Now, there's also online developer forums, company web site, independent sites like stack overflow, well crafted internet searches, etc.

Prime95 2018-07-17 02:40

Sigh, I spent a significant amount of time coding up a sequential read, scatter write version of the FFTs. Results are quite disappointing to say the least.

My strategy for pass 1 was to read a chunk of the sequential data from memory and copy it to a small scratch area, do lots of computation while prefetching the next sequential chunk, scatter write the chunk using movntpd to a big scratch area. Pass 2 is similar, sequential read a chunk of the big scratch area to the small scratch area, do lots of computation while prefetching the next sequential chunk, scatter write the chunk to memory using movntpd.

In theory, this should read the input memory, write the big scratch area, read the big scratch area, and write the results back to original memory. The small scratch area should always remain in the L2 cache. This is the same memory bandwidth requirement as the old code.

The result for a 9600K FFT using one core: 32.2 ms for new code vs. 28.5 ms using old code

The throughput results using all 8 cores is stunning: 98.1 iter/s vs. 161.3 iter/s.

While I cannot guarantee I have not made any coding errors, it sure appears that the new code is using much more memory bandwidth than it should.

A detailed look at the single core case allows us to gain some more insights:

In this pass, the old code reads and writes a contiguous chunk, no scratch area used. Timer 4 is the section that writes to the small scratch area in the new code. Timers 5/9/13 do the computation and prefetching. Timer 14 is the section that scatter writes to the big scratch area in the new code.
[CODE]Old code New code:
timer 4: 6087660 timer 4: 7789814
timer 5: 5314400 timer 5: 5014060
timer 9: 15397940 timer 9: 15372574
timer 13: 5583210 timer 13: 5502810
timer 14: 6310106 timer 14: 21199092
[/CODE]

In this pass, the old code reads cache lines using a large stride to a small scratch area and writes data back to the same cache lines. It also prefetches using a large stride. Timer 16 reads the data and writes to the small scratch area. Timers 18/20/22/24 compute and prefetch next chunk. Timer 26 writes the results.
[CODE]Old code New code:
timer 16: 12380328 timer 16: 9545178
timer 18: 6552980 timer 18: 4829716
timer 20: 9978084 timer 20: 7264882
timer 22: 8967734 timer 22: 7033084
timer 24: 8373384 timer 24: 5103364
timer 26: 10401596 timer 26: 20117636
[/CODE]

From timer 14 and 26 we see that using vmovntpd for our stores "overwhelms" the store unit. The old code used vmovapd which lets the chip write the results back to memory "at its leisure". Timer 4 shows there is a minor cost writing to a small scratch area as opposed to in-place operations. Timer 16/18/20/22/24 shows the huge penalty for the large stride read and prefetching.

Mysticial 2018-07-17 03:26

[QUOTE=Prime95;491957]Sigh, I spent a significant amount of time coding up a sequential read, scatter write version of the FFTs. Results are quite disappointing to say the least.[/QUOTE]

At least it's interesting though. So it's all for science.

I couldn't tell from your description, but did you do an apples-to-apples comparison between the old scheme and new scheme with the same settings: yes/no prefetch and yes/no NT-store?

Some shot-in-the-dark guesses:
[LIST][*]Something is getting transferred to/from memory twice. Bad prefetching will do this. Though it can be hard to spot/confirm without a profiler.[*]From your description, it sounds like your read/write to memory patterns are bursty since you bunch them together under the same timer. How large are these bursts? Are they a large enough to "overwhelm" the load/store buffers?[/LIST]


All times are UTC. The time now is 07:16.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.