mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Factoring (https://www.mersenneforum.org/forumdisplay.php?f=19)
-   -   How to thread matmul on systems with L3 cache (https://www.mersenneforum.org/showthread.php?t=10381)

fivemack 2008-06-09 16:16

How to thread matmul on systems with L3 cache
 
Barcelona has four 256K L2 caches and a 2M L3 cache; Nehalem will have four 256K L2 caches and an 8M L3 cache; it seems to make sense to treat a Q6600 as having four 2048K L2 caches and its Penryn equivalent as having four 3072K L2 caches.

My present idea is that, for systems without L3 caches, you tile the matrix into blocks corresponding to vectors half the length of L2 cache, then feed the tiles into threads on a grab-a-tile-when-your-last-one-is-finished basis.

The slight problem is that, on the Q6600, the ideal vector length is 128k entries so the coordinates within the tiles want to be 34 bits long; I was pondering using 256 bits to store seven 36-bit quantities, but unpacking integers from SSE registers takes more time than I suspected, and 36 bits is already too short for one-thread-per-socket on Nehalem. Three 42-bit quantities per 128 bits might be wiser.

I suppose I ought to be using prefetchnta on the coordinate stream to keep it from displacing things from the L2 cache - I know CPUs have clever prefetch, but I don't think they can notice that a stream is read-only and refrain from caching it.

For systems with L3 caches, I suspect the answer is to tile into blocks such that the input and output vector together fit in L3 cache, then break each of those tiles into L2-cache-sized subtiles and feed those to an appropriate number of threads. But that feels as if it'll be spending nearly all its time creating and destroying threads (or on inter-thread communication to start and stop persistent threads).

I'm ignoring L1 cache entirely here; my reasoning is that the L1 cache might hold a vector of length 2^11, but, for a matrix of say 2^24 side length and 2^6 entries per row, IE a density of 2^-18 non-zero entries, any given 2^11 x 2^11 tile will hold only about a dozen matrix entries, whilst a 2^17 x 2^17 tile will hold 2^16.

What of this is nonsense, what have I forgotten to think about?

jasonp 2008-06-09 17:35

[QUOTE=fivemack;135528]
What of this is nonsense, what have I forgotten to think about?[/QUOTE]
This is extremely well trodden ground. See [url="http://www.cs.utexas.edu/users/flame/libFLAME/ReleaseNotes.html"]here[/url] for a start on modern algorithmic thinking on fast matrix multiplies. I have one of their earlier papers in my archives, they gain a measurable speedup by accounting for TLB misses and assuming that the bandwidth to L2 cache is large enough that excessive prefetching is unnecessary. Of course, that's for dense matrix multiplies.

Maybe it's time to measure my own work in this area more carefully; I've noticed some matrix multiply behavior that's a little counterintuitive, like transpose matrix multiplies slowing down when the coordinates are traversed in reverse order

Another possibility is to rearrange the blocks using a 2-D space-filling curve ordering like a Hilbert order; you can even rearrange the entries in a single block the same way, though I tried to do that early on and it didn't make a difference at all.

In my experience, non-temporal prefetch/load/store is a big win for streams of writes but a big loss for streams of reads. Depending on the machine architecture, you can simulate a stream being read-only by having fewer streams than the amount of cache associativity, since LRU-like algorithms will favor keeping more recent written cache lines at the expense of older read-only cache lines.

fivemack 2008-06-09 22:47

[url]http://www.cs.berkeley.edu/~samw/research/talks/cscads07.pdf[/url] and its accompanying paper [url]http://www.cs.berkeley.edu/~samw/research/papers/sc07.pdf[/url] are roughly where I think the state of the art in sparse double-precision matmul is, though (I suspect for logistical reasons in carrying the matrices around) their matrices are all either much smaller or much sparser than anything factorisation is interested in, and none of their machines has an L3 though the shapes of Niagara and Cell are close.

I wouldn't have thought of TLBs as an issue, though they clearly are; my habit of interleaving the bits of the X and Y addresses then sorting is quite close to the space-filling-curve approach you mentioned, though I'm sure I could make it more efficient.

Cell is clearly a very nice platform for things small enough to fit into the PS3's miniscule main memory.

fivemack 2008-06-09 22:51

I also have a stupid linear-algebra question.

Block Wiedemann involves calculating x, Mx, M^2x, ... so obviously makes sense only for square matrices. For the slightly over-square N*(N+epsilon) matrices from GNFS, is anything dreadful likely to happen if I just trim the longer dimension down to the shorter one and plan to recover the last few rows by dense linear algebra having got more than epsilon dependencies?

It would be terribly nice not to have to do M^TM, both for coding reasons and because just using M ought to be twice as fast.

jasonp 2008-06-10 03:23

[QUOTE=fivemack;135575]I also have a stupid linear-algebra question.

Block Wiedemann involves calculating x, Mx, M^2x, ... so obviously makes sense only for square matrices. For the slightly over-square N*(N+epsilon) matrices from GNFS, is anything dreadful likely to happen if I just trim the longer dimension down to the shorter one and plan to recover the last few rows by dense linear algebra having got more than epsilon dependencies?

It would be terribly nice not to have to do M^TM, both for coding reasons and because just using M ought to be twice as fast.[/QUOTE]
Block Lanczos uses M'M because the matrix must be symmetric for the iteration to work; one of Chris Monico's students published a thesis that suggests using MM' instead, since this lets you do twice as many operations on a block of columns of M before requiring a global synchronize. I've tried to do that, and I know the iteration works correctly, but there is a modification required to convert a vector in the nullspace of MM' into a vector in the nullspace of M that I don't quite understand.

If block Wiedemann can work with unsymmetric matrices then you don't have to worry about the extra multiply.

The problem with using algorithms from the sparse floating point literature is that the dimension is much smaller, and factorization matrices are hugely more dense and more unsymmetric than ordinary sparse matrix problems.


All times are UTC. The time now is 15:39.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.