mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   News (https://www.mersenneforum.org/forumdisplay.php?f=151)
-   -   Holy new Mersenne prime, Batman! (M47 related) (https://www.mersenneforum.org/showthread.php?t=10564)

T.Rex 2008-09-10 20:24

[QUOTE=ewmayer;141787]Regarding parallel scalability [which Tony Reix also asked me about offline][/QUOTE]Here are some data built with Glucas on a Harperton 2.8 GHz 4cores x 2 , done with August 23th prime:

[CODE]#cores sec/iter scalability
-------------------------------------------
1 0.0670 100%
2 0.0366 91.5 % of 2 cores
4 0.0251 66.7 % of 4 cores
8 0.0173 48.4 % of 8 cores[/CODE]

This shows an awfull scalability: with 8 cores, you have the power of only 3.84 cores, though Ernst got 6.24 on his SPARC machine.
And I remember I got 92% with 8 cores (7.4 cores) and 83% with 16 cores (13.3 cores) with Glucas on a Power machine !

So, where is the problem ? Software or Hardware ?

That would be nice to be able to run Mlucas and Glucas on the same hardware... and compare their scalability and performance... and understand why one is running faster than the other and why one is scaling better than the other...

Tony

T.Rex 2008-09-10 20:42

Scalability
 
I see a main reason why programs like Prime95, Glucas and Mlucas do not have a good scalability on machine having a NUMA architecture.

Since, for all 3, the //ed version has been built by extending the no-thread version, they have the same problem. They do:
1) Allocate memory,
2) Divide the FFT in N parts
3) Create and run the N threads on this big memory,
4) Compute the substraction of 2 and go to next iteration.

This is bad ! This way, the memory is allocated around the CPU where the initial thread started. Thus, with 8 or more cores, the NUMA effect creates a bottleneck: all threads want to access the same set of memory.

There are sometimes some ways to say the OperatingSystem to allocate memory around all the cores. That helps. But it is NOT a solution, because the memory zone used by a thread can be at the opposite of the NUMA machine.

The solution would be:
1) Divide the FFT into N parts,
2) Create the N Threads,
3) Allocate memory for each thread,
4) Run the threads

This way, I think that all threads will be able to use memory allocated close to them, thus with no bottleneck and with caches full of useful data.
Only when the N parts of the FFT must be used as one, it is necessary to rebuild a single zone and do the substraction of 2.

When I described how Glucas works (first model), an expert of our HPC team I talked with said: "Oh ! yes, still the same story: they allocate the whole buffer and then all threads use it. Bad".

On (VERY expensive) Power machines made by IBM, the NUMA effect is so small that the machine looks like a simple SMP. And scalability is very good !

On Intel not-very expensive machines, the NUMA effect is high. So much work must be done on the Software side.
With future not-expensive machines with many cores, such a software will run fast.

George, Ernst, Guillermo, your opinion ?

I know that you would have to rewrite a big part the FFT... but it should be worth for machines with NUMA effect.

Tony

ewmayer 2008-09-10 21:13

[QUOTE=T.Rex;141817]I know that you would have to rewrite a big part the FFT... but it should be worth for machines with NUMA effect.[/QUOTE]

Interesting - I'll have to look at just how much coding might be needed here.

But a couple problems with allocating thread-local memory:

1. Since the FFT is fundamentally a data-nonlocal algorithm, each thread can only use "its" local memory chunk for part of each FFT. In Mlucas, the nonlocality of the FFT is delat with by doing 2 multithreaded substeps for each iteration. Each substep accesses the big data array in differently-sliced ways. Even if one were to create thread-local memory the way you outline, at some point on each iteration you'd need to copy the thread-local data to some global dataset and then rebroadcast the merged data to the new set of threads in step 2. Major pain in the ass, and the need for a thread-shared dataset is still present. But perhaps if this merge/rebroadcast step only takes a small amount of time it might be worth it, i.e. if the main NUMA bottleneck is having the various threads accessing the global data pool all the time.

2. The whole idea of packages like OpenMP is to minimize the burden of the kinds of architecture-specific hackery you describe on the programmer - in other words, the threading protocols should be able to do precisely the kinds of architecture-specific thread/memory interaction management and hide the details behind a common level of abstraction. Perhaps the NUMA vendors should be working with the threading-software folks to accomplish this, rather than telling thousands of programmers that they need to do major code rewrites in order to get decent performance.

willmore 2008-09-10 23:13

[QUOTE=ewmayer;141787]Regarding parallel scalability [which Tony Reix also asked me about offline], here are the numbers for Mlucas on the 2.1 GHz Sparc VI [dual-core CPUs] M5000 server, at an FFT length of 4096K:
[code]
#threads iteration speedup vs CPU
time (sec) 1-thread Utilization
-------- --------- ---------- --------
1 0.237 1.00x 100%
2 0.138 1.72x 86%
4 0.065 3.65x 91%
8 0.038 6.24x 78%
16 0.022 10.8 x 68%
[/code]
Interesting what happens at 2 and 4-threaded ... CPU utilization drops to 86% at 2-threads, then rises to 91% at 4 threads before dropping again.

Also, I should point out that on at least one architecture [Itanium2], Mlucas actually showed a slightly *superlinear* scaling for up to 4 threads [see [url=http://mersenneforum.org/showthread.php?t=7292]here[/url], especially post #19 in the thread], likely due to the working set size per thread for 2-4 threads fitting into each processor core's cache better than the full working set size for the FFT being done. So there is in fact potentially an absolute throughput gain to be had from running multithreaded. Not that such information is "useful" or "interesting" or anything...[/QUOTE]

As I argued, probably unsuccesfully, on comp.arch back a decade or so, local caches increase with increasing processors. This can cause an apparant supralinear increase *because you're no longer doing the same task*. You're doing half the job *twice*--in parallel. So, if you can find a job that more than doubles in speed if you halve it in size, you *may* find a task that shows supralinear speedup.

For your odd bump between 2 and 4 cores, were both threads of the 2 job on one physical CPU? What happens if you pin them to seperate CPUs vs leaving them on one? That could be L3(or whatever your package level cache level is) effects. Or, it could be memory bandwidth. I don't know much about current SPARC CPUs--I lot interest when they dropped below pocket calculator speed back in the mid-90's. But, do these chips have local memory or do they share a global pool? If the former, then main memory bandwidth increases with increasing # of processors, so you can expect that there is a chance for supralinear speedup there for similar reasons.

But, that's all IIRC and IMHO. I'm years out of date in processor design.

Oh, and (waves) hi, Bob! Long time no NFS.

Cheers,
David

CRGreathouse 2008-09-11 07:12

[QUOTE=sylvester;141798]The way the currencies and central banks are doing nowadays would probably mean that in 15 years $150K will be just enough for a couple of postcard stamps and maybe a pitcher of beer.[/QUOTE]

So you're predicting, say, [url=http://www.google.com/search?q=(150000/20)^(1/15)]80%[/url] inflation? I calculate a [url=http://www.bls.gov/data/inflation_calculator.htm]4.05%[/url] average since the end of WW2, so that would be a 20-fold increase. Following that historical average, you should still be able to buy a cheap house or a sports car with $150,000 in 2023.

CRGreathouse 2008-09-11 07:30

[QUOTE=R.D. Silverman;141746]Tell me the scientific (or even mathematical) problem that is being solved by actually finding Mersenne Primes.......[/QUOTE]

I don't know if you were serious in your request, but here's one. Yekhanin [1] shows that given a Mersenne prime [tex]M_p[/tex], one can construct an [i]n[/i]-bit locally-decodable code of size [TEX]\exp(n^{1/p}).[/TEX] This was a vast improvement over the previous best-known bounds, [tex]\exp(\sqrt n).[/tex]

LDCs are extremely efficient error-correcting codes for parallel (say, database) data access in corruptible environments. With cloud computing and high-transfer Internet connections, this seems like a practical tool -- though admittedly, I've never heard of them being used yet.

Reference
1. Yekhanin, Sergey. [url=http://research.microsoft.com/users/yekhanin/Papers/phdthesis.pdf]Locally Decodable Codes and Private Information Retrieval Schemes[/url].

T.Rex 2008-09-11 08:37

What with Mersennes with 100 and 1000 Mdigits?
 
The following measures have been done on 8x Itanium2 1.6 GHz, with Glucas.
This machine is quite old now, but it has about the same (74% = 5.92x) scalability with 8 cores than the machine (78% = 6.24x) now used by Ernst's team to verify M45 and M46.

There are strange non-linear performance jumps between the 1M-10M-100M-1000M Mersenne numbers... but it clearly shows that progress must be done before one can check one 100MB Mersenne number at home !

[CODE]M Exponent Mem size sec/iter ratio vs previous time required
--------------------------------------------------------------------------------
1 3321937 78 MB 0.0030 2.7 hours
10 33219281 103 MB 0.0202 prev x 6.7 7.8 days
100 332192831 393 MB 0.4112 prev x 20.4 4.3 years
1000 3321928097 3085 MB 5.2008 prev x 12.7 5.5 centuries

43.6 145000033 191 MB 0.1309 8MB FFT 220 days[/CODE]

Tony

T.Rex 2008-09-11 09:27

Intel MKL FFT
 
Would be nice to compare and learn from what/how others are doing about FFT, like: [URL="http://www.intel.com/cd/software/products/asmo-na/eng/266852.htm"]Intel MKL FFT library[/URL].
Tony

ixfd64 2008-09-11 11:07

The Intel MKL isn't open source, though. :sad:

ldesnogu 2008-09-11 11:31

[quote=T.Rex;141905]Would be nice to compare and learn from what/how others are doing about FFT, like: [URL="http://www.intel.com/cd/software/products/asmo-na/eng/266852.htm"]Intel MKL FFT library[/URL].[/quote]
According to [URL]http://softwarecommunity.intel.com/isn/downloads/softwareproducts/images/FFT_Xeon_MKL10_FFTW_1D_350.jpg[/URL] Intel library has the same speed as FFTW for 512K complex single precision points and according to Ernst mlucas is twice faster than FFTW ([URL]http://hogranch.com/mayer/README.html[/URL]).

I know it doesn't prove Intel library is less efficient (Ernst's README is two years old for instance) but that doesn't look that good.

ixfd64 2008-09-11 16:45

In an attempt to get this thread back on topic: the GIMPS home page has been updated. It looks like we will have to wait at least three more days.

On the bright side, both of Tony's verifications should be done by then, so we'll have three verifications for each number!


All times are UTC. The time now is 22:49.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.