mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2012-03-01, 05:34   #23
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

26·151 Posts
Default

Quote:
Originally Posted by flashjh View Post
SB-E right? Is that 1 thread or 4 threads?
Right. 4 workers, each one single thread, with no helper thread, I have 4.4GHz OC in the config asked by emily and I get 10ms/iter (more or less), and I just applied the proportion rule with her/his (I still think emily is a "he" :P) requested 3.9G (I didn't set the clock, just did the calculus).
LaurV is offline   Reply With Quote
Old 2012-03-01, 05:36   #24
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·29·83 Posts
Default

What about the other two cores? mfaktc?
Dubslow is offline   Reply With Quote
Old 2012-03-01, 06:06   #25
flashjh
 
flashjh's Avatar
 
"Jerry"
Nov 2011
Vancouver, WA

1,123 Posts
Default

Quote:
Originally Posted by LaurV View Post
Right. 4 workers, each one single thread, with no helper thread, I have 4.4GHz OC in the config asked by emily and I get 10ms/iter (more or less), and I just applied the proportion rule with her/his (I still think emily is a "he" :P) requested 3.9G (I didn't set the clock, just did the calculus).
Quote:
Originally Posted by Dubslow View Post
What about the other two cores? mfaktc?
Those CPUs are amazing! LaurV, that's running 27.3 with AVX, correct? Maybe soon I'll get one

Last fiddled with by flashjh on 2012-03-01 at 06:06
flashjh is offline   Reply With Quote
Old 2012-03-01, 06:38   #26
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

100101110000002 Posts
Default

Quote:
Originally Posted by Dubslow View Post
What about the other two cores? mfaktc?
I looked and looked for another 2 cores, but did not find any! Do you think the 4 visible cores conspired together to hide the other 2 cores from me and forbid me to find them? That would be odd...

Now seriously speaking, 3820 has 4 cores only... The hex-core "heavy artillery" stuff like 3930 and 3960 I was not so lucky to touch yet. The 3820 is an eval board received few days ago and we have to return it. Cooling is quite bad (we are interested in passive cooling for one of our industrial PC terminals, no fan, no noises, no dust-trouble, heatpipes will be connected to machine's case, which is a mountain of iron, but this cpu really digest double amount of power comparing with its predecessors, and our design for cooling seems to be not very effective).

For few (30?) dollars more you get a 2600k or even a 2700k which can be overclocked higher and it would be still stable and consume less power. [edit: and do exactly the same job]

Last fiddled with by LaurV on 2012-03-01 at 06:43
LaurV is offline   Reply With Quote
Old 2012-03-01, 06:48   #27
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

11100001101012 Posts
Default

Indeed. I saw 3xxx and thought 'hex core!'. I didn't even know the 38xx had come out yet.
Dubslow is offline   Reply With Quote
Old 2012-03-01, 07:36   #28
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

26·151 Posts
Default

To come back to the theoretical speed (thanks Dubslow for pointing it out) for the case when the memory would be no limitation, well, the perspectives are not so pink, either.

Say for example we run a pure-64bit CPU, that can read and write any cell of memory in zero amount of time from the time of request, that is, you can read any memory cell any time, and which can multiply its registers in a single clock, to get either the lower or the higher part of the result. So, you would need two clocks to multiply two (or square one) 64-bit numbers. In fact you do the multiplication two times, once to read the lower 64 bits of the result and once to read the higher 64 bits, and you need two clocks for all this story. Anyhow, two clocks for a 64-bit multiplication is better then most CPUs can do, but now I have the Neon coprocessor on my desk (Texas Instruments' Cortex A8) and that is how it works with msb/lsb parts of the result, and my thinking is highly influenced of the documentation I am reading in this very moment.

To have an idea how much time per iteration this will take, and how much the FFT improves the things, let's assume more than we have a 4GHz CPU, and we are dealing with a 26-30M exponent. Memory operation is instant, no delay. We can read any 64-bit location when we want, and we even can read more, or all, in the same time. But we still can only multiply 64 bits, and we need 2 clocks for it.

So, working is 64-bit base, we have 26M bits divided by 64, that is n=406250 "units" to multiply.

Using school-grade multiplication, we have to do n^2 operations, that would be 165,039,062,500 operations, taking an amount of time of 82.5195313 seconds per iteration.

Using karatsuba-like multiplication, we would only have to make n^log(3,2) operations, which is n^1.58496, or 775,759,623 operations, and this would take, amazing, only 387.8798 milliseconds to finish, in average. That is about 250 times faster. Even in the worst case scenario, when we would need exactly 3*n^log(3,2), we can get 1.1636394 seconds per iteration, which is about 80 times faster then school-grade.

We can improve more with different variations of Toom-Cook, the best of them getting about 164 million operations, and needing only 82.3576 milliseconds, an improvement of exactly thousand times fold.

Adding FFT into the equation, Schönhage–Strassen runs in O(n log(n) log(log(n))), that would only need 13,422,750 operations and be done in 6.7144 milliseconds. So, even with a pure 64-bit ALU and unlimited memory bandwidth, we can't go much lower under this, assuming we run a single core, no parallel tricks as SSE or AVX, etc, and assuming our ALU is so clever to make a 64bit full multiplication in only two clocks.

Theoretically, a GPU can do much better in this case, thousand time better, but the problem with it is exactly... the bandwidth limitation. When it come to access external memory, which is shared for all "ALU"s, they need to be polite and wait for each other, so they become quite slow.

Some improvement in timing could come only if we have TONS of such "unlimited bandwidth" memory, and pre-compute multiplication tables for a higher number of bits. With few billions of terabytes of such a fast memory which can be accessed instant filled with precomputed multiplication tables, the multiplication algorithm become somehow linear, just lookup in the tables and few additions.
LaurV is offline   Reply With Quote
Old 2012-03-01, 15:27   #29
bcp19
 
bcp19's Avatar
 
Oct 2011

7·97 Posts
Default

Are those DC exponents? I ask cause mine is at 9ms on DC using a 2500k running at 4.25GHz.
bcp19 is offline   Reply With Quote
Old 2012-03-01, 18:30   #30
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

1C3516 Posts
Default

That sounds about right. You'd get like 10 ms if you slowed it down a bit, whereas LaurV predicts 6 ms for a 4 GHz at two cycles to multiply to 64bit numbers. Add in the memory/bandwidth limitation and you get significantly slower than 10, but add back in the parallel operations like SSE/AVX and you get back down to 10 ms. It seems about right.
Dubslow is offline   Reply With Quote
Old 2012-03-05, 06:16   #31
firejuggler
 
firejuggler's Avatar
 
Apr 2010
Over the rainbow

A2E16 Posts
Default

good news about ram speed, and need almost no 'power, here

Quote:
Originally Posted by Tomshardware
According to a research paper published in Nature Photonics, the prototype has a capacity of 4 bits and transfers data at 40 Gbps. It features extremely low power consumption at just 30 nW. While it is far from a commercial product, the researchers believe that it is a foundation for the development of far more capable o-RAM devices with a storage capacity in the range of Kb or Mb. The NTT researchers believe that a 100 Kb o-Ram for all optical network routers device could be built by 2020. A 1 Mb o-RAM chip could be available by 2025

Last fiddled with by firejuggler on 2012-03-05 at 06:16
firejuggler is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
A Theoretical (vs. Proficient/Practical) Deterministic Primality Test a1call Miscellaneous Math 194 2018-03-19 05:54
The Limitations of the Cranial GPU Flatlander Science & Technology 3 2013-06-13 13:34
Maximum theoretical MPG Mini-Geek Lounge 9 2008-07-14 22:45
Custom test for maximum CPU stress Torpedo Information & Answers 10 2007-10-05 17:33
LL test speed up? jebeagles Miscellaneous Math 16 2006-01-04 02:43

All times are UTC. The time now is 23:22.


Fri Aug 6 23:22:14 UTC 2021 up 14 days, 17:51, 1 user, load averages: 3.92, 4.03, 4.03

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.