![]() |
|
|
#23 |
|
Romulan Interpreter
Jun 2011
Thailand
26·151 Posts |
Right. 4 workers, each one single thread, with no helper thread, I have 4.4GHz OC in the config asked by emily and I get 10ms/iter (more or less), and I just applied the proportion rule with her/his (I still think emily is a "he" :P) requested 3.9G (I didn't set the clock, just did the calculus).
|
|
|
|
|
|
#24 |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
3·29·83 Posts |
What about the other two cores? mfaktc?
|
|
|
|
|
|
#25 | |
|
"Jerry"
Nov 2011
Vancouver, WA
1,123 Posts |
Quote:
Last fiddled with by flashjh on 2012-03-01 at 06:06 |
|
|
|
|
|
|
#26 |
|
Romulan Interpreter
Jun 2011
Thailand
100101110000002 Posts |
![]() ![]() I looked and looked for another 2 cores, but did not find any! Do you think the 4 visible cores conspired together to hide the other 2 cores from me and forbid me to find them? That would be odd...Now seriously speaking, 3820 has 4 cores only... The hex-core "heavy artillery" stuff like 3930 and 3960 I was not so lucky to touch yet. The 3820 is an eval board received few days ago and we have to return it. Cooling is quite bad (we are interested in passive cooling for one of our industrial PC terminals, no fan, no noises, no dust-trouble, heatpipes will be connected to machine's case, which is a mountain of iron, but this cpu really digest double amount of power comparing with its predecessors, and our design for cooling seems to be not very effective). For few (30?) dollars more you get a 2600k or even a 2700k which can be overclocked higher and it would be still stable and consume less power. [edit: and do exactly the same job] Last fiddled with by LaurV on 2012-03-01 at 06:43 |
|
|
|
|
|
#27 |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
11100001101012 Posts |
Indeed. I saw 3xxx and thought 'hex core!'. I didn't even know the 38xx had come out yet.
|
|
|
|
|
|
#28 |
|
Romulan Interpreter
Jun 2011
Thailand
26·151 Posts |
To come back to the theoretical speed (thanks Dubslow for pointing it out) for the case when the memory would be no limitation, well, the perspectives are not so pink, either.
Say for example we run a pure-64bit CPU, that can read and write any cell of memory in zero amount of time from the time of request, that is, you can read any memory cell any time, and which can multiply its registers in a single clock, to get either the lower or the higher part of the result. So, you would need two clocks to multiply two (or square one) 64-bit numbers. In fact you do the multiplication two times, once to read the lower 64 bits of the result and once to read the higher 64 bits, and you need two clocks for all this story. Anyhow, two clocks for a 64-bit multiplication is better then most CPUs can do, but now I have the Neon coprocessor on my desk (Texas Instruments' Cortex A8) and that is how it works with msb/lsb parts of the result, and my thinking is highly influenced of the documentation I am reading in this very moment. To have an idea how much time per iteration this will take, and how much the FFT improves the things, let's assume more than we have a 4GHz CPU, and we are dealing with a 26-30M exponent. Memory operation is instant, no delay. We can read any 64-bit location when we want, and we even can read more, or all, in the same time. But we still can only multiply 64 bits, and we need 2 clocks for it. So, working is 64-bit base, we have 26M bits divided by 64, that is n=406250 "units" to multiply. Using school-grade multiplication, we have to do n^2 operations, that would be 165,039,062,500 operations, taking an amount of time of 82.5195313 seconds per iteration. Using karatsuba-like multiplication, we would only have to make n^log(3,2) operations, which is n^1.58496, or 775,759,623 operations, and this would take, amazing, only 387.8798 milliseconds to finish, in average. That is about 250 times faster. Even in the worst case scenario, when we would need exactly 3*n^log(3,2), we can get 1.1636394 seconds per iteration, which is about 80 times faster then school-grade. We can improve more with different variations of Toom-Cook, the best of them getting about 164 million operations, and needing only 82.3576 milliseconds, an improvement of exactly thousand times fold. Adding FFT into the equation, Schönhage–Strassen runs in O(n log(n) log(log(n))), that would only need 13,422,750 operations and be done in 6.7144 milliseconds. So, even with a pure 64-bit ALU and unlimited memory bandwidth, we can't go much lower under this, assuming we run a single core, no parallel tricks as SSE or AVX, etc, and assuming our ALU is so clever to make a 64bit full multiplication in only two clocks. Theoretically, a GPU can do much better in this case, thousand time better, but the problem with it is exactly... the bandwidth limitation. When it come to access external memory, which is shared for all "ALU"s, they need to be polite and wait for each other, so they become quite slow. Some improvement in timing could come only if we have TONS of such "unlimited bandwidth" memory, and pre-compute multiplication tables for a higher number of bits. With few billions of terabytes of such a fast memory which can be accessed instant filled with precomputed multiplication tables, the multiplication algorithm become somehow linear, just lookup in the tables and few additions. |
|
|
|
|
|
#29 |
|
Oct 2011
7·97 Posts |
Are those DC exponents? I ask cause mine is at 9ms on DC using a 2500k running at 4.25GHz.
|
|
|
|
|
|
#30 |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
1C3516 Posts |
That sounds about right. You'd get like 10 ms if you slowed it down a bit, whereas LaurV predicts 6 ms for a 4 GHz at two cycles to multiply to 64bit numbers. Add in the memory/bandwidth limitation and you get significantly slower than 10, but add back in the parallel operations like SSE/AVX and you get back down to 10 ms. It seems about right.
|
|
|
|
|
|
#31 | |
|
Apr 2010
Over the rainbow
A2E16 Posts |
good news about ram speed, and need almost no 'power, here
Quote:
Last fiddled with by firejuggler on 2012-03-05 at 06:16 |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| A Theoretical (vs. Proficient/Practical) Deterministic Primality Test | a1call | Miscellaneous Math | 194 | 2018-03-19 05:54 |
| The Limitations of the Cranial GPU | Flatlander | Science & Technology | 3 | 2013-06-13 13:34 |
| Maximum theoretical MPG | Mini-Geek | Lounge | 9 | 2008-07-14 22:45 |
| Custom test for maximum CPU stress | Torpedo | Information & Answers | 10 | 2007-10-05 17:33 |
| LL test speed up? | jebeagles | Miscellaneous Math | 16 | 2006-01-04 02:43 |