mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-04-13, 04:59   #34
chrisjp
 
Jul 2010

2×5 Posts
Default

Quote:
Originally Posted by diep View Post
Code:
x = a+b;
instruction using 'x' here
So the code uses 'x' directly after it wrote to 'x'. This will give a huge penalty. Say 4 cycles or so.
A way to avoid that is to already have another wavefront started at the gpu that runs while it waits for
the result of 'x' here. Yet in prime number code it might be important to schedule the code such that we can avoid another thread from needing to run there; as then we lose registers which we all need to store the primebase for the sieve generating factor candidates - bigger primebase means we waste less system time on testing composites in the form 2 np + 1 with trial factoring.
This is simply not true, there is 0 penalty for accessing recently written data. The register file is built to support this, all instructions are executed for 4-cycles, with no penalty to access the recently written data (in fact in the instruction language it can even spare a register). It just won't get co-scheduled. Short of trying to access 4-recently modified variables the GPU will have no trouble.

Last fiddled with by chrisjp on 2011-04-13 at 05:00
chrisjp is offline   Reply With Quote
Old 2011-04-13, 08:28   #35
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

80710 Posts
Default

Quote:
Originally Posted by chrisjp View Post
This is simply not true, there is 0 penalty for accessing recently written data. The register file is built to support this, all instructions are executed for 4-cycles, with no penalty to access the recently written data (in fact in the instruction language it can even spare a register). It just won't get co-scheduled. Short of trying to access 4-recently modified variables the GPU will have no trouble.
http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

page 7

he gives there example:
Code:
x = a + b; // 20 cycles to execute
y = a + b; // independent
(stall)
z = x + d; // dependent, must wait for completion

Last fiddled with by diep on 2011-04-13 at 08:29
diep is offline   Reply With Quote
Old 2011-04-13, 22:22   #36
chrisjp
 
Jul 2010

2×5 Posts
Default

Quote:
Originally Posted by diep View Post
http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

page 7

he gives there example:
Code:
x = a + b; // 20 cycles to execute
y = a + b; // independent
(stall)
z = x + d; // dependent, must wait for completion
Okay... not really sure the relevance of a CUDA article on NVIDIA, but I can assure you that isn't correct on ATI :)
chrisjp is offline   Reply With Quote
Old 2011-04-13, 23:31   #37
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3×269 Posts
Default

Quote:
Originally Posted by chrisjp View Post
Okay... not really sure the relevance of a CUDA article on NVIDIA, but I can assure you that isn't correct on ATI :)
The relevance is that the principle works occupancy works similar for AMD gpu's.

100% exactly the same.

So this is relevant for AMD as well.

And yes i already had tested it; more threads in many cases help to reduce occupancy.

And obviously with good reasons. GPU's in the first place are throughput monsters. See it as big buses versus cpu's being cars. CPU's can drive fast and every of its 4 doors can have 1 passenger cq 1 driver.

passengers can get in and out quickly in the car and drive fast to their destination.

Yet to transport in the same car many people takes a long time.

The gpu is a bus. It has just 2 doors. getting people in and out is a proces that takes time. It drives slower, but it takes more people. In total throughput the bus therefore always beats the car.

But when someone is inside the bus, he isn't at the destination yet, unlike with that fast car where he arrives a lot sooner.

So the only relevant thing for AMD gpu's to figure out is what creates the occupancies in AMD's case and for how many cycles. At first sight seems AMD is executing in turn 2 wavefronts at every SIMD.

Vincent

Last fiddled with by diep on 2011-04-13 at 23:37
diep is offline   Reply With Quote
Old 2011-04-14, 07:07   #38
Ralf Recker
 
Ralf Recker's Avatar
 
Oct 2010

191 Posts
Default

Volkov et al. did two things to increase the performance of code running on GPUs. They

- reduced the number of threads / the occupancy.
- increased the amount of work that was done per thread to hide memory access latencies and to keep the GPU cores busy.

These changes helped to get peak performance at very low occupancies, at a time when it was common belief that maximizing the occupancy was a necessity to archive maximum performance.

Last fiddled with by Ralf Recker on 2011-04-14 at 07:29
Ralf Recker is offline   Reply With Quote
Old 2011-04-14, 08:11   #39
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3·269 Posts
Default

Quote:
Originally Posted by Ralf Recker View Post
Volkov et al. did two things to increase the performance of code running on GPUs. They

- reduced the number of threads / the occupancy.
- increased the amount of work that was done per thread to hide memory access latencies and to keep the GPU cores busy.

These changes helped to get peak performance at very low occupancies, at a time when it was common belief that maximizing the occupancy was a necessity to archive maximum performance.
It's not just memory lookups you need to hide; you need to hide everything that is 'in flight' as Volkov points out. These cores are really simple low power cores, they simply do not have transistors unlike the OoO x64 processors, to already use the result prior to latency end.

As for multiple threads, it seems that there is no escape to that, as the system has been built, in case of AMD cayman, to alternate 2 wavefronts.

It really is a throughput optimized chip.

I'm already considering there is no escape from alternating at least 2 wavefronts on the chip; now that might not be a problem in case of mfackt, yet i intend to also run a sieve additional to that.

That sieve produces results; try to optimize throughput if at any given moment a new wavefront can launch; as Volkov points out it reduces the amount of GPR's you have available, which is a BIG problem.

In case of Nvidia, Volkov figured out that the instructions have a 4 cycle latency; now the big question is whether AMD has 4 cycles as well of simplistico instructions being in flight.

Last fiddled with by diep on 2011-04-14 at 08:12
diep is offline   Reply With Quote
Old 2011-04-29, 00:55   #40
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3·269 Posts
Default

Quote:
Originally Posted by chrisjp View Post
Hey guys.

Oliver - I had seen your thread and tried out your code! It's fantastic! I had a similar version based on floating point reciprocals working a few months back. I tried switching over to integer based code since I thought it might be faster given AMDs processing units.

The lack of some of the hardware instructions being passed though in OpenCL is a big downside unfortuantely. This includes some of the 58XX features including rotates, bitcounts, and 24-bit multiplies. There isn't a good "PTX hack" for AMD opencl unfortunately.

Paul - I'm happy to answer any FPGA questions, it's a big hobby of mine. The code is currently fairly elementary and should be a good starter for you. It does use most of the xilinx "blocks" including rams/multipliers/dcm's so it should be a good example.

I'm at work but I will try to throw both designs up somewhere early this week (probably tomorrow)

Chris
Hi Chris,

A reply onto an old posting.

It seems that OpenCL casts very well on the hardware. I am keeping into account the possibility that the instructions not supported actually are rather slow. Like completely crucial for a fast speed is obtaining the top 16 bits of a 24x24 multiplication. That instruction in hardware is called MULHI_UINT16 at the 6900 series (and probably before as well at the 5000 series and 4000 series), yet it might be the case it is casted onto the 32x32 bits mul_hi which means it has to use all 4 PE's at the 6000 series and 5 PE's on the 5000 series. Throughput is then 1 cycle per streamcore which is rather slow compared to the 32 bits mul24 throughput which has 1 cycle and of course also multiply-add at 24x24 bits is 1 cycle throughput per PE.

Now there is some confusion, quite deliberate created it seems, on what throughput is that MULHI_UINT24 has, so i sit and wait until AMD answers that question. Fact it got obfuscated later on in the forum there i assume for now the scenario that it's a slow instruction and therefore was not supported by OpenCL (just google who wrote OpenCL, it's an ex-ATI guy).

If this suspicion is the case, as we are in OpenCL not able to check it, then that means that TF is very horrible at AMD at small bit levels.

Considering you're using the, with 100% certainty very slow, mul_hi a lot, it's amazing you achieved 97M/s with it. Do you have the actual code that calls and compiles this kernel somewhere and can you mail it to me?

diep at xs4all dot nl
diep@xs4all.nl

It means the 6000 series gets a very good throughput, as i calculated your implementation has at your GPU a peak theoretical of around 155M/s and you achieve out of that a 97M/s, which for a GPU is a great occupancy in fact. Much depends however on the testnumbers you used so having that testcode would be interesting. The choice of the test factor candidates can influence this bigtime of course.

But we already knew of course that Fermi and 6000 series have a much better occupancy thanks to all sorts of tricks they use, and that really seems the case.

Even if there would be good news and the top16 bits of the 24x24 bits multiplications can be obtained very fast (i don't count at it for now) it'll take months to release an update of the compiler that would have this inside.

Also good news comes quickly from manufacturers usually and no messenger brought good news so far, just 1 posting there is of possibly an AMD guy who claims it's casted onto the mul_hi, but could be someone else as well, he's anonymous there.

Later he wrote some confusing postings there covering this up. Yet it makes sense of course if you think about it. Don't put something in OpenCL which is hell slow at AMD.

If they'd fix that, that will speedup things over factor way. *way over factor 2*. Maybe factor 3 or so.

As for me i calculated that i want to move up the Wagstaff (which is 100% similar to Mersenne) from 61 bits to a bit or 67-68 at the GPU, it is very difficult to find especially a fast implementation for those bitlevels.

I calculated that storing a 14 bits per unit an using 5 units gives 70 bits.
Then it's possible to use the multiply-add of 24x24 bits delivering a 32 bits result with throughput of 1536 instructions per GPU per cycle.

Yet obviously that's hell slower than when the topbits would be available as then you can do with 3 units.

Theoretic throughput there depends then of course upon an efficient Barrett implementation as that's a tad more complicated with 5 units as there seems several attractive solutions there (if i can get rid of the preshift and at once calculate accurate that would speedup) this gives a theoretic peak i calculated of around 270M-300M/s just for FC testing per GPU (the 6990 of course has 2 gpu's my thingie here has just 1 gpu, let's first get it confirmed by AMD that the 6990 actually has opencl working at both gpu's as they don't officially support it at the 5970 either at both gpu's), yet it's of course very unlikely to achieve that perfect IPC.

Additional to that of course my aim was to build also the generating of FC's in the GPU as i want to free the 16 cores here so they can do more useful stuff and do entire TF inside the gpu, so that'll lower the FC's per second quite a tad, yet it'll offload everything to the gpu then which is cool of course.

As i see it, this 270M-300M/s is not much above what you achieve over there. Just factor 2 at most or so above peak i calculated for your code when i counted it in theoretic cycles by hand quickly (and probably i'm off some), and it's easy to lose a factor 2.

What i'm not so sure of with the AMD gpu's is how mixing different instructions influences theoretic throughput. All instructions i'll trigger will be 1 cycle throughput per PE is the idea whereas you mix them a lot; 4 PE's must cooperate then to execute 1 instruction (mul_hi for example) and yet that seems to give good IPC to you. Most interesting. At CPU's that would be total suicide. An interesting achievement of GPU's, which of course is a result of all cores within a compute unit executing the same code.

Already at design level calculating i can speed it up just factor 2 at most or so, that's not much. All this because of just being able to put 14 bits per integer. That really is so so 80s.

The AMD gpu's really seem to only have been optimized for floating point a lot. I definitely know a very big improvement they can achieve when moving to 22 nm, if the suspicion is the case :)

Regards,
Vincent
diep is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Extracting the Modulus from publickeyblob RSA 512 26B Homework Help 2 2014-11-30 07:31
Free Trials of GPU Cloud Computing Resources NBtarheel_33 GPU to 72 9 2013-07-31 15:32
Guantanamo trials to be restarted garo Soap Box 39 2011-03-22 23:07
It's possible to calculate an unknown RSA modulus? D2MAC Math 8 2010-12-26 16:32
Factorization of a 768-bit RSA modulus akruppa Factoring 53 2010-03-12 13:50

All times are UTC. The time now is 15:15.


Fri Jul 7 15:15:48 UTC 2023 up 323 days, 12:44, 0 users, load averages: 0.96, 1.17, 1.14

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔