mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

Bdot 2011-06-07 13:24

mfakto: an OpenCL program for Mersenne prefactoring
 
This is an early announcement that I have ported parts of Olivers (aka TheJudger) mfaktc to OpenCL.

Currently, I have only the Win64 binary, running an adapted version of Olivers 71-bit-mul24 kernel. Not yet optimized, not yet making use of the vectors available in OpenCL. A very simple (and slow) 95-bit kernel is there as well so that the complete selftest finished successfully on my box.

On my HD5750 it runs about 60M/s in the 50M exponent range - certainly a lot of headroom :smile:

As I have only this one ATI GPU I wanted to see if anyone would be willing to help testing on different hardware.

Current requirements: OpenCL 1.1 (i.e. only ATI GPUs), Windows 64-bit.

There's still a lot of work until I may eventually release this to the public, but I'm optimistic for the summer.

Next steps (unordered):
[LIST][*]Linux port (Is Windows 32-bit needed too?)[*]check, if [URL]http://mersenneforum.org/showpost.php?p=258140&postcount=7[/URL] can be used (looks like it's way faster)[*]fast 92/95-bit kernels (barrett)[*]use of vector data types[*]various other performance/optimization tests&enhancements[*]of course, bug fixes:boxer:[*]docs and licensing stuff :yucky:[*]clarify if/how this new kid may contribute to primenet[/LIST]Bdot

stefano.c 2011-06-07 14:40

Hi, I can run it on my hardware.
I've an AMD Radeon HD 6950 1GB, OpenCL 1.1, Windows 7 64-bit.

diep 2011-06-07 18:12

[QUOTE=Bdot;263174]This is an early announcement that I have ported parts of Olivers (aka TheJudger) mfaktc to OpenCL.

Currently, I have only the Win64 binary, running an adapted version of Olivers 71-bit-mul24 kernel. Not yet optimized, not yet making use of the vectors available in OpenCL. A very simple (and slow) 95-bit kernel is there as well so that the complete selftest finished successfully on my box.

On my HD5750 it runs about 60M/s in the 50M exponent range - certainly a lot of headroom :smile:

As I have only this one ATI GPU I wanted to see if anyone would be willing to help testing on different hardware.

Current requirements: OpenCL 1.1 (i.e. only ATI GPUs), Windows 64-bit.

There's still a lot of work until I may eventually release this to the public, but I'm optimistic for the summer.

Next steps (unordered):
[LIST][*]Linux port (Is Windows 32-bit needed too?)[*]check, if [URL]http://mersenneforum.org/showpost.php?p=258140&postcount=7[/URL] can be used (looks like it's way faster)[*]fast 92/95-bit kernels (barrett)[*]use of vector data types[*]various other performance/optimization tests&enhancements[*]of course, bug fixes:boxer:[*]docs and licensing stuff :yucky:[*]clarify if/how this new kid may contribute to primenet[/LIST]Bdot[/QUOTE]

If you have written OpenCL, where do you need a linux port for, as OpenCL is working for linux isn't it?

You speak of the sieve?

Note i wrote previous week a sieve for CPU that generates FC's. It's for wagstaff yet it's really 1 byte to change to have it generate Mersenne FC's.

It's speed is bad though, yet i don't see how in C i can get it faster a lot.

With a 80k primebase it generates 17M/s at 2.3Ghz Barcelona core
With 5000 primes as primebase it generates somewhere around 40M/s,
yet this is not relevant as 5000 is too little, majority of what it generates are composites.
Over factor 2.5 or something (i posted somewhere exact statistics on this).

Idea was to write this first to get some experience writing FC sieve and then write the sieve for GPU.

17M/s is sucking slow of course to feed GPU, yet there are no legal issues with this code as i wrote it myself :)
When it would help you i can put a GPL header on top of it. Mind shipping the opencl you wrote to me,
as i can directly connect that then for Wagstaff here :)

p.s. you also test in same manner like TheJudger, just multiplying zero's?

Vincent

diep 2011-06-07 18:28

heh Bdot what speed does your card run at?

my main setup here is a XFX HD6970 running opencl under linux.
the machine is a 16 core 2.3Ghz opteron box (barcelona) with 10GB ram.

Total machine price including gpu is 1300 euro.

Yes, it has no case, that would double the price of the machine!

Obviously it's easy to test onto this.

What's your setup there?

Regards,
Vincent

vsuite 2011-06-07 20:40

Hi, would it work with an integrated ATI GPU?

How soon the windows 32bit version?

Cheers

diep 2011-06-07 21:09

[QUOTE=vsuite;263212]Hi, would it work with an integrated ATI GPU?

How soon the windows 32bit version?

Cheers[/QUOTE]

the gpu needs to be 4000 series or newer to work i guess.
the 5970 is only supported for 1 gpu not for 2 (very bad from AMD, i guess because otherwise this gpu is fastest gpu on planet earth for gpgpu, especially from price viewpoint seen).

i do not know about opencl drivers in windows, am typing this from os/x laptop and production machines here use linux. windows2003 server 32 bits will not work for as they have only driver for Vista and newer stuff. Not for server versions AFAIK.

there is a lot of horror reports regarding opencl and nvidia and amd. Especially nvidia. Yet they have cuda with TheJudgers fast code.

Christenson 2011-06-07 21:43

And Bdot's code is essentially the judger's, I imagine, except for OpenCL versus CUDA. I don't think the Linux port will be too bad, as mfaktc seems to be pretty much the same on either platform; there is no GUI yet.

I have an integrated ATI Radion on my six-core AMD Linux64 box, too.

Do let's keep these codes in touch so we have a common effort on the non-factoring parts of the problem.

And diep, are these horrors for speed, for correctness, or for setting cards on fire? mfakto will be the first OpenCL app I have any contact with. I now have enough results from mfaktc that I can help test mfakto.

diep 2011-06-07 21:45

[QUOTE=Christenson;263215]And Bdot's code is essentially the judger's, I imagine, except for OpenCL versus CUDA. I don't think the Linux port will be too bad, as mfaktc seems to be pretty much the same on either platform; there is no GUI yet.

I have an integrated ATI Radion on my six-core AMD Linux64 box, too.

Do let's keep these codes in touch so we have a common effort on the non-factoring parts of the problem.

And diep, are these horrors for speed, for correctness, or for setting cards on fire? mfakto will be the first OpenCL app I have any contact with. I now have enough results from mfaktc that I can help test mfakto.[/QUOTE]

Very good that you have experience with mfackt. Have fun helping out this guy.

Note my codebase is called mfockt

vsuite 2011-06-07 22:51

[QUOTE=diep;263213]the gpu needs to be 4000 series or newer to work i guess.[/QUOTE] Awwww. Too bad. ATI Radeon 3000 Graphics

KingKurly 2011-06-08 02:09

Would the Radeon HD 5450 work for this, eventually? I am using Linux, so I would have to wait for that port to be ready.

[URL]http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5450-overview/pages/hd-5450-overview.aspx#2[/URL]

Linux's 'lspci' says:
02:00.0 VGA compatible controller: ATI Technologies Inc Cedar PRO [Radeon HD 5450]

Christenson 2011-06-08 02:37

[QUOTE=diep;263217]Very good that you have experience with mfackt. Have fun helping out this guy.

Note my codebase is called mfockt[/QUOTE]

That's a terrible name, especially if it is good. Call it something like TF-noCPU, or CPUFreeTF. And stick a C in there for CUDA and an O in there for OPenCL.

*******
Actually, all of this is small enough to live within one code that goes out and finds out what is available if we want it to.
********

diep 2011-06-08 09:44

[QUOTE=KingKurly;263233]Would the Radeon HD 5450 work for this, eventually? I am using Linux, so I would have to wait for that port to be ready.

[URL]http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5450-overview/pages/hd-5450-overview.aspx#2[/URL]

Linux's 'lspci' says:
02:00.0 VGA compatible controller: ATI Technologies Inc Cedar PRO [Radeon HD 5450][/QUOTE]

Of course this one will work fine under linux. Linux has a great driver.

diep 2011-06-08 09:59

[QUOTE=KingKurly;263233]Would the Radeon HD 5450 work for this, eventually? I am using Linux, so I would have to wait for that port to be ready.

[URL]http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5450-overview/pages/hd-5450-overview.aspx#2[/URL]

Linux's 'lspci' says:
02:00.0 VGA compatible controller: ATI Technologies Inc Cedar PRO [Radeon HD 5450][/QUOTE]

Note that besides driver you also need to install the APP SDK 2.4
both are at AMD site.

Bdot 2011-06-08 11:49

Thanks for the replies and help offers, I'll contact you directly with more info.

[QUOTE=diep;263199]If you have written OpenCL, where do you need a linux port for, as OpenCL is working for linux isn't it?
[/QUOTE]

The sieve, handling command line parameters, reading config files, writing result files is all taken over from Oliver and should be no trouble on Linux. What I need to port is all the OpenCL initialization, and driving the GPU. I don't expect a lot of issues, yet is has to be done. Plus, I don't even have the AMD SDK installed on Linux yet :smile:.

[QUOTE=diep;263199]
p.s. you also test in same manner like TheJudger, just multiplying zero's?
[/QUOTE]

Hehe, yes. Just multiplying zeros. Very fast, I tell you!

[QUOTE=diep;263200]heh Bdot what speed does your card run at?
[/QUOTE]

To OpenCL it claims to have 10 compute cores (which would be 800 PEs), but AMD docs for HD 5750 say 9 cores (720 PEs) ... not sure about this. Claimed clock speed is 850 MHz.

[QUOTE=vsuite;263212]Hi, would it work with an integrated ATI GPU?
[/QUOTE]

Actually, I don't know. If you manage to install the AMD APP ([URL]http://developer.amd.com/sdks/AMDAPPSDK/downloads/Pages/default.aspx[/URL]) then it will run, but it may ignore the GPU and run on the CPU instead. You may end up with 1.5M/s :smile:

[QUOTE=vsuite;263212]How soon the windows 32bit version?
[/QUOTE]

Hmm, I did not think that was still necessary. But I just compiled the 32-bit version and it is running the self test just fine. So no big deal if needed.

[QUOTE=KingKurly;263233]Would the Radeon HD 5450 work for this, eventually? I am using Linux, so I would have to wait for that port to be ready.
[/QUOTE]

It would certainly work, but comparing the specs I'd expect a speed of ~4M/s with the current kernel. mprime on your CPU is probably way faster (but those 4M/s would be on top of it at almost no CPU cost). Worth testing anyway.

diep 2011-06-08 16:30

[email]diep@xs4all.nl[/email] for email
diepchess for a range of messengers ranging from skype to aim to yahoo etc
also at some IRC servers sometimes (seldom). had also shipped you that in a private message but maybe you didn't get a notification there.

Lemme ask to thejudger whether he has a GPL header on top of his code as well, didn't checkout his latest code there.

diep 2011-06-08 16:42

[QUOTE=Bdot;263268]Thanks for the replies and help offers, I'll contact you directly with more info.



The sieve, handling command line parameters, reading config files, writing result files is all taken over from Oliver and should be no trouble on Linux. What I need to port is all the OpenCL initialization, and driving the GPU. I don't expect a lot of issues, yet is has to be done. Plus, I don't even have the AMD SDK installed on Linux yet :smile:.



Hehe, yes. Just multiplying zeros. Very fast, I tell you!



To OpenCL it claims to have 10 compute cores (which would be 800 PEs), but AMD docs for HD 5750 say 9 cores (720 PEs) ... not sure about this. Claimed clock speed is 850 MHz.



Actually, I don't know. If you manage to install the AMD APP ([URL]http://developer.amd.com/sdks/AMDAPPSDK/downloads/Pages/default.aspx[/URL]) then it will run, but it may ignore the GPU and run on the CPU instead. You may end up with 1.5M/s :smile:



Hmm, I did not think that was still necessary. But I just compiled the 32-bit version and it is running the self test just fine. So no big deal if needed.



It would certainly work, but comparing the specs I'd expect a speed of ~4M/s with the current kernel. mprime on your CPU is probably way faster (but those 4M/s would be on top of it at almost no CPU cost). Worth testing anyway.[/QUOTE]

The 5450 has 1 SIMD array. Or better: 1 compute unit.

[url]http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5450-overview/pages/hd-5450-overview.aspx#2[/url]

Quoted at most sites as being 650Mhz core frequency.

In the long run i'd expect the speed of it for the 72 bits kernel
to be around 8M/s - 12M/s when APP SDK 2.5 releases, sieving
to generate FC's not counted.

Which is same speed like 2 AMD64 cores achieve.

The theoretic calculation is not so complicated i'd say. It delivers 650Mhz * 80 cores == 52k M instructions per cycle. Naive guess what cost is of a single FC operation would be that one needs a max of around 4500 instructions for that. That's a pretty safe guess if i may say so.

Not achieving that would be very bad in OpenCL. If the compiler somehow focks up there, it might be needed to wait until the next SDK for AMD to fix that.

Vincent

stefano.c 2011-06-08 19:22

[QUOTE=Bdot;263268]Thanks for the replies and help offers, I'll contact you directly with more info.[/QUOTE]
For email: [email]sigma_gimps_team@yahoo.it[/email]

Ken_g6 2011-06-09 00:52

[QUOTE=Bdot;263268]If you manage to install the AMD APP ([URL]http://developer.amd.com/sdks/AMDAPPSDK/downloads/Pages/default.aspx[/URL]) then it will run, but it may ignore the GPU and run on the CPU instead. You may end up with 1.5M/s :smile:
[/QUOTE]

I dealt with that! I eventually found this selects only GPUs:
[code]
cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM,
(cl_context_properties)platform,
0
};

context = clCreateContextFromType(cps, CL_DEVICE_TYPE_GPU, NULL, NULL, &status);[/code]

Feel free to browse [url=https://github.com/Ken-g6/PSieve-CUDA/tree/redcl]my ppsieve-cl source[/url]; just remember that it's GPLed if you want to use any of the rest of it.

Meanwhile, having seen Oliver's source code, I'd like to know how you handle 24*24=48-bit integer multiplies quickly in OpenCL. He used assembly, and I found assembly to be useful in my CUDA app as well; but I just couldn't find an equivalent in OpenCL. That, and I couldn't make shifting the 64-bit integers right 32-bits be parsed as a 0-cycles-required no-op (which it ought to be.)

diep 2011-06-09 09:20

[QUOTE=Ken_g6;263327]I dealt with that! I eventually found this selects only GPUs:
[code]
cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM,
(cl_context_properties)platform,
0
};

context = clCreateContextFromType(cps, CL_DEVICE_TYPE_GPU, NULL, NULL, &status);[/code]

Feel free to browse [url=https://github.com/Ken-g6/PSieve-CUDA/tree/redcl]my ppsieve-cl source[/url]; just remember that it's GPLed if you want to use any of the rest of it.

Meanwhile, having seen Oliver's source code, I'd like to know how you handle 24*24=48-bit integer multiplies quickly in OpenCL. He used assembly, and I found assembly to be useful in my CUDA app as well; but I just couldn't find an equivalent in OpenCL. That, and I couldn't make shifting the 64-bit integers right 32-bits be parsed as a 0-cycles-required no-op (which it ought to be.)[/QUOTE]

See other postings elsewhere of mine. I figured out that AMD had forgotten to implement in OpenCL the top16 bits instruction.

Initially, as usual someone posted desinformation that this would be slow instruction. Posted to the AMD forum it was. Seemed like AMD guy, but i'm not sure. You never know in this forum.

I filed to AMD then official question how fast it would be. How many would've done that if someone already answerred it in the forum?

Answer after some weeks was: it runs at full speed.

The logical question then was to include it in OpenCL. It's possible they do that for APP SDK 2.5

So until then it takes 5 cycles to get the full 48 bits, 1 instruction for low 32 bits and 4 cycles for the top bits using the 32x32 topbits instruction (and we assume of course you have no more than 24 bits info per operand otherwise you'll need another 2 AND's).

This means that at APP SDK 2.5 the 24x24==48 bits operands run at unrivalled speed at AMD gpu's in OpenCL.

How many guys who 'toyed' with those gpu's in opencl have been asleep?

I don't know sir, but if i look to the AMD helpdesk, their official ticket system just progresses a few questions a week. My initial questions though asked over a period over a few days, had sequential numbers.

Nearly no one asks them official questions. Find it weird it never improved, they hardly get official feedback!

In those forums, just like here, people just shout something.

You can throw away all those benchmarks on how fast those gpu's are for integers in opencl.

Obviously one needs patience if you figure out how to do something faster.

Yet this is an immense speedup. The quick code port of chrisjp of the code of TheJudger to opencl, it uses this slow junk you know.

It is not so hard to write down what until the release of the 2.5 app sdk is fastest.

That's using a 70 bits implementation and store 14 bits per limb. Use 5 limbs.

With multiply-add it eats then 25 cycles for the multiplication and a bit of shifting added bla bla you soon end up at nearly 50 cycles, so that's where my guess was based upon that it would take a cycle or 4500 to do the complete FC testing.

These AMD gpu's delivers many teraflops however, completely unrivalled.

I would guess that modifying the code from Bdot using 5 limbs @ 70 bits would be peanuts to do so. That will get a high speed.

So the peak of the 5870 and especially 6970 theoretically lies somewhere around 1.351 Tflop / 4.5k = 1351M / 4.5 = 300M /s

Not based upon multiplying zero's, yet it's a theoretic number. Don't forget the problem of theoretic numbers :)

Then when the app sdk 2.5 releases, suddenly we can build a fast 72 bits kernel with quite a higher peak performance :)

diep 2011-06-09 09:40

[QUOTE=Ken_g6;263327]I dealt with that! I eventually found this selects only GPUs:
[code]
cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM,
(cl_context_properties)platform,
0
};

context = clCreateContextFromType(cps, CL_DEVICE_TYPE_GPU, NULL, NULL, &status);[/code]

Feel free to browse [url=https://github.com/Ken-g6/PSieve-CUDA/tree/redcl]my ppsieve-cl source[/url]; just remember that it's GPLed if you want to use any of the rest of it.

Meanwhile, having seen Oliver's source code, I'd like to know how you handle 24*24=48-bit integer multiplies quickly in OpenCL. He used assembly, and I found assembly to be useful in my CUDA app as well; but I just couldn't find an equivalent in OpenCL. That, and I couldn't make shifting the 64-bit integers right 32-bits be parsed as a 0-cycles-required no-op (which it ought to be.)[/QUOTE]

the 0 cycles required no-op is nonsense. this hardware delivers teraflops. there is nothing else that delivers that much that you can buy for $90 on ebay.

It is great if everything eats 1 cycle. There is no competition against those AMD gpu's possible at 24x24=48 bits level.

Just most 'testers' are not so clever to check actually the architecture manual to see whether there is a possibility to do it faster.

Forget 64 bits thinking, these gpu's are inherently 32 bits entities. If you want to force them to throw in extra transistors to do 64 bits arithmetic that's going to penalize our TF speed, as we want of course preferably a petaflop in 32 bits.

If you take a good look at TheJudgers code the fastest kernel is based upon 24 bits multiplications.

That is for a good reason. It's naive to believe that the gpu's will get 64 bits.

Graphics in itself can do with 16 bits. So having 24 bits or even 32 bits is already luxury.

64 bits integers always will be too slow on those gpu's of course.

Instead of extra transistors thrown away to that, i prefer another few thousands of PE's.

In the first place those gpu's deliver all this calculation power because all those kids who game at 'em pay for us. Only when sold in huge quantities, so billions of those gpu's, then the price can keep at its current level.

The "fast highend HPC " chips, take power6 or power7, that's what is it, like a $200k per unit or so?

A single gpu has the same performance, yes even at double precision, like such entire power7.

Just programming for it is a lot more complicated and that's because AMD and Nvidia simply have no good explanation on the internal workings of the gpu's. Very silly, yet it's the truth.

The factories that produce this hardware have a building costs of many billions. Intel was expecting by 2020 that a new factory would have a price of $20 billion. Each new proces technology, also the price of the factories goes up. Intel calls that the 2nd law of Moore, google for it.

This means that only processors that get produced in HUGE QUANTITY have a chance of getting sold at some economic price.

Right now all those gamers pay for that development as i said earlier on and what's fastest for them doesn't involve any 64 bits logics.

Vincent

TheJudger 2011-06-09 11:03

Hi Vincent,

[QUOTE=diep;263348]If you take a good look at TheJudgers code the fastest kernel is based upon 24 bits multiplications.[/QUOTE]

this is not 100% true. For compute capability 1.x GPUs this is true but the "barrett79_mul32" kernel comes very close to the speed of the "71bit_mul24" kernel. For compute capability 2.x GPUs the "71bit_mul24" kernel is the slowest of the five kernels. Only the "71bit_mul24" kernel uses 24 bit multiplications.

Oliver

diep 2011-06-09 13:05

[QUOTE=TheJudger;263353]Hi Vincent,



this is not 100% true. For compute capability 1.x GPUs this is true but the "barrett79_mul32" kernel comes very close to the speed of the "71bit_mul24" kernel. For compute capability 2.x GPUs the "71bit_mul24" kernel is the slowest of the five kernels. Only the "71bit_mul24" kernel uses 24 bit multiplications.

Oliver[/QUOTE]

And that isn't 64 bits code either, that's my most important point. It's all 32 bits code.

Now where AMD has a similar amount of 'units' that provide 32x32 bits multiplications to nvidia, for 24x24==48 bits, it has 4 times that amount (in fact it uses in the 6000 series precisely 4 simple units to emulate 32x32 bits or 64 bits or double precision).

Now it might be easier to code things in 64 bits (we would be missing a 64x64 == 128 bits multiplication then), yet we must face that the gpu's keep 32 bits.

As for AMD gpu's the time it takes to multiply 32x32 == 64 bits, which is 2 instructions, you can push through 8 simple instructions during the same timespan.

8 simple instructions will multiply 4 x 24x24== 48 bits ==> so that outputs 192 bits in total, whereas 2 x 32 bits multiplies outputs 64 bits in the same timespan.

For Nvidia that break even point is a tad different of course as you correctly indicate.

So per cycle a single AMD gpu can output 1536 * 48 * 0.5 = 36864 bits

The older AMD gpu's on paper even a tad more than that.

Good codes really achieve a very high IPC on those modern gpu's (Fermi, 5000 or 6000 series).

How many integer bits can a GTX580 output per cycle?
What was it, using 32x32 bits multiplications something like:

512 * 64 * 0.5 = 16384 bits

Is that correct?

Or was it 2 cycles per instruction making it 8192 bits per cycle?

Bdot 2011-06-09 13:16

[QUOTE=Ken_g6;263327]
[code]
context = clCreateContextFromType(cps, CL_DEVICE_TYPE_GPU, NULL, NULL, &status);[/code][/QUOTE]
True, I do that as well, but fall back to the CPU on purpose if no suitable GPU is found. Plus, I added (ported) support for the -d option to specifically select any device (including CPU).

[QUOTE=Ken_g6;263327]
Meanwhile, having seen Oliver's source code, I'd like to know how you handle 24*24=48-bit integer multiplies quickly in OpenCL.
[/QUOTE]

Used within my kernel:

[code]
void mul_24_48(uint *res_hi, uint *res_lo, uint a, uint b)
/* res_hi*(2^24) + res_lo = a * b */
{
*res_lo = mul24(a,b);
*res_hi = (mul_hi(a,b) << 8) | (*res_lo >> 24);
*res_lo &= 0xFFFFFF;
}
[/code]The assembly code of this looks already quite nicely packed, as the compiler optimizes (and inlines!) this whole function into 3 cycles (the w: step of cycle 20 is already the next instruction, but 21 w: still belongs to this function):

[code]

mul_24_48(&(a.d1),&(a.d0),k_tab[tid],4620); // NUM_CLASSES

becomes

19 z: MUL_UINT24 ____, R1.x, (0x0000120C, 6.473998905e-42f).x
t: MULHI_UINT ____, R1.x, (0x0000120C, 6.473998905e-42f).x
20 x: LSHL ____, PS19, (0x00000008, 1.121038771e-44f).x
y: LSHR ____, PV19.z, (0x00000018, 3.363116314e-44f).y
z: AND_INT ____, PV19.z, (0x00FFFFFF, 2.350988562e-38f).z
w: ADD_INT ____, KC0[1].x, PV19.z
21 x: ADD_INT ____, KC0[1].x, PV20.z
z: AND_INT R4.z, PV20.w, (0x00FFFFFF, 2.350988562e-38f).x
w: OR_INT ____, PV20.y, PV20.x

[/code][QUOTE=Ken_g6;263327]
He used assembly, and I found assembly to be useful in my CUDA app as well; but I just couldn't find an equivalent in OpenCL. That, and I couldn't make shifting the 64-bit integers right 32-bits be parsed as a 0-cycles-required no-op (which it ought to be.)[/QUOTE]

Yes, assembly is missing for the 32-bit operations (it does not hurt that much in 24-bit ops). Shifting 64-bit values by 32 bits works if you make sure that all values are 64-bit. And that includes the "32":

<any 64-bit> >> 32 = 0 (of type int)
<any 64-bit> >> 32ULL = <upper half> (of type long)

Took me a while to find that ... but you may have meant something different.


And to those still waiting for the real thing - it comes now, I just added Oliver's pre-release signal handler which is really required for OpenCL. Otherwise the graphics driver crashes Windows 7 when there are still kernels in the queue but the process is already gone.

And yes, it may crash your machine too - do not use it on productive machines, yet.

Just remember: this is a first test version. Though performance figures may be interesting, I'm more interested to see if it runs at all, stable and with or without odd side-effects. [B]Do not attempt to upload "no factor" results to primenet yet.

[/B]

diep 2011-06-09 14:10

[QUOTE=Bdot;263359]True, I do that as well, but fall back to the CPU on purpose if no suitable GPU is found. Plus, I added (ported) support for the -d option to specifically select any device (including CPU).



Used within my kernel:

[code]
void mul_24_48(uint *res_hi, uint *res_lo, uint a, uint b)
/* res_hi*(2^24) + res_lo = a * b */
{
*res_lo = mul24(a,b);
*res_hi = (mul_hi(a,b) << 8) | (*res_lo >> 24);
*res_lo &= 0xFFFFFF;
}
[/code]The assembly code of this looks already quite nicely packed, as the compiler optimizes (and inlines!) this whole function into 3 cycles (the w: step of cycle 20 is already the next instruction, but 21 w: still belongs to this function):

[code]

mul_24_48(&(a.d1),&(a.d0),k_tab[tid],4620); // NUM_CLASSES

becomes

19 z: MUL_UINT24 ____, R1.x, (0x0000120C, 6.473998905e-42f).x
t: MULHI_UINT ____, R1.x, (0x0000120C, 6.473998905e-42f).x

[/code]
[/quote]

MULHI_UINT as you see is at the T unit, so that means that it requires work from all other units. In your case that means 5 units (for the AMD HD5000 series) and in case of the HD6000 series that requires 4 units.

So this instruction eats 5 cycles if you look at it from that viewpoint (or in fact from 5 execution units it eats 1 cycle).

The fast instruction is MULHI_UINT24 which is at the full speed of 1351 Gflop (at the HD6970 series and i suppose so at the 5000 series as well).

As you see it doesn't generate that one.

So this is a piece of turtle slow assembler code for several reasons. Not just because it's the T unit, there is other issues as well.

[quote]
[code]

20 x: LSHL ____, PS19, (0x00000008, 1.121038771e-44f).x
y: LSHR ____, PV19.z, (0x00000018, 3.363116314e-44f).y
z: AND_INT ____, PV19.z, (0x00FFFFFF, 2.350988562e-38f).z
w: ADD_INT ____, KC0[1].x, PV19.z
21 x: ADD_INT ____, KC0[1].x, PV20.z
z: AND_INT R4.z, PV20.w, (0x00FFFFFF, 2.350988562e-38f).x
w: OR_INT ____, PV20.y, PV20.x

[/code]

Yes, assembly is missing for the 32-bit operations (it does not hurt that much in 24-bit ops). Shifting 64-bit values by 32 bits works if you make sure that all values are 64-bit. And that includes the "32":

<any 64-bit> >> 32 = 0 (of type int)
<any 64-bit> >> 32ULL = <upper half> (of type long)

Took me a while to find that ... but you may have meant something different.


And to those still waiting for the real thing - it comes now, I just added Oliver's pre-release signal handler which is really required for OpenCL. Otherwise the graphics driver crashes Windows 7 when there are still kernels in the queue but the process is already gone.

And yes, it may crash your machine too - do not use it on productive machines, yet.

Just remember: this is a first test version. Though performance figures may be interesting, I'm more interested to see if it runs at all, stable and with or without odd side-effects. [B]Do not attempt to upload "no factor" results to primenet yet.

[/B][/QUOTE]

diep 2011-06-09 14:20

[code]
*res_lo = mul24(a,b);
*res_hi = (mul_hi(a,b) << 8) | (*res_lo >> 24);
*res_lo &= 0xFFFFFF;
[/code]

Also when AMD adds a function mul_hi16 (or something) generating the 16 top bits in a fast manner using MULHI_UINT24, then still this piece of programming is turtle slow for GPU's.

Let me explain.

You generate the result of res_lo using a mul24. This is a fast instruction.

The problem comes after this. Directly after this. The GPU's are not out of order processors. They focus upon throughput. So i hope i formulate it not too bad if i say that the retirement of the execution unit doing the mul24, it eats a cycle or 8 for the result to have available.

However already directly you need the result to be shifted right 24. That's too quick.

The GPU already hides a little bit of the latency by running multiple threads, but that doesn't cover everything.

The mul_hi gets shifted left 8. Now a good compiler would hide this latency.
Yet that's not what you can expect here. Directly then it needs the result for an OR with the shifted part o fthe res_lo.

So the mul_hi itself eats 5 cycles at the 5000 series, or better it requires all units at the same time, then another 8 cycles later it's available for usage to be OR-ed with the other half, which we can expect to be available at the same time, this 8 cycles later.

So in reality we have a few operations that are 'in flight' at the same time here. The shiftleft after the Mul_hi and the shiftright are at the same time "in flight".

Yet not yet available, as that eats another 8 cycles. So directly issuing then an 'or' is not so clever.

The AND with the res_lo we can safely assume by then to be available of course, as we already needed it the line above it.

So from programming viewpoint this is utmost beginnerscode for 2 reasons.

diep 2011-06-09 14:45

[QUOTE=Bdot;263359]

Just remember: this is a first test version. Though performance figures may be interesting, I'm more interested to see if it runs at all, stable and with or without odd side-effects. [B]Do not attempt to upload "no factor" results to primenet yet.

[/B][/QUOTE]

Where is that code?

My email : [email]diep@xs4all.nl[/email]

diep 2011-06-10 00:32

[QUOTE=Bdot;263174]This is an early announcement that I have ported parts of Olivers (aka TheJudger) mfaktc to OpenCL.

Currently, I have only the Win64 binary, running an adapted version of Olivers 71-bit-mul24 kernel. Not yet optimized, not yet making use of the vectors available in OpenCL. A very simple (and slow) 95-bit kernel is there as well so that the complete selftest finished successfully on my box.

On my HD5750 it runs about 60M/s in the 50M exponent range - certainly a lot of headroom :smile:

As I have only this one ATI GPU I wanted to see if anyone would be willing to help testing on different hardware.

Current requirements: OpenCL 1.1 (i.e. only ATI GPUs), Windows 64-bit.

There's still a lot of work until I may eventually release this to the public, but I'm optimistic for the summer.

Next steps (unordered):
[LIST][*]Linux port (Is Windows 32-bit needed too?)[*]check, if [URL]http://mersenneforum.org/showpost.php?p=258140&postcount=7[/URL] can be used (looks like it's way faster)[/LIST][/quote]

Bdot i'm looking at the code you shipped me and into your kernel.
the square_72_144 multiplicatoin basically seems like an optimized
version of what Chrisjp posted over here.

So with near 100% certainty your code is faster. Especially if we see
how much you tested it to be.

So i wonder why you posted this comment?

Can you explain what you wrote over here?

Thanks,
Vincent

[quote][LIST][*]fast 92/95-bit kernels (barrett)[*]use of vector data types[*]various other performance/optimization tests&enhancements[*]of course, bug fixes:boxer:[*]docs and licensing stuff :yucky:[*]clarify if/how this new kid may contribute to primenet[/LIST]Bdot[/QUOTE]

Bdot 2011-06-10 11:19

[QUOTE=diep;263429]Bdot i'm looking at the code you shipped me and into your kernel.
the square_72_144 multiplicatoin basically seems like an optimized
version of what Chrisjp posted over here.

So with near 100% certainty your code is faster. Especially if we see
how much you tested it to be.

So i wonder why you posted this comment?

Can you explain what you wrote over here?

Thanks,
Vincent[/QUOTE]

First of all, I did not even check out Chrisjp's code in detail, so I did not know if it was faster or not.
Mainly I'm looking for a faster modulus because that is where currently ~3/4 of the TF effort are spent (squaring is less than 20%). There must be some reasons why Chrisjp had better performance figures. I want to see why (probably due to barrett), and take over what makes sense. And having a fast 84-bit kernel is of course another advantage over 72 bits.

Colt45ws 2011-06-13 07:33

My primary machine here has a unlocked (6970 equiv.) HD6950 2GB. I would be willing to test this stuff.

Bdot 2011-06-13 19:32

[QUOTE=Colt45ws;263659]My primary machine here has a unlocked (6970 equiv.) HD6950 2GB. I would be willing to test this stuff.[/QUOTE]

Send me a PM with your email, and you should have it. I received test results from only one tester so far ...

I'm not sure I will send out version 0.04 again as I just built a vectorized version of the 71-bit-kernel. It may take another few days to be finalized. My first tests showed a speedup of 30-40%. You'll probably get this one when it's ready.

Bdot 2011-06-18 23:12

I just sent out mfakto version 0.05 to a few interested people.

Main highlight is the use of vector data types, which on my GPU raises throughput from 60M/s to 100M/s when using multiple instances, and from 36M/s to 88M/s for single instance (on a HD 5750).

Colt45ws 2011-06-19 03:46

I tested this with my 6970. Single instance.
I Adjusted Sieveprimes to max performance, 55k
This resulted in 90M/s rate on a M57 68 to 69.
Then I played with the other settings.
Vector Best to worst was 4,8,2,1,16
16 had a HUGE performance drop; down to 14M/s
Could only just tell 4 was better than 8. Any closer and it would probably be within normal range it runs. 2 and 1 both ran in the 7xM/s range.
Best gridsize was 4 then 3,2,1,0.
Curiously, the GPU never kicked into HighPerf mode, it stayed at 500MHz.
??

Bdot 2011-06-19 10:35

Thanks for testing so quickly!

I guess SievePrimes is at 5k, not 55k?

Are the gridsize differences big enough to try even bigger ones? For my GPU the UI was not usable anymore with the next bigger one ... but for fast GPUs I could check if bigger grids would always fit ...

Did you monitor the CPU load? On my box I see it never go really high (max 50%).

I'm afraid, for now, the only way to fully utilize such a capable GPU is many instances of mfakto (working on different exponents). Which on my machine has the nasty issue of freezing the machine sometimes ... working on it.

Maybe it's time for a multi-threaded siever ... until the GPU-siever comes.

Colt45ws 2011-06-19 11:34

1 Attachment(s)
No, 55000
I have to make an amendment to my previous post, I must have made a mistake when I was keeping track of which GridSize I was running. 3, 2, and 1 are identical. Maybe leaning towards 2 almost imperceptibly. Then 4 and 0.

CPU load is around 13%, or about 65% of a single core.
Im running a i7-920 @ 4GHz

Christenson 2011-06-19 15:31

mfaktc/mfakto certainly needs a GPU-based siever....I have to complete a different project (automatic assignment handling) first before I can think about taking it on.

henryzz 2011-06-19 17:17

[QUOTE=Christenson;264155]mfaktc/mfakto certainly needs a GPU-based siever....I have to complete a different project (automatic assignment handling) first before I can think about taking it on.[/QUOTE]
Can I suggest that if you can get recieving assignments working faster than both then you should. It is fine to only report results occasionally but running out of work is bad.

davieddy 2011-06-19 18:15

[QUOTE=henryzz;264166]but running out of work is bad.[/QUOTE]

That happens to be one of my favourite occupations.

But if picking the low-hanging fruit floats your boat,
go ahead with Breadth First.
OTOH if you get bored with finding new factors (or "getting work"),
try making it as easy for us CPU-bound, patient,
LL-testing prime searchers as possible.

TFing X to X+1 is 1/7th of X+3 effort.*

David

*Open to correction, but you get the idea.
1+2+4 = 7

Christenson 2011-06-19 18:41

Henry:
Once automatic reporting begins to work, it will come all at once....I'm having issues with learning my tools (eclipse) right now, just have to sit and work at it...then add the mutex and thread management stuff and call the appropriate parts of P95.

As for you CPU-bound, LL-testing types (which, incidentally, includes myself), don't worry. The way I look at it is that TF and P-1 both have as their goal making as many LL tests as possible unnecessary. Odds of finding a factor for a given exponent, for the current bit level of 70, are about 1/70. Supposing the GPUs are 128 times faster than the CPUs, then we can do 7 extra bit levels, which will factor about 10% of the candidates that wouldn't have been factored by CPU. This helps, but the real speed-up in finding M48 and beyond will be in freed-up CPUs not doing TF and in the GPU LL tests.

Bdot 2011-06-19 18:50

[QUOTE=henryzz;264166]Can I suggest that if you can get recieving assignments working faster than both then you should. It is fine to only report results occasionally but running out of work is bad.[/QUOTE]

Hehe, if receiving the assignment takes longer than the task itself, then we don't need to optimize the GPU kernels anymore ...

Ken_g6 2011-06-19 21:45

[QUOTE=Christenson;264155]mfaktc/mfakto certainly needs a GPU-based siever....I have to complete a different project (automatic assignment handling) first before I can think about taking it on.[/QUOTE]

Don't forget about [thread=11900]this thread[/thread]. It looks like some work has been done on this kind of sieving this year! :smile:

Bdot 2011-06-23 02:07

This missing carry flag is driving me nuts ...

Has anyone a better idea for the carry-propagation:

[code]
typedef _int96_t
{
uint d0, d1, d2;
} int96_t;

void sub_96(int96_t *res, int96_t a, int96_t b)
/* a must be greater or equal b!
res = a - b */
{
uint carry = (b.d0 > a.d0);

res->d0 = a.d0 - b.d0;
res->d1 = a.d1 - b.d1 - (carry ? 1 : 0);
[B] res->d2 = a.d2 - b.d2 - (((res->d1 > a.d1) || ((res->d1 == a.d1) && carry)) ? 1 : 0);
[/B]}
[/code]I also need this for an int192 (6x32 bit). Then the above logic would become quite lengthy ... Do I really need to use something like this:

[code]
uint carry = (b.d0 > a.d0);

res->d0 = a.d0 - b.d0;
res->d1 = a.d1 - b.d1 - (carry ? 1 : 0);

carry = (res->d1 > a.d1) || ((res->d1 == a.d1) && carry);
res->d2 = a.d2 - b.d2 - (carry ? 1 : 0);

carry = (res->d2 > a.d2) || ((res->d2 == a.d2) && carry);
res->d3 = a.d3 - b.d3 - (carry ? 1 : 0);

carry = (res->d3 > a.d3) || ((res->d3 == a.d3) && carry);
res->d4 = a.d4 - b.d4 - (carry ? 1 : 0);

...
[/code]

Ken_g6 2011-06-23 04:21

Getting the carries would be a lot simpler if the number were 4x24-bit numbers instead of 3x32. Bit shifts could be used instead of conditionals, and conditionals on AMD are slow. This would also seem to allow for easier multiplication, when 24-bit multiplies are faster than 32-bit ones.

Bdot 2011-06-23 09:14

[QUOTE=Ken_g6;264468]Getting the carries would be a lot simpler if the number were 4x24-bit numbers instead of 3x32. Bit shifts could be used instead of conditionals, and conditionals on AMD are slow. This would also seem to allow for easier multiplication, when 24-bit multiplies are faster than 32-bit ones.[/QUOTE]

Sure, the 24-bit stuff works quite well. I just wanted to get a 32-bit kernel running in order to compare exactly that.

BTW, conditional loads are not slow (1st cycle: eval condition and prepare the two possible load values, 2nd cycle: load it), they run at full speed. Only branches having a different control flow have that big penalty, which consists of executing both branches plus some overhead to mask out one of the executions.

ldesnogu 2011-06-23 09:44

Warning: I don't know anything about OpenCL...

Why do you use ||, && et ?: at all? Doesn't OpenCL say a comparison result is either 0 or 1? If so then, I would have written:

[code]uint carry = (b.d0 > a.d0);

res->d0 = a.d0 - b.d0;
res->d1 = a.d1 - b.d1 - carry;
res->d2 = a.d2 - b.d2 - ((res->d1 > a.d1) | ((res->d1 == a.d1) & carry));[/code]and:

[code]uint carry = (b.d0 > a.d0);

res->d0 = a.d0 - b.d0;
res->d1 = a.d1 - b.d1 - carry;

carry = (res->d1 > a.d1) | ((res->d1 == a.d1) & carry);
res->d2 = a.d2 - b.d2 - carry;
...[/code]

Bdot 2011-06-23 17:34

[QUOTE=ldesnogu;264485]Warning: I don't know anything about OpenCL...

Why do you use ||, && et ?: at all? Doesn't OpenCL say a comparison result is either 0 or 1? [/QUOTE]

For scalar data types that is true. I could have saved a conditional load, but I guess the compiler will optimize that out anyway.

I already have in my mind to use the same code for a vector of data (just replace all uint by uint4, for instance), and then the result of a comparison is 0 or -1 (all bits set).

What I was really hoping for is something that propagates the carry with a little less logic, as that will really slow down additions and subtractions ...

Ken_g6 2011-06-23 22:15

Would the "ulong" data type (a 64-bit unsigned integer) help?

bsquared 2011-06-24 00:18

If you cast the arguments to uint64 then the borrow can be computed by shifting...

tmp = (uint64)a - (uint64)b;
sub = (uint32)tmp;
borrow = tmp >> 63;

ldesnogu 2011-06-24 08:41

[QUOTE=Bdot;264505]What I was really hoping for is something that propagates the carry with a little less logic, as that will really slow down additions and subtractions ...[/QUOTE]
Two other random ideas (again sorry if it's not applicable...):

- do as many computations as you can without taking care of carries and do a specific pass for handling them; of course that could lead to a big slowdown if you have to reload values and memory is the limiting factor

- another way to compute carries is bit arithmetic; let's say you want the carry from a - b
[code]res = a - b;
carry = ((b & ~a) | (res & ~a) | (b & res)) >> (bitsize-1);
[/code]where bitsize is the number of bits of a, b and res. Again that could be slower than your original code.

Bdot 2011-06-24 16:27

Thanks for all your suggestions so far. I'll definitely try the (ulong) conversion and compare it to my current bunch of logic. I still welcome suggestions ;-)

Here's the task again:

Inputs: uint a, uint b, uint carry (which is the borrow of the previous (lower) 32 bits)
Output: uint res=a-b, carry (which should be the borrow for the next (higher) 32 bits)

currently this basically looks like
[code]
res = a - b - carry;
carry = (res > a) || ((res == a) && carry);
[/code]I'm looking for something simpler for the evaluation of the new carry. Something like
[code]
carry = (res >= a) || carry;
[/code](Yes, I know this one is not correct)

We have available all logical operators, +, -, and the [URL="http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/integerFunctions.html"]OpenCL Integer Functions[/URL]. But maybe a total of 6 operations for one 32-bit subtraction with borrow is already the minimum for OpenCL?

I did not quite understand how evaluating carries afterwards can save something. Access to the operands is no problem, it's all in registers. The bit-wise operations lead to a total of 10 instructions (?) for one subtraction ... less likely to be an acceleration ;-)

ldesnogu 2011-06-24 16:43

[QUOTE=Bdot;264555]I did not quite understand how evaluating carries afterwards can save something. Access to the operands is no problem, it's all in registers. The bit-wise operations lead to a total of 10 instructions (?) for one subtraction ... less likely to be an acceleration ;-)[/QUOTE]
Well that all depends on two things: what your compiler is able to find depending on the form of your program and what your micro-architecture is able to do.

The post pass carry evaluation for instance is very useful for vector like architectures. And it might be possible on some micro-architectures that the result of a comparison blocks a pipeline while logical operations don't, thus making the logical variant faster even though it uses more operations.

But then again I don't know anything about your target and OpenGL, so I'm probably completely off track :smile:

apsen 2011-07-03 15:42

[QUOTE=Bdot;263174]

As I have only this one ATI GPU I wanted to see if anyone would be willing to help testing on different hardware.

Current requirements: OpenCL 1.1 (i.e. only ATI GPUs), Windows 64-bit.

[/QUOTE]

I have HD 4550, Windows 2008 R2 x64. Would that work?

Andriy

Ken_g6 2011-07-03 18:52

Bdot, what did you wind up finding did the fastest math, 32-bit numbers or 24-bit numbers? And what form of math? Or are you still working on it?

Bdot 2011-07-04 11:35

I played around with this a little ...

[QUOTE=ldesnogu;264544]Two other random ideas (again sorry if it's not applicable...):

- do as many computations as you can without taking care of carries and do a specific pass for handling them; of course that could lead to a big slowdown if you have to reload values and memory is the limiting factor
[/QUOTE]

I have at most 5-6 operations that I can do before checking the carries. The runtime stays exactly the same and the reason is that the compiler reorders the instructions anyway as it thinks fit better. Even a few repeated steps that were necessary for the carry ops did not influence runtime as they were optimized out :smile:

[QUOTE=ldesnogu;264544]
- another way to compute carries is bit arithmetic; let's say you want the carry from a - b
[code]res = a - b;
carry = ((b & ~a) | (res & ~a) | (b & res)) >> (bitsize-1);
[/code]where bitsize is the number of bits of a, b and res. Again that could be slower than your original code.[/QUOTE]

I was really surprised by that one. Even though this is way more operations than my original code, it runs the same speed! Not a bit faster, not a bit slower with the single-vector kernel. Comparing the assembly it turns out that many of the additional operations are "hidden" in otherwise unused slots. Following up with the vector-version of the kernel that has less unused slots, I really saw the kernel takes 3 cycles more - with a total number of ~700 cycles thats less than .5% ...

Here's the current performance analysis of the 79-bit barrett kernel:

[code]
Name Throughput
Radeon HD 5870 135 M Threads\Sec
Radeon HD 6970 118 M Threads\Sec
Radeon HD 6870 100 M Threads\Sec
Radeon HD 5770 68 M Threads\Sec
Radeon HD 4890 66 M Threads\Sec
Radeon HD 4870 58 M Threads\Sec
FireStream 9270 58 M Threads\Sec
FireStream 9250 48 M Threads\Sec
Radeon HD 4770 46 M Threads\Sec
Radeon HD 6670 38 M Threads\Sec
Radeon HD 4670 35 M Threads\Sec
Radeon HD 5670 23 M Threads\Sec
Radeon HD 6450 10 M Threads\Sec
Radeon HD 4550 7 M Threads\Sec
Radeon HD 5450 6 M Threads\Sec
[/code]This is the peak performance of the kernel given enough CPU-power to feed the factor candidates fast enough. Empirical translation tells that 1M Threads/sec is good for 1.2 - 1.5 GHz-days/day.

Unfortunately I right now have a problem that some of the kernel's code is skipped unless I enable kernel tracing. I need that fixed before I can get you another version for testing.

Bdot 2011-07-04 12:54

[QUOTE=Ken_g6;265320]Bdot, what did you wind up finding did the fastest math, 32-bit numbers or 24-bit numbers? And what form of math? Or are you still working on it?[/QUOTE]
Currently the fastest kernel is a 24-bit based kernel working on a vector of 4 factor candidates at once. Here's the list of kernels I currently have, along with the performance on a HD5770:

76 M/s mfakto_cl_71_4: 3x24-bit, 4-vectored kernel
68 M/s mfakto_cl_barrett79: 2.5x32-bit unvectored barrett kernel
53 M/s mfakto_cl_barrett92: 3x32-bit unvectored barrett kernel
44 M/s mfakto_cl_71: 3x24-bit unvectored kernel

The barrett kernels currently need to use a nasty workaround for a bug of the compiler, costing ~ 3%. I'm still working on vectorizing the barretts, a similar speedup as for the 24-bit kernel can be expected, so that the 32-bit based kernels will be a lot faster than 24-bit.

A 24-bit based barrett kernel that was suggested on the forum runs at 75M/s, but as it is using vectors for the representation of the factors, it cannot (easily) be enhanced to run on a vector of candidates. If that were easily possible, then the 24-bit kernel might run for the crown again. But it will not be far ahead of the 32-bit kernel. And the 32-bit one has the advantage of running FCs up to 79 bit instead of 71 bit.

Bdot 2011-07-05 08:12

Oh boy, I finally found why the barrett kernels sometimes behaved odd ...

OpenCL does bit-shifts only up to the number of bits in the target, and for that it only evaluates the necessary amount of bits of the operand. So for a bit-shift a >> b with 32-bit-values, only the lowest 5 bits of b are used, the rest is ignored ... Therefore the code that used bit-shifts of 32 or more positions to implicitely zero the target did not deliver the expected result.

The fix for that goes without performance-penalty ... a little more testing and version 0.06 will be ready.

jasonp 2011-07-05 11:01

Shifts are 'modular' like this on any processor architecture you're likely to encounter. The only exception is the PowerPC line of processors, which allow one more bit to figure into the shift amount, so that e.g. a shift greater than 31 will zero a register.

ldesnogu 2011-07-05 11:34

[QUOTE=jasonp;265493]Shifts are 'modular' like this on any processor architecture you're likely to encounter.[/QUOTE]
ARM doesn't behave that way, but that perhaps doesn't fall into the "likely to encounter" category.

FWIW I had a bug in some MP code when compiled with MS C compiler where a right shift by 32 was considered as a NOP. That can be considered as a legal treatment given that the ANSI C standard leaves shift amounts >= size of operands as undefined.

Bdot 2011-07-05 23:13

[QUOTE=jasonp;265493]Shifts are 'modular' like this on any processor architecture you're likely to encounter. The only exception is the PowerPC line of processors, which allow one more bit to figure into the shift amount, so that e.g. a shift greater than 31 will zero a register.[/QUOTE]

Well, I took that over from mfaktc, and we all know that it works there ... so you have another exception :smile:

And I learned something :grin:

[QUOTE=ldesnogu;265496]ARM doesn't behave that way, but that perhaps doesn't fall into the "likely to encounter" category.
[/QUOTE]

I also happen to program for ARM a little, but on higher level. I don't think I had to use bit-shifts there so far ...

But most importantly, I completed first tests with the vectorized barretts. Extending the previous list:
[QUOTE=Bdot;265388]
76 M/s mfakto_cl_71_4: 3x24-bit, 4-vectored kernel
68 M/s mfakto_cl_barrett79: 2.5x32-bit unvectored barrett kernel
53 M/s mfakto_cl_barrett92: 3x32-bit unvectored barrett kernel
44 M/s mfakto_cl_71: 3x24-bit unvectored kernel
[/QUOTE]
96 M/s mfakto_cl_barrett79_8: 2.5x32-bit 8-vectored barrett kernel
92 M/s mfakto_cl_barrett79_2: 2.5x32-bit 2-vectored barrett kernel
88 M/s mfakto_cl_barrett79_4: 2.5x32-bit 4-vectored barrett kernel
72 M/s mfakto_cl_barrett92_8: 3x32-bit 8-vectored barrett kernel
71 M/s mfakto_cl_barrett92_4: 3x32-bit 4-vectored barrett kernel
70 M/s mfakto_cl_barrett92_2: 3x32-bit 2-vectored barrett kernel

Now it would be interesting how a 24-bit vectored barrett kernel would do ...

Bdot 2011-07-12 09:03

Anyone running a HD4xxx?
 
I'd like to see what capabilities a HD4xxx has. Would anyone in possession of such a GPU please post the output of clinfo? I'm mostly interested in the "Extensions:" part, but if you could pm me the full output that'd be nice too.

clinfo is part of the AMD-APP-SDK ...

Thanks a lot ...

apsen 2011-07-12 13:51

[QUOTE=Bdot;266188]I'd like to see what capabilities a HD4xxx has. Would anyone in possession of such a GPU please post the output of clinfo? I'm mostly interested in the "Extensions:" part, but if you could pm me the full output that'd be nice too.

clinfo is part of the AMD-APP-SDK ...

Thanks a lot ...[/QUOTE]

HD4550:

[CODE]

Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (650.9)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_khr_d3d10_sharing


Platform Name: AMD Accelerated Parallel Processing
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
Device ID: 4098
Max compute units: 2
Max work items dimensions: 3
Max work items[0]: 128
Max work items[1]: 128
Max work items[2]: 128
Max work group size: 128
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 4
Preferred vector width double: 0
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 4
Native vector width double: 0
Max clock frequency: 600Mhz
Address bits: 32
Max memory allocation: 134217728
Image support: No
Max size of kernel argument: 1024
Alignment (bits) of base address: 32768
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 268435456
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Global
Local memory size: 16384
Kernel Preferred work group size multiple: 32
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue properties:
Out-of-Order: No
Profiling : Yes
Platform ID: 000000000173B118
Name: ATI RV710
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 1.0
Driver version: CAL 1.4.1332
Profile: FULL_PROFILE
Version: OpenCL 1.0 AMD-APP-SDK-v2.4 (650.9)
Extensions: cl_khr_gl_sharing cl_amd_device_attribute_query cl_khr_d3d10_sharing


Device Type: CL_DEVICE_TYPE_CPU
Device ID: 4098
Max compute units: 2
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 1024
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 4
Preferred vector width double: 0
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 4
Native vector width double: 0
Max clock frequency: 2400Mhz
Address bits: 64
Max memory allocation: 2147483648
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 8192
Max image 2D height: 8192
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 4096
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: No
Cache type: Read/Write
Cache line size: 64
Cache size: 32768
Global memory size: 4294287360
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Global
Local memory size: 32768
Kernel Preferred work group size multiple: 1
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 426
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: Yes
Queue properties:
Out-of-Order: No
Profiling : Yes
Platform ID: 000000000173B118
Name: Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Vendor: GenuineIntel
Device OpenCL C version: OpenCL C 1.1
Driver version: 2.0
Profile: FULL_PROFILE
Version: OpenCL 1.1 AMD-APP-SDK-v2.4 (650.9)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_media_ops cl_amd_popcnt cl_amd_printf cl_khr_d3d10_sharing



[/CODE]

Bdot 2011-07-13 08:33

[QUOTE=apsen;266198]HD4550:

[CODE]

Extensions: cl_khr_gl_sharing cl_amd_device_attribute_query cl_khr_d3d10_sharing

[/CODE][/QUOTE]

Thanks, so there is really no atomics available. I'll try to make mfakto adjust automatically ...

Bdot 2011-07-19 08:35

Status
 
Just a short update (so you don't think I've lost interest :smile:):

After I upgraded my ATI drivers to a [URL="http://developer.amd.com/tools/gDEBugger/Pages/default.aspx#download"]pre-version of Catalyst 11.7[/URL] the barrett92 kernel does not find any factors anymore (72-bit and barrett79 are still fine). I'm currently (re-)introducing Oliver's MODBASECASE checks that I skipped so far.

George has already enabled primenet's manual page for mfakto's results, but as (parts of) mfakto are broken with the new driver (or compiler?), I need to delay mfakto's "official release".

Going back to 11.6 is of no use as it has serious issues with the kernel files as they were growing bigger, and 11.7 does not lock up my machine anymore - so there are improvements ...

Bdot 2011-08-15 20:35

mfakto Release!
 
1 Attachment(s)
After I found and fixed the last (?) serious bugs in mfakto, all tests finished as they should. Therefore:

[B]mfakto 0.07 releases[/B]

This is the first version going public, let me know of any issues.

Attached are the Windows 32-bit and 64-bit binaries. Source will follow right away, Linux (SuSE 11.4) will come tomorrow.

Bdot 2011-08-15 20:37

1 Attachment(s)
mfakto 0.07 sources

monst 2011-08-15 20:59

Can you also please post the correct versions of OpenCL.dll and any other dll's that are required? Thanks.

firejuggler 2011-08-15 20:59

isn' t open-cl supposed to work on nvidia and radeon card?

Bdot 2011-08-16 06:38

[QUOTE=monst;269174]Can you also please post the correct versions of OpenCL.dll and any other dll's that are required? Thanks.[/QUOTE]

I recommend to install [URL="http://developer.amd.com/sdks/AMDAPPSDK/downloads/Pages/default.aspx"]AMD APP 2.5[/URL]. Though for windows the latest [URL="http://support.amd.com/us/gpudownload/Pages/index.aspx"]Catalyst[/URL] driver should also contain OpenCL - if you have that already, then you may just need to extend the PATH.

Thinking about it, maybe both are required ... I have both of them in my boxes ...

Bdot 2011-08-16 07:43

[QUOTE=firejuggler;269175]isn' t open-cl supposed to work on nvidia and radeon card?[/QUOTE]
Nvidia supports OpenCL 1.0, but we need 1.1. For now the OpenCL version requires an AMD(ATI) GPU. If you install the above SDK and Catalyst drivers on an Nvidia box, then you will at least be able to run mfakto on your CPU ;-)

Most likely there is not much sense in running the OpenCL version on Nvidia when there is a CUDA version of the program (which in this case is the original). Maybe for performance comparison - but then you only compare Nvidia's cuda-compiler against Nvidia's OpenCL-compiler.

Bdot 2011-08-16 08:55

1 Attachment(s)
mfakto 0.07 for Linux-64

KingKurly 2011-08-16 21:13

[QUOTE=Bdot;269221]mfakto 0.07 for Linux-64[/QUOTE]
I posted earlier in this thread saying I was interested. Still am. I installed AMD APP 2.5 and downloaded the mfakto 0.07 binary. It seems to run, but claims not to find my GPU so it uses the CPU instead. Any suggestions? I finally found that the help option is -h, but even that is not too clear. (I had expected --help to work, but it silently ignored me and continued doing what it was doing.)

(Note that I don't expect my lousy GPU to actually be very valuable to the project, but it's better than having it run idle, especially if it won't siphon CPU cycles away from what they're doing now.)

My card is a Radeon HD 5450.

Bdot 2011-08-17 08:22

[QUOTE=KingKurly;269267]I posted earlier in this thread saying I was interested. Still am. I installed AMD APP 2.5 and downloaded the mfakto 0.07 binary. It seems to run, but claims not to find my GPU so it uses the CPU instead.
[/quote]

If it runs (using the CPU) all libs are found OK.
The APP SDK contains a clinfo binary. (APP SDK path/bin/x86_64/clinfo) Run it and see if it is reporting both the CPU and the GPU (one block CL_DEVICE_TYPE_GPU, one CL_DEVICE_TYPE_CPU). If the GPU block is there, paste it in here, maybe I can spot something. You can also play around with the "-d" option. Try "-d 1", "-d 2", "-d 11", "-d 21". See if any of them picks the GPU.

If the GPU is missing from the clinfo output you probably run the open "radeon" graphics driver. Try installing the Catalyst graphics driver mentioned in an earlier post. If that helps I need to update the documentation ...

[QUOTE=KingKurly;269267] I finally found that the help option is -h, but even that is not too clear. (I had expected --help to work, but it silently ignored me and continued doing what it was doing.)[/QUOTE]

I take that as an enhancement for the next version. And also I'll try to implement a "-d g" option to force running on GPU or fail. There is already "-d c" forcing it to run on the CPU ...

[QUOTE=KingKurly;269267](Note that I don't expect my lousy GPU to actually be very valuable to the project, but it's better than having it run idle, especially if it won't siphon CPU cycles away from what they're doing now.)

My card is a Radeon HD 5450.[/QUOTE]

I would expect this card to deliver about 10-12M factors per second, which should be somewhere between 12 and 18 GHz-days per day. And be consuming ~300 Mhz of CPU power. Very rough estimates - once you get it running, let me know how reality looks like ;-)

KingKurly 2011-08-17 14:49

[QUOTE=Bdot;269301]If it runs (using the CPU) all libs are found OK.
The APP SDK contains a clinfo binary. (APP SDK path/bin/x86_64/clinfo) Run it and see if it is reporting both the CPU and the GPU (one block CL_DEVICE_TYPE_GPU, one CL_DEVICE_TYPE_CPU). If the GPU block is there, paste it in here, maybe I can spot something. You can also play around with the "-d" option. Try "-d 1", "-d 2", "-d 11", "-d 21". See if any of them picks the GPU.[/QUOTE]
I was able to run clinfo okay, and I will include the output below. I tried playing around with the -d option with no success, several different error messages.

[CODE]Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 1.1 AMD-APP-SDK-v2.5 (684.213)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices


Platform Name: AMD Accelerated Parallel Processing
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
Device ID: 4098
Device Topology: PCI[ B#0, D#0, F#0 ]
Max compute units: 2
Max work items dimensions: 3
Max work items[0]: 128
Max work items[1]: 128
Max work items[2]: 128
Max work group size: 128
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 4
Preferred vector width double: 0
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 4
Native vector width double: 0
Max clock frequency: 0Mhz
Address bits: 32
Max memory allocation: 134217728
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 8192
Max image 2D height: 8192
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 32768
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 536870912
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Kernel Preferred work group size multiple: 32
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue properties:
Out-of-Order: No
Profiling : Yes
Platform ID: 0x7f1bb5145060
Name: Cedar
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 1.1
Driver version: CAL 1.4.1353
Profile: FULL_PROFILE
Version: OpenCL 1.1 AMD-APP-SDK-v2.5 (684.213)
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt


Device Type: CL_DEVICE_TYPE_CPU
Device ID: 4098
Max compute units: 6
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 1024
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 4
Preferred vector width double: 0
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 4
Native vector width double: 0
Max clock frequency: 2700Mhz
Address bits: 64
Max memory allocation: 4149655552
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 8192
Max image 2D height: 8192
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 4096
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: No
Cache type: Read/Write
Cache line size: 64
Cache size: 65536
Global memory size: 16598622208
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Global
Local memory size: 32768
Kernel Preferred work group size multiple: 1
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: Yes
Queue properties:
Out-of-Order: No
Profiling : Yes
Platform ID: 0x7f1bb5145060
Name: AMD Phenom(tm) II X6 1045T Processor
Vendor: AuthenticAMD
Device OpenCL C version: OpenCL C 1.1
Driver version: 2.0
Profile: FULL_PROFILE
Version: OpenCL 1.1 AMD-APP-SDK-v2.5 (684.213)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_media_ops cl_amd_popcnt cl_amd_printf

[/CODE]

Thank you for your help and your code, and I look forward to hearing back from you. :smile:

Bdot 2011-08-17 21:10

The clinfo output looks good (except for the "Max clock frequency: 0Mhz" part :smile: ). Based on this, the correct device number should be 11, the CPU should be 12.

Could you please run mfakto -d 11 --CLtest and paste the output? --CLtest will display any errors but continue to invoke a small test kernel. Depending on what errors occurred this may lead to mfakto hanging (use Ctrl-C) or crashing - that is kind of expected.

Also, what exactly is the error when running mfakto -d 11?

I still wonder if installing the latest Catalyst driver would solve this. Do you see a chance for upgrading it?

KingKurly 2011-08-17 22:36

[CODE]
kurly@hex:~/mfakto/mfakto-0.07 - Linux/x86_64$ ./mfakto -d 11 --CLtest

Runtime options
SievePrimes 50000
SievePrimesAdjust 0
NumStreams 10
GridSize 3
WorkFile worktodo.txt
Checkpoints enabled
Stages enabled
StopAfterFactor class
PrintMode full
AllowSleep yes
VectorSize 4
No protocol specified
OpenCL Platform 1/1: Advanced Micro Devices, Inc., Version: OpenCL 1.1 AMD-APP-SDK-v2.5 (684.213)
GPU not found, fallback to CPU.
Device 1/1: AMD Phenom(tm) II X6 1045T Processor (AuthenticAMD),
device version: OpenCL 1.1 AMD-APP-SDK-v2.5 (684.213), driver version: 2.0
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_media_ops cl_amd_popcnt cl_amd_printf
Global memory:16598622208, Global memory cache: 65536, local memory: 32768, workgroup size: 1024, Work dimensions: 3[1024, 1024, 1024, 0, 0] , Max clock speed:2700, compute units:6
loop 0:
Error -7 in clGetEventProfilingInfo.(startTime)
0 threads: RES (32): 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
loop 1:
1 threads: RES (32): 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
loop 2:
2 threads: RES (32): 2 2 2 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
loop 3:
3 threads: RES (32): 3 2 2 2 3 3 3 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
loop 4:
4 threads: RES (32): 4 2 2 2 3 3 3 4 4 4 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
loop 5:
5 threads: RES (32): 5 2 2 2 3 3 3 4 4 4 5 5 5 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
loop 6:
6 threads: RES (32): 6 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
loop 7:
7 threads: RES (32): 7 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 1 1 1 0 0 0 0 0 0 0 0 0 0
loop 8:
8 threads: RES (32): 8 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 1 1 1 0 0 0 0 0 0 0
loop 9:
9 threads: RES (32): 9 2 2 2 3 3 3 4 4 4 6 6 6 7 7 7 5 5 5 8 8 8 9 9 9 1 1 1 0 0 0 0
loop 10: [/CODE]I would be willing to upgrade the drivers, although I would think that Ubuntu 11.04 (latest release) would be up to date. I generally use the computer without a monitor, so if you have a way to do it from the command-line, that would be best.

Edit: Also, it did not crash, nor did I need to Ctrl-C.

Bdot 2011-08-18 20:46

[QUOTE=KingKurly;269365]I would be willing to upgrade the drivers, although I would think that Ubuntu 11.04 (latest release) would be up to date. I generally use the computer without a monitor, so if you have a way to do it from the command-line, that would be best.

Edit: Also, it did not crash, nor did I need to Ctrl-C.[/QUOTE]

That leaves two possible issues: in the past the AMD GPU drivers required a running X-Server - do you have that?

And second, as the AMD GPU driver is closed source, many distributions ship the open "radeon" driver, which will not work with the AMD APP.
"lsmod | grep radeon" should return empty, while "lsmod|grep fglrx" should list at least one line if you have the AMD driver.

I just verified that the driver version you have (CAL 1.4.1353) is even lower than what comes with Catalyst 11.6, which is the minimum supported with AMP APP 2.4 and 2.5. So even if Ubuntu 11.04 ships the proprietary driver, it is too old.

Just yesterday AMD released Catalyst 11.8, and so I tried to install it via a remote ssh session. Just run "sh ati-driver-installer-11-8-x86.x86_64.run" and it all worked well. You should reboot afterwards, that's it. According to the change log they now don't even need a running X-Server anymore, but I did not test that (I need X anyway).

BTW, your --CLtest output is totally correct, except that it was the CPU which calculated it. The one error line is expected as the test tried to start a kernel with 0 threads ;-)

KingKurly 2011-08-19 01:11

[QUOTE=Bdot;269450]That leaves two possible issues: in the past the AMD GPU drivers required a running X-Server - do you have that?
[/QUOTE]Yes, I do.

[QUOTE=Bdot;269450]And second, as the AMD GPU driver is closed source, many distributions ship the open "radeon" driver, which will not work with the AMD APP.
"lsmod | grep radeon" should return empty, while "lsmod|grep fglrx" should list at least one line if you have the AMD driver.
[/QUOTE]I checked, and radeon was empty and fglrx had one line. (Continue reading, there's more!)

[QUOTE=Bdot;269450] I just verified that the driver version you have (CAL 1.4.1353) is even lower than what comes with Catalyst 11.6, which is the minimum supported with AMP APP 2.4 and 2.5. So even if Ubuntu 11.04 ships the proprietary driver, it is too old.

Just yesterday AMD released Catalyst 11.8, and so I tried to install it via a remote ssh session. Just run "sh ati-driver-installer-11-8-x86.x86_64.run" and it all worked well. You should reboot afterwards, that's it. According to the change log they now don't even need a running X-Server anymore, but I did not test that (I need X anyway).
[/QUOTE]I downloaded 11.8 and upgraded to it. I believe I did it correctly, but I am still not able to make mfakto use the GPU, I have tried many parameters to -d but none work, they only use CPU. This is the new output from clinfo:
[CODE]
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 1.1 AMD-APP-SDK-v2.5 (684.213)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices


Platform Name: AMD Accelerated Parallel Processing
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
Device ID: 4098
Device Topology: PCI[ B#1, D#0, F#0 ]
Max compute units: 2
Max work items dimensions: 3
Max work items[0]: 128
Max work items[1]: 128
Max work items[2]: 128
Max work group size: 128
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 4
Preferred vector width double: 0
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 4
Native vector width double: 0
Max clock frequency: 0Mhz
Address bits: 32
Max memory allocation: 134217728
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 8192
Max image 2D height: 8192
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 32768
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 536870912
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Kernel Preferred work group size multiple: 32
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue properties:
Out-of-Order: No
Profiling : Yes
Platform ID: 0x7f8a50e37060
Name: Cedar
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 1.1
Driver version: CAL 1.4.1523
Profile: FULL_PROFILE
Version: OpenCL 1.1 AMD-APP-SDK-v2.5 (684.213)
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt
[/CODE]Any more ideas?

KingKurly 2011-08-19 01:25

Sorry to make back-to-back posts, but there was enough new information that I thought it warranted making a new post.

I found that if I plug in a monitor and keyboard to that computer and then login to the computer locally, the video card is found and can be used just fine. It would be a bit of a burden to have to always login locally, but I guess I can do that until a better solution is determined.

That said, I do have a new problem to report:

[CODE]
kurly@hex:~/mfakto/mfakto-0.07 - Linux/x86_64$ ./mfakto -d 11 -CLtest
mfakto 0.07 (64bit build)


Runtime options
SievePrimes 50000
SievePrimesAdjust 0
NumStreams 10
GridSize 3
WorkFile worktodo.txt
Checkpoints enabled
Stages enabled
StopAfterFactor class
PrintMode full
AllowSleep yes
VectorSize 4
Compiletime options
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 64kiB
SIEVE_SIZE 482885bits
SIEVE_SPLIT 250
MORE_CLASSES enabled
Select device - Get device info - Compiling kernels..........

OpenCL device info
name Cedar (Advanced Micro Devices, Inc.)
device (driver) version OpenCL 1.1 AMD-APP-SDK-v2.5 (684.213) (CAL 1.4.1523)
maximum threads per block 128
maximum threads per grid 2097152
number of multiprocessors 2 (160 compute elements(estimate for ATI GPUs))
clock rate 0MHz

ERROR: THREADS_PER_BLOCK (256) > deviceinfo.maxThreadsPerBlock
[/CODE]As you can see, my GPU can only do 128 max threads per block, but you have the build compiled for 256. I will try downloading the source from a previous post in this thread and seeing if I can work around this issue myself, but I wanted to let you know that I surpassed one hurdle, now onto the next one! :smile:

-----------------------------------------------------------------------------
***Edit: I was able to build the source with THREADS_PER_BLOCK changed from 256 to 128, but it fails selftest 1-5 and 9-11. See below:

[CODE]
kurly@hex:~/mfakto/mfakto-0.07 - Linux/x86_64$ ./mfakto -d 11 -CLtest
mfakto 0.07 (64bit build)


Runtime options
SievePrimes 50000
SievePrimesAdjust 0
NumStreams 10
GridSize 3
WorkFile worktodo.txt
Checkpoints enabled
Stages enabled
StopAfterFactor class
PrintMode full
AllowSleep yes
VectorSize 4
Compiletime options
THREADS_PER_BLOCK 128
SIEVE_SIZE_LIMIT 64kiB
SIEVE_SIZE 482885bits
SIEVE_SPLIT 250
MORE_CLASSES enabled
Select device - Get device info - Compiling kernels..........

OpenCL device info
name Cedar (Advanced Micro Devices, Inc.)
device (driver) version OpenCL 1.1 AMD-APP-SDK-v2.5 (684.213) (CAL 1.4.1523)
maximum threads per block 128
maximum threads per grid 2097152
number of multiprocessors 2 (160 compute elements(estimate for ATI GPUs))
clock rate 0MHz

Automatic parameters
threads per grid 1048576

running a simple selftest...
########## testcase 1/14 ##########
ERROR: selftest failed for M49635893 (mfakto_cl_71)
no factor found
########## testcase 2/14 ##########
ERROR: selftest failed for M51375383 (mfakto_cl_71)
no factor found
########## testcase 3/14 ##########
ERROR: selftest failed for M47644171 (mfakto_cl_71)
no factor found
########## testcase 4/14 ##########
ERROR: selftest failed for M51038681 (mfakto_cl_71)
no factor found
########## testcase 5/14 ##########
ERROR: selftest failed for M49717271 (mfakto_cl_71)
no factor found
########## testcase 6/14 ##########
########## testcase 7/14 ##########
########## testcase 8/14 ##########
########## testcase 9/14 ##########
ERROR: selftest failed for M60009109 (mfakto_cl_71)
no factor found
########## testcase 10/14 ##########
ERROR: selftest failed for M60002273 (mfakto_cl_71)
no factor found
########## testcase 11/14 ##########
ERROR: selftest failed for M60004333 (mfakto_cl_71)
no factor found
########## testcase 12/14 ##########
########## testcase 13/14 ##########
########## testcase 14/14 ##########
Selftest statistics
number of tests 52
successfull tests 44
no factor found 8

selftest FAILED!
[/CODE]The GPU does not make very much noise at all during the test, but I assume it is doing something!

Bdot 2011-08-19 13:44

[QUOTE=KingKurly;269472]
I found that if I plug in a monitor and keyboard to that computer and then login to the computer locally, the video card is found and can be used just fine. It would be a bit of a burden to have to always login locally, but I guess I can do that until a better solution is determined.
[/quote]

Well, it appears the dependency to the running X-Server is not yet dropped (or some additional work is necessary). You need to be logged in in order to start the X-Server. But then you can lock the screen and run mfakto remotely.

I'll check if we can get rid of that.

[QUOTE=KingKurly;269472]
That said, I do have a new problem to report:
ERROR: THREADS_PER_BLOCK (256) > deviceinfo.maxThreadsPerBlock
[/quote]

I'll check what implications that has and if we could drop this check altogether as OpenCL calculates the threads a little differently.

[QUOTE=KingKurly;269472]
it fails selftest 1-5 and 9-11. See below:
[/quote]

Now that is odd! The 72-bit kernel fails, but the vectored versions of the same kernel succeed! I just compared the kernels, but there are no code-differences.
Plus, I can reproduce it now on my Linux box: I still had the LD_LIBRARY_PATH point to 2.4, and that runs fine. When pointing it to 2.5, the problem appears. Looks like an AMD APP issue, I'll check what I can do about it. Running 2.5 on the CPU also works fine ...

I already wanted to drop the single kernel because it is so much slower ...

As you built your own binary anyway, go to mfaktc.c and comment out line 487 (removing the _71BIT_MUL24 kernel). Don't submit results with that to primenet, just use it to check what your GPU can do :smile:. You can then run the full selftest (-st), if you want the GPU to work for a while. There, you also see the speed of the different kernels for different tasks.

KingKurly 2011-08-20 03:00

I rebuilt the program with the change you recommended. All the tests pass, including the large selftest. The card seems to do about 5-10M/s in the "lower" ranges (like below 75M) and is about 10% of that in the 332M+ range.

[CODE]
Selftest statistics
number of tests 3637
successfull tests 3637

selftest PASSED!
[/CODE]

I look forward to future versions, and I will not use the program to submit any "no factor" results until you have indicated that it is safe for me to do so. If I happen to find factors, I might submit those, but I do not expect to use the program for much production work until things have stabilized a bit more.

Thanks again! :smile:

KingKurly 2011-08-20 17:18

The very first test I ran saved me an LL test, and of course saved someone else the LL-D down the road.

[CODE]
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
3760/4620 | 159.38M | 16.721s | 9.53M/s | 50000 | 49m53s | 90889us
3765/4620 | 159.38M | 16.696s | 9.55M/s | 50000 | 49m32s | 90729us
Result[00]: M40660811 has a factor: 490782599517282826471
found 1 factor(s) for M40660811 from 2^68 to 2^69 (partially tested) [mfakto 0.07 mfakto_cl_barrett79]
tf(): total time spent: 3h 53m 44.575s
[/CODE]

I had the exponent queued up for a first-time LL test, but I've since removed it from my worktodo because it's not necessary!

Bdot 2011-08-21 14:49

[QUOTE=KingKurly;269623]The very first test I ran saved me an LL test, and of course saved someone else the LL-D down the road.

[CODE]
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
3760/4620 | 159.38M | 16.721s | 9.53M/s | 50000 | 49m53s | 90889us
3765/4620 | 159.38M | 16.696s | 9.55M/s | 50000 | 49m32s | 90729us
Result[00]: M40660811 has a factor: 490782599517282826471
found 1 factor(s) for M40660811 from 2^68 to 2^69 (partially tested) [mfakto 0.07 mfakto_cl_barrett79]
tf(): total time spent: 3h 53m 44.575s
[/CODE]I had the exponent queued up for a first-time LL test, but I've since removed it from my worktodo because it's not necessary![/QUOTE]

What a start! While I already found a lot of factors with mfakto, they almost all have been known before :smile:.

BTW, at the expense of a little more CPU you can speed up the tests a little: Set SievePrimes to 200000 and the siever will eliminate some more candidates so the GPU will not test them. What's mfakto's CPU-load right now and with SievePrimes at 200k?

9.5 M/s is also not bad for an entry-level GPU - I guess it is as least twice as fast as one of your CPU cores.

Grats also to the successful selftest. The speed of the tests does not depend a lot on the size of the exponent but mainly on the kernel being used. The selftest will run each test with all kernels that can handle the required factor length. If you still have the output of the selftest you should see that mfakto_cl_barrett79 is always close to 10 M/s, most others a bit below that, and mfakto_cl_95 slowly crawling along.

Bdot 2011-08-24 20:55

Did anyone else give mfakto a try? Any experiences to share (anything strange happening, suggestions you'd like to get included or excluded for the next versions, performance figures for other GPUs, ...)?

I'm running this version on a SuSE 11.4 box with AMD APP SDK 2.4, and when multiple instances are running I occasionally see one instance hanging. It will completely occupy one CPU core but no GPU resources. It is looping inside some kernel code, being immune to kill, kill -9 or attempts to attach a debugger or gcore. So far, reboot was the only way I know to get rid of it. How can I find out where that hang occurs? And what else could I try to kick such a process without a reboot?

apsen 2011-08-25 17:29

[QUOTE=Bdot;270043]Did anyone else give mfakto a try?[/QUOTE]

I had the same experience as another poster: had to recompile to reduce number of threads per block and disable one kernel. Apart from that AMD_APP refused to install on Win2008 so I had to swap the graphic cards between two machines so the AMD one would be on Windows 7. The performance is about 20% of what I get out of GeForce 8800 GTS (around 6 M/s comparing to 29 M/s). I haven't played with sieve parameter much - just had to disable auto adjust as it will raise the setting to the limit slowing the testing to a crawl. If I'll lower it to below default I would probably get better overall performance.

Bdot 2011-08-25 20:19

[QUOTE=apsen;270091]The performance is about 20% of what I get out of GeForce 8800 GTS (around 6 M/s comparing to 29 M/s). I haven't played with sieve parameter much - just had to disable auto adjust as it will raise the setting to the limit slowing the testing to a crawl. If I'll lower it to below default I would probably get better overall performance.[/QUOTE]

That is a bit slower than I had expected. Which kernel and bitlevel was that? But if raising SievePrimes slows down the tests, then the tests are CPU-limited, and the GPU not running at full load. If you want, build a test binary with CL_PERFORMANCE_INFO defined (params.h) - this will tell you the memory transfer rate and pure kernel speed, without accounting for the siever.

According to [URL="http://www.hwcompare.com/3268/geforce-8800-gts-g80-320mb-vs-radeon-hd-4550-256mb/"]hwcompare[/URL], the 8800 GTS should be 3-4 times faster, so 8-10 M/s would be expected if OpenCL and my port were as efficient as Oliver's CUDA implementation.

Chaichontat 2011-08-26 11:55

Hi, I'm running mfakto on my HD6950 @912MHz, Catalyst 11.8 SDK 2.5, one thing that I seen is that it uses approx. 30 percent of my GPU utilization and gives about 50M/s. Does anyone knows how to make it fully use the GPU?
Thanks.

apsen 2011-08-26 14:30

[QUOTE=Bdot;270107]That is a bit slower than I had expected. Which kernel and bitlevel was that? But if raising SievePrimes slows down the tests, then the tests are CPU-limited, and the GPU not running at full load. If you want, build a test binary with CL_PERFORMANCE_INFO defined (params.h) - this will tell you the memory transfer rate and pure kernel speed, without accounting for the siever.

According to [URL="http://www.hwcompare.com/3268/geforce-8800-gts-g80-320mb-vs-radeon-hd-4550-256mb/"]hwcompare[/URL], the 8800 GTS should be 3-4 times faster, so 8-10 M/s would be expected if OpenCL and my port were as efficient as Oliver's CUDA implementation.[/QUOTE]

I did some more testing and it looks like the problem is in getting enough CPU. When i run it alone i'm getting about 7.3 M/s and CPU usage is 50-56%(!) on two core machine. When I start prime95 the CPU usage drops to about 10% average and I'm getting about 6.5 M/s even though prime runs at default priority and mfacto at normal. Also average wait is always in teens of milliseconds (12000-15000 microseconds). It is lower without prime95 running.

MrHappy 2011-08-27 12:44

I get ~28M/s on my HD5670 / Phenom II 4 Core 925 with 2 Cores on P-1 tests, 1 Core LL-D; and 1 core is busy video editing. I'll look again when the video job is done.

Christenson 2011-08-28 14:18

[QUOTE=Chaichontat;270138]Hi, I'm running mfakto on my HD6950 @912MHz, Catalyst 11.8 SDK 2.5, one thing that I seen is that it uses approx. 30 percent of my GPU utilization and gives about 50M/s. Does anyone knows how to make it fully use the GPU?
Thanks.[/QUOTE]

At the current stage of development, mfaktc/mfakto sieves for probable primes on the CPU side before passing them to the GPU for checking. Make sure that sieveprimes on your machine has gone down to 10,000. Beyond that, at the moment, you have to throw more CPU at it, in the form of running a second copy of mfaktc on a different core.

50M/s is doing a bit better than my GTX440 under mfaktc, incidentally.

Setting up both mfaktc and mfakto to sieve on the GPU is at least a dream for the developers.

Bdot 2011-08-29 22:07

Thanks a lot for your reports, good to hear some people would actually use it :smile:

I did some detailed testing on the CPU demands of mfakto vs. mprime/prime95. (Fix SievePrimes for this test)

My HD 5770 can reach about 120M/s total with 2 instances running and no other big consumer. Then, mfakto's CPU load is about 320%. Yes, only two instances, single-threaded, will occupy a little more than 3 cores. A single instance will reach about 105M/s, at 195% CPU.

Starting mprime (mostly LL tests) will drastically decrease mfakto's CPU load. Throughput also drops, but by a lesser degree - with ~100% CPU a single instance still reaches 75M/s. Conclusion is that the OpenCL runtime has quite some busy-waits behind the user-level events ...

And what timings change inside mfakto when mprime starts? Well, the pure kernel runtime is absolutely unchanged (as expected). The siever slows down by 15-20% (370 -> 433 ms for 20 blocks of 1.25M). (Even though mfakto runs at normal priority while mprime is the "nicest" of all.) But what may count way more is the time required to copy the blocks to the GPU. While this is normally above 3 GB/s, the rate starts fluctuating a lot, averaging 1.55GB/s when mprime runs. Worst case was 14.7 ms to copy the block, and 9.4 ms to process it on the GPU. Unlike the longer sieving times, the longer transfer times will not be hidden by parallelism: OpenCL does not yet support copying data to the kernel while another kernel is still running. mfakto will copy and process blocks strictly alternating.

Conclusion? Both mprime and mfakto put quite some stress on the memory bus (and mfakto not yet being optimized to be cache-friendly). When a CPU waits for data from memory, this is counted as "CPU busy" towards the application, even though the CPU has to wait lots of cycles.

I'll see if I can make mfakto a bit "cache-friendlier", but the ultimate solution to this problem will be when the siever runs on the GPU.

Regarding the maximum throughput of your cards:

Chaichontat's HD6850 (at this speed rather a 6870!) should achieve around 160M/s. For that you'll need at least 2, probably 3 instances running on at least 4 CPU-cores.

MrHappy's HD5670 should have it's max at ~40M/s. Maybe 2 instances are needed here too in order to keep the GPU at 99%.

@Christenson: keep dreaming, one day it will come true ...

Christenson 2011-08-30 03:23

I do keep dreaming...the question is whether I will be the implementer, or someone else....

Bdot 2011-08-30 12:22

GPU sieving for Trial Factoring
 
[QUOTE=Christenson;270374]I do keep dreaming...the question is whether I will be the implementer, or someone else....[/QUOTE]
I'm still in a stage of collecting ideas how to distribute the work onto multiple threads.

Easiest would be to give each thread a different exponent to work on. This would eliminate the need for threads to communicate with each other, each could work in the fast private storage ... However, you'd need at least 64 exponents to work on, for high-end GPUs up to 1024. The factoring progress of each would be about 2-4M/s, leading to huge runtime even for medium bitlevels.

Each thread could also process a fixed block of sieve-input. This would require sieve-initialization for each block as you cannot build upon the state of the previous block. Therefore each block needs to have a good size to make the initialization less prominent. An extra step (i.e. extra kernel) would be needed to combine the output of all the threads into the sieve-output. And only after that step we know if we have enough FCs to fill a block for the GPU factoring.

Similarly, we could let each thread prepare a whole block of sieve-output factor candidates. This would require to have good estimates about where each block will start. Usually you don't know where a certain block starts until the previous block is finished sieving. It can be estimated, but to be safe, there needs to be a certain overlap, some checks and maybe re-runs of the sieving if gaps were detected.

We could split the primes that are used to sieve a block. Disadvantages include different run-lengths for the loops, lots of (slow) global memory operations and synchronization for access to the block of FCs (not sure about that). Maybe that could be optimized by using workgroup-size blocks and local memory that is considerably faster, and combining that later into global memory.

Maybe the best would be to split the task (factor M[SUB]exp[/SUB] from 2[SUP]n[/SUP] to 2[SUP]m[/SUP]) into <workgroup> equally-sized blocks and run sieving and factoring of those blocks in independent threads. Again, lots of initializations, plus maybe too many private resources required ... Preferred workgroup numbers seem to be 32 to 256, depending on the GPU.

More suggestions, votes, comments?

MrHappy 2011-08-30 18:27

With Prime95 stopped mfakto reaches ~50M/s on the HD5670.

AldoA 2011-08-30 18:52

Hi everyone. I wanted to help this project with my ATI Redeon HD 4650.
Then I downloaded OpenCL, and mfakto. I installed OpenCl, but when I started mfakto it sayes "Impossible to start the application, MSCVR100.dll hasn't found, a new installation of the program could solve the problem".
Can anyone say me what do I have to do or install. Thanks

Bdot 2011-08-30 21:09

Hi AldoA,

this is the Microsoft Visual C++ runtime, to download from MS (the below links are for the German version, but there you can also change the language):
[URL="http://www.microsoft.com/downloads/details.aspx?FamilyID=a7b7a05e-6de6-4d3a-a423-37bf0912db84&displayLang=de"]Microsoft Visual C++ 2010 Redistributable Package (x86)[/URL]
[URL="http://www.microsoft.com/downloads/details.aspx?familyid=BD512D9E-43C8-4655-81BF-9350143D5867&displaylang=de"]Microsoft Visual C++ 2010 Redistributable Package (x64)[/URL]

I'll add it to the list of dependencies in the README.

Bdot 2011-08-30 21:18

[QUOTE=MrHappy;270416]With Prime95 stopped mfakto reaches ~50M/s on the HD5670.[/QUOTE]

Did you try a single instance only, or also two separate invocations (two different exponents)? That would certainly add something on the totals line.

Bdot 2011-08-30 21:48

[QUOTE=apsen;270091]Apart from that AMD_APP refused to install on Win2008 [/QUOTE]

I posted that on the [URL="http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=154048&enterthread=y"]AMD Forum[/URL] and they want to know what exactly the error is. Could you please try again and tell me?

AldoA 2011-08-31 10:03

[QUOTE=Bdot;270422]Hi AldoA,

this is the Microsoft Visual C++ runtime, to download from MS (the below links are for the German version, but there you can also change the language):
[URL="http://www.microsoft.com/downloads/details.aspx?FamilyID=a7b7a05e-6de6-4d3a-a423-37bf0912db84&displayLang=de"]Microsoft Visual C++ 2010 Redistributable Package (x86)[/URL]
[URL="http://www.microsoft.com/downloads/details.aspx?familyid=BD512D9E-43C8-4655-81BF-9350143D5867&displaylang=de"]Microsoft Visual C++ 2010 Redistributable Package (x64)[/URL]

I'll add it to the list of dependencies in the README.[/QUOTE]

Thanks. Now I can open mfakto, but I think it's using the CPU because it says "select device-GPU not found-fallback to CPU". What to do? Anyway I made the selftest and it passed it. What other can I do? (Sorry for the questions but I'm not really into computing).


All times are UTC. The time now is 23:29.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.