mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

potonono 2014-07-25 23:53

I haven't had much chance to test yet, except I can confirm VectorSize=4 is best on mine too.

legendarymudkip 2014-07-26 00:00

VectorSize=4 increases throughput by about 1 GHz-d/Day on my end as well.

Bdot 2014-07-27 11:46

[QUOTE=kracker;378955]I'll do/try that. :smile:

On another note...
[code]

Selftest statistics
number of tests 287351
successful tests 287350
no factor found 1

selftest FAILED!

ERROR: selftest failed, exiting.
[/code][/QUOTE]
Oh, that's sad. Was that the IntelHD? And again it was a really good choice to add so many test cases ...

It must be a different rounding that leads to a higher-than-expected error. I cannot reproduced this error on my H/W. Can you reproduce it?

The factor is not particularly close to the limit of this kernel, the exponent does not have a long suite of ones in its binary form ... I don't see why this test should fail.

I'll provide a special test version to you to be able to analyze this failure.

kracker 2014-07-28 01:12

Yep, just ran part of -st2 again on my HD4600. Same failure.

Rodrigo 2014-07-29 12:05

[QUOTE=LaurV;376585]If you want to squeeze more performance from mfakto, try to factor lower than 73 bits only. Contrary to mfaktc, where there is no big drop in performance for higher bitlevels (or, say, no big gain in performance for lower bit levels), for mfakto, especial for higher GCN cards, the "shorter" kernels are much faster. For example, I get from my HD7970 GHz edition, something like: 450GHzD/D when factoring 6xM to 74, but I get 500GHzD/D when factoring to 73 only, and so on. Decreasing the bitlevel increase the "gain" (but helps GIMPS less) and also decreasing the exponent increase the "gain", but only a little. For example, the same card I described above, gives 630-650GHzD/D when factoring 4xM exponents to 69 bits. Right now, Chris made them unavailable from GPU72, to channel the workers toward Cat4 exponents, but there are still [URL="http://www.gpu72.com/reports/current_level/"]35 thousands[/URL] of them (44-47M) at 68 bits, you can take them to 69 directly from PrimeNet, or ask Chris to make them available. For this range of expos and bitlevel, the performance of the card (kernel) is about 50% higher.[/QUOTE]

How does one go about getting these kinds of assignments? I just went into PrimeNet to request manual assignments in that range using the specified range option, and the server took forever to respond, finally issuing a timeout error.

Can I enter, say "45000000" to "47000000", or do I have to enter specific starting/ending exponents?

(I was able to get manual assignments in the normal manner if no range was indicated.)

Rodrigo

LaurV 2014-07-29 13:29

[QUOTE=Rodrigo;379296]How does one go about getting these kinds of assignments?[/QUOTE]
GPU72

chalsall 2014-07-29 16:26

[QUOTE=Rodrigo;379296]Can I enter, say "45000000" to "47000000", or do I have to enter specific starting/ending exponents?[/QUOTE]

I can't speak to how one might get such assignments from Primenet using the manual assignment page there, since GPU72 uses different techniques to reserve candidates for TF'ing beyond the nominal CPU "bit levels".

But, such work is easily available from GPU72 -- both low DCTF'ing and low LLTF'ing. Please note that LLTF'ing is the most needed at the moment, and the deeper the better. But for those whose cards are more optimal going to lower bit levels are encouraged to do so.

Rodrigo 2014-07-29 23:20

Thanks, LaurV and Chris.

I guess I got sidetracked by this part:

[QUOTE]but there are still [URL="http://www.gpu72.com/reports/current_level/"]35 thousands[/URL] of them (44-47M) at 68 bits, you can take them to 69 directly from PrimeNet[/QUOTE]It's time to get around to investigating GPU72. :smile:

Rodrigo

potonono 2014-08-03 16:35

[QUOTE=kracker;379188]Yep, just ran part of -st2 again on my HD4600. Same failure.[/QUOTE]

I get the same failure on my HD2500.

[QUOTE]######### testcase 2584/32927 (M59000521[82-83]) #########
Starting trial factoring M59000521 from 2^82 to 2^83 (16600.99GHz-days)
Using GPU kernel "cl_barrett15_83_gs_4"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Aug 03 08:00 | 3828 0.1% | 4.565 n.a. | n.a. 81205 0.00%
no factor for M59000521 from 2^82 to 2^83 [mfakto 0.15pre1-Win cl_barrett15_83_gs_4]
ERROR: selftest failed for M59000521 (cl_barrett15_83_gs)
no factor found
tf(): total time spent: 4.565s[/QUOTE]

Bdot 2014-08-05 20:56

[QUOTE=potonono;379619]I get the same failure on my HD2500.[/QUOTE]
Thanks for the confirmation.

I'm on it to troubleshoot this with kracker, but don't have a enough time to make good progress on it. The next step would be to build a version where tracing will show where exactly calculations go wrong ...

Jayder 2014-08-11 13:43

[QUOTE=Bdot;371459]I have released [URL="http://mersenneforum.org/mfakto/mfakto-0.14/"]mfakto 0.14.[/URL]

As most people seem to use the GPU sieve, I no longer created the versions with different (CPU-) sieve sizes. If anyone still needs them, just let me know.[/QUOTE]
I know it's been a while since release, but if you can be bothered to, would you mind making a 64kB version if not also (optionally) a -var version? The GPU sieve is nice, but I think I am willing to switch back as the CPU sieve results in almost twice the speed on my APU. The standard 36kB sieve size limit is also quite a bit slower than 64kB.

Feel free to say no or to put it at the end of your to-do list. :smile: I can stick with the GPU sieve for a while longer. I seem to be the only one wanting it, and I don't expect you to go out of your way or anything.

Bdot 2014-08-13 00:50

[QUOTE=Jayder;380174]I know it's been a while since release, but if you can be bothered to, would you mind making a 64kB version if not also (optionally) a -var version? The GPU sieve is nice, but I think I am willing to switch back as the CPU sieve results in almost twice the speed on my APU. The standard 36kB sieve size limit is also quite a bit slower than 64kB.

Feel free to say no or to put it at the end of your to-do list. :smile: I can stick with the GPU sieve for a while longer. I seem to be the only one wanting it, and I don't expect you to go out of your way or anything.[/QUOTE]

No problem, I think I can build them within the next days.


Just a quick update about the HD4600/2500 selftest failure:
I analyzed kracker's data and the code. The reason is that the HD4600 has a slightly different rounding behavior.

I noticed that also for AMD devices the code walks dangerously close to the border of the available precision. Even though all tests succeed, the warning lights that Oliver once built into the code (CHECKS_MODBASECASE) do light up in the 15_82/15_83 and 15_88 kernels.

In order to fix that I finally did a long-waiting attempt: base the initial division on double instead of float. This allows for doing the div_180_90 in two instead of five steps with only one instead of four big multiplications in between.

Result: no more CHECKS_MODBASECASE issues (plenty of safety bits), and 1.5% faster overall (on HD7950), even though processing speed for doubles is just 4:1. I will run a few tests over night and then probably send out 0.15pre2 for testing - it should at least fix the IntelHD issue.

I'll then test if the smaller kernels would also benefit from using doubles, and how the performance looks like on mid- and lower end GPUs where the performance for doubles is just 16:1.

Well, maybe after my vacation :motorhome:

Bdot 2014-08-13 20:07

mfakto-0.15pre2 ready for testing
 
Dear mfakto-testers,

I now put the windows/64 version of mfakto-0.15pre2 to the [URL="http://mersenneforum.org/mfakto/mfakto-0.15pre2/mfakto-0.15pre2.zip"]ftp[/URL]. I'd appreciate if you could test it on the various systems you have access to:

[LIST][*] does mfakto detect the devices automatically or are switches (like -d 11) required[*]does it correctly identify the devices and their device type[*] is 'mfakto -st' reporting success (on fast systems, or when you have lots of time, 'mfakto -st[B]2[/B]') - if testing is too long, you can always interrupt by pressing 'q' or Ctrl-C.[*] use a normal trial factoring task and try to optimize the ini-file settings: try VectorSize=2 and =4 (1, 3, 8 and 16 are possible as well) to see which is faster, then use the +/-, s/S, p/P keys to get the best possible GHz-days: what was the TF job, and which settings (VectorSize, SievePrimes, SieveSize, SieveProcessSize) gave the best performance for the specific device?[*] any problems/suggestions?[/LIST]
Additional performance-testing:

As the new division algorithm is based on double precision, I'd need to get performance results from as many different devices as possible:
[LIST][*]Modify the ini-file to use the best VectorSize (see above)[*]Switch to CPU sieving: SieveOnGPU=0[*]make sure CPU and GPU are idle[*]run "mfakto-pi.exe -st > st-pi.log"[*]keep it running for one or two minutes, then press q (or Ctrl-C)[*]have a look at st-pi.log: ist the detected clock speed correct (it rarely is on AMD - please let me know the correct one)[*]send me the log[/LIST]Thanks a lot for any help you can provide - even if the complete checklist is too long for you: any partial result is also appreciated.

kracker 2014-08-13 20:52

Thanks! Will try it out.

On another point... Intel iGPU's [URL="https://software.intel.com/en-us/forums/topic/393241"]do not have[/URL] double precision... :sad:

LaurV 2014-08-14 09:13

1 Attachment(s)
"MSVCR110.DLL missing from your system". Didn't need it for the old one.

[QUOTE=Bdot;380305]
[LIST][*] does mfakto detect the devices automatically or are switches (like -d 11) required[/LIST][/QUOTE]
Yes, (after installing the redistributable thing) there is only one HD card here, successfully detected.

[QUOTE]
[LIST][*]does it correctly identify the devices and their device type[/LIST][/QUOTE]Yes. What's with the big "elf" file? Can it be deleted?

[QUOTE]
[LIST][*] is 'mfakto -st' reporting success (on fast systems, or when you have lots of time, 'mfakto -st[B]2[/B]') - if testing is too long, you can always interrupt by pressing 'q' or Ctrl-C.[/LIST][/QUOTE]3092/3092 successful tests. Or I could say that something is odd... because all 3092 exponents picked had factors... Hm... :razz:

-st2 works fine, no fail. Good job!

[QUOTE]
[LIST][*] use a normal trial factoring task and try to optimize the ini-file settings: try VectorSize=2 and =4 (1, 3, 8 and 16 are possible as well) to see which is faster, then use the +/-, s/S, p/P keys to get the best possible GHz-days: what was the TF job, and which settings (VectorSize, SievePrimes, SieveSize, SieveProcessSize) gave the best performance for the specific device?[/LIST][/QUOTE]Not so much to do here, GCN card, VS=2 still works best, still playing with it.

[QUOTE]
[LIST][*] any problems/suggestions?[/LIST] [/QUOTE]Cosmetic: I ran it with "-i" switch with no file parameter (just "mfakto -i", by reflex, I was looking for card "info" hehe) and it crashes ugly.

[QUOTE]
Additional performance-testing:

As the new division algorithm is based on double precision, I'd need to get performance results from as many different devices as possible:
[LIST][*]Modify the ini-file to use the best VectorSize (see above)[*]Switch to CPU sieving: SieveOnGPU=0[*]make sure CPU and GPU are idle[*]run "mfakto-pi.exe -st > st-pi.log"[*]keep it running for one or two minutes, then press q (or Ctrl-C)[*]have a look at st-pi.log: ist the detected clock speed correct (it rarely is on AMD - please let me know the correct one)[*]send me the log[/LIST]Thanks a lot for any help you can provide - even if the complete checklist is too long for you: any partial result is also appreciated.[/QUOTE]Tried to do that. I have the file(s) (from -pi and from --perftest). Where I can put them? [edit: solved, didn't know the quota limit for zip is larger]

Antonio 2014-08-14 10:08

Tried this on my system (i5, 3570k) 2 * NVidia graphics cards and integrated HD4000 enabled, Windows 7.
With GPUType = AUTO or CPU :- Windows reports 'mfakto.exe has stopped working' during the kernel compile.
With GPUType=INTEL :- program compiles the kernel and runs on the CPU successfully.

Bdot 2014-08-14 19:10

[QUOTE=kracker;380308]Thanks! Will try it out.

On another point... Intel iGPU's [URL="https://software.intel.com/en-us/forums/topic/393241"]do not have[/URL] double precision... :sad:[/QUOTE]
Ohh ... :blush:

Good that I implemented a check for that ... you should receive a greeting and the kernels in question be skipped.

Bdot 2014-08-14 21:16

Thanks for your tests, there are quite some news to me:
[QUOTE=LaurV;380351]"MSVCR110.DLL missing from your system". Didn't need it for the old one.
[/QUOTE]
I did not remember it is different, I thought I moved to VS12 before 0.14 ... but [URL="https://github.com/Bdot42/mfakto/commits/master?page=2"]git [/URL]tells otherwise ... So this needs to be added to the requirements list.
[QUOTE=LaurV;380351]
Yes. What's with the big "elf" file? Can it be deleted?
[/QUOTE]
It's the kernels compiled for your device. You can delete it, and mfakto will not recreate it if you set UseBinFile to empty. If mfakto finds the file during startup, it will skip kernel recompilation, improving startup time a lot.
[QUOTE=LaurV;380351]
Cosmetic: I ran it with "-i" switch with no file parameter (just "mfakto -i", by reflex, I was looking for card "info" hehe) and it crashes ugly.
[/QUOTE]
Very good. It's actually reports like these that I'm looking for. Fixed.
[QUOTE=LaurV;380351]
Tried to do that. I have the file(s) (from -pi and from --perftest). Where I can put them? [edit: solved, didn't know the quota limit for zip is larger][/QUOTE]
Your card does not even spin up to full clock speed for the -pi test - your CPU is just too slow :razz: I need to see how I can improve GPU utilisation for this test.
Also the --perftest shows that my old PhenomII is between 2 and 4 times as fast as your CPU ... did you keep prime95 running?
The GPU part of --perftest thinks that the optimal GPUSievePrimes is a little above 110k. It will depend on the TF task though. As the card has plenty of memory with relatively large caches, probably GPUSieveSize and GPUSieveProcessingSize maxed out are best as well.

Bdot 2014-08-14 21:31

[QUOTE=Antonio;380352]Tried this on my system (i5, 3570k) 2 * NVidia graphics cards and integrated HD4000 enabled, Windows 7.
With GPUType = AUTO or CPU :- Windows reports 'mfakto.exe has stopped working' during the kernel compile.
With GPUType=INTEL :- program compiles the kernel and runs on the CPU successfully.[/QUOTE]
Can you tell a bit more about your system:
[LIST][*]Which Graphics drivers (AMD, Intel and/or NVIDIA, and which version)
I see a crash during compile as well when trying to run it on my Quadro FX 880M with NV drivers 334.something. It's an NV driver bug, it used to work with older drivers.[*]Interesting detail that GPUType=INTEL make it work ... that one skips optimization and enables a few workarounds in the code.[*]Does mfakto -d 11 / -d 12 / -d 13 / -d 21 / 22 / 23 / ... try to use other devices? (keep increasing the two digits separately until mfakto tells something like "Error: Only 1 platforms found. Cannot use platform 3..." or "Error: Only 1 devices found. Cannot use device 3..:" Does any of the settings select the HD4000? Is the HD4000 listed in the output of "clinfo"?[*]How did you check the HD4000 is enabled? Does it have a monitor connected?[/LIST]

Bdot 2014-08-15 01:33

[QUOTE=Jayder;380174]I know it's been a while since release, but if you can be bothered to, would you mind making a 64kB version if not also (optionally) a -var version? The GPU sieve is nice, but I think I am willing to switch back as the CPU sieve results in almost twice the speed on my APU. The standard 36kB sieve size limit is also quite a bit slower than 64kB.

Feel free to say no or to put it at the end of your to-do list. :smile: I can stick with the GPU sieve for a while longer. I seem to be the only one wanting it, and I don't expect you to go out of your way or anything.[/QUOTE]

I've added -64k and -var versions to the current version at the [URL="http://mersenneforum.org/mfakto/mfakto-0.15pre2/mfakto-0.15pre2.zip"]ftp[/URL]. I have tested this version extensively and LaurV also reported successful tests. I'd ask you to run the -st2 selftest with the settings you intended to use, then feel free to use it for your normal TF tasks.

LaurV 2014-08-15 01:38

[QUOTE=Bdot;380408]It's the kernels compiled for your device. You can delete it, and mfakto will not recreate it if you set UseBinFile to empty. If mfakto finds the file during startup, it will skip kernel recompilation, improving startup time a lot.
[/QUOTE]
I understood as much as this, looking into the new ini file, after I posted my previous post.

[QUOTE]Your card does not even spin up to full clock speed for the -pi test - your CPU is just too slow :razz: I need to see how I can improve GPU utilisation for this test.[/QUOTE]Indeed, I was going to say, that wheelbarrow has an old Core 2 CPU, with a 7970 on it, it took me a while to find the suitable mobo (with new PCIE and old Socket 775, haha) and it is not used for anything else except mfakto. You may remember when I was asking here about win32 and after a struggle with it, I installed win64. The monitor, till today, still shows the "black screen of death", with the "you are victim of piracy" window in the middle, which is always covered by the misfit window, haha. I don't use the computer for other things.

Performance-wise: new mfacto seems a bit faster but also the computer is less responsive. I decreased the GPUSieveSize to 64 and the ProcessSize to 16, it seems the best. BTW I remember is was a bug long ago, missing some factors when the ProcessSize was 24, is that fixed? (I only use 16 and 32 since that time, and I see that now the default is set to 24).

Jayder 2014-08-15 07:36

[QUOTE=Bdot;380430]I've added -64k and -var versions to the current version at the [URL="http://mersenneforum.org/mfakto/mfakto-0.15pre2/mfakto-0.15pre2.zip"]ftp[/URL]. I have tested this version extensively and LaurV also reported successful tests. I'd ask you to run the -st2 selftest with the settings you intended to use, then feel free to use it for your normal TF tasks.[/QUOTE]
I can't properly express my thanks. :smile: I will definitely test it thoroughly, and I will report back for good measure.

Jayder 2014-08-15 12:18

1 Attachment(s)
There are a few things I've noticed already. I did a little searching, but please forgive me if they are known about or are not issues. In all of my tests, I am using the standard 64-bit version of mfakto and not one of the special versions.

The first issue appears to be an old one (present in 0.14): SievePrimes doesn't seem to adjust after a certain point (or in some cases at all) in certain bit ranges. NumThreads is somewhat involved, but is probably not the culprit. I've pasted below some of my outputs. Descriptions come before the snippets.


In the following, the SievePrimes climbs from 50k and gets stuck somewhere before 182656. The CPU idle is low, but, whether it gets even lower or much higher, it stays at 182656. Note the "n.a.%"
[CODE][date time] exponent [TF bits]: percent class #, seq GHz/d time | ETA | #FCs | rate | SieveP. | CPU idle
[Aug 15 02:10] M4412033 [63-64]: 21.3% 975/4620,204/960 31.37 1.215s | 15m19s | 44.04M | 36.25M/s | 144321 | 11708us = 20.24%
[Aug 15 02:10] M4412033 [63-64]: 21.4% 976/4620,205/960 30.76 1.239s | 15m35s | 44.04M | 35.54M/s | 162361 | 10475us = 17.76%
[Aug 15 02:10] M4412033 [63-64]: 21.5% 987/4620,206/960 32.68 1.166s | 14m39s | 41.94M | 35.97M/s | 182656 | 4000us = n.a.%
[Aug 15 02:11] M4412033 [63-64]: 21.6% 991/4620,207/960 32.71 1.165s | 14m37s | 41.94M | 36.00M/s | 182656 | 3993us = n.a.%
[Aug 15 02:11] M4412033 [63-64]: 21.7% 1000/4620,208/960 32.71 1.165s | 14m36s | 41.94M | 36.00M/s | 182656 | 3651us = n.a.%[/CODE]


If I set SievePrimes to be higher than 182656, it will not lower itself, even if it is set much higher.
[code][date time] exponent [TF bits]: percent class #, seq GHz/d time | ETA | #FCs | rate | SieveP. | CPU idle
[Aug 15 02:12] M4412033 [63-64]: 23.3% 1068/4620,224/960 26.39 1.444s | 17m43s | 41.94M | 29.05M/s | 300000 | 104us = n.a.%
[Aug 15 02:12] M4412033 [63-64]: 23.4% 1071/4620,225/960 27.59 1.381s | 16m55s | 41.94M | 30.37M/s | 300000 | 105us = n.a.%
[Aug 15 02:12] M4412033 [63-64]: 23.5% 1075/4620,226/960 27.56 1.383s | 16m55s | 41.94M | 30.33M/s | 300000 | 102us = n.a.%
[Aug 15 02:12] M4412033 [63-64]: 23.6% 1080/4620,227/960 27.46 1.388s | 16m57s | 41.94M | 30.22M/s | 300000 | 104us = n.a.%
[Aug 15 02:12] M4412033 [63-64]: 23.8% 1083/4620,228/960 27.54 1.384s | 16m53s | 41.94M | 30.31M/s | 300000 | 92us = n.a.%[/code]
As I mentioned, it's only certain bit ranges, but it also depends on the exponent. I tested both this 4M exponent (above) as well as an 85M exponent. For the 4M, the SievePrimes has trouble adjusting up to 64 bits (64-65 adjusting fine) and the 85M exponent has trouble adjusting up to 68 bits (68-69 adjusting fine).

I noticed all of this first during the selftest (st). Attached are some files. Jayder-NS3 shows that with NumStreams 3 (or less, but not shown here) SPrimes climbs for a while but stops. Jayder-NS4 shows that with NumStreams 4 (or greater, but not shown) SPrimes doesn't change at all.

+/-, s/S, and p/P seem to work as intended, even when SPrimes is stuck as above, but it does not unstick it.


The second thing which I noticed is that time per class for my 4M exponent, 63-64 bits, has increased by at least 7%. The other two files in the archive contain brief logs showing this. There seemed to be no difference with the 85M exponent I tested. Settings all the same, computer idle.

Finally, I'm told that my "device does not support double precision operations." I don't know enough to know if this is right or not (it probably is), but I thought I'd check. I have an A4-3420 (with HD 6410D). I know the GPU does not have DP, but your description makes it sound like the DP is for the CPU. I don't know, me dumb. :cmd:

I hope I have helped more than hindered. Thank you again (and kracker, and the many others who've helped).

kracker 2014-08-15 17:29

1 Attachment(s)
-st2 passed on Llano APU(6550D)

Also, -pi info for it. 7770 and HD4600 coming after I finish these assignments... :razz:

Also, I can not get my HD4600 detected in any other way except -d 11 still.
(System with two AMD(7770) cards and the "integrated" one.)

potonono 2014-08-15 23:52

1 Attachment(s)
I'm getting a failure on the selftest for an HD2500.

ERROR: selftest failed for M60004333 (mfakto_cl_63)##

kracker 2014-08-16 01:09

1 Attachment(s)
Radeon HD 7770, passed -st2.

kracker 2014-08-16 15:00

[QUOTE=potonono;380490]I'm getting a failure on the selftest for an HD2500.

ERROR: selftest failed for M60004333 (mfakto_cl_63)##[/QUOTE]

Hmm, looks like cl_mg62 and mfakto_cl_63 in general are failing for me too...(HD4600)

Bdot 2014-08-19 10:51

[QUOTE=kracker;380522]Hmm, looks like cl_mg62 and mfakto_cl_63 in general are failing for me too...(HD4600)[/QUOTE]
It looks like [URL="https://github.com/Bdot42/mfakto/commit/de4dbe2fd1f32f357dcfbc7054d2a54467769589"]this change[/URL] was premature. I'm rolling it back. The Montgomery kernel, however, was not changed for a while - not sure why that one would be affected.

I'll not release anything within the next two weeks as I have no access to my test machines ...

Antonio 2014-08-26 07:08

[QUOTE=Bdot;380409]Can you tell a bit more about your system:
[LIST][*]Which Graphics drivers (AMD, Intel and/or NVIDIA, and which version)
I see a crash during compile as well when trying to run it on my Quadro FX 880M with NV drivers 334.something. It's an NV driver bug, it used to work with older drivers.[*]Interesting detail that GPUType=INTEL make it work ... that one skips optimization and enables a few workarounds in the code.[*]Does mfakto -d 11 / -d 12 / -d 13 / -d 21 / 22 / 23 / ... try to use other devices? (keep increasing the two digits separately until mfakto tells something like "Error: Only 1 platforms found. Cannot use platform 3..." or "Error: Only 1 devices found. Cannot use device 3..:" Does any of the settings select the HD4000? Is the HD4000 listed in the output of "clinfo"?[*]How did you check the HD4000 is enabled? Does it have a monitor connected?[/LIST][/QUOTE]

Sorry, my fault - at some point the HD4000 had become disabled, once I enabled it again everything was fine.
Also sorry for the delay, was away from my test machine for some time.

Bdot 2014-09-04 14:46

Sorry for the delay on this ... I was on vacation, and had a lot of other stuff to do after returning ...

Thank you for your reports:
[QUOTE=Jayder;380461]The first issue appears to be an old one (present in 0.14): SievePrimes doesn't seem to adjust after a certain point (or in some cases at all) in certain bit ranges. NumThreads is somewhat involved, but is probably not the culprit.
...

As I mentioned, it's only certain bit ranges, but it also depends on the exponent. I tested both this 4M exponent (above) as well as an 85M exponent. For the 4M, the SievePrimes has trouble adjusting up to 64 bits (64-65 adjusting fine) and the 85M exponent has trouble adjusting up to 68 bits (68-69 adjusting fine).

I noticed all of this first during the selftest (st). Attached are some files. Jayder-NS3 shows that with NumStreams 3 (or less, but not shown here) SPrimes climbs for a while but stops. Jayder-NS4 shows that with NumStreams 4 (or greater, but not shown) SPrimes doesn't change at all.

+/-, s/S, and p/P seem to work as intended, even when SPrimes is stuck as above, but it does not unstick it.
[/QUOTE]
Good observation and an easy explanation: If not each of the NumStreams has sent at least 2 blocks of factor candidates to the GPU, then the resulting timing information is regarded unreliable and no SievePrimes adjustment will be done. It basically means that the job is too small to tell anything about the CPU utilization during the trial factoring, because they were not running in parallel: Each stream will prepare a block of factor candidates, send it off to the GPU and then start preparing the second block. So only when the second block is prepared, the GPU has a chance to run in parallel.

Lowering GridSize will help as that reduces the block size. If you regularly run such small tasks, this may be a good thing anyway, as on average half a block of FC's is wasted per class - if you have only 2 blocks per class, that is 25% wasted. An even better approach might be to run the GPU sieve with MoreClasses=0 for such tasks.


[QUOTE=Jayder;380461] The second thing which I noticed is that time per class for my 4M exponent, 63-64 bits, has increased by at least 7%. The other two files in the archive contain brief logs showing this. There seemed to be no difference with the 85M exponent I tested. Settings all the same, computer idle.
[/QUOTE]
I need to see if the same kernel is selected as before ... Maybe I did something wrong with the kernel precedence for APUs ... I'll check your logs and come back to that separately.

[QUOTE=Jayder;380461]Finally, I'm told that my "device does not support double precision operations." I don't know enough to know if this is right or not (it probably is), but I thought I'd check. I have an A4-3420 (with HD 6410D). I know the GPU does not have DP, but your description makes it sound like the DP is for the CPU. I don't know, me dumb. :cmd:
[/QUOTE]
"device" in this case means the device where the OpenCL kernels are running, i.e. the GPU part of your APU. And that one is VLIW5 without DP support. No error here, and also no problem for running mfakto. I changed "WARNING" into "INFO" for the next version.

[QUOTE=Jayder;380461]I hope I have helped more than hindered. Thank you again (and kracker, and the many others who've helped).[/QUOTE]

I'm really thankful for all feedback I can get. There's no way for me to test it on all the possible devices - I need your help with that. Also in respect to unclear descriptions or behavior: It is not sufficient that something is clear for me, I have a special view on mfakto. Please do ask, others may have the same question :smile:

Bdot 2014-09-06 22:47

Finally I managed to create a 74-bit kernel that helps straightening out the performance of mfakto when the factor sizes increase (it moves out the big drop one more bit). My HD7950@1100MHz now runs 100M candidates:

bits : GHz-days/day
67-68: 448
68-69: 476
69-70: 459
70-71: 416
71-72: 417
72-73: 418
[COLOR=DarkGreen]73-74: 408[/COLOR] <== the new one, was 361 before
74-82: 361

Attempts of achieving this using a new 5x16-bit kernel or an improved montgomery kernel yielded slow results. The solution is a "4x15-bit + 1x16-bit" kernel ...

Rodrigo 2014-09-06 23:04

Wonderful, thank you! :tu:

I'll be watching this channel for news of the new version's official release.

Rodrigo

kracker 2014-09-07 03:01

[QUOTE=Bdot;382330]Finally I managed to create a 74-bit kernel that helps straightening out the performance of mfakto when the factor sizes increase (it moves out the big drop one more bit). My HD7950@1100MHz now runs 100M candidates:

bits : GHz-days/day
67-68: 448
68-69: 476
69-70: 459
70-71: 416
71-72: 417
72-73: 418
[COLOR=DarkGreen]73-74: 408[/COLOR] <== the new one, was 361 before
74-82: 361

Attempts of achieving this using a new 5x16-bit kernel or an improved montgomery kernel yielded slow results. The solution is a "4x15-bit + 1x16-bit" kernel ...[/QUOTE]

Niice :smile:
If you need any testing/ers, I'm up for it :razz:

Jayder 2014-09-07 03:18

Thanks for the reply and the great work, Bdot. :tu:

LaurV 2014-09-07 04:07

I'm available for [testing/playing with a beta] too. Eager to raise the limit of my Misfit from 73 to 74 :wink:
Very good job Bdot! (as usually)

Bdot 2014-09-09 08:42

[QUOTE=Bdot;382095]
I need to see if the same kernel is selected as before ... Maybe I did something wrong with the kernel precedence for APUs ... I'll check your logs and come back to that separately.
[/QUOTE]
The correct kernel was selected, and everything else also seems to be OK. More detailed testing has shown that a performance improvement for GCN has adverse effects on VLIW5 :rant:

Thanks again, Jayder, for pointing out this issue to me. I'm not yet sure how to address this, though ...

It goes in line with another observation regarding the use of double precision: On my HD7950, it improves performance by 5%. On HD7850, performance drops by 7%. It looks like a lot of device-dependent #ifdefs need to go into the kernel files, which I tried to avoid so far (the IntelHD bugs were the first to require that). I may also need to create a separate device class for the high-end GCN's because of their faster DP performance.

Thank you all for your offers to test ... with these additional changes coming, I think it makes no sense to send out a test version right now.

I'll come back to you ...

VictordeHolland 2014-09-09 13:23

[QUOTE=Bdot;382330]Finally I managed to create a 74-bit kernel that helps straightening out the performance of mfakto when the factor sizes increase (it moves out the big drop one more bit). My HD7950@1100MHz now runs 100M candidates:

bits : GHz-days/day
67-68: 448
68-69: 476
69-70: 459
70-71: 416
71-72: 417
72-73: 418
[COLOR=DarkGreen]73-74: 408[/COLOR] <== the new one, was 361 before
74-82: 361

Attempts of achieving this using a new 5x16-bit kernel or an improved montgomery kernel yielded slow results. The solution is a "4x15-bit + 1x16-bit" kernel ...[/QUOTE]
Great news, well done!

AK76 2014-10-07 13:52

[QUOTE=Bdot;382330]My HD7950@1100MHz now runs 100M candidate.[/QUOTE]

100M mean 2^100M or 100M-digits ?

snme2pm1 2014-10-08 06:14

[QUOTE=AK76;384579]100M mean 2^100M or 100M-digits ?[/QUOTE]

The production rate figures quoted for that HD7950 1100MHz are not too far removed from rates for my HD7950 1000MHz working on exponents in the 118 million space, (i.e. 2 power p minus 1).
I've been watching this space anticipating a new release for 74 bit exploration, but can probably only use that out of hours, since I've never been able to adequately duck lack of responsiveness issues.

Bdot 2014-10-08 07:42

[QUOTE=AK76;384579]100M mean 2^100M or 100M-digits ?[/QUOTE]
I used a 100M exponent for that test.
[QUOTE=snme2pm1;384659]The production rate figures quoted for that HD7950 1100MHz are not too far removed from rates for my HD7950 1000MHz working on exponents in the 118 million space, (i.e. 2 power p minus 1).
I've been watching this space anticipating a new release for 74 bit exploration, but can probably only use that out of hours, since I've never been able to adequately duck lack of responsiveness issues.[/QUOTE]
There are a few parameters you can try to tweak for better responsiveness: low but non-zero FlushInterval (3, 2, 1), lower GPUSieveSize and lower GPUSieveProcessSize should each help. I'd try tweaking them in this order for best responsiveness-gain per performance-loss ratio.

The new release will have to wait a bit longer as I currently have very little time to work on it (and there are still a few things to do).

AK76 2014-10-09 17:42

On my ATI R9 290 i use FlushInterval=0. 3,2,1 works much worse than "0".

GPUSieveSize=5 or 6

GPUSieveProcessSize=16

GPUSievePrimes= between 30000 and 80000

For example: today i run exp 70M bit 71-72 Rate 2500M/s, 430GHz-d/day.

Soon i will test 100M candidates on different bits, to comapre my GPU performance with Bdot's 7950.

VictordeHolland 2014-10-09 19:01

[QUOTE=AK76;384792]On my ATI R9 290 i use FlushInterval=0. 3,2,1 works much worse than "0".
[/QUOTE]
Less responsive or less throughput (or both)???

AK76 2014-10-09 20:27

Less throughput.

Bdot 2014-10-10 22:27

Ah, thanks for that clarification!

Yes, of course, the best throughput is achieved when the GPU is not shared with anything, especially not 3D-Games or screen-updates in general :smile:.

My suggestion was meant towards a more responsive system at the cost of as little as possible throughput.

Regarding your performance measurements: throughput should scale linearly with the GPU clock speed (or shader clock). Memory clock has very little influence.

snme2pm1 2014-10-11 08:10

[QUOTE=Bdot;384939]My suggestion was meant towards a more responsive system at the cost of as little as possible throughput.[/QUOTE]

I've read that over several times, but can't convince myself that it can not be entirely misunderstood.
I recognise that the writer is not necessarily english first.
I suspect that various minor inflections might have conveyed a more intended meaning.
One such moderation would be to suggest an intention of a more responsive system at the cost of as little as possible throughput [B]reduction[/B].

Bdot 2014-10-11 12:31

Oh .. I see :blush:. Thank you for trying to extract what I really meant (I wish all forum members did that consistently). I probably wanted to say something like "the responsiveness improvement should cost you as little performance as possible" ...

And you're totally right about English not being my first language. It was actually the third I started to learn.

Thinking about this all again, it would probably be much easier for me to aim for as little throughput as possible than what I was trying over the past few years :gah:

AK76 2014-10-11 21:02

[QUOTE=AK76;384792]Soon i will test 100M candidates on different bits, to comapre my GPU performance with Bdot's 7950.[/QUOTE]

My Gigabyte R9 290 results:

TF 100M exponent 65-66 - 2550 M/s - 430 GHz-days/day
TF 100M exponent 66-67 - 2680 M/s - 440 GHz-days/day
TF 100M exponent 67-68 - 2850 M/s - 480 GHz-days/day
TF 100M exponent 68-69 - 2880 M/s - 484 GHz-days/day
TF 100M exponent 69-70 - 2690 M/s - 450 GHz-days/day
TF 100M exponent 70-71 - 2420 M/s - 405 GHz-days/day
TF 100M exponent 71-72 - 2400 M/s - 400 GHz-days/day
TF 100M exponent 72-74 - 2450 M/s - 410 GHz-days/day

GPUSieveSize=4
GPUSieveProcessSize=8
GPUSievePrimes=30000

Bdot 2014-10-15 19:53

[QUOTE=AK76;384991]My Gigabyte R9 290 results:

TF 100M exponent 65-66 - 2550 M/s - 430 GHz-days/day
TF 100M exponent 66-67 - 2680 M/s - 440 GHz-days/day
TF 100M exponent 67-68 - 2850 M/s - 480 GHz-days/day
TF 100M exponent 68-69 - 2880 M/s - 484 GHz-days/day
TF 100M exponent 69-70 - 2690 M/s - 450 GHz-days/day
TF 100M exponent 70-71 - 2420 M/s - 405 GHz-days/day
TF 100M exponent 71-72 - 2400 M/s - 400 GHz-days/day
TF 100M exponent 72-74 - 2450 M/s - 410 GHz-days/day

GPUSieveSize=4
GPUSieveProcessSize=8
GPUSievePrimes=30000[/QUOTE]

Are these the settings that you use for best screen responsiveness? I'd also be interested to see what you get for high-performance settings, e.g.
GPUSieveSize=126
GPUSieveProcessSize=24
GPUSievePrimes=67000
FlushInterval=0

And I think it's a typo that you got the same 410 GHz for 72-73 and 73-74 ... Or did you download the mfakto sources from github and compiled it for yourself?

Anyway, I don't have sufficient time right now to trace down the remaining performance issue. So I will post another test version soon so that a few more people can do these tests and tell me if it got better or worse for them ... I have the feeling that I reached some limit for the optimizer because I added so much code. Maybe it is not trying as hard as before ...

AK76 2014-10-15 20:34

[QUOTE=Bdot;385277]Are these the settings that you use for best screen responsiveness? I'd also be interested to see what you get for high-performance settings, e.g.
GPUSieveSize=126
GPUSieveProcessSize=24
GPUSievePrimes=67000
FlushInterval=0[/QUOTE]
I have the biggest GHz-d/day on this settings but i will try your numbers in few days.

[QUOTE]And I think it's a typo that you got the same 410 GHz for 72-73 and 73-74 ... Or did you download the mfakto sources from github and compiled it for yourself?[/QUOTE]
It's my error, of course it should be 72-73. I don't know how to compile mfakto from sources.

Mark Rose 2014-10-22 05:54

Has anyone got the iGPU stuff working on Linux? I haven't tried yet, but I have access to a few Haswells and Ivy Bridges...

Prime95 2014-10-22 14:25

[QUOTE=Mark Rose;385747]Has anyone got the iGPU stuff working on Linux? I haven't tried yet, but I have access to a few Haswells and Ivy Bridges...[/QUOTE]

Last I read Intel had no plans to support OpenCL on Linux for their integrated GPUs.

Mark Rose 2014-10-22 15:06

[QUOTE=Prime95;385761]Last I read Intel had no plans to support OpenCL on Linux for their integrated GPUs.[/QUOTE]

Intel released the first support in January 2013. Apparently the guy who first created it left the company, but the work was continued by Intel China.

Prime95 2014-10-22 22:18

[QUOTE=Mark Rose;385764]Intel released the first support in January 2013. Apparently the guy who first created it left the company, but the work was continued by Intel China.[/QUOTE]

I looked again at Intel's web site - still no love for Linux that I could see: [url]https://software.intel.com/en-us/intel-opencl/details#devices[/url]

Can you point us to a place to get Linux drivers for OpenCL support with the integrated GPU?

potonono 2014-10-22 23:00

[QUOTE=Prime95;385795]I looked again at Intel's web site - still no love for Linux that I could see: [url]https://software.intel.com/en-us/intel-opencl/details#devices[/url]

Can you point us to a place to get Linux drivers for OpenCL support with the integrated GPU?[/QUOTE]

Maybe [URL="https://software.intel.com/en-us/articles/opencl-drivers"]https://software.intel.com/en-us/articles/opencl-drivers[/URL] here?

AK76 2014-10-23 13:57

[QUOTE=AK76;385283]I have the biggest GHz-d/day on this settings but i will try your numbers in few days.
It's my error, of course it should be 72-73. I don't know how to compile mfakto from sources.[/QUOTE]

I compared both settings, my and suggested. On my 290 better are my.

Prime95 2014-10-23 21:10

[QUOTE=potonono;385800]Maybe [URL="https://software.intel.com/en-us/articles/opencl-drivers"]https://software.intel.com/en-us/articles/opencl-drivers[/URL] here?[/QUOTE]

I don't see any Linux drivers there for HD graphics.

Bdot 2014-10-24 11:38

[QUOTE=AK76;385840]I compared both settings, my and suggested. On my 290 better are my.[/QUOTE]
OK, that is really interesting because it means that the GCN chips in the R290 are really different from its predecessors. Or in other words, I would need one in order to optimize for it.

Or ... OK. I will post another pre-release version on the weekend that has an enhanced --perftest mode that exactly measures each kernel. Together with a script and a set of ini files (and maybe some alternative kernels files) I should be able to provide an automatic test, if you'd be willing to run such a thing ...

kracker 2014-10-24 14:30

[QUOTE=Bdot;385956]OK, that is really interesting because it means that the GCN chips in the R290 are really different from its predecessors. Or in other words, I would need one in order to optimize for it.

Or ... OK. I will post another pre-release version on the weekend that has an enhanced --perftest mode that exactly measures each kernel. Together with a script and a set of ini files (and maybe some alternative kernels files) I should be able to provide an automatic test, if you'd be willing to run such a thing ...[/QUOTE]

Also... the low end "older" GCN cards(77xx,78xx etc) have 1/16 DP, the high end ones have 1/4 DP, and the "new" GCN cards (290(X)) have 1/8 DP.... :mike:

AK76 2014-10-24 18:49

2 questions:

1. how factoring bitlevels below 64 with mfakto?

2. how to LLR-test with ATI gpu?

Bdot 2014-10-24 19:14

[QUOTE=kracker;385967]Also... the low end "older" GCN cards(77xx,78xx etc) have 1/16 DP, the high end ones have 1/4 DP, and the "new" GCN cards (290(X)) have 1/8 DP.... :mike:[/QUOTE]
Indeed. Would be good to see if 1/8 is sufficient to show an improvement over SP. 1/16 is definitely too slow. I'll add that to the tests I'm going to prepare.

Bdot 2014-10-24 19:31

[QUOTE=AK76;386003]2 questions:

1. how factoring bitlevels below 64 with mfakto?

2. how to LLR-test with ATI gpu?[/QUOTE]

1. From 2[sup]60[/sup] upwards you just use it like for the higher levels. For large exponents it may be required to list each bitlevel separately in the worktodo file, otherwise mfakto may decide to combine the bitlevels (e.g. try 2[sup]60[/sup] to 2[sup]63[/sup] in one rush but does not find a kernel that can handle multiple bitlevels at once). For very short tests, setting MoreClasses=0 will result in a good performance improvement.
Below 2[sup]60[/sup] you need to switch to CPU sieving because no GPU-sieve-enabled kernel can handle those. MoreClasses=0 is not available for the CPU sieve, therefore mfakto may not be the best choice for such tests.

2. Maybe the [URL="http://mersenneforum.org/showthread.php?t=16142"]GPU LL Testing FAQ[/URL] needs an update to mention clLucas? There's an [URL="http://www.mersenneforum.org/showthread.php?t=18297&page=26"]LL with OpenCL[/URL] thread for it.

AK76 2014-10-25 20:44

[QUOTE=Bdot;386006]There's an [URL="http://www.mersenneforum.org/showthread.php?t=18297&page=26"]LL with OpenCL[/URL] thread for it.[/QUOTE]

Thx for this info. My first LLR test on Ati took 5h and 30min.

M( 64847711 )C, 0xffffffff80000000, n = 131072, clLucas v1.02

VBCurtis 2014-10-26 03:24

[QUOTE=AK76;386082]Thx for this info. My first LLR test on Ati took 5h and 30min.

M( 64847711 )C, 0xffffffff80000000, n = 131072, clLucas v1.02[/QUOTE]

If you read through that thread, you'll see that the FFT size your run used was much too small, making your result meaningless. A 64M test will not finish in 5 hrs on any card.

AK76 2014-10-26 10:17

I do not set -f parameter myself. clLucas set this number. I do the same test one more time. My ATI is not overclocked.

Bdot 2014-10-26 17:35

mfakto-0.15pre4 available
 
There's a [URL="http://mersenneforum.org/mfakto/mfakto-0.15pre4/mfakto-0.15pre4.zip"]new pre-version[/URL] out on the ftp. It comes with a test script to run various settings and measure the different kernels.

To run this test, you should not need the computer for a while: the full test will take between one and two hours. During this time it would be best to not use the computer at all, if possible. Watching videos, running CPU or GPU hogs will definitely make the test results worthless.

To start the test, just run

perftestmfakto.cmd

After it finished, just zip the testresults folder that it created and post it here or send it to me (email address in the README).

The test will notice that some ini files do not contain the "TestSievePrimes" parameter, which is intentional.
The test script may need adaptations if you want to use a specific GPU (then please add the appropriate -d switch).
The provided ini files may need adaptations if you want to run it on a predecessor of GCN (adapt VectorSize to your usual settings, most likely 4).
If you think other variations are worth testing, please go ahead (and let me know :smile:).
I successfully tested it on two different CPUs (-d c) - I hope it also works with the IntelHD devices.

kracker 2014-10-27 14:42

2 Attachment(s)
:smile:

NickOfTime 2014-10-27 20:51

1 Attachment(s)
290x

Bdot 2014-10-27 22:28

Very nice! All seems to work as expected. I'll need a bit more time to go through the results, the first things I noticed:
[LIST][*]290x seems to have major improvements in int32 performance: the 32-bit kernels are now 20% faster than the 15-bit ones (on previous GCN, they are 15% slower). As the current code does not yet honor this, expect a 20% performance boost with the next mfakto version, and no more performance drop at the 73-bit-boundary.[*]290x behaves pretty much like GCN regarding ini file settings: according to these tests, GPUSieveSize=126 and GPUSieveProcessSize=24 should be fastest on this card as well.[*]290x: Measuring the CPU-sieve-based TF kernels only worked for the smallest exponent, the other results are way too low - either something overflowed, some throttling kicked in or my test did not fully utilize the GPU.[*]1/8 DP rate on 290x is not sufficient to give DP calculations an advantage over SP. Therefore, only Tahiti and Malta chips will use DP in mfakto. Has anyone still some HD5870/5850 sitting around? This one would also be a good candidate.[*]HD4600 worked well, delivering 18-19 GHz-days/day for current LL test range.[*]performance dependency to the exponent size is stronger than I expected, e.g. 290x: 975GHz (2M), 770GHz (39M), 739GHz (78M), 684GHz (332M), 616GHz (4200M).[*]I missed to include an m-gs-128-32.ini file. Could you please copy from m-gs-128-16.ini and set GPUSieveProcessSize=32. Then, please run mfakto [-d ..] -i m-gs-128-32.ini --perftest > m-gs-128-32.log
I think I know the outcome for HD7770, but HD4600 and R290x would be interesting.[/LIST]

tului 2014-10-28 13:49

1 Attachment(s)
A10-5745M results.

Bdot 2014-10-28 22:31

[QUOTE=tului;386305]A10-5745M results.[/QUOTE]
Thank you for that one too. Your A10 will most likely benefit from using VectorSize=4. Would you mind changing that in the .ini files and just rerun the m-gs-<nn>-<nn> tests (skip the long-running *fulltest.ini and *GCN*ini tests)? And please add an m-gs-128-32.ini file. Copy from m-gs-128-16.ini and set GPUSieveProcessSize=32. Then, please run mfakto [-d ..] -i m-gs-128-32.ini --perftest > m-gs-128-32.log

Now, that I have the "automatic" evaluation of kernel speeds, I will think about auto-adjusting VectorSize as well, but we're not there yet.

NickOfTime 2014-10-29 19:42

1 Attachment(s)
[QUOTE=Bdot;386275]Very nice! All seems to work as expected. I'll need a bit more time to go through the results, the first things I noticed:
[LIST][*]290x seems to have major improvements in int32 performance: the 32-bit kernels are now 20% faster than the 15-bit ones (on previous GCN, they are 15% slower). As the current code does not yet honor this, expect a 20% performance boost with the next mfakto version, and no more performance drop at the 73-bit-boundary.[*]290x behaves pretty much like GCN regarding ini file settings: according to these tests, GPUSieveSize=126 and GPUSieveProcessSize=24 should be fastest on this card as well.[*]290x: Measuring the CPU-sieve-based TF kernels only worked for the smallest exponent, the other results are way too low - either something overflowed, some throttling kicked in or my test did not fully utilize the GPU.[*]1/8 DP rate on 290x is not sufficient to give DP calculations an advantage over SP. Therefore, only Tahiti and Malta chips will use DP in mfakto. Has anyone still some HD5870/5850 sitting around? This one would also be a good candidate.[*]HD4600 worked well, delivering 18-19 GHz-days/day for current LL test range.[*]performance dependency to the exponent size is stronger than I expected, e.g. 290x: 975GHz (2M), 770GHz (39M), 739GHz (78M), 684GHz (332M), 616GHz (4200M).[*]I missed to include an m-gs-128-32.ini file. Could you please copy from m-gs-128-16.ini and set GPUSieveProcessSize=32. Then, please run mfakto [-d ..] -i m-gs-128-32.ini --perftest > m-gs-128-32.log
I think I know the outcome for HD7770, but HD4600 and R290x would be interesting.[/LIST][/QUOTE]

[ATTACH]11893[/ATTACH] edit: hmm, and an extra 0.5 ghz-day/day if numstreams=5 vs 3...

kracker 2014-10-30 23:51

1 Attachment(s)
[QUOTE=Bdot;386275][LIST][*]I missed to include an m-gs-128-32.ini file. Could you please copy from m-gs-128-16.ini and set GPUSieveProcessSize=32. Then, please run mfakto [-d ..] -i m-gs-128-32.ini --perftest > m-gs-128-32.log
I think I know the outcome for HD7770, but HD4600 and R290x would be interesting.[/LIST][/QUOTE]

Late result, sorry...
Intel HD4600

AK76 2014-10-31 21:45

1 Attachment(s)
[ATTACH]11917[/ATTACH]

kracker 2014-10-31 23:05

[QUOTE=AK76;386588][ATTACH]11917[/ATTACH][/QUOTE]

Something's quite wrong there... the results are quite different from NickOfTime's 290X results, the 290 should be not much slower than it's X version..

NickOfTime 2014-11-01 02:35

[QUOTE=kracker;386593]Something's quite wrong there... the results are quite different from NickOfTime's 290X results, the 290 should be not much slower than it's X version..[/QUOTE]

Well, mine are XFX 290x Double Dissipation (PCIEx 16x) running around 66 - 70C so there is no throttling...
Hmm, it probably is the Catalyst Version, since opencl is compiled at runtime ,there is a lot of variations of performance diffs there, I am using 14.3

Cruelty 2014-11-01 13:05

1 Attachment(s)
Attached results for [URL="http://www.sapphiretech.com/presentation/product/?cid=1&gid=3&sgid=1227&pid=2091&psn=&lid=1&leg=0"]Sapphire R9 290 Tri-X OC[/URL]. As a bonus I've put also log from GPU-Z for entire run :smile:

AK76 2014-11-01 16:37

[QUOTE=kracker;386593]Something's quite wrong there... the results are quite different from NickOfTime's 290X results, the 290 should be not much slower than it's X version..[/QUOTE]

Hmm i don't really know where is problem. I run mfakto on Catalyst 14.4 and 14.9 - results are practically the same.

My plaftorm is 6 years old Asus P5K pro and Xeon E5440 2,83 GHz, which is overclocked to 3,7 GHz. FSB default is 333 MHz and now is set to 443 MHz. It might cause the problem?

kracker 2014-11-02 02:40

[QUOTE=AK76;386651]Hmm i don't really know where is problem. I run mfakto on Catalyst 14.4 and 14.9 - results are practically the same.

My plaftorm is 6 years old Asus P5K pro and Xeon E5440 2,83 GHz, which is overclocked to 3,7 GHz. FSB default is 333 MHz and now is set to 443 MHz. It might cause the problem?[/QUOTE]
Hmm... how are your thermals?

On another note, I've ordered a R9 285... :smile: Will have results by Tuesday/Wednesday.

kracker 2014-11-02 02:45

[QUOTE=Cruelty;386635]Attached results for [URL="http://www.sapphiretech.com/presentation/product/?cid=1&gid=3&sgid=1227&pid=2091&psn=&lid=1&leg=0"]Sapphire R9 290 Tri-X OC[/URL]. As a bonus I've put also log from GPU-Z for entire run :smile:[/QUOTE]

Nice! :smile: Well, here's comparing your 290 with AK76's 290... something's wrong.

[code]
5. GPU tf kernels, exponent=66362159 ... calibrating
5. GPU tf kernels, exponent=66362159, 12287M FCs each
k=2223766598517, 0.900843 GHz-days (assignment), 0.025120 GHz-days (per test)
cl_barrett32_79_gs [64-79]: 3609.79 ms ==> 3569.43M FCs/s ==> 601.23 GHz-days/day
cl_barrett32_77_gs [64-77]: 3230.19 ms ==> 3988.90M FCs/s ==> 671.89 GHz-days/day
cl_barrett32_76_gs [64-76]: 3097.96 ms ==> 4159.15M FCs/s ==> 700.57 GHz-days/day
cl_barrett32_92_gs [65-92]: 5107.85 ms ==> 2522.57M FCs/s ==> 424.90 GHz-days/day
cl_barrett32_88_gs [65-88]: 3778.44 ms ==> 3410.12M FCs/s ==> 574.40 GHz-days/day
cl_barrett32_87_gs [65-87]: 3610.30 ms ==> 3568.93M FCs/s ==> 601.15 GHz-days/day
cl_barrett15_73_gs [60-73]: 3728.22 ms ==> 3456.05M FCs/s ==> 582.14 GHz-days/day
cl_barrett15_69_gs [60-69]: 3141.44 ms ==> 4101.59M FCs/s ==> 690.87 GHz-days/day
cl_barrett15_70_gs [60-69]: 3145.50 ms ==> 4096.30M FCs/s ==> 689.98 GHz-days/day
[/code][code]
5. GPU tf kernels, exponent=66362159 ... calibrating
5. GPU tf kernels, exponent=66362159, 6143M FCs each
k=2223766598517, 0.900843 GHz-days (assignment), 0.012560 GHz-days (per test)
cl_barrett32_79_gs [64-79]: 2755.63 ms ==> 2337.92M FCs/s ==> 393.80 GHz-days/day
cl_barrett32_77_gs [64-77]: 2554.79 ms ==> 2521.71M FCs/s ==> 424.76 GHz-days/day
cl_barrett32_76_gs [64-76]: 2486.19 ms ==> 2591.30M FCs/s ==> 436.48 GHz-days/day
cl_barrett32_92_gs [65-92]: 3022.35 ms ==> 2131.60M FCs/s ==> 359.05 GHz-days/day
cl_barrett32_88_gs [65-88]: 2844.33 ms ==> 2265.02M FCs/s ==> 381.52 GHz-days/day
cl_barrett32_87_gs [65-87]: 2756.18 ms ==> 2337.46M FCs/s ==> 393.72 GHz-days/day
cl_barrett15_73_gs [60-73]: 2815.09 ms ==> 2288.54M FCs/s ==> 385.48 GHz-days/day
cl_barrett15_69_gs [60-69]: 2507.41 ms ==> 2569.36M FCs/s ==> 432.78 GHz-days/day
cl_barrett15_70_gs [60-69]: 2508.61 ms ==> 2568.14M FCs/s ==> 432.58 GHz-days/day
[/code]

AK76 2014-11-02 10:38

[QUOTE=kracker;386692]Hmm... how are your thermals?

On another note, I've ordered a R9 285... :smile: Will have results by Tuesday/Wednesday.[/QUOTE]

GPU-Z show 77-80 Celsius and 100% load.

Cruelty 2014-11-02 15:46

[QUOTE=AK76;386707]GPU-Z show 77-80 Celsius and 100% load.[/QUOTE]

It would be interesting to see full GPU-Z readings - your GPU may throttle not because of high GPU temperature but high VRM temperature. I've had such a problem with HD7870XT while ago...

AK76 2014-11-02 17:14

1 Attachment(s)
[QUOTE=Cruelty;386718]It would be interesting to see full GPU-Z readings - your GPU may throttle not because of high GPU temperature but high VRM temperature. I've had such a problem with HD7870XT while ago...[/QUOTE]

GPU log

Cruelty 2014-11-03 09:29

[QUOTE=AK76;386723]GPU log[/QUOTE]

I don't see throttling or thermal issues in this log :huh: The only issue I see there is the +12V voltage, which according to ATX specs should not fall below 11.4V. Of course you would have to confirm it with some digital multimeter before RMA, nevertheless it shouldn't impact GPU performance since it is running at stock specs.

The other things that come to my mind:[LIST][*]PCI-E link speed - but I don't think it would cause you trouble even at v1.0 x1[*]something else consuming your system resources (Prime95, LLR, PFGW, other?)[*]software issues - e.g. messy driver installation (installing new version without prior uninstalation of an old one)[*]overclocking issues - try to go back to FSB @ 333MHz and check if it solves your problem. BTW: what is your PCI-E frequency @ FSB=443MHz? Using default multipliers it may be running way out of specs and thus causing troubles.[/LIST]

Bdot 2014-11-03 23:58

mfakto-0.15pre5 available
 
I've uploaded another test version to the [URL="http://mersenneforum.org/mfakto/mfakto-0.15pre5/"]ftp[/URL] with improvements for GCN 1.2 devices and for the perftest.

[LIST][*]I now added a GPU type "GCN3" for Hawaii, Bonaire, Tonga and Vesuvius, which should improve "normal" TF speed by ~20%
Please confirm if you have one of these (Bonaire = HD7790 is just a guess of mine).[*]you have sent so many test results to me that I noticed I should have sorted the --perftest output ...[*] --perftest now displays mfakto version and GPU info[*]--perftest now also shows a nice performance summary for each exponent[*]using "OCLCompileOptions=+ -USE_DP" mfakto can now be switched to double precision without changing the GPU type[*]there was indeed an overflow bug in the CPU-sieved perftest (only on cards faster than mine :ouch1:)[/LIST]For cards not in the GCN3 category there should not be a lot of changes - no need to rerun any test. But if anyone wants to run another perftest, then please use this version for the nicer output.


Thank you all for your help.

kracker 2014-11-04 20:00

1 Attachment(s)
HD4600 results... Tonga and APU coming later :smile:

NickOfTime 2014-11-04 21:40

[QUOTE=Bdot;386818]I've uploaded another test version to the [URL="http://mersenneforum.org/mfakto/mfakto-0.15pre5/"]ftp[/URL] with improvements for GCN 1.2 devices and for the perftest.

[LIST][*]I now added a GPU type &quot;GCN3&quot; for Hawaii, Bonaire, Tonga and Vesuvius, which should improve &quot;normal&quot; TF speed by ~20%
Please confirm if you have one of these (Bonaire = HD7790 is just a guess of mine).[*]you have sent so many test results to me that I noticed I should have sorted the --perftest output ...[*] --perftest now displays mfakto version and GPU info[*]--perftest now also shows a nice performance summary for each exponent[*]using &quot;OCLCompileOptions=+ -USE_DP&quot; mfakto can now be switched to double precision without changing the GPU type[*]there was indeed an overflow bug in the CPU-sieved perftest (only on cards faster than mine :ouch1:)[/LIST]For cards not in the GCN3 category there should not be a lot of changes - no need to rerun any test. But if anyone wants to run another perftest, then please use this version for the nicer output.


Thank you all for your help.[/QUOTE]

Hmm, I believe my Radeon 260X is bonaire, I test that as well as my 290x again...

NickOfTime 2014-11-05 02:28

1 Attachment(s)
[QUOTE=NickOfTime;386870]Hmm, I believe my Radeon 260X is bonaire, I test that as well as my 290x again...[/QUOTE]

[ATTACH]11930[/ATTACH]

NickOfTime 2014-11-05 02:35

2 Attachment(s)
And the 290x. Since I am reconfiguring one machine one card is running at 8x @74C and the other at 16x @64C for pcie...
[ATTACH]11932[/ATTACH]
[ATTACH]11933[/ATTACH]

kracker 2014-11-05 15:32

Well... this is depressing. I reinstalled the driver about three times now for my new 285, the driver crashes while running mfakto's test... :sad:

EDIT: Running the "regular" 0.14 mfakto on regular work doesn't crash... Weird.

NickOfTime 2014-11-05 16:05

[QUOTE=NickOfTime;386890]And the 290x. Since I am reconfiguring one machine one card is running at 8x @74C and the other at 16x @64C for pcie...
[ATTACH]11932[/ATTACH]
[ATTACH]11933[/ATTACH][/QUOTE]

Hmm, I tried running it concurrently on my 290x's in the 70M 74bit range and I start seeing fluxations in Ghz/d , GPU core clock and GPU utilization resulting what looks like overall throughput of 900 ghz/d instead of the potential 1450 ghz/d...

On 14 I am have 100% utilization but the overall throughput drops by 25ghz for both cards (525ghz to 500ghz) so 1000ghz/d of potential 1050ghz...

Are we hitting a memory / bus bandwidth? Asus z87-ws I7-4770S 8G DDR3-1866

kracker 2014-11-05 18:48

[QUOTE=kracker;386926]Well... this is depressing. I reinstalled the driver about three times now for my new 285, the driver crashes while running mfakto's test... :sad:

EDIT: Running the "regular" 0.14 mfakto on regular work doesn't crash... Weird.[/QUOTE]

UPDATE: It seems that only some of the tests in pre5 crash, which are:
[code]
m-gs-fulltest
m-gs-96-24
m-gs-128-16
m-gs-128-32
[/code]

All the other ones works fine from what I can see now.

Bdot 2014-11-05 21:14

[QUOTE=NickOfTime;386887]260x[/QUOTE]
OK, it turns out that my guess was wrong - bonaire still has the slow (4-cycle) int32 multiplications. I reverted the code base accordingly. Thanks for this test.

It's sad that AMD's versioning of GCN 1.0/1.1/1.2 does not seem to mean anything at all (at least nothing that I could use).

Bdot 2014-11-05 21:29

[QUOTE=NickOfTime;386929]Hmm, I tried running it concurrently on my 290x's in the 70M 74bit range and I start seeing fluxations in Ghz/d , GPU core clock and GPU utilization resulting what looks like overall throughput of 900 ghz/d instead of the potential 1450 ghz/d...

On 14 I am have 100% utilization but the overall throughput drops by 25ghz for both cards (525ghz to 500ghz) so 1000ghz/d of potential 1050ghz...

Are we hitting a memory / bus bandwidth? Asus z87-ws I7-4770S 8G DDR3-1866[/QUOTE]
Before I saw the test results of your x16 vs. x8 PCIex cards I would have said that the bus etc. does not have any influence on mfakto. But seeing the x16 card consistently a few percent ahead of the x8 counterpart suggests it does make a difference.

Do I understand it right that each instance, when running alone would give ~725GHz, but when starting the other instance the speed drops to ~450GHz per card?

In this case I'd say this is AMD's Powertune technology in action. Maybe you can try to use Catalyst Control Center and set the power target to some percent higher and watch if the GHz-d/d output increases accordingly? But careful, if you do that over a longer period of time: the additional heat generation can be significant. I don't have a good explanation why the speed would drop below the 0.14 level, though. Maybe the 15-bit kernels have a bigger share of simple instructions that do not generate so much heat?
Edit: which settings did you use for this parallel TF test? Your files suggest you should be using something similar to m-gs-128-32.ini or m-gs-fulltest.ini for maximum performance.

Bdot 2014-11-05 21:35

[QUOTE=kracker;386940]UPDATE: It seems that only some of the tests in pre5 crash, which are:
[code]
m-gs-fulltest
m-gs-96-24
m-gs-128-16
m-gs-128-32
[/code]All the other ones works fine from what I can see now.[/QUOTE]
These are the tests that use a bigger share of the GPU memory. Maybe downclock GPU memory a bit?

To ease the perftest, you can add a number on the mfakto command line for these tests:

mfakto -i m-gs-128-32.ini --perftest [B]2[/B] > testresults/m-gs-128-32.log

This number gives something like the number of iterations for each test, default is 10 so that 2 would be ~5 times faster.

Edit: Does the driver survive running a TF test using m-gs-128-32.ini?

kracker 2014-11-05 21:56

1 Attachment(s)
[QUOTE=Bdot;386952]
Edit: Does the driver survive running a TF test using m-gs-128-32.ini?[/QUOTE]
I'll try that. Little cosmetic bug?

NickOfTime 2014-11-05 22:11

[QUOTE=Bdot;386950]Before I saw the test results of your x16 vs. x8 PCIex cards I would have said that the bus etc. does not have any influence on mfakto. But seeing the x16 card consistently a few percent ahead of the x8 counterpart suggests it does make a difference.

Do I understand it right that each instance, when running alone would give ~725GHz, but when starting the other instance the speed drops to ~450GHz per card?

In this case I'd say this is AMD's Powertune technology in action. Maybe you can try to use Catalyst Control Center and set the power target to some percent higher and watch if the GHz-d/d output increases accordingly? But careful, if you do that over a longer period of time: the additional heat generation can be significant. I don't have a good explanation why the speed would drop below the 0.14 level, though. Maybe the 15-bit kernels have a bigger share of simple instructions that do not generate so much heat?
Edit: which settings did you use for this parallel TF test? Your files suggest you should be using something similar to m-gs-128-32.ini or m-gs-fulltest.ini for maximum performance.[/QUOTE]

Initially I started with GCN3 128/32 and then 96/24 and 64/16 and then I changed GPUType to GCN and tried again... From looking at GPUZ and seeing that GPU util was 100% and then sometimes zero.. It would seem that the GPU kernel was finished and it was waiting for a Write or Read Buffer from main memory...

Bdot 2014-11-05 22:35

[QUOTE=kracker;386955]Little cosmetic bug?[/QUOTE]
Oh you guys with these tiny windows :gah:
But feel free to truncate the status lines ...


All times are UTC. The time now is 13:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.