mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   AMD FX-8350 issue (https://www.mersenneforum.org/showthread.php?t=21027)

MontyOnTheRun 2016-02-24 08:18

AMD FX-8350 issue
 
Hi, I'm kinda new here, so if I don't understand GIMPS too well, please don't judge, I'll learn about it eventually. And now to my issue. I took on a first time LL test on my 8350 machine and for 10 hours/day, the expected finish date was for February of 2017. Nevertheless I was quite surprised, as I thought this cpu could pack quite a punch when highly threaded and from what I understand, a prime takes about a month to complete. Is this normal?

VictordeHolland 2016-02-24 11:25

Maybe the estimate is off? You could try to expolate a completion day based on the progress you made in that 10 hours?
How many sec/ms per iter are you getting? You could also run the Prime95 benchmark to compare your results.

henryzz 2016-02-24 13:00

[QUOTE=MontyOnTheRun;427257]Hi, I'm kinda new here, so if I don't understand GIMPS too well, please don't judge, I'll learn about it eventually. And now to my issue. I took on a first time LL test on my 8350 machine and for 10 hours/day, the expected finish date was for February of 2017. Nevertheless I was quite surprised, as I thought this cpu could pack quite a punch when highly threaded and from what I understand, a prime takes about a month to complete. Is this normal?[/QUOTE]

They take around a month on Intel's cpus when run full time. AMD cpus have slow avx instructions which mean it is better to go back to the old sse2 based code. This carries a fair sized performance penalty(Can be 2x I believe). Also your cpu has two integer units and one fpu per module. Prime95 uses the fpu so you effectively have 4 cores not 8. Are you running 8 tests on 8 different threads? If so that is another factor of 2 slowdown(per thread).
Generally even single-threaded results with the same codepath will be slower with AMD as well compared to integer. Hopefully AMD Zen will fix some of these problems and become a useful cpu for GIMPS.
2*2*(24/10)=9.6. After taking into account AMD cpus being slower per clock cycle, a year rather than a month sounds reasonable.
I would suggest running tests on all your threads rather than 8 tests as is default if you want faster results.

kladner 2016-02-24 16:24

5 Attachment(s)
[QUOTE=henryzz;427264].....
I would suggest running tests on all your threads rather than 8 tests as is default if you want faster results.[/QUOTE]
Experiments with an 8350 lead me to think that running 4 workers of 2 threads each is about optimal. With 8 workers, each pair of integer cores, (1-2,3-4,5-6,7-8) is sharing an FPU. I have a few test results which I have accumulated: some text and some screenshots.

First off is a summary. These, (I think,) are best case at stock 4 GHz, DDR3-1600 RAM. By best case I mean that the rest of the CPU was idle when these tests were run. More worker threads equal more contention for CPU cache and RAM.
[CODE]32.5M range

One helper thread, on same FPU: ~19ms
Single thread, unshared FPU: ~24.4ms
Single thread, shared FPU: ~36.5ms[/CODE]As to the screenies, the file names should be self-explanatory. Note that background processes cause considerable variation in times. Overall, the difference for the different allocations are relatively small.

I know that I am exposing exponents to common view. If someone is really hard up enough to poach these, more power to them.

henryzz 2016-02-24 17:07

[QUOTE=kladner;427276]Experiments with an 8350 lead me to think that running 4 workers of 2 threads each is about optimal. With 8 workers, each pair of integer cores, (1-2,3-4,5-6,7-8) is sharing an FPU. I have a few test results which I have accumulated: some text and some screenshots.

First off is a summary. These, (I think,) are best case at stock 4 GHz, DDR3-1600 RAM. By best case I mean that the rest of the CPU was idle when these tests were run. More worker threads equal more contention for CPU cache and RAM.
[CODE]32.5M range

One helper thread, on same FPU: ~19ms
Single thread, unshared FPU: ~24.4ms
Single thread, shared FPU: ~36.5ms[/CODE]As to the screenies, the file names should be self-explanatory. Note that background processes cause considerable variation in times. Overall, the difference for the different allocations are relatively small.

I know that I am exposing exponents to common view. If someone is really hard up enough to poach these, more power to them.[/QUOTE]

Experimentation is better than theory.
I suggested one test on all threads in order to bring the test time down to a manageable length.
It looks like more performance is lost on these AMD cpus by running one test over multiple fpus than on current Intel cpus. Depending on how the FFT length matches the L3 cache size one test can be faster on a recent Intel. I suppose this makes sense as the current AMDs behave more like older Intels.

MontyOnTheRun 2016-02-24 17:28

[QUOTE=kladner;427276]Experiments with an 8350 lead me to think that running 4 workers of 2 threads each is about optimal. With 8 workers, each pair of integer cores, (1-2,3-4,5-6,7-8) is sharing an FPU. I have a few test results which I have accumulated: some text and some screenshots.

First off is a summary. These, (I think,) are best case at stock 4 GHz, DDR3-1600 RAM. By best case I mean that the rest of the CPU was idle when these tests were run. More worker threads equal more contention for CPU cache and RAM.
[CODE]32.5M range

One helper thread, on same FPU: ~19ms
Single thread, unshared FPU: ~24.4ms
Single thread, shared FPU: ~36.5ms[/CODE]As to the screenies, the file names should be self-explanatory. Note that background processes cause considerable variation in times. Overall, the difference for the different allocations are relatively small.

I know that I am exposing exponents to common view. If someone is really hard up enough to poach these, more power to them.[/QUOTE]

Thanks for the info guys. I followed kladner's example and the ETA dropped to
september 2016, so i guess the situation is better now, however, I still think the chip should perform better. I have attached a bench, do they support the ETA?

Also, henry mentioned an sse2 build, exactly what version is that?

MontyOnTheRun 2016-02-24 17:29

1 Attachment(s)
oops, bench is here.

Mark Rose 2016-02-24 17:54

[QUOTE=MontyOnTheRun;427288]Thanks for the info guys. I followed kladner's example and the ETA dropped to
september 2016, so i guess the situation is better now, however, I still think the chip should perform better. I have attached a bench, do they support the ETA?

Also, henry mentioned an sse2 build, exactly what version is that?[/QUOTE]

The AMD CPUs of the last ten years excel at integer math. So things like web servers and databases do very well.

But when it comes to floating point throughput, they are woefully behind Intel.

The AMD chips must combine FPU from two cores to do 256 bit SSE operations. There is some overhead in combining the execution units, so it's better to avoid the 256 bit SSE operations in SSE3 and stick to the 128 bit operations in SSE2. The Prime95/mprime executables contain the SSE2 code, and it should be picked automatically for AMD CPUs as it is faster.

What speed is your RAM? Are you running it in dual channel mode?

kladner 2016-02-24 18:46

1 Attachment(s)
Do you have the following in prime.txt (in your Prime95 folder)?

[CODE]BenchMultipleWorkers=1
BenchMultithreads=1[/CODE]These will cause Prime95 to run an exhaustive benchmark for all the possible combinations of the number of CPUs (cores) and numbers of threads. It will do this for the same range of FFTs as those in the previous test.
Warning: Running the Benchmark with these settings took right at 1 hour, on my system.

[cross post w/Mark]

Mark asked one of the questions I have, about memory. Also have a look at memory timings. Compare the your settings to the 'SPD' for the RAM. CPUID's CPU-Z shows the values on adjacent tabs.

MontyOnTheRun 2016-02-24 19:24

I'm running 16 GB of 1600MHz memory*, however I remembered I didn't turn XMP on before, so I did now. I'll run the benchmark and update my results.

*side note, not to go off track: Until now I thought that ram was only significant for p-1 stage 2 factoring. Does it make a difference in LL tests?

Mark Rose 2016-02-24 20:04

[QUOTE=MontyOnTheRun;427301]I'm running 16 GB of 1600MHz memory*, however I remembered I didn't turn XMP on before, so I did now. I'll run the benchmark and update my results.

*side note, not to go off track: Until now I thought that ram was only significant for p-1 stage 2 factoring. Does it make a difference in LL tests?[/QUOTE]

Memory speed makes a huge difference for LL. Not much is needed, but it the faster it is, the better.


All times are UTC. The time now is 10:19.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.