mersenneforum.org  

Go Back   mersenneforum.org > Other Stuff > Open Projects > y-cruncher

Reply
 
Thread Tools
Old 2019-06-13, 08:21   #1
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

1011110102 Posts
Default Aiming for hwbot per-core records

Let's see some activity in here :)

I've held the hwbot 6 core records for a while (25m, 1b, 10b). Not because I'm a great overclocker, but in part because no one else ran y-cruncher properly with AVX-512.

After discussion on another forum, I thought I'd have a go at 1t. No idea which system to use yet, but it is a bit different due to the vast data size required. I'll save that discussion for another day, as it isn't something I'm doing soon.

Anyway, I thought I'd see if I can take some more hwbot per-core records, and I'm looking at 2 and 4 core in particular.

10b was going to prove a challenge, as my systems 2-4 core capable systems only have 4 ram slots, and I only have 8GB modules. 32GB isn't enough, and I had to learn to use swap. I recently dusted off an old server that I experimented with mining burstcoin, which is based on hard disk capacity. I didn't RTFM so I probably did things wrong, but a 10b run took about 8 hours, with 3 mismatched HDs (two WD greens, 1 7200k Hitachi). The CPU was an ancient low power dual core AMD thing.

I can do better than 8 hours... I dusted off an i3-6100. More clock. More instructions. More threads. But what to use as swap? I have a spare 32GB Optane module, they're fast right? No, while fast on reads, a bench on it revealed horrible writes, confirmed by looking at online reviews. Try it anyway? No, it wasn't as simple as ram + swap = enough. I think the software reported it needed ~50GB on disk. The only other spare device I had on hand was a Crucial 525GB SATA SSD. I don't care too much about write life for a one off run, so I plugged it in, configured y-cruncher and set it off. About 3 hours later it was indicating 50% done, this was far too slow. Turned out, on the system I was using it in, Windows decided to enlarge its swap file on the boot drive in an attempt to match the extra ram I added, and ran out of disk space. I aborted the run.

Now having RTFM on using disks, I ran the benchmark to see how bad my configuration was. Although the SSD could sustain near SATA interface levels in the short term, long term was half that. When I next have time, I now intend to move over the 3 HDs mentioned earlier, keep the SSD, throw in another SATA SSD and the Optane module, bench it and see if it is worth using the lot. Yes, it is a complete mess, but one of my goals is not to spend any more cash just to run a benchmark! On that note, I have offers/bids in on ebay for a 7350k and a 7740X. Yeah, not doing well there... new pricing is silly, so I'd only get them if I can get them low enough.

I did think about doing a non-K overclock on the 6100, but then remembered a side effect of the hack required to do that. It cripples AVX performance, so that would probably hurt more than any clock gain I could get.

Maybe I should just buy a bunch of new HDs, I was thinking of building a new file server anyway so it wouldn't just be for benching.
mackerel is offline   Reply With Quote
Old 2019-06-14, 07:58   #2
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

2×33×7 Posts
Default

Finally getting there after some interesting hurdles. During a run, I saw Windows using up a full thread, which disappeared as soon as I touched the mouse. Must be some "run after idle for x time" thing. I had to install a tool to fake activity to prevent that during a run. Win10 might not be the best OS for benching... I'm now leaving the system running in the hopes it gets it out of the system for future use.

I've managed to take the 10b 2 core record now, although there was only one other person attempting it.

The disk situation is less than ideal, but I'm running with what I have. Following are the benchmarks for read and write from CrystalDiskMark. As a relatively small data set, it will over-report on SSDs as it probably fits in the higher speed cache. SSD2 for example was benching around 260MB/s writes with much longer sustained writes.

SSD1 544 482
SSD2 529 514 (long term sustained write ~260MB/s)
HD1 192 193
HD2 205 235

HD2 is actually a software raid 0 (done by Windows) of the WD green drives. Individually they were really slow. I also ran out of SATA connectors on the mobo, so pulled out one I had bought for my current backup storage server that never got used. It's only 2 port and initially I connected SSDs to it, and they benched significantly lower than on the mobo ports. So might as well move the slowest disks on to it. It was even on CPU connected PCIe lanes, not chipset, so I can't blame bandwidth competition.

Anyway, the results of the y-cruncher IO test is as follows for the 4 logical devices above.

Working Memory... 28.2 GiB (locked, 2.00 MiB pages, spread: 100%/1)
I/O Buffers... 256 MiB

Sequential Write: 736 MiB/s
Sequential Read: 727 MiB/s
Threshold Strided Write: 452 MiB/s
Threshold Strided Read: 262 MiB/s
VST Streaming:
Computation: 1.50 GiB/s
Disk I/O : 730 MiB/s
Ratio : 0.474176


For fun, I thought I'd bench the Optane 900p in another system, wondering if for future use that would be of some benefit by itself. The results I got were lower than I thought I'd see. Wasn't Optane supposed to get around the limitations of flash?

Unable to acquire the permission, "SeLockMemoryPrivilege".
Large pages and page locking may not be possible.

Expect larger performance penalties from Meltdown mitigation.

Working Memory... 11.6 GiB (locked, spread: 100%/1)
I/O Buffers... 64.0 MiB

Sequential Write: 445 MiB/s
Sequential Read: 505 MiB/s
Threshold Strided Write: 421 MiB/s
Threshold Strided Read: 454 MiB/s
VST Streaming:
Computation: 3.60 GiB/s
Disk I/O : 459 MiB/s
Ratio : 0.124577

I haven't dug around yet to see if there was some other problem, or if this is representative. CrystalDiskMark results for sequential were >2GB/s read and writes, but again it is a much smaller/shorter test.

If I understand the output of the disk test correctly, I should be aiming for a benchmark sustained transfer rate of double the computation rate... this is gonna be a lot harder than I thought. For 10b the amount of disk access time seemed relatively small proportion of overall computation time. I guess this will go up where the available ram is small relative to the total storage requirement?


In other news, I didn't get the 7740X as I was outbid by multiple last minute snipers. Door is still open on the 7350k but seller wants more than I want to pay.
mackerel is offline   Reply With Quote
Old 2019-07-09, 22:55   #3
Mysticial
 
Mysticial's Avatar
 
Sep 2016

7·47 Posts
Default

Quote:
Originally Posted by mackerel View Post
For fun, I thought I'd bench the Optane 900p in another system, wondering if for future use that would be of some benefit by itself. The results I got were lower than I thought I'd see. Wasn't Optane supposed to get around the limitations of flash?

I haven't dug around yet to see if there was some other problem, or if this is representative. CrystalDiskMark results for sequential were >2GB/s read and writes, but again it is a much smaller/shorter test.
That's really strange. And I don't know about Optane to comment. Maybe there's some sort of caching going on that's being overrun by the I/O benchmark, but not CrystalDiskMark.

For example, CrystalDiskMark's access pattern isn't large enough to put NVMe SSDs into their steady-state write behavior. Most of such SSDs now have an SLC cache. And you can't spill it unless you write more than the size of the entire drive.

The 3 GB/s write on my MP510 looks awesome under CrystalDiskMark, but it's not sustainable. But I'm not using it for such purposes anyway so it's fine for my usecases.


Quote:
If I understand the output of the disk test correctly, I should be aiming for a benchmark sustained transfer rate of double the computation rate... this is gonna be a lot harder than I thought. For 10b the amount of disk access time seemed relatively small proportion of overall computation time. I guess this will go up where the available ram is small relative to the total storage requirement?
Yes. It's going to be very difficult. In your case, you'll need > 3 GB/s which means saturating at least 2 NVMe slots.

Looking beyond the HWBOT size of only 10b: On the HEDTs, there won't be enough NVMe slots. So you'd need to use the 4 x NVMe PCIe cards. But those require PCIe bifurcation which only X399 and server has AFAIK.
Mysticial is offline   Reply With Quote
Old 2019-07-12, 21:28   #4
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

1011110102 Posts
Default

I just accidentally took the y-cruncher 10b 8 core record on hwbot without trying... sat there in disbelief thinking it was a mistake until I realised there wasn't.

Got 3700X today. Threw in the 4x16gb of ram I also got to play with, which wasn't stable at its rated speed of 3200 but seemed ok at 3000. Still, no more swap for 10b runs. Thought I might as well submit it anyway, and was surprised to see it come out on top. I'd guess a stock 7820X with 64GB of ram will smash my entry, but it seems no one bothered to run it with one. 10b isn't very popular, possibly because of the long run time and large ram requirement.

I could better my result, but looking at the 3700X behaviour at stock, I'm not sure how much I'll be able to overclock it without significantly improving the cooling.

Running 12 threads small FFT it ran about 1.06v 3.8 GHz
Running 6 threads small FFT it ran about 1.16v 3.9 GHz
Running 1 thread large FFT in place, it ran about 1.45V 4.3 GHz.
The max boost of 4.4 GHz was detected by monitoring software but I never saw it. In the single thread case the CPU was already mid-70's C with software monitoring showing that core was taking 18W by itself. If 5 more cores were to similarly, it will overload my cooling for sure, which is only a cheap 240mm AIO.
mackerel is offline   Reply With Quote
Old 2019-07-14, 04:19   #5
Mysticial
 
Mysticial's Avatar
 
Sep 2016

1010010012 Posts
Default

Quote:
Originally Posted by mackerel View Post
I just accidentally took the y-cruncher 10b 8 core record on hwbot without trying... sat there in disbelief thinking it was a mistake until I realised there wasn't.

Got 3700X today. Threw in the 4x16gb of ram I also got to play with, which wasn't stable at its rated speed of 3200 but seemed ok at 3000. Still, no more swap for 10b runs. Thought I might as well submit it anyway, and was surprised to see it come out on top. I'd guess a stock 7820X with 64GB of ram will smash my entry, but it seems no one bothered to run it with one. 10b isn't very popular, possibly because of the long run time and large ram requirement.

I could better my result, but looking at the 3700X behaviour at stock, I'm not sure how much I'll be able to overclock it without significantly improving the cooling.

Running 12 threads small FFT it ran about 1.06v 3.8 GHz
Running 6 threads small FFT it ran about 1.16v 3.9 GHz
Running 1 thread large FFT in place, it ran about 1.45V 4.3 GHz.
The max boost of 4.4 GHz was detected by monitoring software but I never saw it. In the single thread case the CPU was already mid-70's C with software monitoring showing that core was taking 18W by itself. If 5 more cores were to similarly, it will overload my cooling for sure, which is only a cheap 240mm AIO.
Maybe if they start giving points then it will be a lot harder to "accidentally" break a record. But given the state of competitive OCing, that's not gonna happen for a long time - if ever.

But yeah. Nobody in competitive OC runs a full ram configuration since it sacrifices performance.

Last fiddled with by Mysticial on 2019-07-14 at 04:19
Mysticial is offline   Reply With Quote
Old 2019-07-14, 20:04   #6
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

2·33·7 Posts
Default

@Mysticial, have you ever encountered a situation where CPU usage drops to 0% for periods of time while running 10b?

Trying to get a good time on E5 2683v3, running quad channel 2133 4x16GB. This has run PrimeGrid projects without problems in the past. Today I did a 10b run, and the time was rather rubbish compared to the only other result on hwbot. Submitted it anyway, and did a 2nd run to see if it would be better. It was much worse. Started up monitoring stuff and did a 3rd run. Not wanting to stay by the system which was in a beadroom, I returned to my desk and used VNC to access after it should have finished. It hadn't finished, and was sitting there with 0 CPU usage. Nothing seemed unusual at the time apart from it not doing anything. CPU clocks were normal, no other loads. In case some idle activity timer was kicking in (even though I have power saving stuff disabled) I installed a keep alive utility which simulates a button press every minute. Another run, this time I kept watching it. Nothing unusual happened, and I got my best time smashing the 1st run by a good margin. Maybe it worked? Turned off monitoring, and started it going yet again in the hopes of gaining a slight improvement, and it was again significantly slower. Ok, VNC back on, and this time something weird is happening. Temps are fine (60C) but the CPU is at 1.2 GHz and software using ~60% of that. Ok, I guess it is a system problem and I've no idea where to start...


Unrelated question, I generally turn on HPET but the hwbot submitter often stays blank under clock type. Only on the above system did I see it show HPET. If I forget to turn on HPET is shows something else and indicates it is invalid.
mackerel is offline   Reply With Quote
Old 2019-07-14, 22:33   #7
Mysticial
 
Mysticial's Avatar
 
Sep 2016

7×47 Posts
Default

Quote:
Originally Posted by mackerel View Post
@Mysticial, have you ever encountered a situation where CPU usage drops to 0% for periods of time while running 10b?

Trying to get a good time on E5 2683v3, running quad channel 2133 4x16GB. This has run PrimeGrid projects without problems in the past. Today I did a 10b run, and the time was rather rubbish compared to the only other result on hwbot. Submitted it anyway, and did a 2nd run to see if it would be better. It was much worse. Started up monitoring stuff and did a 3rd run. Not wanting to stay by the system which was in a beadroom, I returned to my desk and used VNC to access after it should have finished. It hadn't finished, and was sitting there with 0 CPU usage. Nothing seemed unusual at the time apart from it not doing anything. CPU clocks were normal, no other loads. In case some idle activity timer was kicking in (even though I have power saving stuff disabled) I installed a keep alive utility which simulates a button press every minute. Another run, this time I kept watching it. Nothing unusual happened, and I got my best time smashing the 1st run by a good margin. Maybe it worked?
Did you by any chance accidentally select something on the console window? That will cause the program to pause. This "feature" of the Windows console is useful, but easily to trigger accidentally.

Quote:
Turned off monitoring, and started it going yet again in the hopes of gaining a slight improvement, and it was again significantly slower. Ok, VNC back on, and this time something weird is happening. Temps are fine (60C) but the CPU is at 1.2 GHz and software using ~60% of that. Ok, I guess it is a system problem and I've no idea where to start...
I've seen something similar in Linux, but not in Windows. The pstate drives in Linux can be screwy at times and will throttle a memory-intensive application all the way down to the lowest allowed frequency.

Quote:
Unrelated question, I generally turn on HPET but the hwbot submitter often stays blank under clock type. Only on the above system did I see it show HPET. If I forget to turn on HPET is shows something else and indicates it is invalid.
If it doesn't show HPET, it's some other clock that the program doesn't recognize.

Early versions of the submitter required HPET specifically. But it became apparent that not all "platform clocks" are HPET. Furthermore, BCDEdit can only enable/disable the platform clock. It can't force that the clock be HPET.

So instead of requiring HPET, I decided to just blacklist TSC since that's the one that's vulnerable to the clock skew.

Last fiddled with by Mysticial on 2019-07-14 at 22:36
Mysticial is offline   Reply With Quote
Old 2019-07-14, 23:08   #8
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

2×33×7 Posts
Default

The console selection thing, I'm aware of it, and it would be impossible to tell without going back in time. I didn't interact with it during the time and it seemed to resolve itself, so I don't think it is likely.

The possible pstate thing, I can only do what I usually do: reboot (didn't help) and update everything. Just took Win10 to latest 1903, and there's also a bios update with microcode, presumably for some vulnerability or other. I'd like a clean run since I'm going against other Xeons at 14 cores and need a bit more. Your run with 7920X is in a different ball park... but if I can get "best of the rest" I'd take it.

And finally, I think I just learnt that "platform clock" does not equal HPET.
mackerel is offline   Reply With Quote
Old 2019-07-15, 00:21   #9
Mysticial
 
Mysticial's Avatar
 
Sep 2016

7×47 Posts
Default

Quote:
Originally Posted by mackerel View Post
The console selection thing, I'm aware of it, and it would be impossible to tell without going back in time. I didn't interact with it during the time and it seemed to resolve itself, so I don't think it is likely.
If you see it again, open Task Manager, find the process under the "Processes" tab, right-click -> "Create dump file".

That will tell me if the program is deadlocked (a bug) or if it's suspended by the OS.

Quote:
The possible pstate thing, I can only do what I usually do: reboot (didn't help) and update everything. Just took Win10 to latest 1903, and there's also a bios update with microcode, presumably for some vulnerability or other. I'd like a clean run since I'm going against other Xeons at 14 cores and need a bit more. Your run with 7920X is in a different ball park... but if I can get "best of the rest" I'd take it.
Core counts being equal, it's going to be very hard to beat AVX512 without AVX512.

Interestingly, AMD declined to confirm the absence of AVX512 on Zen 2. So there's a slim, but non-zero chance that it may show up on Zen 2 TR or Epyc. But that still won't be enough to match Intel since they still only have 2 x 256-bit FMA.

p.s. I have v0.7.8 planned for September/October. That's after the 3950X launches, after Hot Chips 2019, and hopefully after we have more information about Zen 2 TR/Epyc. Those will affect my purchasing decisions as well as whether I do a "19-ZN2" binary for Zen 2.

Anyways, more on v0.7.8 later. It's now feature frozen except for the possible Zen 2 binary. And I'll be looking for some beta testers.

Quote:
And finally, I think I just learnt that "platform clock" does not equal HPET.
The other ones I'm aware of are "ACPI" and some VM clocks under Linux. I first discovered this problem when it didn't recognize the platform clock on my laptop and blocked me from any submissions.

In the end I never did figure out what the clock was on my laptop. So I switched from whitelist of HPET and ACPI to a blacklist of just TSC.

Last fiddled with by Mysticial on 2019-07-15 at 00:26
Mysticial is offline   Reply With Quote
Old 2019-08-30, 20:01   #10
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

2·33·7 Posts
Default

Today I got a shiny new 7920X which was (relatively) cheap. After some hurdles I installed it in a system replacing the 7800X, put back in 64GB of ram and thought I'd see what it would do with 10b. I had observed it was running at 2.9 GHz "stock" for AVX-512, but thought it would be an interesting time to see anyway before later overclock.

After some quick runs of 25m and 1b, I started it off on 10b and went away for dinner. Got back to find it still running... it shouldn't take that long. All CPU cores were still active, and still 2.9 GHz reported. I did a dump and left it some more, and it did finish. I didn't note it down, but the reported kernel time was silly high, and the multi-core ratio was something like 23%. Repeating a 1b run, low kernel time, good scaling as expected.

Ok, back to 10b run, this time I was semi-watching task manager on CPU tab with "show kernel time" selected. I caught it in the act! For the early part, I saw high CPU usage as expected, no significant kernel time. At 54% in, I saw on all the cores the kernel time went up to maximum. Some seconds later, all but 3 threads dropped to idle, and the 3 active threads were still maxed out on kernel time. Again, I took a dump. I went to a different system to write this, and now I look back and it has finished. This run was faster than the 1st, reporting 65% kernel overhead on CPU utilisation, and 70% multicore efficiency.

So if you still want to look at it, how do I get a ~48GB file to you? Putting it through 7zip now to see if I can get it down a bit.

I'm going to see if I can get Win7 on it to see if it is an OS thing. I hadn't seen this until the 14 core Xeon run previously. Since that I had done hwbot submissions on i3-6100 and a 3700X without similar problems. Could core/thread count be a factor?
mackerel is offline   Reply With Quote
Old 2019-08-30, 20:14   #11
Mysticial
 
Mysticial's Avatar
 
Sep 2016

7·47 Posts
Default

Quote:
Originally Posted by mackerel View Post
Today I got a shiny new 7920X which was (relatively) cheap. After some hurdles I installed it in a system replacing the 7800X, put back in 64GB of ram and thought I'd see what it would do with 10b. I had observed it was running at 2.9 GHz "stock" for AVX-512, but thought it would be an interesting time to see anyway before later overclock.

After some quick runs of 25m and 1b, I started it off on 10b and went away for dinner. Got back to find it still running... it shouldn't take that long. All CPU cores were still active, and still 2.9 GHz reported. I did a dump and left it some more, and it did finish. I didn't note it down, but the reported kernel time was silly high, and the multi-core ratio was something like 23%. Repeating a 1b run, low kernel time, good scaling as expected.

Ok, back to 10b run, this time I was semi-watching task manager on CPU tab with "show kernel time" selected. I caught it in the act! For the early part, I saw high CPU usage as expected, no significant kernel time. At 54% in, I saw on all the cores the kernel time went up to maximum. Some seconds later, all but 3 threads dropped to idle, and the 3 active threads were still maxed out on kernel time. Again, I took a dump. I went to a different system to write this, and now I look back and it has finished. This run was faster than the 1st, reporting 65% kernel overhead on CPU utilisation, and 70% multicore efficiency.
That is soooo strange. Haven't heard of it before. Shot in the dark: Was Superfetch enabled? I've seen weird things with the memory compression that it does. But not like this.

Quote:
So if you still want to look at it, how do I get a ~48GB file to you? Putting it through 7zip now to see if I can get it down a bit.
48 GB is probably prohibitive. Is there anyway you can produce just a mini-dump instead of the full dump. This is Windows right?

Quote:
I'm going to see if I can get Win7 on it to see if it is an OS thing. I hadn't seen this until the 14 core Xeon run previously. Since that I had done hwbot submissions on i3-6100 and a 3700X without similar problems. Could core/thread count be a factor?
Shouldn't be a factor. People have been running this thing on much larger machines without any issues. The fact that it's stuck in the kernel on a small number of threads has to imply that something is weird with the OS. A mini-dump should tell me what API call it was hanging in or whether the OS was acting on a page fault.

The only place where the program should have high kernel usage it right at the beginning when it's allocating memory and right at the end when it's releasing all that memory.

Last fiddled with by Mysticial on 2019-08-30 at 20:16
Mysticial is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
records for primes 3.14159 Information & Answers 8 2018-12-09 00:08
Records for complete factorisation Brian-E Math 25 2009-12-16 21:40
gmp-ecm records page question yqiang GMP-ECM 6 2007-05-18 12:23
What happens to our existing records when we form a team? kwstone Teams 5 2005-05-06 03:38
Records in January wblipp ElevenSmooth 10 2004-03-22 01:26

All times are UTC. The time now is 21:36.

Thu Jul 9 21:36:15 UTC 2020 up 106 days, 19:09, 0 users, load averages: 1.94, 1.74, 1.55

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.