![]() |
|
|
#34 |
|
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2
100101000001012 Posts |
I had to recompile it (for some reason this binary doesn't produce factors, even though it is doing something on CPU - but not on GPU).
After recompilation, looks very nice; I am running the 0-1P range again and will compare to the earlier results. |
|
|
|
|
|
#35 |
|
Jun 2003
505110 Posts |
So, a first (re)attempt at getting linux to work. Removed all asm code so that a single code base will work under both Win and Linux.
Haven't compiled/tested under linux, so if someone can try it out.... ? There will be performance regression, as CPU code will be much slowed, but that doesn't matter. If it works under Linux, I'll gradually add back the assembly routines. EDIT:- Per usual, a Win32 compile under CUDA 3.2 is included. Last fiddled with by axn on 2015-01-11 at 15:54 |
|
|
|
|
|
#36 |
|
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2
100101000001012 Posts |
Ok, will test.
|
|
|
|
|
|
#37 |
|
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2
36×13 Posts |
Sorry that it took me so long to get to it.
It works fine under linux. The factors are all valid and none are missed (compared to the win version). Could you please merge plain source and asm, and put multiple "#if 0 {... block of asm code} #else {block of c code} #endif" and I will gladly flip 0 to 1s for as many combinations as necessary to get to the bug. |
|
|
|
|
|
#38 | ||
|
Jun 2003
5,051 Posts |
Quote:
Quote:
Meanwhile, could you compare the performance of the C version vs ASM version under Windows? I'd like to have hard numbers on how much performance is lost (or not) on mid/high end GPUs. If it is < 1-2% (especially when running multiple instances per GPU), then I'd rather not introduce assembly at all. [Note to self - try out MPIR/GMP someday] |
||
|
|
|
|
|
#39 |
|
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2
36×13 Posts |
I don't have any dual boot machines. (And I have very different cards on one side and on the other.)
But the speed on linux "looks" fine. Your idea to leave it in C sounds fine to me. I only tried the problematic 0-1P region once again, and then 1P-3P (and they ran in parallel with some other programs, too) - they finished fast; "organoleptically" similar to what I'd expect (from having run sieving on Windows up to 350P; I know how frequently I had to restart binaries). For the accuracy point, I've compared the outputs of 0-1P that I already had. After sorting (because the factors are dumped in differently sized chunks and are not ordered by design), the files match precisely, byte to byte. For validity, all checked with reformatter_in_perl piped to gp (I prepare x=Mod(b,f), then check that x^2n - x^n +1 = 0, obviously). |
|
|
|
|
|
#40 |
|
Jun 2003
5,051 Posts |
Actually, I was looking for asm (0.2)/windows vs plain C (0.3)/windows. Maybe one pair of results with n=18, and another pair with n=22, at say, 1000-1001 range.
EDIT:- Looking for speed as reported by the program. Also, 1 vs 2 instances running. Linux should be approximately equal to Windows, but even if it isn't, there isn't much that I can do :-( Last fiddled with by axn on 2015-01-15 at 07:12 |
|
|
|
|
|
#41 |
|
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2
36·13 Posts |
Ah! Right, I forgot that there was a Windows binary included.
This test makes sense. The asm code does have influence. Let's have a look case-by-case: CPU is 1055T and GPU is a 570 OC. n=18 1001 B12: with 0.3) 45.4P/day single; 2*37.7P/day dual; CPU is busy, GPU is cold (53C) n=18 1001 B12: with 0.2) 59.4P/day single; 2*46P/day dual; CPU is busy, GPU is a bit warmer (55C) n=22 1001 B12: with 0.3) 137.1P/day single; 2*83.8P/day dual; CPU load is much less, GPU is warm (63C) n=22 1001 B12: with 0.2) 171P/day single; 2*94.8P/day dual; CPU load is lesser still, GPU is warmer (65C) For full load, I use n=18, 350P(going up) + BOINC genefer, CPU load is 100% and T = 66-67C (I've recently replaced fans; T used to be 72C+ and started climbing to 78 with an occasional butterfly rattle) Last fiddled with by Batalov on 2015-01-15 at 08:55 |
|
|
|
|
|
#42 |
|
Jun 2003
116738 Posts |
Version 0.4
Highlights include: 1) Bmax = 1e8 (for PG). Had to increase the buffer size another 10x for 0-1p range, and yet, that will overflow as well if Block parameter above 8 is used :-( 2) Performance improvements compared to NoAsm. 16-18 should be as fast or faster than the last asm version. Still not as fast as the asm version at higher n's, but the gap should have been significantly narrowed. Needed: Performance figures under windows. Build & regression testing under linux. Per usual, source and windows build attached. If everything checks out, this will be the "production" version. I'll work on more performance improvements for higher n's, but hopefully, those won't be needed for a while. |
|
|
|
|
|
#43 |
|
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2
36·13 Posts |
Tested "18 1001 1002 B12" and "18 1002 1003 B12"on Windows and linux.
The speed is volatile on both platforms but generally similar (and similar to earlier runs: 58P/day single-load; 2*48P/day dual-load). The output is identical (after removing Windows-style line breaks, and sorting each file [because factor dumps are asynchronous]). Good stuff! Kudos! Last fiddled with by Batalov on 2015-02-02 at 03:17 |
|
|
|
|
|
#44 |
|
Jun 2003
5,051 Posts |
Many thanks! I'll post in the PG thread and let them know that the sieve is ready for use.
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Prime 95 and internet connection issue | Jwb52z | Software | 10 | 2013-01-30 01:09 |
| Twin prime search? | MooooMoo | Twin Prime Search | 115 | 2010-08-29 17:38 |
| Prime Search at School | Unregistered | Information & Answers | 5 | 2009-10-15 22:44 |
| Prime Search on PS-3? | Kosmaj | Riesel Prime Search | 6 | 2006-11-21 15:19 |
| Running prime on PC without internet-connection | Ferdy | Software | 3 | 2006-04-25 08:53 |