![]() |
[QUOTE=xilman;257997]To compensate, rate the top 1000 by the size of P-1 after division by all powers of the Mersenne exponent.[/QUOTE]
Why all powers? Why not one power? [QUOTE]Note that this proposal scales down composite P-1 factors also.[/QUOTE] Composite factors should be resolved into their constituent primes. |
[QUOTE=henryzz;258001]AFAIK the non-SSE2 code supports higher exponents than SSE2 code.[/QUOTE]I haven't found the definition of MAX_PRIME yet.
|
[QUOTE=Mr. P-1;258004]Why all powers? Why not one power?
Composite factors should be resolved into their constituent primes.[/QUOTE]Purely to give a way of dealing with composites. If they are factored, multiple powers are not the issue. Paul |
[QUOTE=xilman;258069]Purely to give a way of dealing with composites. If they are factored, multiple powers are not the issue.
Paul[/QUOTE] Slighly confused. Are you suggesting that if C = P1*P2 be a composite factor of a mersenne M(p), then C-1 will be divisible by p^2? Obviously (P1-1)*(P2-1) will be divisible by p^2, but, in general, C-1 won't be. |
[QUOTE=axn;258072]Slighly confused. Are you suggesting that if C = P1*P2 be a composite factor of a mersenne M(p), then C-1 will be divisible by p^2?
Obviously (P1-1)*(P2-1) will be divisible by p^2, but, in general, C-1 won't be.[/QUOTE]You're right and I was confused (read: wrong). The correct approach, IMO, is to report only prime factors and to scale them by dividing by p. Division by p rewards hard work, because higher B1 & B2 gives a higher chance of finding a factor, and good luck. Paul |
[QUOTE=petrw1;257690]And if I put my entire (albeit meagre) team of P-1 capable PCs (3 Quads and 6 Duals and 1 P4=25 cores) I estimate I could complete only about 10 or 12 per day or 5 - 6%.
Mind you, all 4 cores of a Quad on P-1 could be quite taxing.[/QUOTE] I now have 5 cores running P-1's and over the last 30 days I'm averaging 2.6 per day in the 53M range all in the 4.6 - 4.7 GHz-Day range. |
[QUOTE=drh;258387]I now have 5 cores running P-1's and over the last 30 days I'm averaging 2.6 per day in the 53M range all in the 4.6 - 4.7 GHz-Day range.[/QUOTE]
I remember being told by someone that p-1 would only take me 7 days but I stopped it ~1 day in because it had only done like 3% (100/3 =33.3 days not 7) |
[QUOTE=science_man_88;258390]I remember being told by someone that p-1 would only take me 7 days but I stopped it ~1 day in because it had only done like 3% (100/3 =33.3 days not 7)[/QUOTE]
Depends completely on what CPU you are running on and how much RAM you allocate. I have a i5-750 that will complete a P-1 in the 50M range in 31 hours and a PIV 2.4 Ghz that takes over 100 hours. |
[QUOTE=drh;258387]I now have 5 cores running P-1's and over the last 30 days I'm averaging 2.6 per day in the 53M range all in the 4.6 - 4.7 GHz-Day range.[/QUOTE]
Is that 5 cores on 1 PC? Or 1 core on each of 5 PC's or a combination of such? If you have more than 1 core on a PC doing P-1 do you notice any conflict for CPu resources; for example, do the run times drop noticeably if there are more than 1 cores doing stage 2? |
[QUOTE=petrw1;258415]Is that 5 cores on 1 PC? Or 1 core on each of 5 PC's or a combination of such?
If you have more than 1 core on a PC doing P-1 do you notice any conflict for CPu resources; for example, do the run times drop noticeably if there are more than 1 cores doing stage 2?[/QUOTE] Here is what I'm doing right now - 1 Intel Core i5 M 540 @ 2.53GHz 4Gig RAM -- (1 Core P-1 & 1 Core mafktc) 1 Intel Core i7 Q 720 @ 1.60GHz 8Gig RAM -- (2 Cores P-1 & 2 Cores LL) 1 Intel Core2 Quad Q8400 @ 2.66GHz 8Gig RAM -- (2 Cores P-1 & 2 Cores LL) The memory on the systems running (2 P-1's) automatically adjusts when they are both in stage 2, but the throughput remains the same, as well as the throughput of the LL's. I'm contemplating trying (3 P-1's) at a time, and if I do, I'll let you know those results as well. |
For my part, I've got 10 cores running P-1:
4x i7-920 @ 3.2GHz w/12GB (10GB allocated) 4x Q6600 @ 2.7GHz w/8GB (6GB allocated) 2x 3600X2 @ 2.0GHz w/2GB (1GB allocated) The older machines are working their way through ancient ranges of never-really-P1'd exponenents (and finding a bunch of factors, I might add). My i7 is working through more current exponents. The i7 has plenty of bandwith available (triple-channel DDR3-1600) and is unbothered by which exponents are in stage1/2. The Q6600 is dual-channel DDR3-1333, and is only slightly bothered. (The 3600+ is too slow for me to really pay any attention to.) On both my quads, I limit Stage2 to max 3 workers at a time, but more to give each one a decent chunk of RAM rather than for performance bottleneck reasons. |
[QUOTE=James Heinrich;258426] On both my quads, I limit Stage2 to max 3 workers at a time, but more to give each one a decent chunk of RAM rather than for performance bottleneck reasons.[/QUOTE]
And do you use the "Memory=" parms to limit Memory use per core or do you let it use whatever is available? (i.e. if only 1 core is in stage 2 it gets all the RAM; if there are 2 cores then half each and if 3 cores then 1/3 each) |
I let them fight it out for memory. The Q6600 is currently working in the M7 range, so a full batch of 480 relative primes only takes about 1500MB, so there's no real fighting for RAM on that machine. On the i7, it varies depending on workload. Sometimes I'll see workers with 6GB/2GB/1GB, other times it's balanced at 3/3/3. Of course, the win comes when only 2 or 1 worker happens to be in stage2 at that time, in which case they get a bunch more RAM than if I hardcoded a per-worker limit. Right now I have a big backlog of stage2 to complete, so it's rare to see less than the max workers in stage2.
|
[QUOTE=petrw1;258412]Depends completely on what CPU you are running on and how much RAM you allocate.
I have a i5-750 that will complete a P-1 in the 50M range in 31 hours and a PIV 2.4 Ghz that takes over 100 hours.[/QUOTE] yeah I've said it before but I'm on a HP comnpaq presario SR2050NX ( apparently designed for XP)with a 2.80 Ghz processor 3.19 GB of detected ( usable ?) ram on a 32 bit installation of windows 7 and a ATI graphics with almost all USB ports filled with plug in stuff and a headset if I want to do speech recognition. I have a new printer a energy star model of almost the same one I used to have I think ( 3 in one I believe). At last check it got a 2.2 window experience rating ( but that's because it sucks at aero). |
[QUOTE=science_man_88;258434]yeah I've said it before but I'm on a HP comnpaq presario SR2050NX ( apparently designed for XP)with a 2.80 Ghz processor 3.19 GB of detected ( usable ?) ram on a 32 bit installation of windows 7 [/QUOTE]
I tkink this is really all that is relevant to P-1 times. With this setup I don't see why it should take more than about 4 days for P-1 in the current range assuming you allocate at least 400MB to it and run 24 hours a day. |
[QUOTE=James Heinrich;258432]I let them fight it out for memory.[/QUOTE]
One more question... I know that it chooses B1/B2 based partly on how much RAM it has to work with. If I were to allocate 2400M it will choose larger B1/B2 than if I allocate 600M. But if I allocated 2400M total and 600M for each of the 4 cores will it not choose smaller B1/B2 and potentially complete each one faster? |
[QUOTE=petrw1;258440]With this setup I don't see why it should take more than about 4 days for P-1 in the current range assuming you allocate at least 400MB to it and run 24 hours a day.[/QUOTE]Expounding a little: The CPU is a [url=http://ark.intel.com/Product.aspx?id=27512]Pentium D 820[/url] dual-core (2.8GHz, 2x 1MB cache). A single core should push through about [url=http://mersenne-aries.sili.net/throughput.php?cpu1=Intel%28R%29+Pentium%28R%29+D+CPU+2.80GHz|1024|0&mhz1=2800]1.58GHz-days per day[/url] of P-1 work in the 53M range. If a single P-1 assignment is about 4GHz-days of work then a single core should really only take about 2.5-3 days to complete. Don't expect that running two P-1's on that CPU to give you double the throughput, but you should be able to complete a P-1 at least every other day, I'd say. Assuming you have a decent amount of RAM allocated, that is (you didn't say what amount of RAM you let Prime95 use...?)
|
[QUOTE=James Heinrich;258447]Expounding a little: The CPU is a [url=http://ark.intel.com/Product.aspx?id=27512]Pentium D 820[/url] dual-core (2.8GHz, 2x 1MB cache). A single core should push through about [url=http://mersenne-aries.sili.net/throughput.php?cpu1=Intel%28R%29+Pentium%28R%29+D+CPU+2.80GHz|1024|0&mhz1=2800]1.58GHz-days per day[/url] of P-1 work in the 53M range. If a single P-1 assignment is about 4GHz-days of work then a single core should really only take about 2.5-3 days to complete. Don't expect that running two P-1's on that CPU to give you double the throughput, but you should be able to complete a P-1 at least every other day, I'd say. Assuming you have a decent amount of RAM allocated, that is (you didn't say what amount of RAM you let Prime95 use...?)[/QUOTE]
yeah well I had it running full tilt ( though maybe not without sleep mode ( which may be the problem)) with 2936 ( the maximum it allowed) and at one day I think it was under 4%. |
[QUOTE=science_man_88;258451]yeah well I had it running full tilt ( though maybe not without sleep mode ( which may be the problem)) with 2936 ( the maximum it allowed) and at one day I think it was under 4%.[/QUOTE]
Do you happen to recall if it was 4% of stage 1 or stage 2? Typically Stage 2 takes about 50% longer than Stage 1 so if by any change it was 4% of Stage 2 then it was really almost half done. |
[QUOTE=science_man_88;258451]yeah well I had it running full tilt ( though maybe not without sleep mode ( which may be the problem)) with 2936 ( the maximum it allowed) and at one day I think it was under 4%.[/QUOTE]
never mind apparently I had my preferred settings on my PC set to balanced performance. It's now on high performance. that may be a factor. |
It sounds very much like it was at 4% of stage 2, so in reality almost exactly 50% done. If you would've left it for another day it would've completed.
|
[QUOTE=James Heinrich;258458]It sounds very much like it was at 4% of stage 2, so in reality almost exactly 50% done. If you would've left it for another day it would've completed.[/QUOTE]
if it was I'd be surprised because I tried another p-1 ( though prime 95 gave me 2 to work on) and even at full memory usage, ( though at times I used PARI) it actually only got 50% on either one before a slow mouse and useless mouse clicks got me to restart my computer by ctrl+ alt + delete ( because with useless mouse clicks I was hopeless to stop or even pause prime 95 or change memory settings. |
[QUOTE=science_man_88;258579]if it was I'd be surprised because I tried another p-1 ( though prime 95 gave me 2 to work on) and even at full memory usage, ( though at times I used PARI) it actually only got 50% on either one before a slow mouse and useless mouse clicks got me to restart my computer by ctrl+ alt + delete ( because with useless mouse clicks I was hopeless to stop or even pause prime 95 or change memory settings.[/QUOTE]
Odd...I've had that problem with mfaktc (slow screen updates; the cheap video card is running ragged) but not Prime95. What version of prime95 are you running, I ASSUME under Windows, but which version of Windows? Did you leave the priority settings alone, as recommended? Did you use task manager to kill prime95, or did you end up doing a complete reboot? And what did task manager have to say about the size of your system memory? (I'm thinking you were getting thrashed by constantly swapping virtual memory in and out, but this is an intuitive leap that risks falling on its face) Mods::direction: |
[QUOTE=Christenson;258586]Odd...I've had that problem with mfaktc (slow screen updates; the cheap video card is running ragged) but not Prime95. What version of prime95 are you running, I ASSUME under Windows, but which version of Windows? Did you leave the priority settings alone, as recommended? Did you use task manager to kill prime95, or did you end up doing a complete reboot? And what did task manager have to say about the size of your system memory?
(I'm thinking you were getting thrashed by constantly swapping virtual memory in and out, but this is an intuitive leap that risks falling on its face) Mods::direction:[/QUOTE] prime 95 25.11 I believe I have the download of later versions though. I think it was just from a memory overload I can allocate up to 888 MB in PARI and 2936 MB in prime 95 and I have 3.19 GB of usable memory so they are pretty much fighting everything else for memory if i use them both. I thought about killing prime 95 but I wasn't sure if task manager would allow me to click it. it was on a 32 bit version of windows 7 home premium, I used the ctrl + alt + delete to choose shutdown after i realized task manager may not work. |
[QUOTE=science_man_88;258590]I think it was just from a memory overload I can allocate up to 888 MB in PARI and 2936 MB in prime 95 and I have 3.19 GB of usable memory so they are pretty much fighting everything else for memory if i use them both. I thought about killing prime 95 but I wasn't sure if task manager would allow me to click it. it was on a 32 bit version of windows 7 home premium, I used the ctrl + alt + delete to choose shutdown after i realized task manager may not work.[/QUOTE]
OUCH!!! Definitely starving Windows. From readme.txt (I snipped some to save room here) [QUOTE]SETTING P-1/ECM STAGE 2 MEMORY ------------------------------ Stage 2 of P-1 factoring ... is slightly more effective if it is given more memory to work with. However, if you let the program use too much memory then the performance of ALL programs will suffer. 1) Be conservative. It is better to set the memory too low than too high. Setting the value too high can cause thrashing which slows down all programs. Remember, the program will only use the extra memory in stage 2 of P-1 factoring. 2) Start with how much memory is installed in your machine. Allow a reasonable amount of memory for the OS and whatever background tasks you run (say 100 or 200MB). This represents the maximum value you should use. The program won't let you enter more than 90% of installed memory. 3) Assuming you run your machine 24 hours a day, what hours of the day do you not use your computer? Make these your nighttime hours and let the program use a lot of memory during these hours. But reduce this value if you also run batch jobs at night. 4) Factor in the information below about minimum, reasonable, and desirable memory amounts for some sample exponents. If you choose a value below the minimum, that is OK. The program will simply skip stage 2 of P-1 factoring. Exponent Minimum Reasonable Desirable -------- ------- ---------- --------- 20000000 40MB 80MB 120MB 33000000 65MB 125MB 185MB 50000000 85MB 170MB 250MB [/QUOTE] To point 2) above: I tried allocating 90% on a XP maching and found the whole PC slowed noticeably; so I personally wouldn't go that high. I expect Windows 7 or Vista may even want more memory. Considering the table in point 4) and given you have over 3Gb available I really don't think you need to allocate 2936Mb to Prime95. Even if you want to run P-1 on both cores 1200Mb-1600Mb would certainly be lots. And I am quite sure task manager would allow you to kill prime95 but why wouldn't you stop it yourself from the GUI? Also consider from undoc.txt use the pausewhilerunning parm. [QUOTE]In rare cases, users have reported the program can interfere with the performance of some programs such as disk defragmenters and some games. You can pause the program automatically when these programs are running by adding this line to prime.txt: PauseWhileRunning=prog1[n1],prog2[n2],etc The [n1], [n2] values are optional and indicate the number of worker threads to pause when prog1 and prog2 are running. The default value for n1 and n2 is to pause all worker threads. Note that the program will pause if the program name matches any part of the running program's file name. That is "foobar" will match "c:\foobar.exe", "C:\FOOBAR\name.exe", and even "C:\myfoobarprog.exe". Also, if prog1 is "*" the program will pause no matter what. Examples: PauseWhileRunning=*[1] during 6-7/2:00-3:00 PauseWhileRunning=* during 23:00-24:00 else decomp[1],mygame[2] The first example pauses one worker thread on Saturday and Sunday between 2AM and 3AM. The second example pauses all workers between 11PM and 12AM and pauses 1 worker if decomp is running and 2 if mygame is running.[/QUOTE] |
Petrw1, thank you...
SM88, Prime95 26.5 is a bit faster...you may prefer to run that instead...but pay attention to what Petrw1 is telling you, he's more expert at it than I am and he's correct. |
[QUOTE=petrw1;258602]OUCH!!! Definitely starving Windows.
From readme.txt (I snipped some to save room here) To point 2) above: I tried allocating 90% on a XP maching and found the whole PC slowed noticeably; so I personally wouldn't go that high. I expect Windows 7 or Vista may even want more memory. Considering the table in point 4) and given you have over 3Gb available I really don't think you need to allocate 2936Mb to Prime95. Even if you want to run P-1 on both cores 1200Mb-1600Mb would certainly be lots. And I am quite sure task manager would allow you to kill prime95 but why wouldn't you stop it yourself from the GUI? Also consider from undoc.txt use the pausewhilerunning parm.[/QUOTE] Q:And I am quite sure task manager would allow you to kill prime95 but why wouldn't you stop it yourself from the GUI? A: mouse clicks didn't work so i couldn't get into the menus ( though it partially may be memory but also distance ( since my mouse is wireless) advice:Also consider from undoc.txt use the pausewhilerunning parm. problem: I leave PARI up a lot of times so it might as well pause indefinitely from that. PARIMAXIMUMS ( a file I made to keep track of maximums with a higher prime limit or not) has the maximum at allocatemem(931987423)which is 888.812 Mb prime95 ( the other main things I do outside of play online crap) has a maximum of 2936 on my machine ( 32 -bit PC 3.19 Gb of usuable memory, these 2 put me over by about .545 Gb). it's definitely Prime 95 I have to cut down because that's 558 MB over and that means 888.812 - 558 -300 ( system) = 30.812 MB if I do it to only PARI. |
You want to investigate the LowMemWhileRunning option:[quote]LowMemWhileRunning is similar to PauseWhileRunning. This option does not allow workers to use a lot of memory. This example in prime.txt will make sure the program is using the minimum amount of memory possible while photoshop is running:
LowMemWhileRunning=Photoshop[/quote]So when you're running PARI (other other memory-intensive program(s), Prime95 won't use a lot of RAM (it will do P-1 stage1, or L-L, or TF -- tasks that don't require any large amount of RAM). One you close PARI then Prime95 will restart with the workers to continue on P-1 stage2. |
George: has there been any recent change to the server credit for P-1?
[i]KingKurly[/i] [url=http://www.mersenneforum.org/showpost.php?p=258746]noted in this post[/url] that P-1 on [url=http://mersenne.org/report_exponent/?exp_lo=332205149]M332205149[/url] (B1=3365000, B2=95902500) he got 181.3220 GHz-days of credit; but this doesn't match at all with [url=http://mersenne-aries.sili.net/prob.php?exponent=332205149&b1=3365000&b2=95902500&factorbits=76]my calculation[/url] of 210.272541 GHz-days. I suspect the cause is related to new FFT sizes and/or timings in use for credit calculation. Would you mind sending me a copy of the latest "t_gimps_credit_timings" lookup table (for exponent range + FFT size + timings), please? My old copy (taken from [url=http://www.mersenneforum.org/showpost.php?p=152289&postcount=207]your post here[/url] is now out of date. |
[CODE]CREATE FUNCTION [dbo].[fn_gimps_interpolate_timing] ( @fftlen float )
RETURNS float AS BEGIN DECLARE @base_fftlen float, @base_timing float, @next_fftlen float, @next_timing float select @base_fftlen = max(fftlen), @base_timing = max(timing) from t_gimps_credit_timings where fftlen < @fftlen select @next_fftlen = min(fftlen), @next_timing = min(timing) from t_gimps_credit_timings where fftlen > @fftlen if (@next_fftlen > 0) BEGIN return ( @base_timing + (@fftlen - @base_fftlen) / (@next_fftlen - @base_fftlen) * (@next_timing - @base_timing) ) END return ( @base_timing * @fftlen / @base_fftlen ) END [/CODE] |
SM88:
What is PARI? (Might sound dumb, but I'm a bit of a spring chicken here) |
[QUOTE=Christenson;258783]SM88:
What is PARI? (Might sound dumb, but I'm a bit of a spring chicken here)[/QUOTE] [url]http://pari.math.u-bordeaux.fr/[/url] |
[QUOTE=Prime95;258773]select @base_fftlen = max(fftlen), @base_timing = max(timing) from t_gimps_credit_timings where fftlen < @fftlen
select @next_fftlen = min(fftlen), @next_timing = min(timing) from t_gimps_credit_timings where fftlen > @fftlen[/QUOTE]Thanks... but that doesn't help me much if I don't have access to the `t_gimps_credit_timings` table. If you could give me a [COLOR="Blue"]SELECT * FROM `t_gimps_credit_timings`[/COLOR] (or an export from phpMyAdmin or mysqldump or whatever is easy) then I can reconstruct the data. |
32 8.7840000000000003E-7 0 743
48 1.1399999999999999E-6 743 1099 64 1.4087999999999998E-6 1099 1469 80 0.0000017592 1469 1827 96 0.000002052 1827 2179 112 2.4623999999999999E-6 2179 2539 128 2.5271999999999997E-6 2539 2905 160 3.5639999999999997E-6 2905 3613 192 4.2935999999999999E-6 3613 4311 224 5.2511999999999998E-6 4311 5029 256 5.4743999999999996E-6 5029 5755 320 7.3439999999999995E-6 5755 7149 384 8.7839999999999992E-6 7149 8527 448 0.000010536 8527 9933 512 1.1327999999999999E-5 9933 11359 640 1.7039999999999999E-5 11359 14119 768 2.0591999999999999E-5 14119 16839 896 2.5415999999999999E-5 16839 19639 1024 2.7647999999999997E-5 19639 22477 1280 3.9119999999999998E-5 22477 27899 1536 4.7592000000000001E-5 27899 33289 1792 5.6879999999999998E-5 33289 38799 2048 6.1680000000000006E-5 38799 44339 2560 8.5199999999999997E-5 44339 55099 3072 1.0247999999999999E-4 55099 65729 3584 0.00012648 65729 76559 4096 1.2959999999999998E-4 76559 87549 5120 0.000192 87549 108800 6144 2.2559999999999998E-4 108800 129900 7168 2.7599999999999999E-4 129900 151300 8192 0.0002856 151300 172700 10240 3.9839999999999998E-4 172700 214400 12288 4.8479999999999997E-4 214400 255300 14336 5.9040000000000004E-4 255300 297300 16384 6.1439999999999997E-4 297300 340400 20480 8.3760000000000008E-4 340400 423300 24576 0.0010176 423300 504600 28672 1.2599999999999998E-3 504600 587500 32768 1.3127999999999998E-3 587500 671400 40960 1.7256000000000001E-3 671400 835200 49152 2.0951999999999998E-3 835200 995500 57344 2.5607999999999998E-3 995500 1158000 65536 2.7384000000000002E-3 1158000 1325000 81920 3.8592000000000001E-3 1325000 1648000 98304 4.6871999999999999E-3 1648000 1966000 114688 5.6808000000000006E-3 1966000 2287000 131072 5.9928000000000004E-3 2287000 2614000 163840 7.3535999999999992E-3 2614000 3251000 196608 9.0743999999999998E-3 3251000 3875000 229376 1.0775999999999999E-2 3875000 4512000 262144 1.1975999999999999E-2 4512000 5158000 327680 1.5311999999999999E-2 5158000 6421000 393216 1.8839999999999999E-2 6421000 7651000 458752 0.02256 7651000 8908000 524288 0.0252 8908000 10180000 655360 3.3264000000000002E-2 10180000 12650000 786432 4.1375999999999996E-2 12650000 15070000 917504 4.9200000000000001E-2 15070000 17550000 1048576 5.5920000000000004E-2 17550000 20050000 1310720 7.0080000000000003E-2 20050000 24930000 1572864 8.5919999999999996E-2 24930000 29690000 1835008 0.10224 29690000 34560000 2097152 0.11375999999999999 34560000 39500000 2621440 0.14999999999999999 39500000 49100000 3145728 0.18312 49100000 58520000 3670016 0.21839999999999998 58520000 68130000 4194304 0.24360000000000001 68130000 77910000 5242880 0.31295999999999996 77910000 96830000 6291456 0.37919999999999998 96830000 115300000 7340032 0.45839999999999997 115300000 134200000 8388608 0.504 134200000 153400000 10485760 0.67200000000000004 153400000 190700000 12582912 0.81120000000000003 190700000 227300000 14680064 0.98399999999999987 227300000 264600000 16777216 1.0800000000000001 264600000 302600000 20971520 1.4903999999999999 302600000 376100000 25165824 1.8215999999999999 376100000 448000000 29360128 2.1791999999999998 448000000 521500000 33554432 2.3832 521500000 596000000 |
Thanks. Unfortunately, that data is identical to what I have already, so that's not the problem. The credit formula I have for P-1 is the one [url=http://www.mersenneforum.org/showpost.php?p=160774&postcount=14]you posted 27-Jan-2009[/url]:[code]if ( $B1 >= $B2 )
return ( $timing * ( 1.45 * $B1 ) / 86400.0 ); else return ( $timing * ( 1.45 * $B1 + 0.079 * ($B2 - $B1) ) / 86400.0 );[/code]Has it been modified since then? |
[QUOTE=James Heinrich;258828]Has it been modified since then?[/QUOTE]
Just the interpolation for FFT lengths not in the timings table. There is no way to predict which FFT size prime95 will use when you are near an FFT boundary. The xjmptable table in mult.asm lists all the available FFT sizes and approximate maximum exponent that can be handled. |
I believe mprime used an 18M FFT on the test in question.
|
Well, that does reveal some discrepancies. The exponent ranges (the last two columns from your data in post #519 above) are what I'm using in my lookup table, but they don't match at all what's in mult.asm (if I'm reading it right). Using for example M332205149, the `t_gimps_credit_timings` data seems to indicate it should use a 20480K FFT. But in mult.asm it seems to say it should be a 17920K FFT. Did I read that correctly? If so, that would explain the ~15% credit difference issue I raised in post #514.
|
Yes, the server's timing table is based on FFT lengths available in prime95 version 25. Version 26 introduced many more FFT lengths and faster timings.
|
[QUOTE=James Heinrich;258747][i]KingKurly[/i] [url=http://www.mersenneforum.org/showpost.php?p=258746]noted in this post[/url] that P-1 on [url=http://mersenne.org/report_exponent/?exp_lo=332205149]M332205149[/url] (B1=3365000, B2=95902500) he got 181.3220 GHz-days of credit; but this doesn't match at all with [url=http://mersenne-aries.sili.net/prob.php?exponent=332205149&b1=3365000&b2=95902500&factorbits=76]my calculation[/url] of 210.272541 GHz-days.[/QUOTE]Good news -- after interpolating my own timings table for all the missing new FFT sizes, as well as figuring out which FFT range the exponents now fall under, I'm getting "[url=http://mersenne-aries.sili.net/prob.php?exponent=332205149&b1=3365000&b2=95902500&factorbits=76]181.321973 GHz-days[/url]" as the estimate credit, which looks pretty close. :smile:
For anyone interested, this is my version of the timing table and FFT sizes, derived from George's data above, plus the contents of gwnum/mult.asm:[code]static $fft_timing = array( 33554432 => 2.3832E+0, 33030144 => 2.3577E+0, 32768000 => 2.3449E+0, 31457280 => 2.2812E+0, 29360128 => 2.1792E+0, 28311552 => 2.0898E+0, 27525120 => 2.0227E+0, 26214400 => 1.9110E+0, 25165824 => 1.8216E+0, 23592960 => 1.6974E+0, 22937600 => 1.6456E+0, 22020096 => 1.5732E+0, 20971520 => 1.4904E+0, 19660800 => 1.3621E+0, 18874368 => 1.2852E+0, 18350080 => 1.2339E+0, 16777216 => 1.0800E+0, 16515072 => 1.0680E+0, 16384000 => 1.0620E+0, 15728640 => 1.0320E+0, 14680064 => 9.8400E-1, 14155776 => 9.4080E-1, 13762560 => 9.0840E-1, 13107200 => 8.5440E-1, 12582912 => 8.1120E-1, 11796480 => 7.5900E-1, 11468800 => 7.3725E-1, 11010048 => 7.0680E-1, 10485760 => 6.7200E-1, 9830400 => 6.1950E-1, 9437184 => 5.8800E-1, 9175040 => 5.6700E-1, 8388608 => 5.0400E-1, 8257536 => 4.9830E-1, 8192000 => 4.9545E-1, 7864320 => 4.8120E-1, 7340032 => 4.5840E-1, 7077888 => 4.3860E-1, 6881280 => 4.2375E-1, 6553600 => 3.9900E-1, 6291456 => 3.7920E-1, 5898240 => 3.5436E-1, 5734400 => 3.4401E-1, 5505024 => 3.2952E-1, 5242880 => 3.1296E-1, 4915200 => 2.9128E-1, 4718592 => 2.7828E-1, 4587520 => 2.6961E-1, 4194304 => 2.4360E-1, 4128768 => 2.4045E-1, 4096000 => 2.3887E-1, 3932160 => 2.3100E-1, 3670016 => 2.1840E-1, 3538944 => 2.0958E-1, 3440640 => 2.0296E-1, 3276800 => 1.9194E-1, 3145728 => 1.8312E-1, 2949120 => 1.7070E-1, 2867200 => 1.6552E-1, 2752512 => 1.5828E-1, 2621440 => 1.5000E-1, 2457600 => 1.3867E-1, 2359296 => 1.3188E-1, 2293760 => 1.2735E-1, 2097152 => 1.1376E-1, 2064384 => 1.1232E-1, 2048000 => 1.1160E-1, 1966080 => 1.0800E-1, 1835008 => 1.0224E-1, 1769472 => 9.8160E-2, 1720320 => 9.5100E-2, 1638400 => 9.0000E-2, 1572864 => 8.5920E-2, 1474560 => 7.9980E-2, 1376256 => 7.4040E-2, 1310720 => 7.0080E-2, 1228800 => 6.5655E-2, 1179648 => 6.3000E-2, 1146880 => 6.1230E-2, 1048576 => 5.5920E-2, 1032192 => 5.5080E-2, 983040 => 5.2560E-2, 917504 => 4.9200E-2, 884736 => 4.7244E-2, 819200 => 4.3332E-2, 786432 => 4.1376E-2, 737280 => 3.8334E-2, 688128 => 3.5292E-2, 655360 => 3.3264E-2, 589824 => 2.9232E-2, 573440 => 2.8224E-2, 524288 => 2.5200E-2, 491520 => 2.3880E-2, 458752 => 2.2560E-2, 409600 => 1.9770E-2, 393216 => 1.8840E-2, 344064 => 1.6194E-2, 327680 => 1.5312E-2, 294912 => 1.3644E-2, 262144 => 1.1976E-2, 245760 => 1.1376E-2, 229376 => 1.0776E-2, 196608 => 9.0744E-3, 163840 => 7.3536E-3, 147456 => 6.6732E-3, 131072 => 5.9928E-3, 122880 => 5.8368E-3, 114688 => 5.6808E-3, 98304 => 4.6872E-3, 86016 => 4.0662E-3, 81920 => 3.8592E-3, 73728 => 3.2988E-3, 65536 => 2.7384E-3, 61440 => 2.6496E-3, 57344 => 2.5608E-3, 49152 => 2.0952E-3, 40960 => 1.7256E-3, 32768 => 1.3128E-3, 28672 => 1.2600E-3, 24576 => 1.0176E-3, 20480 => 8.3760E-4, 16384 => 6.1440E-4, 14336 => 5.9040E-4, 12288 => 4.8480E-4, 10240 => 3.9840E-4, 8192 => 2.8560E-4, 7168 => 2.7600E-4, 6144 => 2.2560E-4, 5120 => 1.9200E-4, 4096 => 1.2960E-4, 3584 => 1.2648E-4, 3072 => 1.0248E-4, 2560 => 8.5200E-5, 2048 => 6.1680E-5, 1792 => 5.6880E-5, 1536 => 4.7592E-5, 1280 => 3.9120E-5, 1024 => 2.7648E-5, 896 => 2.5416E-5, 768 => 2.0592E-5, 640 => 1.7040E-5, 512 => 1.1328E-5, 448 => 1.0536E-5, 384 => 8.7840E-6, 320 => 7.3440E-6, 256 => 5.4744E-6, 224 => 5.2512E-6, 192 => 4.2936E-6, 160 => 3.5640E-6, 128 => 2.5272E-6, 112 => 2.4624E-6, 96 => 2.0520E-6, 80 => 1.7592E-6, 64 => 1.4088E-6, 48 => 1.1400E-6, 32 => 8.7840E-7, ); static $fft_size_lookup_table = array( 0 => 32, 743 => 32, 1099 => 48, 1469 => 64, 1827 => 80, 2179 => 96, 2539 => 112, 2905 => 128, 3613 => 160, 4311 => 192, 5029 => 224, 5755 => 256, 7149 => 320, 8527 => 384, 9933 => 448, 11359 => 512, 14119 => 640, 16839 => 768, 19639 => 896, 22477 => 1024, 27899 => 1280, 33289 => 1536, 38799 => 1792, 44339 => 2048, 55099 => 2560, 65729 => 3072, 76559 => 3584, 87549 => 4096, 108800 => 5120, 129900 => 6144, 151300 => 7168, 172700 => 8192, 214400 => 10240, 255300 => 12288, 297300 => 14336, 340400 => 16384, 423300 => 20480, 504600 => 24576, 587500 => 28672, 671400 => 32768, 835200 => 40960, 995500 => 49152, 1158000 => 57344, 1243000 => 61440, 1325000 => 65536, 1485000 => 73728, 1648000 => 81920, 1725000 => 86016, 1966000 => 98304, 2287000 => 114688, 2452000 => 122880, 2614000 => 131072, 2929000 => 147456, 3251000 => 163840, 3875000 => 196608, 4512000 => 229376, 4837000 => 245760, 5158000 => 262144, 5781000 => 294912, 6421000 => 327680, 6716000 => 344064, 7651000 => 393216, 7967000 => 409600, 8908000 => 458752, 9547000 => 491520, 10180000 => 524288, 11100000 => 573440, 11410000 => 589824, 12650000 => 655360, 13250000 => 688128, 14160000 => 737280, 15070000 => 786432, 15690000 => 819200, 16930000 => 884736, 17550000 => 917504, 18800000 => 983040, 19740000 => 1032192, 20050000 => 1048576, 21850000 => 1146880, 22490000 => 1179648, 23360000 => 1228800, 24930000 => 1310720, 26120000 => 1376256, 27900000 => 1474560, 29690000 => 1572864, 30900000 => 1638400, 32420000 => 1720320, 33370000 => 1769472, 34560000 => 1835008, 37030000 => 1966080, 38570000 => 2048000, 38880000 => 2064384, 39500000 => 2097152, 43060000 => 2293760, 44250000 => 2359296, 46030000 => 2457600, 49100000 => 2621440, 51450000 => 2752512, 53460000 => 2867200, 54950000 => 2949120, 58520000 => 3145728, 60940000 => 3276800, 63970000 => 3440640, 65790000 => 3538944, 68130000 => 3670016, 73060000 => 3932160, 76090000 => 4096000, 76680000 => 4128768, 77910000 => 4194304, 84920000 => 4587520, 87250000 => 4718592, 90760000 => 4915200, 96830000 => 5242880, 101200000 => 5505024, 105300000 => 5734400, 108200000 => 5898240, 115300000 => 6291456, 120000000 => 6553600, 126000000 => 6881280, 129500000 => 7077888, 134200000 => 7340032, 143800000 => 7864320, 149800000 => 8192000, 151000000 => 8257536, 153400000 => 8388608, 167200000 => 9175040, 172000000 => 9437184, 178300000 => 9830400, 190700000 => 10485760, 199500000 => 11010048, 207600000 => 11468800, 213400000 => 11796480, 227300000 => 12582912, 236700000 => 13107200, 248400000 => 13762560, 255500000 => 14155776, 264600000 => 14680064, 283900000 => 15728640, 295500000 => 16384000, 297800000 => 16515072, 302600000 => 16777216, 329800000 => 18350080, 339300000 => 18874368, 352500000 => 19660800, 376100000 => 20971520, 393800000 => 22020096, 409300000 => 22937600, 420700000 => 23592960, 448000000 => 25165824, 467600000 => 26214400, 488400000 => 27525120, 503500000 => 28311552, 521500000 => 29360128, 559300000 => 31457280, 582200000 => 32768000, 585300000 => 33030144, 596000000 => 33554432, );[/code] |
[QUOTE=James Heinrich;258851]Good news -- after interpolating my own timings table for all the missing new FFT sizes, as well as figuring out which FFT range the exponents now fall under, I'm getting "[url=http://mersenne-aries.sili.net/prob.php?exponent=332205149&b1=3365000&b2=95902500&factorbits=76]181.321973 GHz-days[/url]" as the estimate credit, which looks pretty close. :smile:[/QUOTE]Ok, so see if I have this right.
I can take the exponent and match it to the entry in the second table to find FFT size. Then take the FFT size and look-up the number in the first table. Then take [U]that[/U] number and input it into the formula above as $timing. And that will give a GHz-days value. |
[QUOTE=Uncwilly;258858]Ok, so see if I have this right.... And that will give a GHz-days value.[/QUOTE]Yes.
Same for L-L, except the formula is simpler ($timing * $exponent / 86400). |
[QUOTE=James Heinrich;258878]Yes.[/QUOTE]:tu:So I will now have a way to figure P-1 credit for the 100M digit range. Will try to get this implemented in my spreadsheet before vacation. :juggle:
|
[QUOTE=Uncwilly;258931]So I will now have a way to figure P-1 credit for the 100M digit range.[/QUOTE]Since they'll all use an 18M FFT, the Excel version of the GHz-days formula should be:[code]=1.2852 * ((1.45 * <B1>) + (0.079 * (<B2> - <B1>))) / 86400[/code](replacing <B1> and <B2> with cell references, of course).
|
I find it odd that just as P-1 finished the 53M range many of my recent P-1 assignments are in the high 54M to low 55M range.
Virtually every exponent in the 54M range has been factored to at least 69 bits and is currently unassigned. |
[QUOTE=petrw1;260149]I find it odd that just as P-1 finished the 53M range many of my recent P-1 assignments are in the high 54M to low 55M range.
Virtually every exponent in the 54M range has been factored to at least 69 bits and is currently unassigned.[/QUOTE]Guilty m'Lud A bunch of us have been using mfaktc for the last week or few. It really chews through the TF allocations given out by default. In just over a week I alone have processed well over 1500 exponents in the 53-54M range and have gone from nowhere to 92nd in the TF league table. The current batch is factoring to 70 bits; the previous ones to 69. Paul |
[QUOTE=xilman;260151]Guilty m'Lud
A bunch of us have been using mfaktc for the last week or few. It really chews through the TF allocations given out by default. In just over a week I alone have processed well over 1500 exponents in the 53-54M range and have gone from nowhere to 92nd in the TF league table. The current batch is factoring to 70 bits; the previous ones to 69. Paul[/QUOTE] Exactly, by you and yours completing the factoring to the necessary limits all these exponents should be available for P-1 |
[QUOTE=petrw1;260165]Exactly, by you and yours completing the factoring to the necessary limits all these exponents should be available for P-1[/QUOTE]
IIRC, within each 1 million range the server hands out the least TF'ed exponents for P-1. Thus, it will first hand out all the exponents TFed to 2^68, then those to 2^69, etc. |
My P-1 assignments have been 53.9M and just turned in a 54.0M this afternoon.
My CUDA card came up last night; it will start doing a little OBD TF for real in half a day after repeating one of my successful P-1 assignments; this probably means the six-core beast isn't getting much ECM done. Any chance of a CUDA P-1 program? |
[QUOTE=Prime95;260167]IIRC, within each 1 million range the server hands out the least TF'ed exponents for P-1. Thus, it will first hand out all the exponents TFed to 2^68, then those to 2^69, etc.[/QUOTE]
That makes sense |
If there is a 2^69 > factor > 2^68, what is the chance of P-1 finding it?
David |
And the race is on.
The server has just started handing out LL assignment in the 53M range (I think, unless someone manually grabbed 50 or so) but keep in mind with still ample lower exponents expiring daily there won't be a full daily complement in that 53M range for a while.
Before George decided to add another TF bit in the 53-59M range the P-1'ers had almost 18,000 exponents ready for LL in the 53M range. We should see a similar number back in the LL Available column in a few days once the 53M TF is done that extra bit. Then we can watch if the LL available number grows or drops in the coming months as an indication of whether P-1 is keeping up or not. |
My recent P-1 efforts have found 4 factors in perhaps 90 attempts and 360GHz days. I'm doing better at that than at deep(extra bit or two) TF, which has only found one factor.
Question: would P-1 on CUDA offer the same kind of speedup that is offered by TF on CUDA? |
[QUOTE=Christenson;261100]My recent P-1 efforts have found 4 factors in perhaps 90 attempts and 360GHz days. I'm doing better at that than at deep(extra bit or two) TF, which has only found one factor.
Question: would P-1 on CUDA offer the same kind of speedup that is offered by TF on CUDA?[/QUOTE] Good question. I'd argue that P-1 is a tad pensionned algorithm. One year older than Pollard-Rho, yet suffering from the same randomness. There sure is room for an adapted version of it at a gpu. Yet if you do all that effort anyway, why not go for a new algorithm, it's not so hard to design a new one. If you take a look to old ones anyway i'd argue why not take a look at pomerance? As basically it needs to sieve quick. What you really want to avoid is to use algorithms that need to do things modulo P where P is some millions of bits. Regrettably that's nearly about all algorithms... the gpu's have not much of a room for that type of calculation. Sure they can do it faster than a CPU, provided gpu has enough RAM. the question is also whether you want to throw such code on the net once you have it to work :) Regards, Vincent |
[QUOTE=Christenson;261100]Question: would P-1 on CUDA offer the same kind of speedup that is offered by TF on CUDA?[/QUOTE]
Answer: P-1 on CUDA will offer the same kind of speedup that is offered by [B]LL[/B] on CUDA |
[QUOTE=axn;261108]Answer: P-1 on CUDA will offer the same kind of speedup that is offered by [B]LL[/B] on CUDA[/QUOTE]
You can design an adapted version algorithm which randomly looks for factors of limited size. Say 256 or 512 bits; that means you can run it at every core rather than needing all cores to calculate at million bitlevel. In similar way like TF then you can see whether it divides P. That'll give a massive boost of course and might find a few percent. Regards, Vincent |
[QUOTE=axn;261108]Answer: P-1 on CUDA will offer the same kind of speedup that is offered by [B]LL[/B] on CUDA[/QUOTE]
A litterary implementation of P-1 will of course not work at the same speed like LL, because it is difficult to implement the GCD at an efficient manner at GPU. You will have many compute units idle. |
[QUOTE=diep;261120]You can design an adapted version algorithm which randomly looks for factors of limited size. Say 256 or 512 bits; that means you can run it at every core rather than needing all cores to calculate at million bitlevel. In similar way like TF then you can see whether it divides P.[/QUOTE]
That would not be a P-1 algorithm. P-1 algorithm doesn't really use any randomisation, nor does it find factors of specific sizes. [QUOTE=diep;261122]A litterary implementation of P-1 will of course not work at the same speed like LL, because it is difficult to implement the GCD at an efficient manner at GPU. You will have many compute units idle.[/QUOTE] Maybe so. However, gcd computation is only a _very_ small fraction of the total runtime. For a given FFT size, you can essentially treat gcd as a constant-time operation. |
[QUOTE=axn;261131]That would not be a P-1 algorithm. P-1 algorithm doesn't really use any randomisation, nor does it find factors of specific sizes.
[/quote] I don't have the original publication of Pollard. Wiki uses a random number to initialize. See: [url]http://en.wikipedia.org/wiki/Pollard's_p_−_1_algorithm[/url] "randomly pick a coprime to n (note: we can actually fix a, random selection here is not imperative)" Crandall&Pomerance write it different in the book primes i see now. They use c=2 as a starting point and note you can use as well a random number there. [quote] Maybe so. However, gcd computation is only a _very_ small fraction of the total runtime. For a given FFT size, you can essentially treat gcd as a constant-time operation.[/QUOTE] Well modern gpu's allow really high IPC's; it is difficult for me to see how to avoid losing a significant amount of that IPC. A single multiplication mod P is 3n + 2 n log n Many of the log n steps are not dependant upon RAM speed. About factor 3 reduction you have. So RAM dependant is roughly: 4 + (log n-9) / 3. The O ( n ) operation initially for the GCD is dependant upon the RAM as the gpu doesn't have enough caches. After this you stumble upon the problem of taking non-easy modulo's. So not some sort of power of 2 modulo's. That's rather slow at a gpu. The total price of the GCD could easily be more than the price of a FFT. Obviously you can see this as lineair, but in fact you slow down over factor 2 compared to a LL. Factor 2 is a significant slowdown. |
[QUOTE=diep;261136]
The total price of the GCD could easily be more than the price of a FFT. Obviously you can see this as lineair, but in fact you slow down over factor 2 compared to a LL. Factor 2 is a significant slowdown.[/QUOTE] Can't you devote that last stage to the CPU? [QUOTE=diep;261136]There sure is room for an adapted version of it at a gpu. Yet if you do all that effort anyway, why not go for a new algorithm, it's not so hard to design a new one. If you take a look to old ones anyway i'd argue why not take a look at pomerance? As basically it needs to sieve quick.[/QUOTE] May I have some more pointers at it? :smile: Luigi |
[QUOTE=ET_;261180]Can't you devote that last stage to the CPU?
[/quote] Bandwidth internally at the GPU is say 2x170GB/s up to 190GB/s. Depends which propaganda channel you believe more. The 140GB/s is what i measured at home, yet it is also serving as a videocard at the same time. When the 'butterfly' works great or actually the unit calculation of a NTT works great at the gpu, then the limitation is bandwidth to the RAM for that FFT. Say we get units of 96 bits within 3 x 32 bits stored, from the RAM. We need that 2 times for an unit calculation and we store it 2 times. So we read and write in total 4 units * 3 integers * 4 bytes = 48 bytes GPU can deliver this 140GB/s / 48 = 2.92 billion times per second. We have 1.35T instructions per second that we can execute at the gpu. So the RAM already is quite a problem then. So to speak we've got 500 cycles available or can execute 500 instructions per limb. Now if we communicate to the CPU. At my mainboard the linkspeed from CPU to GPU is 2.0 GB/s and from GPU to CPU is 2.20 GB/s Sure it's also serving as a videocard at the same time. Let me no longer use that excuse as it won't matter much in this case :) So the difference in bandwidth between what the mainboard over the pci-e delivers versus the bandwidth of the GPU's RAM is too much of a difference, whereas the GPU's bandwidth already is not huge luxury for the transform. [quote] May I have some more pointers at it? :smile: Luigi[/QUOTE] [url]http://en.wikipedia.org/wiki/Quadratic_sieve[/url] |
[QUOTE=diep;261183]Bandwidth internally at the GPU is say 2x170GB/s up to 190GB/s.
Depends which propaganda channel you believe more. The 140GB/s is what i measured at home, yet it is also serving as a videocard at the same time. When the 'butterfly' works great or actually the unit calculation of a NTT works great at the gpu, then the limitation is bandwidth to the RAM for that FFT. Say we get units of 96 bits within 3 x 32 bits stored, from the RAM. We need that 2 times for an unit calculation and we store it 2 times. So we read and write in total 4 units * 3 integers * 4 bytes = 48 bytes GPU can deliver this 140GB/s / 48 = 2.92 billion times per second. We have 1.35T instructions per second that we can execute at the gpu. So the RAM already is quite a problem then. So to speak we've got 500 cycles available or can execute 500 instructions per limb. Now if we communicate to the CPU. At my mainboard the linkspeed from CPU to GPU is 2.0 GB/s and from GPU to CPU is 2.20 GB/s Sure it's also serving as a videocard at the same time. Let me no longer use that excuse as it won't matter much in this case :) So the difference in bandwidth between what the mainboard over the pci-e delivers versus the bandwidth of the GPU's RAM is too much of a difference, whereas the GPU's bandwidth already is not huge luxury for the transform. [/QUOTE] I'll have to study better. I confused the GCD final stage of ECM :redface: Obviously, a continuous flow of data from GPU to CPU and vice versa is to be excluded. Thank you for the link! Luigi |
[QUOTE=axn;261108]Answer: P-1 on CUDA will offer the same kind of speedup that is offered by [B]LL[/B] on CUDA[/QUOTE]
Axn, roughly how much LL speedup do we get on CUDA? (10-100 is typical for TF; I assume it's not nearly as good for LL). Also, Vincent, the high end CUDA cards have almost as much memory on them as my CPUs; typically a few gig. Given that LL requires perhaps 100 Meg of RAM, can we get multiple LLs in parallel on one CUDA card for better GPU utilization?. I was hoping for a better understanding of P-1; I take it it spends the majority of its time doing FFTs and inverses? |
[QUOTE=diep;261183]
[url]http://en.wikipedia.org/wiki/Quadratic_sieve[/url][/QUOTE] What does this have to do with P-1? The two algorithms are designed to achieve very different goals. QS is useless for Mersenne candidates with exponents greater than 512 or so (and thats being fairly generous). |
[QUOTE=Christenson;261195]Axn, roughly how much LL speedup do we get on CUDA? (10-100 is typical for TF; I assume it's not nearly as good for LL).[/quote]
Definitely less. From the numbers I have seen here, on the order of 3x-10x (compared to a single core of high end CPU). Maybe higher for the really high end cards. [QUOTE=Christenson;261195]I was hoping for a better understanding of P-1; I take it it spends the majority of its time doing FFTs and inverses?[/QUOTE] Essentially, it is a large modular exponentiation operation -- lots of multiplication. So, yes. |
[QUOTE=ET_;261185]I'll have to study better. I confused the GCD final stage of ECM :redface:
[/QUOTE] Don't worry. It is not you who has to study better. Vincent is definitely conflating many different things here. FWIW, P-1 stage 1 & 2 takes _days_ to run. The gcd phase lasts a few _minutes_. No matter how inefficient the gcd is, it will never be the bottleneck. |
Vincent,
P-1 is a deterministic algorithm, not a probabilistic one like Pollard Rho. P-1 will [I]always[/I] find a factor if that factor is within the B1/B2 search space. [QUOTE=diep;261102]I'd argue that P-1 is a tad pensionned algorithm. One year older than Pollard-Rho, yet suffering from the same randomness.[/QUOTE] [QUOTE=diep;261120]You can design an adapted version algorithm which randomly looks for factors of limited size.[/QUOTE] [QUOTE=diep;261136]Wiki uses a random number to initialize. See: [URL="http://en.wikipedia.org/wiki/Pollard%27s_p_%E2%88%92_1_algorithm"]http://en.wikipedia.org/wiki/Pollard's_p_−_1_algorithm[/URL] "randomly pick a coprime to n (note: we can actually fix a, random selection here is not imperative)" Crandall&Pomerance write it different in the book primes i see now. They use c=2 as a starting point and note you can use as well a random number there.[/QUOTE]In addition to what axn has already replied: If you compare the P-1 and Pollard Rho algorithms, you will find that the Pollard Rho algorithm uses a pseudorandom function [I]within its main loop[/I]. That is not true of the P-1 algorithm, so P-1 is _not_ "suffering from the same randomness". While it is true that a random number (as long as it is coprime to the number being factored) [I]can[/I] be used for the P-1 initial value that is exponentiated, [I]randomness of its choice is not necessary![/I] As your own quote from Wikipedia says, "(note: we can actually fix a, random selection here is not imperative)." P-1 definitely does [U]not[/U] do any pseudorandom number computation inside its loops ... and it need not do [I]any[/I] pseudorandom number computation at all, since it is perfectly acceptable to use a fixed initial value (such as 3, which Prime95 uses) as long as it's properly coprime. |
axn, thanks.
So we spend 2 days getting 3 to this large power -- how is the GCD operation carried out efficiently? (I'm doing light studying here, thinking long-term about maybe doing P-1 on CUDA). And does additional TF, which has just gotten cheaper, help out P-1? |
[QUOTE=cheesehead;261208]Vincent,
P-1 is a deterministic algorithm, not a probabilistic one like Pollard Rho. P-1 will [I]always[/I] find a factor if that factor is within the B1/B2 search space. In addition to what axn has already replied: If you compare the P-1 and Pollard Rho algorithms, you will find that the Pollard Rho algorithm uses a pseudorandom function [I]within its main loop[/I]. That is not true of the P-1 algorithm, so P-1 is _not_ "suffering from the same randomness". While it is true that a random number (as long as it is coprime to the number being factored) [I]can[/I] be used for the P-1 initial value that is exponentiated, [I]randomness of its choice is not necessary![/I] As your own quote from Wikipedia says, "(note: we can actually fix a, random selection here is not imperative)." P-1 definitely does [U]not[/U] do any pseudorandom number computation inside its loops ... and it need not do [I]any[/I] pseudorandom number computation at all, since it is perfectly acceptable to use a fixed initial value (such as 3, which Prime95 uses) as long as it's properly coprime.[/QUOTE] I completely disagree with you here as it all has to do with the B1/B2. This because of the political wording you choose: " P-1 will [I]always[/I] find a factor if that factor is within the B1/B2 search space." The real problem is the selection of the B1/B2 to pick. If we compare this with a systematic search through all hills and valleys then this is just a randomly behaving algorithm as it depends upon your initial random selection of a starting number. If you pick a constant then you will miss the entire other search space that is out there. Each time you're just busy in a single valley trying to hillclimb a single mountain. If it isn't the Everest by accident you'll never succeed. That's the kind of randomness i speak about. Furthermore you somehow want to know whether you can factor the number and you don't have unlimited RAM. Are you gonna postpone the GCD at a gpu with its limited ram or try it each time just like Pollard-Rho? The FFT of a LL runs for biggest part of the calculation not from the RAM. You can really limit the ram dependency. With this algorithm you can't, except if you try that GCD each time (if i studied it ok). Otherwise you're going to be completely RAM dependant. The GPU can execute 1.35T instructions per gpu per second. The RAM can deliver just 170GB/s or so (that's a theoretic number whereas that 1.35T is possible to achieve). If additional to that each entity stores a few bits in a 64 bits unit (so 2x32 bits) then the amount of instructions available for each unit is quite huge. So you just can't use GPU ram to store anything. Apply the GCD directly? That would be a solution. Only factor 2 slowdown. Alternative i see is not going to be fun. Vincent |
[QUOTE=Christenson;261227]So we spend 2 days getting 3 to this large power -- how is the GCD operation carried out efficiently? (I'm doing light studying here, thinking long-term about maybe doing P-1 on CUDA).[/quote]
[url]http://en.wikipedia.org/wiki/Euclidean_algorithm[/url] -- The basic algorithm is simple enough. The tricky bit is to do a "generalized" (i.e. involving numbers of no special form) mod operation. I do not know the state of the art in doing that part efficiently. [QUOTE=Christenson;261227]And does additional TF, which has just gotten cheaper, help out P-1?[/QUOTE] Quite the opposite, actually. In general, smaller factors are more likely to be "smooth" (which P-1 requires) than larger ones. So additional TF screening means that P-1 would receive candidates that are less likely to have smooth factors ==> fewer successes. But the effect won't be that pronounced, if it is just 1 or 2 bit levels of addnl TF. |
[QUOTE=axn;261233][url]http://en.wikipedia.org/wiki/Euclidean_algorithm[/url] -- The basic algorithm is simple enough. The tricky bit is to do a "generalized" (i.e. involving numbers of no special form) mod operation. I do not know the state of the art in doing that part efficiently.
Quite the opposite, actually. In general, smaller factors are more likely to be "smooth" (which P-1 requires) than larger ones. So additional TF screening means that P-1 would receive candidates that are less likely to have smooth factors ==> fewer successes. But the effect won't be that pronounced, if it is just 1 or 2 bit levels of addnl TF.[/QUOTE] Our experience is quite the opposite; Additional bits of TF finds other factors than P-1. We compared there the default 61 bits TF which i had done with what P-1 found at around 8 million bitrange for Wagstaff. In general P-1 factoring is so randomly compared to TF that the statistical odds it finds that small factor that another few bits of TF finds, assuming it is there, is so small that P-1 has nearly no overlap with another few bits of TF. Vincent |
[QUOTE=Christenson;261195]Axn, roughly how much LL speedup do we get on CUDA? [/QUOTE]
Using CUDALucas on my GTX275, I have 2x-4.5x speedup depending on the exponent and the FFT chosen. Luigi |
[QUOTE=diep;261235]Additional bits of TF finds other factors than P-1.[/quote]
A finding that I never disputed. [QUOTE=diep;261235]In general P-1 factoring is so randomly compared to TF that the statistical odds it finds that small factor that another few bits of TF finds, assuming it is there, is so small that P-1 has nearly no overlap with another few bits of TF.[/QUOTE] Also consistent with what I wrote, to wit: [QUOTE=axn;261233]But the effect won't be that pronounced, if it is just 1 or 2 bit levels of addnl TF.[/QUOTE] |
[QUOTE=Christenson;261227]And does additional TF, which has just gotten cheaper, help out P-1?[/QUOTE]
It helps out the projects as a whole. Each additional bit of TF eliminates about another 1.5% of the exponents per this page: [url]http://www.mersenne.org/various/math.php[/url] The question tends to center around: Is a 1.5% success rate enough to take the time to do another bit level or is the time better spent doing the LL/DC? This seems to most depend on two things (IMHO): - How fast TF is relative to LL? The answer is very different if you use CPU or GPU? - How many people choose TF over LL. Right now we have an excess of people choosing TF. The bit levels being handed out now at the current LL wave are higher than recommended by that same Math page because of this excess of people doing TF and more and more of them using GPUs. |
[QUOTE=petrw1;261270]It helps out the projects as a whole.
Each additional bit of TF eliminates about another 1.5% of the exponents per this page: [url]http://www.mersenne.org/various/math.php[/url] The question tends to center around: Is a 1.5% success rate enough to take the time to do another bit level or is the time better spent doing the LL/DC? This seems to most depend on two things (IMHO): - How fast TF is relative to LL? The answer is very different if you use CPU or GPU? - How many people choose TF over LL. Right now we have an excess of people choosing TF. The bit levels being handed out now at the current LL wave are higher than recommended by that same Math page because of this excess of people doing TF and more and more of them using GPUs.[/QUOTE] TF is very cheap. It soon gets a lot less than 1.5% each additional bitlevel of course, otherwise we'd only be doing TF. To do 1 additional bit of TF has an extra cost of factor 2 over the current bitlevel. It's not dependant upon caches nor RAM speed (of course in case of TheJudger's code it is as he's doing the generation of the factor candidates on the cpu - but that's another discussion). You do things modulo the factor candidate so that is simply same speed usually to do another bit. Each factored candidate doesn't need a double check. Having a factor is hard. It's very easy to calculate whether the factor is factoring the Mersenne cq Wagstaff. This where LL has a lot additional problems. It slows down more than factor 2 of course when your n increases. It slows down quadratic. You grow more out of the caches and are more and more dependant upon the RAM speed when things grow out of hand. That really sucks. So TF really has major advantages to do real deep. Additional to that, once we get to the 100M digits level, it's like 1 year at a gpu now to run a LL or so of 100M digits size? That's pretty long time for a single entity run. Maybe you can double check the first few millions of iterations of such run, but i guess it'll take a real long time to double check all that. Compared to all those problems with LL, i'd guess TF is pretty efficient and fast to do real real deep. However as usual with toto's; majority wants a shot at finding a huge prime and not run factoring type software as that for sure won't find them a giant prime number. So i guess now is the time to TF real real deep, because as soon as we get the 22 nm GPU's which might be able to do a LL within 1 or 2 months at the 332M+ bitlevel, then i guess most will soon abandon running TF and other factoring software. |
[QUOTE=diep;261278]So TF really has major advantages to do real deep.[/QUOTE]Using an example of M333000013, which TF's by default to 2^76, we find:
* TF to 4 levels above that (2^80) takes an additional [url=http://mersenne-aries.sili.net/credit.php?worktype=TF&exponent=333000013&f_exponent=&b1=&b2=&numcurves=&factor=&frombits=76&tobits=80]689 GHz-days[/url] and has a 5.16% chance of finding a factor * P-1, for the same 5.16% chance of factor, takes [url=http://mersenne-aries.sili.net/prob.php?exponent=333000013&prob=5.16&work=&factorbits=76]85GHz-days[/url] of work. So your TF processor has to perform at least 8.1x faster than your P-1 processor to be worth it (this goes up to roughly 10x performance-per-chance difference at 2^86, 12%). |
[QUOTE=diep;261230]I completely disagree with you here as it all has to do with the B1/B2.[/QUOTE]That's what I mean by "within the B1/B2 search space".
Once you choose a B1 and B2, P-1 will find any factor that is of the form: (Product of any number of prime factors or prime powers no greater than B1) * (a prime factor or prime power no greater than B2) + 1 In the case of Mersennes, prime95 also throws in the Mersenne exponent (n), so that it will find any factor of the form: (Product of any number of prime factors or prime powers no greater than B1) * (a prime factor or prime power no greater than B2) * n + 1 That is deterministic, not probabilistic. [quote]This because of the political wording you choose: " P-1 will [I]always[/I] find a factor if that factor is within the B1/B2 search space."[/quote]It's simply factual. No politics involved. Perhaps my phrase "B1/B2 search space" is not clear enough for you. [quote]The real problem is the selection of the B1/B2 to pick.[/quote]That's like saying that "the real problem in TF is selection of the upper bound. Shall we TF to 2^75 or to 2^76? " If you TF to 2^75, you'll find any factor less than 2^75. If you subsequently TF from 2^75 to 2^76 will find any factor between 2^75 and 2^76. The same is true of P-1, except that the B1/B2 search space is multidimensional, not the simple one-dimensional number line that TF searches. [quote]If we compare this with a systematic search through all hills and valleys[/quote]... which is exactly what P-1 does. [quote]then this is just a randomly behaving algorithm as it depends upon your initial random selection of a starting number.[/quote]No, it does not. Because the algorithm is based on properties of a multiplicative group, [I]it does not matter what the starting number (prime95 always uses 3) is, just as long as it is coprime with the number being factored[/I]. As the Wikipedia article states, [quote]... by working in the multiplicative group modulo a composite number [I]N[/I], we are also working in the multiplicative groups modulo all of [I]N'[/I]s factors.[/quote]With [I]any[/I] coprime initial value a, the result of the exponentiation has the property that it is congruent to (a factor satisfying the above criteria) mod p, if such factor exists. 100% guaranteed. (Fermat's Little Theorem) No guessing or probability involved. [quote]If you pick a constant then you will miss the entire other search space that is out there.[/quote]No, because the search space [I]does not depend on the choice of a (as long as it's coprime ...)[/I]. It depends only on B1/B2. [quote]Each time you're just busy in a single valley trying to hillclimb a single mountain.[/quote]No, you are misunderstanding the multiplicative group property upon which P-1 is based. It's based on multiplication mod N, which cycles through all possible congruence classes 1, 2, 3, ... N-1 if N is prime. If N is composite, there are subsequences of prime lengths which divide N, and those lengths are found by the GCD. Perhaps someone with a more practical understanding of multiplicative groups than my intuitive understanding can explain it better for you. |
[QUOTE=cheesehead;261297]That's what I mean by "within the B1/B2 search space".
Once you choose a B1 and B2, P-1 will find any factor that is of the form: (Product of any number of prime factors or prime powers no greater than B1) * (a prime factor or prime power no greater than B2) + 1 In the case of Mersennes, prime95 also throws in the Mersenne exponent (n), so that it will find any factor of the form: (Product of any number of prime factors or prime powers no greater than B1) * (a prime factor or prime power no greater than B2) * n + 1 That is deterministic, not probabilistic. It's simply factual. No politics involved. Perhaps my phrase "B1/B2 search space" is not clear enough for you. That's like saying that "the real problem in TF is selection of the upper bound. Shall we TF to 2^75 or to 2^76? " If you TF to 2^75, you'll find any factor less than 2^75. If you subsequently TF from 2^75 to 2^76 will find any factor between 2^75 and 2^76. The same is true of P-1, except that the B1/B2 search space is multidimensional, not the simple one-dimensional number line that TF searches. ... which is exactly what P-1 does. No, it does not. Because the algorithm is based on properties of a multiplicative group, [I]it does not matter what the starting number (prime95 always uses 3) is, just as long as it is coprime with the number being factored[/I]. As the Wikipedia article states, With [I]any[/I] coprime initial value a, the result of the exponentiation has the property that it is congruent to (a factor satisfying the above criteria) mod p, if such factor exists. 100% guaranteed. (Fermat's Little Theorem) No guessing or probability involved. No, because the search space [I]does not depend on the choice of a (as long as it's coprime ...)[/I]. It depends only on B1/B2. No, you are misunderstanding the multiplicative group property upon which P-1 is based. It's based on multiplication mod N, which cycles through all possible congruence classes 1, 2, 3, ... N-1 if N is prime. If N is composite, there are subsequences of prime lengths which divide N, and those lengths are found by the GCD. Perhaps someone with a more practical understanding of multiplicative groups than my intuitive understanding can explain it better for you.[/QUOTE] I think you misunderstand my search terminology. If we perform a brute force search we ain't gonna miss anything. Now TF is a very stupid form of brute force search. You do n bits then n+1 bits and so on. P-1 simply will sometimes find big factor for 1 prime, then miss a simple 63 bits factor for other prime. In fact odds it misses that 63 bits factor is *huge*. So the result is total random from brute force viewpoint seen. Missing small factors say < 128 bits is incredible childish for factoring algorithm considering system time put into P-1. What i call the hill there, you call of course correctly the multiplicative group; as a search expert i only see things from search viewpoint of course and i just see it's the wrong hill nearly always. Just in a few % of cases it's the correct hill. Very primitive. If we compare the public factorisation algorithms with progress in for example computerchess, then factorisation algorithms have stood still and are still years 70s so to speak compared to how search moved in computerchess. But of course there was public money to make with it in computerchess and lots of prestige. In factorisation publicly nothing happened that's worth even mentionning. Regards, Vincent |
In fact, we've done P-1 at some tens of thousands of exponents now for Wagstaff just around and mostly above 8 million bits; it doesn't even break even. And that though we just trial factored to 61 bits so far. Still have to throw GPU into action to factor further.
Jeff Gilchrist also tried other forms of public factorisation. ECM and others. All been tried at his cluster. Nothing even remotely breaks even. Well let me rephrase that. The oldie k8's are of course not so great in SSE2. Just 1 instruction per cycle (2 flops per cycle per core) so from his viewpoint having them factor is ok as it's slow for VRB-Reix test anyway. Which of course all is 1 to 1 similar to a LL (Wagstaff has 100% same properties like Mersenne). Even at those k8's it's hardly break even. Just because it's nicer to have things factored than to maybe have 1 or 2 wrong residue's *maybe* we prefer the factorisation. So 50 years of factorisation or so, what has it brought in terms of public algorithms? Especially the past 16 years. *nothing* How comes? Regards, Vincent |
[QUOTE=James Heinrich;261290]Using an example of M333000013, which TF's by default to 2^76, we find:
* TF to 4 levels above that (2^80) takes an additional [URL="http://mersenne-aries.sili.net/credit.php?worktype=TF&exponent=333000013&f_exponent=&b1=&b2=&numcurves=&factor=&frombits=76&tobits=80"]689 GHz-days[/URL] and has a 5.16% chance of finding a factor * P-1, for the same 5.16% chance of factor, takes [URL="http://mersenne-aries.sili.net/prob.php?exponent=333000013&prob=5.16&work=&factorbits=76"]85GHz-days[/URL] of work. So your TF processor has to perform at least 8.1x faster than your P-1 processor to be worth it (this goes up to roughly 10x performance-per-chance difference at 2^86, 12%).[/QUOTE] Without a CUDA P-1, mfaktc can get that 689GHz days in approximately 24 hours on a high end card, and 8 days or so on my medium (GTX440) card, except that it's off helping out on Operation Billion Digits this week. Assuming a CUDA P-1 at 10X speedup, we are roughly at equilibrium. Without a CUDA P-1, we are way behind and should do more TF. So here's my next question: If, for our P-1 operation, we are spending the majority of our time raising three to a high exponent, can we improve the algorithm time by saving the result, and then multiplying this by 3 to a much smaller power for the next, presumably nearby mersenne exponent we wish to work our P-1 on? |
[QUOTE=diep;261304]In fact, we've done P-1 at some tens of thousands of exponents now for Wagstaff just around and mostly above 8 million bits; it doesn't even break even. And that though we just trial factored to 61 bits so far. [/QUOTE]
Can you provide some details on this? What were the bounds used? What was the run-time relative to a V-R test? What was the hit rate? How much memory was used? What software was used? |
[QUOTE=diep;261302]I think you misunderstand my search terminology.
If we perform a brute force search we ain't gonna miss anything.[/QUOTE]... and by brute force search I think you mean that it searches sequentially among candidates which are arranged along the customary one-dimensional number line. [quote]P-1 simply will sometimes find big factor for 1 prime, then miss a simple 63 bits factor for other prime. In fact odds it misses that 63 bits factor is *huge*.[/quote]But there, you're looking at the P-1 search as though it proceeded along the same one-dimensional number line as TF. When you look at it in a more appropriate coordinate system, P-1 is a well-behaved deterministic method with no strange misses. What you perceive as adjacent points in the TF search space are not necessarily adjacent points in the P-1 search space, and vice versa. I use the terminology "B1/B2 search space" to denote the two-dimensional space that P-1 searches for factors of N. Neither of its dimensions are one-dimensional integer number line that TF searches. Instead, in B1/B2 P-1 search space, one axis represents the largest prime or prime power factor of P-1 where P is any factor of N. The other axis represents the second largest prime or prime power factor of P-1 where P is any factor of N. Most intersections at integral coordinates of that space are empty, but each factor P of N occupies one intersection. The P-1 algorithm searches the rectangle with (0,0) at one corner and (B1,B2) at the diagonally opposite corner. If any factor of N resides in that rectangle, P-1 will definitely find it. Examples: The integer 19 corresponds to the intersection (2,9) in B1/B2 P-1 search space. 37 corresponds to (4,9). 127 corresponds to (7,9) in B1/B2 P-1 search space. 631 corresponds to the same intersection (7,9). Any numbers which have either 127 or 631 as a factor can be factored by P-1 with limits of B1=7 and B2=9. That same search (B1=7,B2=9) will find any other factor that exists (corresponds to an intersection) in the rectangle from (0,0) to (7,9). Intersection (4,5) corresponds to 61 and 101, which would also be found by a (7,9) search -- if either is a factor of the number being tested. If more than one prime factor is in the search rectangle, P-1 will find their product, so factors found by P-1 need to be checked for compositeness. Now, a mapping from that P-1 search space to some other space may transform the nice neat rectangle from (0,0) to (B1,B2) into a pattern of hills and valleys, but insisting that such hills and valleys are the proper way, or only way, to look at a P-1 search will lead one to misunderstandings such as considering P-1 to be a probabilistic method. In the most appropriate view of the P-1 search space, it is a deterministic rectangle on the other end of the mapping. [quote]So the result is total random from brute force viewpoint seen.[/quote]... only if one is looking at the P-1 search space without using the most appropriate coordinate system, or looking at the wrong end of a mapping from one space to the other. [quote]Missing small factors say < 128 bits is incredible childish for factoring algorithm considering system time put into P-1.[/quote]All you're doing there is judging P-1 by an inappropriate standard (size of factor P, rather than size of largest factor of P-1), which leads you to a mistaken conclusion about the algorithm's childishness. [quote]What i call the hill there, you call of course correctly the multiplicative group;[/quote]No, the hill is simply a part of a mapping from P-1 search space to TF search space. What the multiplicative group corresponds to is a rectangle in P-1 search space from (0,0) to (largest prime or prime power factor of P-1 where P is any factor of the group order, second largest prime or prime power factor of P-1 where P is any factor of the group order). That rectangle will map to many noncontiguous hills and valleys in your conventional search space. [quote]as a search expert i only see things from search viewpoint of course[/quote]... but here, you're using only one type of coordinate space -- one that's inappropriate for P-1. In your space, a 128-bit prime number may be just 2 away from a number whose largest factor is, say, 11-bit. In mapping to P-1 space, those two numbers may be very far apart, or close together, depending on the factorizations of the numbers-1. [quote]and i just see it's the wrong hill nearly always. Just in a few % of cases it's the correct hill. Very primitive.[/quote]... because what you're looking at is only a part of one end of a mapping from P-1 search space -- and it's the inappropriate end for understanding P-1. |
[QUOTE=Christenson;261307]So here's my next question: If, for our P-1 operation, we are spending the majority of our time raising three to a high exponent, can we improve the algorithm time by saving the result, and then multiplying this by 3 to a much smaller power for the next, presumably nearby mersenne exponent we wish to work our P-1 on?[/QUOTE]Good question.
The problem is that in practice, for the size of numbers we're working with, we have to handle all the exponentiation in mod N. (N is often some Mersenne 2[sup]p[/sup]-1). The same consideration applies to the squaring in the Lucas-Lehmer test in addition to the exponentiation in P-1. Without the modular reductions, we'd quickly (surprisingly quickly) be trying to store numbers which have more digits than the number of particles in the known universe. See the response to "Why not skip the modular reduction in the LL test and reuse residues?" on this page [URL]http://www.mersennewiki.org/index.php/Frequently_suggested_improvements[/URL] and the second and third paragraphs (which include a calculation about the number of particles in the known universe) under "Simple Explanation" on this page [URL]http://www.mersennewiki.org/index.php/Lucas-Lehmer_Test[/URL] Since we're therefore dealing with residues mod N, the ones we compute for one value of N are useless when computing for a different value of N. So it [I]is[/I] necessary to start fresh each time ... unless one is continuing a computation for the same N, but higher limits! Therefore, saving the final residue from an unsuccessful P-1 [I]is[/I] useful when another P-1 test is simply extending the limits used by the first P-1 test -- but note that here we're not changing N, which is why the earlier mod N result is useful. Furthermore (again, for the size of numbers we're dealing with), it's actually much faster to restart exponentiation afresh, with modular arithmetic, than it would be to "read" stored full non-modular products from wherever we managed to store them. (Consider that the size of "disk" you'd be reading from might need to be larger than the Milky Way galaxy.) |
[QUOTE=diep;261304]So 50 years of factorisation or so, what has it brought in terms of public algorithms? Especially the past 16 years.
*nothing* How comes? [/QUOTE]You are very, very wrong. The last 50 years have produced a slew of sub-exponential factoring algorithms, including the quadratic sieve, the number field sieve and the elliptic curve method. They have produce a number of exponential but fast algorithms for finding small factors, including P-1, P+1, SQUFOF, and Pollard rho. All of these algorithms are public and all are in daily use. Much of the advance in the last few decades can be ascribed to three people: John Pollard, Hendrik Lenstra and Peter Montgomery. Paul |
[QUOTE=cheesehead;261318]Good question.
The problem is that in practice, for the size of numbers we're working with, we have to handle all the exponentiation in mod N. (N is often some Mersenne 2[sup]p[/sup]-1). The same consideration applies to the squaring in the Lucas-Lehmer test in addition to the exponentiation in P-1. Without the modular reductions, we'd quickly (surprisingly quickly) be trying to store numbers which have more digits than the number of particles in the known universe. See the response to "Why not skip the modular reduction in the LL test and reuse residues?" on this page [URL]http://www.mersennewiki.org/index.php/Frequently_suggested_improvements[/URL] and the second and third paragraphs (which include a calculation about the number of particles in the known universe) under "Simple Explanation" on this page [URL]http://www.mersennewiki.org/index.php/Lucas-Lehmer_Test[/URL] Since we're therefore dealing with residues mod N, the ones we compute for one value of N are useless when computing for a different value of N. So it [I]is[/I] necessary to start fresh each time ... unless one is continuing a computation for the same N, but higher limits! Therefore, saving the final residue from an unsuccessful P-1 [I]is[/I] useful when another P-1 test is simply extending the limits used by the first P-1 test -- but note that here we're not changing N, which is why the earlier mod N result is useful. Furthermore (again, for the size of numbers we're dealing with), it's actually much faster to restart exponentiation afresh, with modular arithmetic, than it would be to "read" stored full non-modular products from wherever we managed to store them. (Consider that the size of "disk" you'd be reading from might need to be larger than the Milky Way galaxy.)[/QUOTE] I had missed that the practical B1 on my machine is 660,000, and 660,000 Primorial is of, well, nontrivial, size.:blush: In the P-1 operation, we do all the arithmetic modulo some N...I assume that N would be 2^P-1, with no other reasonable choices, right? (That is, some other choice, such as the product of the three mersenne exponents we wanted to test in parallel, would more than triple the compute time, and a product of, say, 10, wouldn't allow testing of additional mersenne exponents beyond the 10 listed) I'm thinking at this point that the only obvious optimization of P-1 on a GPU would be to run exponents in parallel, to the size of available memory, thus maximising usage of the many compute cores in the GPU. |
[QUOTE=Christenson;261325]In the P-1 operation, we do all the arithmetic modulo some N...I assume that N would be 2^P-1, with no other reasonable choices, right? (That is, some other choice, such as the product of the three mersenne exponents we wanted to test in parallel, would more than triple the compute time, and a product of, say, 10, wouldn't allow testing of additional mersenne exponents beyond the 10 listed)[/quote]
Correct. [QUOTE=Christenson;261325]I'm thinking at this point that the only obvious optimization of P-1 on a GPU would be to run exponents in parallel, to the size of available memory, thus maximising usage of the many compute cores in the GPU.[/QUOTE] Stage 2 of a single exponent can also be parallelised. |
[QUOTE=diep;261278]TF is very cheap. It soon gets a lot less than 1.5% each additional bitlevel of course, otherwise we'd only be doing TF.
[/quote] If I understand the approimation alogorithm correctly on the Math page: at 65 bits the odds are abouut 1/65 or 1.53%; 1.43% at 70 bits. Most TF now is in that range so that is why I said "about 1.5%" in my post. [quote]So TF really has major advantages to do real deep.[/quote] Define "real deep". If you can do the 70 bit level in 1 minute, 80 bits is 17 hours; 85 bits is 22.66 days; 90 bits is about 2 years. [quote]Additional to that, once we get to the 100M digits level, it's like 1 year at a gpu now to run a LL or so of 100M digits size?[/quote] Yes, based on current hardware but hardware advances have done a pretty good job of keeping pace with the project. In the 8 or so years I have been involved leading edge LL tests have always taken 3-4 weeks on mainstream PC's. [quote]Compared to all those problems with LL, i'd guess TF is pretty efficient and fast to do real real deep.[/quote] Depends on your perspective. I think George does a good job determining the break-even point between TF and LL. Granted with the recent influx of GPUs it's now much harder to determine. [quote]So i guess now is the time to TF real real deep, because as soon as we get the 22 nm GPU's which might be able to do a LL within 1 or 2 months at the 332M+ bitlevel, then i guess most will soon abandon running TF and other factoring software.[/QUOTE] Again I suggest mainstream PCs will be doing these LL tests in a few weeks by the time the project gets there. Mind you, if the consistently linear rate of 3 1-million ranges per year continues we are looking at 93 more years. I know I won't be around to witness that. I doubt there would ever be a time that it made sense to abandon TF; there will always be a bit-level where TF is more effective than LL. That being said; if the current rate of TF continues to increase as more and more GPUs get involved I suspect all necessary TF up to 332M+ will be done in less than this same 93 years. |
[QUOTE=Christenson;261325]I had missed that the practical B1 on my machine is 660,000, and 660,000 Primorial is of, well, nontrivial, size.:blush:[/QUOTE]It's worse than that: for B1=660,000 P-1 actually computes a number whose exponent is much larger than 660,000 primorial.
The exponent to which it raises 3 is not just the primorial of B1, which is the product of all primes up to B1; it's the product of all [U]prime powers[/U] up to B1. So, for B1 = 660,000, the exponent product includes 390625 = 5^8, instead of just 5. |
[QUOTE=cheesehead;261367]It's worse than that: for B1=660,000 P-1 actually computes a number whose exponent is much larger than 660,000 primorial.
The exponent to which it raises 3 is not just the primorial of B1, which is the product of all primes up to B1; it's the product of all [U]prime powers[/U] up to B1. So, for B1 = 660,000, the exponent product includes 390625 = 5^8, instead of just 5.[/QUOTE] It wouldn't matter; we can directly represent the exponent, but not the result of the exponentiation, even if we build the one devices discussed in another thread to represent the list of all the primes up to 600 digits. Remember that the environmental impact statement for those outweighs the earth! |
Was idly wondering if P-1 effort was keeping up with or falling behind the LL effort?
|
[QUOTE=Christenson;263589]Was idly wondering if P-1 effort was keeping up with or falling behind the LL effort?[/QUOTE]I'll venture that it will perpetually fall behind until P-1 gets the kind of computational breakthrough that GPUs have given TF over the last year.
|
The P-1 explanation page isn't terribly clear... is P-1 inherently seuqential, like LL tests, or can it be parallelized, like TF tests? Should I start a thread on the P-1 algorithm as homework?
|
Stage 2 can be done in parallel, but it requires the save file from stage 1.
|
How big is the save file? Would it be feasible to do stage 1 on a system with limited memory, then copy the save file to another user's system with enough memory to do stage 2?
Re doing p-1 on a GPU, would a GPU have enough memory for stage 2? If not you would still gain from doing stage 1 on a GPU and stage 2 on a CPU. Chris K |
[QUOTE=chris2be8;263617]How big is the save file?[/QUOTE]
[URL="http://www.mersenneforum.org/showpost.php?p=210670&postcount=36"]Here's an example of a save from GMP-ECM.[/URL] Size varies. [QUOTE]Would it be feasible to do stage 1 on a system with limited memory, then copy the save file to another user's system with enough memory to do stage 2?[/QUOTE]Yes, that can be done. Prime95 saves to its own directory. I believe the documentation tells you the extensions of the different files. (Which files are P-1 etc.) [QUOTE]Re doing p-1 on a GPU, would a GPU have enough memory for stage 2? If not you would still gain from doing stage 1 on a GPU and stage 2 on a CPU.[/QUOTE]I'm not sure about the current status of P-1 on GPUs. |
Re: enough memory -- P95 requires 300M before (I think) before assigning P-1 work. A typical high-end GPU these days is going to have 768M or a Gig.
If stage 1 can't be sped up heavily by a GPU, but stage 2 can, then we will certainly want multiple systems feeding stage 1 save files to a single GPU. I have to go penetrate P95 source code anyway. I note that even without the GPU speedup, I'm producing factors by P-1 for significantly less effort than two CPU-based LL tests. |
[QUOTE=chris2be8;263617]How big is the save file?[/QUOTE]
5MB or more during stage 1. Double that once stage 1 is complete. [QUOTE]Would it be feasible to do stage 1 on a system with limited memory, then copy the save file to another user's system with enough memory to do stage 2?[/QUOTE] Very feasible; this is what I do: Set the nighttime memory on the stage 1 machine to the amount you will be using on the stage 2 machine. You can't do this via the menu system if that's more memory than the stage 1 machine has, but you can edit local.txt directly. Leave the daytime memory at 8MB. Put LowMemWhileRunning=prime95 (Windows) or LowMemWhileRunning=mprime (Linux) into your prime.txt file, to prevent the program for ever trying to use the high memory setting. |
| All times are UTC. The time now is 10:25. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.