![]() |
PARI Routine for Finding Cycles
I've been running a compiled version of the following routine on several machines in my quest for a 4-cycle sequence. I am running a compiled version, more due to the ease of calling it via a bash script, than because of any speed difference. There is only a slight difference. I have tried many methods of speeding this up further, to include handling the counter outside the main function and different types of loops, etc. As can be seen, I am skipping many numbers in my search, based on information from another member on what they searched. All of this is in a different thread. On my fastest machine, a 1M block (in 9*10^12) is covered in about 44 seconds. The routine does return 2-cycles and found the one known 4-cycle that was within its search pattern.
Finally, on to my questions: 1. Is there room for any appreciable speed improvement in the routine? 2. Is the sigma() function something I could use with a GPU? 3. If 2, then how would I go about (learning and) modifying it to make use of a GPU? [code] findcyc(a,e)= { if(a%2==1,a+=1); print("version 20140530b, running ", a, " through ", e); gettime(); forstep(b=a,e,2, if(b%4==0 || b%3==0, i=b; for(d=1,4,i=sigma(i)-i; if(i<b, break(), if(i==b, write("../../Desktop/cyclesFound", b, " ", d); break() ) ) ); if(b%1000000==0, print(b,"/",e," ",gettime()/1000+0.," s/M"); write("../../Desktop/cycleStatus", b," ",e) ) ) ) } [/code]Thanks for any/all assistance... |
Essentially all the work is done in sigma(), which spends essentially all of its time factoring the number. The best thing you could do would be to set up a sieve to factor the first instance of i more-or-less 'for free'. Next it would probably be worthwhile to write a customized factor function that could cut out early if the result would give a sigma value that's too small. This has value, since those cases would tend to be harder to factor than normal.
|
Normally I wouldn't suggest it, but since you brought up GPUs I thought I'd mention another option. Perl's ntheory module has a faster sigma than Pari for 64-bit numbers (it gets complicated for bigints). It took my computer 7.6s seconds per 1M block for this at 9 * 10^12, vs. Pari 2.8.0 at 19.2 seconds. That's only a 2x improvement, but maybe worth it if there isn't a good Pari workaround. Also note that on my computer, Pari 2.8.0 ran almost 2x faster than Pari 2.5.5, so make sure you're using a recent version.
[code] use ntheory qw/:all/; use Time::HiRes qw(gettimeofday tv_interval); sub findcyc { my($a, $e) = @_; $a++ if $a & 1; print "running $a through $e\n"; my $start = [gettimeofday]; for (my $b = $a; $b <= $e; $b += 2) { if ($b % 4 == 0 || $b % 3 == 0) { my $i = $b; for my $d (1 .. 4) { $i = divisor_sum($i) - $i; if ($i <= $b) { print "found $b $d\n" if $i == $b; last; } } } print " $b/$e ", tv_interval($start), "\n" unless $b % 1000000; } } [/code] |
[QUOTE=EdH;385628]I've been running a compiled version of the following routine on several machines in my quest for a 4-cycle sequence. I am running a compiled version, more due to the ease of calling it via a bash script, than because of any speed difference. There is only a slight difference. I have tried many methods of speeding this up further, to include handling the counter outside the main function and different types of loops, etc. As can be seen, I am skipping many numbers in my search, based on information from another member on what they searched. All of this is in a different thread. On my fastest machine, a 1M block (in 9*10^12) is covered in about 44 seconds. The routine does return 2-cycles and found the one known 4-cycle that was within its search pattern.
Finally, on to my questions: 1. Is there room for any appreciable speed improvement in the routine? 2. Is the sigma() function something I could use with a GPU? 3. If 2, then how would I go about (learning and) modifying it to make use of a GPU? [code] findcyc(a,e)= { if(a%2==1,a+=1); print("version 20140530b, running ", a, " through ", e); gettime(); forstep(b=a,e,2, if(b%4==0 || b%3==0, i=b; for(d=1,4,i=sigma(i)-i; if(i<b, break(), if(i==b, write("../../Desktop/cyclesFound", b, " ", d); break() ) ) ); if(b%1000000==0, print(b,"/",e," ",gettime()/1000+0.," s/M"); write("../../Desktop/cycleStatus", b," ",e) ) ) ) } [/code]Thanks for any/all assistance...[/QUOTE] I've thought of a few things that might help without a gpu, but I can't implement them efficiently if you know what a%12 is you could make a vector of steps to only hit those that are 0 mod 4 or 0 mod 3 with values mod 12 that are 0 mod 2 eliminating the need for the if statement, but you could keep it and try a vector of mod 12 values like [1,0,0,0,1,0,1,0,1,0,0,0][(b%12)+1] which eliminates all the 1 that aren't also 0 mod 2 ( learned this earlier in the thread). I was thinking if you added efficient checks to see if i can be the start of a 3-cycle but that may have too much overhead. Laurv/CRG might know more than me since all my gp are thread engine:single ( mostly because I use the Self-installing distributions for Windows I think) but in newer versions maybe parfor's or something could speed it up by checking multiple candidates at once. or using select/parselect maybe you could select all the ones that fit your conditions quicker. |
Thanks for all the help. I will definitely look into all of this as time permits. I did try other methods of choosing the candidates, even trying the selection outside the Pari routine, but none (so far) gained any speed. Many methods slowed the program down.
I had not considered adding a sieve on the front, but I did think about a customized function, but assumed I couldn't do better than what was available. I will have to study that some more. The Perl info is something I'll have to visit, as well. (I'll have to knock some rust off my Perl programming.) As to the version of Pari, I'll have to see, since every machine has its own install and I compiled the code on each individual computer to be sure any specifics were met, some may not be running the newest. Of course, the reason for bringing up GPUs is that I have two systems with old NVidia boards (48 cores, each) that are functional with CUDA. All my iterations so far are running as single threads, manually invoked across all available CPU cores per machine. If I could add in some of the NVidia cores, I should gain a little on those machines. But, I have a lot of study to get where I can use them and I'm growing really slow in all my computer "playings" due to other "life interests." Again, thanks everyone for all the help... |
Short Followup
Alas, all the several machines I checked (Fedora, Ubuntu, Debian) have Pari 2.5.x installed and nothing newer available via their repositories. Perhaps, rather than opt for individually, manually reinstalling Pari, I should look into the Perl option mentioned. I've also thought of trying one machine with a newer version of Pari and see if the compiled application will run across the others...
|
Last Followup Here (for now, at least)
Well, some preliminary results on one of my slower machines has proven danaj quite correct in the suggestion to take a look at Perl. Some basic tests have shown quite a difference in Perl's favor. My PARI (2.5.5) script takes 116 seconds to cover a 1M block (9*10^12 range), while the Perl (5.18.2) version completes the same range in 33 seconds.
I will leave the PARI thread, for now, in pursuit of more Perl testing. Thanks danaj! |
[QUOTE=EdH;385768]Well, some preliminary results on one of my slower machines has proven danaj quite correct in the suggestion to take a look at Perl. Some basic tests have shown quite a difference in Perl's favor. My PARI (2.5.5) script takes 116 seconds to cover a 1M block (9*10^12 range), while the Perl (5.18.2) version completes the same range in 33 seconds.
I will leave the PARI thread, for now, in pursuit of more Perl testing. Thanks danaj![/QUOTE] using my thoughts from thinking back to the asm book I have and CRG's post about my recent code outside of the writes ( which I think we could use listput and write only once at the end) I have broken your code into two codes that get back 0 ms for each individually so 2 instances of pari could in theory do this section of 1 million in 0 ms ( at least for version 2.7.2 but I realized that 2.5.3 was faster for some of my codes that I came up with: admittedly it also semi-messes up the printing [CODE]findcyc4(a,e) = { if(e%4!=0, e+=4-(e%4) ); print("version 20140530b, running ",a," through ",e); gettime(); forstep(b=e,a,-4, i=b; d=1; i=sigma(i)-i; if(cmp(b,i), break(), if(cmp(b,i)==0, break() ) ); d+=1; i=sigma(i)-i; if(cmp(b,i), break(), if(cmp(b,i)==0, break() ) ); d+=1; i=sigma(i)-i; if(cmp(b,i), break(), if(cmp(b,i)==0, break() ) ); d+=1; i=sigma(i)-i; if(cmp(b,i), break(), if(cmp(b,i)==0, break() ) ); if(b%1000000==0, print(b,"/",e," ",gettime()/1000+0.," s/M") ) ) }[/CODE] [CODE]findcyc6(a,e) = { if(e%12!=6, e+=6-(e%12) ); print("version 20140530b, running ",a," through ",e); gettime(); forstep(b=e,a,-12, i=b; d=1; i=sigma(i)-i; if(cmp(b,i), break(), if(cmp(b,i)==0, break() ) ); d+=1; i=sigma(i)-i; if(cmp(b,i), break(), if(cmp(b,i)==0, break() ) ); d+=1; i=sigma(i)-i; if(cmp(b,i), break(), if(cmp(b,i)==0, break() ) ); d+=1; i=sigma(i)-i; if(cmp(b,i), break(), if(cmp(b,i)==0, break() ) ); if(b%1000000==0, print(b,"/",e," ",gettime()/1000+0.," s/M") ) ) }[/CODE] maybe someone else can find some more improvements that can take the time down even more.edit: just realized cmp may not be a defined function in my oldest PARI ( which I should uninstall) maybe check that first. |
[QUOTE=science_man_88;385842]I have broken your code into two codes that get back 0 ms for each individually so 2 instances of pari could in theory do this section of 1 million in 0 ms[/QUOTE]
:sad: |
[QUOTE=axn;385843]:sad:[/QUOTE]
maybe I could just put the second forstep loop in the first code and check e again so maybe I can recombine them.edit: though my ultimate thought if It could ever easily be done (with a slight modification to my repeat script to write to file I think it could be ) was to unroll the forstep loops as we can predict how many times each would be done (e-a)/abs(<respective step>) and yes I forgot the my() for local variables.edit:[B][U] DOH[/U][/B] I realized why I got 0ms I did break() not next(). now I'm back up to just under 18 seconds. |
[QUOTE=science_man_88;385845]maybe I could just put the second forstep loop in the first code and check e again so maybe I can recombine them.edit: though my ultimate thought if It could ever easily be done (with a slight modification to my repeat script to write to file I think it could be ) was to unroll the forstep loops as we can predict how many times each would be done (e-a)/abs(<respective step>) and yes I forgot the my() for local variables.edit:[B][U] DOH[/U][/B] I realized why I got 0ms I did break() not next(). now I'm back up to just under 18 seconds.[/QUOTE]
I'm going to play with Perl a bit more as time permits. The basic conversion is running around 10 seconds per million on some of my machines. I am interested in what you come up with, though. I could always swap back later... |
| All times are UTC. The time now is 06:55. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.