![]() |
TPSieve CUDA Testing Thread
You asked for it, and I've finally made it. Download TPSieve-CUDA [url=https://sites.google.com/site/kenscode/prime-programs/tpsieve-cuda.zip?attredirects=0]here[/url]. :smile:
I haven't done extensive testing on twin primes, so probably somebody should go over a short range (100G? Perhaps 1T?) with TPSieve-CUDA to make sure it gets the same factors. I hope it works well for everyone! |
Could you please specify an example to run? I see it needed cudart static linking - that file comes with cuda sdk. Was it compiled with cuda toolkit 3.1 ?
|
OK. Supposing you downloaded the 480000-484999_30aug2010.txt sieve file, if you run:
./tpsieve-cuda-x86_64-linux -i 480000-484999_30aug2010.txt -p 710005180000000 -P 710005200000000 It should output: 710005185071411 | 5012115*2^481782+1 710005192340203 | 4018161*2^483419-1 very quickly. (I tested this on the emulator, so it runs really slow for me!) Expand the range, and you should get more of [url=http://www.sendspace.com/file/4frdyp]Mdettweiler's results[/url]. ./tpsieve-cuda-x86_64-linux -i 480000-484999_30aug2010.txt -p 710T -P 715T would produce all of them, for instance. Edit: Compiled with the 2.3 toolkit. One place to get the appropriate libcudart.so would be [url=http://www.primegrid.com/download/libcudart.so.2.32bit]here[/url] or [url=http://www.primegrid.com/download/libcudart.so.2.64bit]here[/url]. |
:no:
Where do I get that fancy file? I dont need that libcudart.so file, I'm on windows. |
Ah, I got it working.
Found that fancy file from [URL="http://mersenneforum.org/showthread.php?t=12260"]this thread[/URL] Here's the output: [code] tpsieve-cuda>tpsieve-cuda-x86-windows.exe -i 480000-484999_30aug2010.txt -p 710T -P 715T tpsieve version cuda-0.1.5b (testing) Found K's from 3 to 9999999. Found N's from 480000 to 484999. nstart=480000, nstep=27, gpu_nstep=27 Read 18013513 terms from NewPGen format input file `480000-484999_30aug2010.txt' ppsieve initialized: 3 <= k <= 9999999, 480000 <= n <= 484999 Sieve started: 710000000000000 <= p < 715000000000000 Thread 0 starting Detected GPU 0: GeForce GTX 285 Detected compute capability: 1.3 Detected 30 multiprocessors. 710001064441429 | 1473435*2^480477+1 710001781836203 | 3090555*2^482969+1 710002017069043 | 1947711*2^484889-1 710002639870109 | 7153191*2^483771+1 710003699276149 | 5489211*2^481645-1 710004831474721 | 9156609*2^482469+1 710005185071411 | 5012115*2^481782+1 710005192340203 | 4018161*2^483419-1 710005390472317 | 3240861*2^484861+1 710005916032213 | 5469669*2^482131+1 710006212449883 | 9438471*2^480253+1 710006478541837 | 942801*2^484681-1 p=710007273971713, 121.2M p/sec, 0.34 CPU cores, 0.1% done. ETA 05 Sep 23:11 710007380861971 | 3067731*2^482247-1 710007392845019 | 7483995*2^483443-1 710007480582299 | 1724049*2^480073-1 710008202353481 | 5813421*2^481371-1 710008811001043 | 9322383*2^480292-1 710008912579171 | 6024705*2^482149-1 710009562402587 | 5037609*2^482129-1 710010162887723 | 6614673*2^481762+1 710010987465557 | 6749691*2^483663+1 710011016356171 | 1349535*2^480408-1 710011368918931 | 7281273*2^482722-1 710011521417881 | 8617299*2^483945+1 710013019046899 | 3562503*2^481238-1 710013536554247 | 2683773*2^482840-1 p=710013762297857, 108.1M p/sec, 0.46 CPU cores, 0.3% done. ETA 05 Sep 23:50 710013880633081 | 4357815*2^480333+1 710013961546411 | 6488649*2^484015+1 710014319798129 | 1676877*2^480670-1 710014611723727 | 3195591*2^483289+1 710015165703751 | 1844445*2^483863+1 710016591664817 | 2155857*2^482100+1 710017445315627 | 9930375*2^480732-1 710017473222427 | 8642289*2^480555+1 710018153777579 | 5008965*2^484938-1 710018465445529 | 9185721*2^480167+1 p=710019807338497, 100.7M p/sec, 0.50 CPU cores, 0.4% done. ETA 06 Sep 00:21 710020260919457 | 1584663*2^483746-1 [/code]Now, will compiling x64 win binaries cause trouble? I've had a "out of memory" error, even though I had like 1GB out of 4 free. P.S. It's not using GPU completely. Peak GPU usage is reported 40%. But I guess you already know that ? |
I tried on a GTX465 on 64-bit linux using a range I had already tested so that I could compare the results. However I didn't get very far before getting an error.
[QUOTE]./tpsieve-cuda-x86_64-linux -i 480000-484999_19jun2010.txt -p 510T -P 515T tpsieve version cuda-0.1.5b (testing) Compiled Sep 4 2010 with GCC 4.3.3 Found K's from 3 to 9999999. Found N's from 480000 to 484999. nstart=480000, nstep=26, gpu_nstep=26 Read 18977477 terms from NewPGen format input file `480000-484999_19jun2010.txt' ppsieve initialized: 3 <= k <= 9999999, 480000 <= n <= 484999 Sieve started: 510000000000000 <= p < 515000000000000 Thread 0 starting Detected GPU 0: GeForce GTX 465 Detected compute capability: 2.0 Detected 11 multiprocessors. 510000064759291 | 604839*2^481707-1 510000994356869 | 2198475*2^482446+1 510001808585051 | 6049827*2^482948+1 510001965458981 | 9867039*2^480087-1 510002179900517 | 3334131*2^481253+1 510002930897567 | 8814495*2^481041+1 510003018137897 | 7665489*2^480401-1 510003129240001 | 4959981*2^480291+1 510003356427241 | 2391561*2^483615-1 510003644411923 | 7580307*2^484486-1 510003728553343 | 8313309*2^482255-1 510003886955161 | 3607413*2^482256-1 510004210312339 | 5073345*2^483515-1 Cuda error: cudaStreamCreate: out of memory [/QUOTE] |
[B]Amorphia[/B], that's exactly the same error I had on x86 windows.
Here it pops again: [code] tpsieve-cuda-x86-windows.exe -i 480000-484999_30aug2010.txt -p 900T -P 901T tpsieve version cuda-0.1.5b (testing) Found K's from 3 to 9999999. Found N's from 480000 to 484999. nstart=480000, nstep=27, gpu_nstep=27 Read 18013513 terms from NewPGen format input file `480000-484999_30aug2010.txt' ppsieve initialized: 3 <= k <= 9999999, 480000 <= n <= 484999 Sieve started: 900000000000000 <= p < 901000000000000 Thread 0 starting Detected GPU 0: GeForce GTX 285 Detected compute capability: 1.3 Detected 30 multiprocessors. 900000899028509 | 3182751*2^483513-1 900001860603749 | 9998469*2^481563+1 900001934059139 | 1853133*2^482022-1 900002540407273 | 8064075*2^482811+1 900002726446853 | 5749455*2^480565-1 900003355059173 | 3695019*2^484373-1 900003556591063 | 9754467*2^480376-1 900003917464219 | 7522179*2^481393-1 900004723972547 | 5306133*2^484306+1 900005287423111 | 6879159*2^482887-1 900007745466833 | 451935*2^481504+1 900009608245457 | 5425383*2^480786+1 p=900010489954305, 87.42M p/sec, 0.50 CPU cores, 1.0% done. ETA 05 Sep 15:17 900010638873601 | 378417*2^481830-1 900011291258897 | 6507645*2^482813+1 900011626245037 | 1340685*2^481238+1 900012104179271 | 645705*2^484085-1 900016125631741 | 2968161*2^483961+1 900016501038581 | 8124711*2^484951+1 900016817068751 | 75363*2^484216+1 900017662186813 | 525711*2^480789+1 900018252281867 | 6892521*2^484727-1 900020059012663 | 8598285*2^481068+1 900021150322181 | 4615461*2^482939+1 900021336561331 | 8746389*2^484435+1 900021361527311 | 3408945*2^482966-1 p=900021998075905, 95.90M p/sec, 0.54 CPU cores, 2.2% done. ETA 05 Sep 15:08 900022619382521 | 6958245*2^482800-1 900022833913493 | 6580995*2^483721+1 900022917366103 | 4560555*2^482723-1 900023322448907 | 1472211*2^480431-1 900024288211007 | 4808679*2^480371+1 900026935242913 | 3056079*2^482407-1 900028117404131 | 5600343*2^481214-1 900029413721059 | 815793*2^483818-1 900029829750299 | 4354917*2^483802-1 900030812047093 | 5639913*2^483592+1 p=900033017561089, 91.82M p/sec, 0.53 CPU cores, 3.3% done. ETA 05 Sep 15:08 900033789053611 | 7495449*2^481653-1 900034220560883 | 9094419*2^483205-1 900034570657763 | 9890505*2^480175-1 900035606160989 | 5867385*2^481157-1 900037077781057 | 8390829*2^481741+1 900037229605601 | 1863285*2^484553-1 900038990324497 | 3815157*2^482054+1 900040739108881 | 3513243*2^482350+1 900041542191221 | 6049533*2^482774-1 900042730035877 | 9304977*2^481916+1 900043309201403 | 136581*2^482397+1 p=900044056969217, 91.99M p/sec, 0.55 CPU cores, 4.4% done. ETA 05 Sep 15:08 900044321638183 | 8388129*2^484645-1 900044489973593 | 8240649*2^483659+1 900044550938063 | 7226823*2^484696-1 900045358508729 | 7763775*2^483076-1 900047216136989 | 2338305*2^482753-1 900047780897267 | 4008369*2^483695+1 900048470300299 | 2963115*2^481453-1 900048762025013 | 383355*2^480270-1 900049228276043 | 8622855*2^483971-1 900049467999349 | 660627*2^481816-1 900049796295679 | 2937537*2^483980-1 900052042582919 | 385575*2^484714+1 900052572323899 | 7711221*2^484603+1 900053267475361 | 7173609*2^483949+1 900053714040401 | 633879*2^480079+1 900053996550817 | 6894867*2^480856-1 p=900055633248257, 96.46M p/sec, 0.54 CPU cores, 5.6% done. ETA 05 Sep 15:06 900055866972487 | 6849789*2^483481-1 900056741014807 | 2245995*2^482732-1 900056770768759 | 814365*2^482000-1 900057523274303 | 642045*2^480196+1 900057941699027 | 3370071*2^480999+1 900058480102739 | 9883737*2^484374+1 900060511991023 | 8680035*2^484611-1 900060730024969 | 7366341*2^482195+1 900060738679177 | 1099155*2^483395-1 900063136569923 | 6597225*2^483763-1 900063669798383 | 5873829*2^481137-1 900064551341591 | 9219153*2^483872+1 900064734779653 | 7558803*2^483916-1 900065290605601 | 7338225*2^482126-1 900065587257671 | 7356405*2^481242-1 900065728724587 | 8091525*2^484942-1 p=900067529342977, 99.13M p/sec, 0.51 CPU cores, 6.8% done. ETA 05 Sep 15:04 900067553916287 | 8588259*2^483407+1 900068224207921 | 5907333*2^480414-1 900068309721587 | 4858185*2^483053+1 900069742623089 | 7249299*2^483067-1 900071614223911 | 974289*2^484133-1 900072154118867 | 3615069*2^480585-1 900072931824211 | 6749313*2^480668-1 900073013900513 | 1479111*2^482079-1 900073241850151 | 3667035*2^484867-1 900075811775299 | 5091681*2^482559+1 900076383783517 | 6995187*2^481406-1 p=900079152807937, 96.86M p/sec, 0.52 CPU cores, 7.9% done. ETA 05 Sep 15:03 900079180930459 | 8088465*2^482743+1 900080177837117 | 9137745*2^481706+1 900081068828399 | 116547*2^480962+1 900082664855509 | 4331577*2^481696-1 900084606014311 | 7923375*2^480228+1 900084897625079 | 7498953*2^482154+1 900085127281819 | 3059145*2^480229-1 900086877470243 | 2313279*2^483107+1 900087308304337 | 5166585*2^482543-1 p=900090529857537, 94.80M p/sec, 0.52 CPU cores, 9.1% done. ETA 05 Sep 15:03 900090831293629 | 9965295*2^481322-1 900091902233021 | 9990753*2^481446-1 900095462990077 | 9537003*2^480320+1 900096420949717 | 8525847*2^481988-1 900096832377143 | 1048245*2^483598+1 900096929852677 | 2153943*2^481358-1 900098509267721 | 7751367*2^481256+1 900099157340237 | 9244893*2^480360-1 900099669905143 | 9687819*2^484633+1 900101450465951 | 1940013*2^484300+1 p=900101851332609, 94.34M p/sec, 0.52 CPU cores, 10.2% done. ETA 05 Sep 15:03 900102028546621 | 2525739*2^482461+1 900102230642357 | 9699093*2^482344+1 900102319841591 | 8400777*2^481706-1 900102426091157 | 3881955*2^483157-1 900102488675867 | 337989*2^481711+1 900102580103633 | 9216783*2^482100+1 900102741563621 | 2272611*2^480277+1 900103553433571 | 7722345*2^483866-1 900104117029049 | 505821*2^480631-1 900105270926371 | 8850651*2^483739-1 900105302568581 | 7921695*2^482577+1 900106241542903 | 5146383*2^482750-1 900107926468921 | 5710305*2^481576-1 900110050114909 | 8376111*2^480199+1 900110665560263 | 7689909*2^483029-1 p=900112516399105, 88.77M p/sec, 0.52 CPU cores, 11.3% done. ETA 05 Sep 15:04 900113522958017 | 4037511*2^483349+1 900113818670537 | 2881989*2^483381+1 900114293440121 | 9168045*2^484941-1 900114895651987 | 1452225*2^484402+1 900116209588091 | 9696105*2^483814-1 900119156145683 | 1042815*2^481830+1 900120506278387 | 7095909*2^480787+1 900120924004501 | 8100075*2^480374-1 900121363886917 | 7647465*2^481616-1 900122553451477 | 5630019*2^482963+1 900122578082093 | 4954053*2^480218-1 900122941358243 | 6183075*2^482076+1 p=900123264303105, 89.57M p/sec, 0.55 CPU cores, 12.3% done. ETA 05 Sep 15:05 900124366202023 | 5674563*2^484186-1 900126758233421 | 757953*2^481946+1 900127178299367 | 2843865*2^484173-1 900128009292293 | 5166717*2^484434-1 900128886707417 | 4666179*2^484269+1 900129038645509 | 4965093*2^481026-1 900129732119477 | 8266305*2^480961-1 900131300857937 | 9063171*2^481141-1 900131738429219 | 7244415*2^480666+1 900131757667309 | 9087471*2^481013-1 900132376847051 | 8236677*2^480080+1 900133053419231 | 4099683*2^480928-1 p=900134169493505, 90.88M p/sec, 0.57 CPU cores, 13.4% done. ETA 05 Sep 15:05 900136707397699 | 4107933*2^480032+1 900138484885813 | 9160035*2^484905-1 900139177590013 | 8492325*2^481494+1 900141735781361 | 2119167*2^480722+1 900141821615489 | 5879169*2^480907+1 900143923881347 | 8023179*2^481613+1 900144031193809 | 1315365*2^482686+1 p=900144518938625, 86.23M p/sec, 0.60 CPU cores, 14.5% done. ETA 05 Sep 15:06 900146872538153 | 6797847*2^483152+1 900146934657221 | 31053*2^482810-1 900148138109243 | 6732345*2^484116-1 900149419577411 | 7471065*2^483931+1 900150206766011 | 4495635*2^480957-1 900152425932013 | 2111517*2^480778+1 900152581520117 | 8415135*2^481699-1 900153500000561 | 4769493*2^483890-1 900153813027347 | 4079283*2^481164+1 p=900154149060609, 80.25M p/sec, 0.58 CPU cores, 15.4% done. ETA 05 Sep 15:08 900155149794317 | 9979257*2^481892+1 900155755904123 | 2521533*2^483994+1 900158445390817 | 3743577*2^480768-1 900158816255227 | 6612759*2^481673-1 900159262356971 | 7420557*2^482234-1 900159600424079 | 7586655*2^481304+1 900159986271683 | 943605*2^481215+1 Cuda error: cudaStreamCreate: out of memory tpsieve-cuda>pause Press any key to continue . . .[/code] |
I get the feeling I have a severe memory leak on the GPU that I didn't know I had. Someone helped me with the stream synchronization code, and it worked, but I'm starting to suspect that [B]each[/B] event and stream that is created also has to be destroyed. I'll fix it in the next release.
|
v0.1.6, of both PPSieve and TPSieve, is released. Many changes and fixes are included.
- Faster on the GPU than 0.1.5b (though about the same as 0.1.5c) - Uses less CPU - A huge memory leak on the GPU should be fixed. - Input files are more often read correctly. - Many other bugfixes and tweaks. Get it at the usual URL, in the first post. Edit: P.S. I've forgotten to post [url=http://github.com/Ken-g6/PSieve-CUDA]the source location[/url]! |
[QUOTE=Ken_g6;228689]v0.1.6, of both PPSieve and TPSieve, is released. Many changes and fixes are included.[/QUOTE]
I have completed sieving 510-515T and the factors match those I previously found. I got 138M p/sec on a GTX465 using 0.41 CPU on a single core of a Core i7@3.6GHz. As the single core was not maxed out I decided to try running 2 instances on a single core (the other 3 cores were running instances of LLR). With 2 instances I got a combined throughput of 210M p/sec with 0.68 CPU used. This would suggest that the GTX465 wasn't maxed out either with a single instance. |
[QUOTE=amphoria;228861]I have completed sieving 510-515T and the factors match those I previously found.[/quote]Good! :smile:
[QUOTE=amphoria;228861]With 2 instances I got a combined throughput of 210M p/sec with 0.68 CPU used. This would suggest that the GTX465 wasn't maxed out either with a single instance.[/QUOTE] Interesting! Try fiddling with the -m option (probably going up from [i]8[/i] in increments of 1), and see if you can make a single instance do any better. |
| All times are UTC. The time now is 13:34. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.