mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > Twin Prime Search

Reply
 
Thread Tools
Old 2009-08-20, 02:45   #276
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

13·89 Posts
Default

tpsieve doesn't currently modify the bitmap after the sieve file has been loaded, so multiple factors for the same candidate will sometimes be found. This should become less of a problem as sieving depth increases.

It is a bit simpler to do it this way because the bitmap can't be updated atomically. If it was modifiable then it could only be accessed by one thread at a time, but making it read-only allows it to be accessed by many threads at once. I haven't tested how much difference it makes to performance however, maybe not much.

If the bitmap is replaced by another data structure like a hash table (to save memory) then the benefit of keeping it read-only might be greater.

Edit: Actually now that I think about it maybe the bitmap can be updated atomically by using assembly instead of C to do the update. But for consistency with a hashtable or other methods it might be better to keep it read-only anyway.

Last fiddled with by geoff on 2009-08-20 at 02:51
geoff is offline   Reply With Quote
Old 2009-08-20, 03:17   #277
axn
 
axn's Avatar
 
Jun 2003

125308 Posts
Default

Quote:
Originally Posted by geoff View Post
Actually now that I think about it maybe the bitmap can be updated atomically by using assembly instead of C to do the update. But for consistency with a hashtable or other methods it might be better to keep it read-only anyway.
In theory, btr should do the bit testing and clearing as an atomic operation, shouldn't it?

Can't even the hash table entry be removed if one finds an entry (just clear the LSB?)
axn is offline   Reply With Quote
Old 2009-08-20, 04:27   #278
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

5·79 Posts
Default

Quote:
Originally Posted by geoff View Post
It is a bit simpler to do it this way because the bitmap can't be updated atomically.
Good point, I wasn't thinking about the threads.

It would probably be easier to write something else, maybe even in C, to clear extra factors out of the results file once it's finished, if you wanted to do that.
Ken_g6 is offline   Reply With Quote
Old 2009-08-25, 23:26   #279
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

6138 Posts
Default

Well, I have bad news...for 32-bit SSE2 users. I have a new version of tpsieve out that makes 64-bit more than twice as fast as the SSE2 code!

I'd like to thank Biwema for getting the idea stuck in my head (back on post 222) of finding an N to check based on even values of K. What took the longest was convincing myself that this approach is correct:

First, if K is even, P-K (a.k.a. -K mod P) is odd. And vice versa.
Second, it turns out Intel has a lowest-1-bit-finding instruction, "bsf". Since this has been around since the 386 or so, 32-bit non-SSE2 users get a boost too.
Third, if I divide K by 2, that's equivalent to adding P to -K and dividing by 2. So if P>2*kmax, (P+(-K))/2 >= P/2, which is >kmax.
So, my new algorithm is to begin by finding the odd one of K and -K. I check that for divisibility. Then I go to the even one (P-that previous value), shift it right until it's odd, jumping forward as many N's as there were 0-bits, and repeat at the divisibility check. The number of 0-bits averages to a little less than 2, since there's at least 1, a 50% chance of at least 2, a 25% chance of at least 3, etc.
The last problem was that tpsieve reports whether it factored K*2^N+1 or K*2^N-1. I had to add an extra bit for each K, that I flip each loop, to record which I'm working on. It might be a little faster still without that!

So, 64-bit is close to twice as fast as it was, and 32-bit non-SSE2 is more than twice as fast as it was, but I can't find a good way to implement the new algorithm for SSE2. In particular, SSE2 doesn't have a "bsf" equivalent, and doesn't allow for shifting two values by different amounts at the same time. I believe the shifting is coming with SSE5. I haven't seen a TZCNT instruction in the pipeline, though apparently a POPCNT instruction can help build an equivalent from four instructions. So for now the SSE2 code is unchanged. (Though I quietly upgraded the single-N SSE2 code earlier, this has little effect on the current TPS sieve.)

Oh, and the new algorithm causes a small (5%) performance hit for 64-bit single-N sieves, so I left the old version out there for anyone doing a private N.
Ken_g6 is offline   Reply With Quote
Old 2009-08-26, 01:05   #280
axn
 
axn's Avatar
 
Jun 2003

10101010110002 Posts
Default

Quote:
Originally Posted by Ken_g6 View Post
So, my new algorithm is to begin by finding the odd one of K and -K. I check that for divisibility. Then I go to the even one (P-that previous value), shift it right until it's odd, jumping forward as many N's as there were 0-bits, and repeat at the divisibility check. The number of 0-bits averages to a little less than 2, since there's at least 1, a 50% chance of at least 2, a 25% chance of at least 3, etc.
The way I understand it, you might miss factors. For every even bit, you need to check that p-even (which is odd) is divisible or not, don't you?
axn is offline   Reply With Quote
Old 2009-08-26, 03:19   #281
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

5·79 Posts
Default

Quote:
Originally Posted by axn View Post
The way I understand it, you might miss factors. For every even bit, you need to check that p-even (which is odd) is divisible or not, don't you?
This was the part I had the hardest time convincing myself of. But...

Quote:
Originally Posted by Ken_g6 View Post
Third, if I divide K by 2, that's equivalent to adding P to -K and dividing by 2. So if P>2*kmax, (P+(-K))/2 >= P/2, which is >kmax.
So, yes, "p-even" is divisible by something, but that something is >= P/2. If P/2 > kmax, there's no problem.

And, come to think of it, I should change the initialization code to verify the requirement that P > kmax*2. But I think my logic is solid, and my testing turned up no errors.
Ken_g6 is offline   Reply With Quote
Old 2009-08-27, 03:06   #282
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

5×79 Posts
Default

One more update: I managed to improve the performance of the 32-bit non-SSE2 multi-N section slightly past the performance of the SSE2 version.* So now SSE2 and non-SSE2 are almost identical in speed, the difference being the initial N calculation. This makes the 32-bit code almost half as fast as the 64-bit code. Somehow it seems like that's as it should be.

* Pentium 4's may be an exception. If anyone has one, would you mind comparing my new SSE2 version with my old SSE2 version?
Ken_g6 is offline   Reply With Quote
Old 2009-08-27, 08:10   #283
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

3·23·89 Posts
Default

Quote:
Originally Posted by Ken_g6 View Post
One more update: I managed to improve the performance of the 32-bit non-SSE2 multi-N section slightly past the performance of the SSE2 version.* So now SSE2 and non-SSE2 are almost identical in speed, the difference being the initial N calculation. This makes the 32-bit code almost half as fast as the 64-bit code. Somehow it seems like that's as it should be.

* Pentium 4's may be an exception. If anyone has one, would you mind comparing my new SSE2 version with my old SSE2 version?
maybe we need optional code paths and/or an automatic test to find which is best
henryzz is offline   Reply With Quote
Old 2009-08-28, 04:02   #284
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

5·79 Posts
Default

Quote:
Originally Posted by henryzz View Post
maybe we need optional code paths and/or an automatic test to find which is best
Maybe so, but I don't know how bad the performance is on P4's, since I don't have one. (And I never want one! They tried to improve the performance of "common" instructions, but ignored some important less common ones. But I .)

Anyway, I have one more more update. Your post made me realize that SSE2-style MMX instructions wouldn't be much worse than the regular 64-bit math I was using, and might be faster on the P4. It took full inline assembly to get it working; the intrinsics decided to write everything to memory and swap it out through registers! And the performance was slightly slower on my Core 2. But with full inline assembly I was able to use all the MMX registers and software pipeline it like I do with the 64-bit code. So the "SSE2" code, using MMX registers, is now about 40% faster than it was.

P.S. Before someone asks, no, the SSE2 changes won't work on a Pentium MMX, since Intel didn't add a 64-bit MMX add/subtract until they added SSE2.
Ken_g6 is offline   Reply With Quote
Old 2009-08-28, 20:51   #285
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

5×79 Posts
Default

A little update: I don't have a P4, but I do have an Athlon64. So I tried my new code on that, and the SSE2 code is slightly slower, even after a tiny code rearrangement; but only by 2-3%. I was surprised to learn that AMD made some of these instructions slow, too. On the other hand, my latest code probably isn't bad on older P4's. So it looks like the only processors that might not like my new code are Athlon64's in 32-bit mode, and P4 Prescotts.

I'm not going to make major changes just for Prescotts, or for just a 2-3% improvement (which might go away if I tried to add optional code paths anyway) on Athlon64s. Plus, the old code is still available for anyone with those processors.
Ken_g6 is offline   Reply With Quote
Old 2009-08-29, 08:19   #286
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

3·23·89 Posts
Default

Quote:
Originally Posted by Ken_g6 View Post
A little update: I don't have a P4, but I do have an Athlon64. So I tried my new code on that, and the SSE2 code is slightly slower, even after a tiny code rearrangement; but only by 2-3%. I was surprised to learn that AMD made some of these instructions slow, too. On the other hand, my latest code probably isn't bad on older P4's. So it looks like the only processors that might not like my new code are Athlon64's in 32-bit mode, and P4 Prescotts.

I'm not going to make major changes just for Prescotts, or for just a 2-3% improvement (which might go away if I tried to add optional code paths anyway) on Athlon64s. Plus, the old code is still available for anyone with those processors.
The slow SSE2 on amd cpus up to Athlon 64 and i think to some degree phenom I massively slows down GIMPS as well. When i changed from and athlon 64 to a core 2 my llr(similar to ll) iteration times halved with the same clock rate. Which meant i had a brilliant 8x speed improvement as i had 4x the number of cores.
henryzz is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
S9 and general sieving discussion Lennart Conjectures 'R Us 31 2014-09-14 15:14
Sieving discussion thread philmoore Five or Bust - The Dual Sierpinski Problem 66 2010-02-10 14:34
Combined sieving discussion ltd Prime Sierpinski Project 76 2008-07-25 11:44
Sieving Discussion ltd Prime Sierpinski Project 26 2005-11-01 07:45
Sieving Discussion R.D. Silverman Factoring 7 2005-09-30 12:57

All times are UTC. The time now is 13:33.


Fri Jul 7 13:33:13 UTC 2023 up 323 days, 11:01, 0 users, load averages: 1.18, 1.21, 1.20

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔