mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2012-06-03, 21:24   #342
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

37×263 Posts
Default

Quote:
Originally Posted by Dubslow View Post
You are right, I have absolutely no idea where to look, and I've gone over each comma of parse.c twice, and once through CUDALucas.cu.
Oh, come on Dubslow...

You have reproducible crashes in code you have the source code to.

This is easy.

Just wait until you have reported to you a "once a month" bug (read: not easily reproducible) in a piece of software you're responsible for....
chalsall is offline   Reply With Quote
Old 2012-06-04, 00:22   #343
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

160658 Posts
Default

Quote:
Originally Posted by chalsall View Post
You have reproducible crashes in code you have the source code to.

This is easy.
I can't reproduce the crash. Whatever it does for me, it doesn't crash. This code is a modification of other code already in use; I looked at the diff between them (in the same post you quoted), and that didn't reveal anything. The debugger was rather useless, though I'll run it through with -O0 and see if anything appears there.

Last fiddled with by Dubslow on 2012-06-04 at 00:24
Dubslow is offline   Reply With Quote
Old 2012-11-07, 00:46   #344
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·29·83 Posts
Default Optimization

Okay, being now at least a somewhat proficient programmer in C/Python, I'm interested now in learning how to write fast code, à la YAFU/Msieve/whatever Ernst does/Prime95, etc. (I realize that you guys have years and years of experience, but I've got to start somewhere )

To that end, I wrote a sieve of Ἐρατοσθένης in Python, to avoid messy complications like memory allocation/control etc., with the goal of finding algorithmic optimizations. I have these two different versions: one runs 1-2% faster than the other, but with some extra code (and 1-2% is hardly an improvement).
Code:
def primes(depth=10000000):
     if depth <= 1: return []
     primes = [2]

     if depth % 2 == 0: depth -= 1 # make depth odd
     length = depth // 2
     list = [True for i in range(length)] # 3, 5, 7... might be prime

     i, n = 0, 3 # i=index, n=num=2*index+3
     j = i + (n>>1)*n # the index of n^2
     while j < length: # while n^2 < depth
          if list[i]: # n is prime
               primes.append(n)
               while j < length:
                    list[j] = False
                    j += n
          i, n = i+1, n+2
          j = i + (n>>1)*n
     # all necessary sieving is done, read the rest of the list
     while i < length:
          if list[i]:
               primes.append(n)
          i, n = i+1, n+2
     
     return primes

def slow(depth=10000000):
     if depth <= 1: return []
     primes = [2]

     if depth % 2 == 0: depth -= 1 # make depth odd
     length = depth // 2
     list = [True for i in range(length)] # 3, 5, 7... might be prime

     i, n = 0, 3 # i=index, n=num=2*index+3
     while i < length:
          #print("checking", n)
          if list[i]: # n is prime
               #print(n, "is prime")
               primes.append(n)
               j = i + (n>>1)*n # sieve out the odd mulitples of n:
                                # n*n, (n+2)*n, (n+4)*n...
               while j < length:
                    list[j] = False
                    j += n
          i, n = i+1, n+2 # check the next odd

     return primes
Initially I just started with a list of everything, because I wasn't sure how to run through a list of just odd numbers; once I figured that out, I had a list with length = depth/2, as given above. Initially I had it eliminate (odd) multiples of every prime in the list, but then I realized after watching the animation here, whence I realized (with a significant feeling of stupidity) that I only needed to sieve the primes < sqrt(depth); then, after reading the article further, I got to roughly the final iterations above where sieving starts at p^2 for a given prime.

It can sieve the primes <= 10e6 in around 144ms on one core of my 2600K with nothing else running (which I'm sure is at least 1, probably >= 2 orders of magnitude slower than YAFU). (The slower one takes around 154 ms.) I also ran a profiler:
Code:
Timer unit: 1e-06 s

File: sieve.py
Function: primes at line 3
Total time: 88.8523 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3                                           @profile
     4                                           def primes(depth=10000000):
     5         1            4      4.0      0.0       if depth <= 1: return []
     6         1            3      3.0      0.0       primes = [2]
     7                                           
     8         1            5      5.0      0.0       if depth % 2 == 0: depth -= 1 # make depth odd
     9         1            2      2.0      0.0       length = depth // 2
    10         1      1174339 1174339.0      1.3       list = [True for i in range(length)] # 3, 5, 7... might be prime
    11                                           
    12         1            4      4.0      0.0       i, n = 0, 3 # i=index, n=num=2*index+3
    13         1            5      5.0      0.0       j = i + (n>>1)*n
    14      1581         3204      2.0      0.0       while j < length: # while n^2 < depth
    15      1580         3399      2.2      0.0            if list[i]: # n is prime
    16       445         1790      4.0      0.0                 primes.append(n)
    17   8925589     17890912      2.0     20.1                 while j < length:
    18   8925144     18544686      2.1     20.9                      list[j] = False
    19   8925144     18512476      2.1     20.8                      j += n
    20      1580         3710      2.3      0.0            i, n = i+1, n+2
    21      1580         4128      2.6      0.0            j = i + (n>>1)*n
    22                                                # all necessary sieving is done, read the rest of the list
    23   4998420     10202160      2.0     11.5       while i < length:
    24   4998419     10139585      2.0     11.4            if list[i]:
    25    664133      1509461      2.3      1.7                 primes.append(n)
    26   4998419     10862405      2.2     12.2            i, n = i+1, n+2
    27                                                
    28         1            2      2.0      0.0       return primes

File: sieve.py
Function: slow at line 30
Total time: 87.4097 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    30                                           @profile
    31                                           def slow(depth=10000000):
    32         1            4      4.0      0.0       if depth <= 1: return []
    33         1            2      2.0      0.0       primes = [2]
    34                                           
    35         1            4      4.0      0.0       if depth % 2 == 0: depth -= 1 # make depth odd
    36         1            2      2.0      0.0       length = depth // 2
    37         1      1168725 1168725.0      1.3       list = [True for i in range(length)] # 3, 5, 7... might be prime
    38                                           
    39         1            4      4.0      0.0       i, n = 0, 3 # i=index, n=num=2*index+3
    40   5000000      9554915      1.9     10.9       while i < length:
    41                                                     #print("checking", n)
    42   4999999      9654528      1.9     11.0            if list[i]: # n is prime
    43                                                          #print(n, "is prime")
    44    664578      1445235      2.2      1.7                 primes.append(n)
    45    664578      1503640      2.3      1.7                 j = i + (n>>1)*n # sieve out the odd mulitples of n:
    46                                                                           # n*n, (n+2)*n, (n+4)*n...
    47   9589722     18329467      1.9     21.0                 while j < length:
    48   8925144     17693395      2.0     20.2                      list[j] = False
    49   8925144     17703299      2.0     20.3                      j += n
    50   4999999     10356447      2.1     11.8            i, n = i+1, n+2 # check the next odd
    51                                           
    52         1            2      2.0      0.0       return primes
(Note that the profiler was run with primes(10e7), not 10e6.) The biggest surprise: it takes > 1 second to initialize a list of length 5e6. I guess that's what I get for using Python. (It can't be that slow with e.g. memset(list, 1) in C?) Other than that, around 62% of the time is spent sieving, and around 35% to just convert the sieved list into a list of primes. One optimization that occurs to me is (for the faster version only, after the sieving) to increment only i, and calculate n directly since there are so fewer primes than composites.

Besides rewriting in C (presumably I'd use a bit array for the sieving list?), are there other ways to speed this up, or is this as far as I can get in an interpreted scripting language?
Dubslow is offline   Reply With Quote
Old 2012-11-07, 01:40   #345
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

DB916 Posts
Default

Quote:
Originally Posted by Dubslow View Post
Besides rewriting in C (presumably I'd use a bit array for the sieving list?), are there other ways to speed this up, or is this as far as I can get in an interpreted scripting language?
Hooboy, are there ways! Bit packing (just because you used boolean doesn't mean the internal storage is 1 bit), wheels, and segmentation are the big ones. Segmentation won't be as much of an improvement until you try large ranges, at least if you've already done bit-packing and wheels. Pre-sieving is also a big one for small limits like 10e6. Then there is the standard stuff like loop unrolling and pre-calculations.

If googling doesn't get you far enough, let me know if you have more specific questions about any of these.
bsquared is offline   Reply With Quote
Old 2012-11-07, 03:57   #346
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

11100001101012 Posts
Default

Quote:
Originally Posted by bsquared View Post
Hooboy, are there ways!
Oh jeez, this'll take a while
Quote:
Originally Posted by bsquared View Post
Bit packing (just because you used boolean doesn't mean the internal storage is 1 bit)
Well duh. That's probably why the list initialization took so long. (And by bit array I think I meant bit packing.)
Quote:
Originally Posted by bsquared View Post
wheels, and segmentation are the big ones. Segmentation won't be as much of an improvement until you try large ranges, at least if you've already done bit-packing and wheels.
Hmm... I will certainly Google like you say, but my first thought is "Are they truly algorithmic, or just a different manner of doing something like bit packing?"
Quote:
Originally Posted by bsquared View Post
Pre-sieving is also a big one for small limits like 10e6.
Pre-sieving? Hmm... Google google google...
Quote:
Originally Posted by bsquared View Post
Then there is the standard stuff like loop unrolling and pre-calculations.
I've heard of and have a rough idea of loop unrolling, but that would create some quite-nasty code if I were to do it manually, wouldn't it?
Quote:
Originally Posted by bsquared View Post
If googling doesn't get you far enough, let me know if you have more specific questions about any of these.
Certainly. I'm in Winbloze right now for my monthly game fix, but perhaps later tonight or tomorrow I'll be able to tackle the Google.
Dubslow is offline   Reply With Quote
Old 2012-11-07, 04:09   #347
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

7×1,373 Posts
Default

Quote:
Originally Posted by Dubslow View Post
but my first thought is "Are they truly algorithmic, or just a different manner of doing something like bit packing?"
say "or just a cleverer manner". Assume you want to store the numbers from 1 to 30, you may use a vector with 30 components for it. You can do this "cleverer" by storing only the odd numbers, so you would only need 15 bytes. Bit packing is using only four bytes to store the numbers by using a bit for each, or if you only store the odds, you need just 2 bytes, instead of using 30 bytes. "Wheeling" them would only need a byte (ONE byte), instead of 4 or 2, so you will only need a thousand bytes to store all primes from 0 to 30000. This by observing that you only have 8 possible primes in any 30 numbers, those are 1,7,11,13,17,19,23,29 (mod 30), so you need only one byte to store 30 numbers. Of course, addressing them (for clearing the bits when you sieve) become a bit more complicate now, and you need to find a compromise. Some guys came with really beautiful solutions for this.
LaurV is offline   Reply With Quote
Old 2012-11-07, 04:23   #348
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3×1,171 Posts
Default

I'd say that none of these are truly algorithmic improvements. It is still the sieve of Eratosthenes, after all, no matter how you represent the list of things you are striking off.

That said, wheel sieves can lower the asymptotic complexity a bit (search for Pritchard’s Wheel Sieve), and the segmented sieve and bit packing will drastically lower the space complexity. If you count asymptotic complexity improvements as algorithmic improvements then yes we have some of the latter.

Manual loop unrolling might not be pretty, but some things you just gotta do yourself. I got significant (10%+) gains that way in both my CPU and GPU versions of the sieve. At rough count YAFU's sieve is > 2000 lines; at that point you stop caring about aesthetics :)
bsquared is offline   Reply With Quote
Old 2012-11-07, 05:14   #349
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·29·83 Posts
Default

Quote:
Originally Posted by LaurV View Post
say "or just a cleverer manner". Assume you want to store the numbers from 1 to 30, you may use a vector with 30 components for it. You can do this "cleverer" by storing only the odd numbers, so you would only need 15 bytes. Bit packing is using only four bytes to store the numbers by using a bit for each, or if you only store the odds, you need just 2 bytes, instead of using 30 bytes. "Wheeling" them would only need a byte (ONE byte), instead of 4 or 2, so you will only need a thousand bytes to store all primes from 0 to 30000. This by observing that you only have 8 possible primes in any 30 numbers, those are 1,7,11,13,17,19,23,29 (mod 30), so you need only one byte to store 30 numbers. Of course, addressing them (for clearing the bits when you sieve) become a bit more complicate now, and you need to find a compromise. Some guys came with really beautiful solutions for this.
I know how a bit array works, but thanks for the tips on wheeling. Like I said though, it'll wait a while.
Quote:
Originally Posted by bsquared View Post
I'd say that none of these are truly algorithmic improvements. It is still the sieve of Eratosthenes, after all, no matter how you represent the list of things you are striking off.
Something like start sieving from p^2 instead of p is what I might call an algorithmic improvement, but storing the data differently isn't.
Quote:
Originally Posted by bsquared View Post
That said, wheel sieves can lower the asymptotic complexity a bit (search for Pritchard’s Wheel Sieve), and the segmented sieve and bit packing will drastically lower the space complexity. If you count asymptotic complexity improvements as algorithmic improvements then yes we have some of the latter.
I will Googley Doogley.

Quote:
Originally Posted by bsquared View Post
At rough count YAFU's sieve is > 2000 lines; at that point you stop caring about aesthetics :)
Dubslow is offline   Reply With Quote
Old 2012-11-08, 05:27   #350
only_human
 
only_human's Avatar
 
"Gang aft agley"
Sep 2002

EAA16 Posts
Default

Quote:
Originally Posted by Dubslow View Post
Something like start sieving from p^2 instead of p is what I might call an algorithmic improvement, but storing the data differently isn't.
Loosely speaking, data storage often is a big deal. I know you know that and there is no need to rephrase words to explicitly state exactly what you meant for this specific case. Here is one that I found interesting not long ago: Adding faulted data to produce accurate results, faster
only_human is offline   Reply With Quote
Old 2013-01-18, 20:56   #351
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·29·83 Posts
Default

Well, it seems I'm doing things backwards compared to Mr. Sam Kennedy.

I had decided to learn some x86 assembly, when lo and behold one of my friends in ECE is taking a Computer Systems Engineering course in which x86 assembly is being taught -- so I'm sitting in on that and learning.

I have some questions though. They seem to be using the GNU assembler, which as far as I can tell, uses some pretty weird conventions (besides the AT&T syntax), like '#' for comments, and ';' has the same meaning as in C, as opposed to marking a comment (not to mention C-sytle /* */ comments are allowable too, and the .string syntax, which I haven't seen elsewhere). So, in the interest of learning practical things, how do you guys (meaning I guess B^2, Jasonp, Prime95, only_human, EWMayer et al.) deal with the different syntaxes and formats in your programs? I looked at the Prime95 assembly once, and noticed that it uses ';' comments, yet I was able to compile it with gcc just fine -- how? I looked at YAFU, and saw the defines by BGladman, while in the one SIQS file I looked at, it seemed that only GNU syntax was used ("ASM_G").

Can anybody help shed some light on the complicated situation? Thanks


(PS @B^2 I tried shifting the SoE to using a bytes array, but being in Python, it was actually slower. See the best answer to this question.)
Dubslow is offline   Reply With Quote
Old 2013-01-18, 21:41   #352
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3·1,171 Posts
Default

YAFU is (and by extension, I am) perhaps strange in that only inline assembler is used. Typically the assembler blocks are small-ish, so it just seemed easier that way (especially within visual studio).

For any given block of inline assembly, I typically write it first in gcc syntax and then port it to the other one. Actually, usually there is a third port as well, using visual studio intrinsic functions. Compile time #defines control which version of any given inline assembly block is used.

The file you reference is the only exception to that rule, where the bloody thing was just too complex to contemplate porting it to 2 other representations (non-supported compilers in that case see a generic C implementation).

Last fiddled with by bsquared on 2013-01-18 at 21:44
bsquared is offline   Reply With Quote
Reply



All times are UTC. The time now is 03:11.


Sat Jul 17 03:11:55 UTC 2021 up 50 days, 59 mins, 1 user, load averages: 1.33, 1.36, 1.33

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.