mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Msieve (https://www.mersenneforum.org/forumdisplay.php?f=83)
-   -   Msieve v1.49 feedback (https://www.mersenneforum.org/showthread.php?t=15678)

pinhodecarlos 2011-11-26 13:10

I start msieve as a batch file. Right now I started another RSALS task so I can only test what you say later at night.
I'm suspecting of another issue that it is related to the gz relations source but I am still waiting for Lionel's reply.

Carlos

jasonp 2011-11-26 18:23

You definitely cannot mix zipped and unzipped relation files into the same msieve.dat, but that should not have been an issue if you also performed the filtering. Likewise, there is some memory allocation at the point where you crashed but if the allocation failed then there would be a message to that effect.

Batalov 2011-12-13 22:40

I am kinda curious. In the long threaded runs, the very first child process is a bit lazier than others:
[CODE]Mem: 16077M total, 15568M used, 509M free, 10M buffers
Swap: 8187M total, 2218M used, 5969M free, 1909M cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P nFLT nDRT COMMAND
3975 serge 20 0 15.4g 13g 480 R 96 86.1 5742:29 5 8743 0 msieveS
3987 serge 20 0 15.4g 13g 480 R [COLOR=darkred]91[/COLOR] 86.1 [COLOR=darkred]5226:11[/COLOR] 3 131 0 msieveS
3988 serge 20 0 15.4g 13g 480 R 94 86.1 5611:49 1 232 0 msieveS
3989 serge 20 0 15.4g 13g 480 R 95 86.1 5706:39 2 124 0 msieveS
3992 serge 20 0 15.4g 13g 480 R 92 86.1 5680:39 4 36 0 msieveS
3997 serge 20 0 15.4g 13g 480 R 92 86.1 5671:16 0 7 0 msieveS
[/CODE]
It probably has less job than others; maybe it can be given a 10% larger slice and the efficiency will increase (even if by a few percent)?
J- Let me know if this makes sense, I can implement and report the results.

Batalov 2011-12-14 00:12

P.S. I think even though the pid of the child is "1st" it actually must be the one that gets the last chunk. I've patched a bit that creates slight assymetry and committed (SVN 693).

jasonp 2011-12-14 01:40

Does that patch actually help? Even in your case with lots of threads, each thread gets millions of columns, shifting one back or one forward shouldn't noticeably affect the time.

The thread with the lowest pid is actually the master, and its lower CPU utilization may just reflect it sleeping a little bit to wait for each thread in turn to finish, so that another partial vector can be XOR-ed into the final answer.

Batalov 2011-12-14 02:15

I meant the 1st child (its PID is the second least). The master spend most time of all threads.

The patch seems to help on small cases. I want to restart the big matrix (it's the 16M[SUP]2[/SUP] for 7,326+) when I get home - want to do it from the master shell (not from ssh). I'll report in a day when the threads will accumulate enough runtime and whether the ETA would change.

[COLOR=green]P.S. The 694th patch is just beautification. Look at the 693 vs 639 diff. to see the real change: it gets rid of a 1000 fudge factor which on many threads runs up to shorthand the last worker.[/COLOR]

jasonp 2011-12-14 12:47

Ok, I didn't notice the commit before. Another possibility for why it might help is because the density on the matrix varies between the ends; the code sorts the columns in order of increasing weight, then alternates the lightest and heaviest columns. The end result is that threads assigned columns of the matrix close to the left edge have fewer columns, and threads assigned to the right-hand columns get more of them. It's possible that the bias in the original code, to avoid a thread with almost no columns, caused threads assigned to gobble up too many columns.

It might be a better idea to do a [num_threads]-way scatter of the columns, like the MPI code does with the rows; but that means you have to have the number of threads in mind when you build the matrix, which means an efficiency loss when the number of threads changes. Perhaps a [max_threads]-way scatter would be best.

Batalov 2011-12-14 19:56

Now that the chunks are [I]fairly[/I] divided between threads (in terms of sum(weights)), [...drum roll...] the spread of running times actually increased. :-(

I am thinking about trying a spread with "[URL="http://en.wikipedia.org/wiki/Gamma_correction"]gamma correction[/URL]" which will attempt to mimic and counteract the function of running-time vs the given slot (the alternation makes it almost linear, but not entirely linear; maybe a gamma of ~0.9 will take care of that; that's a cheap trick compared to your suggestions). The original code with an overshot of 1000 takes off higher than the diagonal line, - similar to a gamma slightly less than 1.

Batalov 2011-12-15 00:43

No, wait, I haven't actually tested anything as it turns out. :rolleyes:
(SVN skipped the patch on the matmul0.c file, as it had other edits; so I was actually "testing" yet another build of unmodified code.)

Jeff Gilchrist 2011-12-15 21:43

[QUOTE=Batalov;282261]No, wait, I haven't actually tested anything as it turns out. :rolleyes:
(SVN skipped the patch on the matmul0.c file, as it had other edits; so I was actually "testing" yet another build of unmodified code.)[/QUOTE]

So doing nothing increased your running time? I guess you might want to look at a different way to benchmark this if re-running the same code has that much variation.

Jeff.

Batalov 2011-12-15 22:31

It always has signinficant variation: this is LA. I am not re-running exactly same iterations (and even if I did, I would have only overtrained the 'model' into an effect that may be only characteristic to a specific stage of a specific LA run); instead, I am tweaking and then continuing LA as it goes. There's still 16 days to go.

Anyway, I've run half a day with ad hoc gamma-correction (of 0.97) and ran half a day with simply giving the first child 8% larger slice, and the latter seems to be evening per-thread run time better than all other attempts (the spread is less than it was). The ultimate goal would be to have them all run the same time (on average), then the time when five threads are waiting for the sixth to finish would be minimized. Same with one thread systematically working less than others.


All times are UTC. The time now is 04:52.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.