mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   CADO-NFS (https://www.mersenneforum.org/forumdisplay.php?f=170)
-   -   CADO NFS (https://www.mersenneforum.org/showthread.php?t=11948)

thome 2009-06-03 22:17

[quote=fivemack;175756]I issue the command

[code]
nfsslave2@cow:/scratch/fib1039/with-cado$ /home/nfsslave2/cado/cado-nfs-20090603-r2189/build/cow/linalg/bwc/u128_bench -t --impl bucket snfs.small
[/code]and it produces a lot of output at the 'large' level
[/quote]

normal behaviour (admittedly way too verbose).

[quote]before failing with

[code]
Lsl 56 cols 3634827..3699734 w=778884, avg dj=7.2, max dj=34365, bucket hit=1/1834.7-> too sparse
Switching to huge slices. Lsl 56 to be redone
Flushing 56 large slices
Hsl 0 cols 3634827..5582056 (30*64908) ..............................
w=16383453, avg dj=0.3, max dj=29376, bucket block hit=1/10.2
u128_bench: /home/nfsslave2/cado/cado-nfs-20090603-r2189/linalg/bwc/matmul-bucket.cpp:610: void split_huge_slice_in_vblocks(builder*, huge_slice_t*, huge_slice_raw_t*, unsigned int): Assertion `(n+np)*2 == (size_t) (spc - sp0)' failed.
Aborted
[/code][/quote]

If you could put your failing snfs.small file somewhere where I can grab it, it would be great.

[quote]The enormous filtering run got terminated by something that kills SSH sessions that have produced no output for ages, will try that again.[/quote]

You mean, the cado filtering programs got killed prematurely ? That would have a tendency to truncate the input to the bwc executables, but I doubt this is the cause, since the balancing program would have choked first.

Thanks for your patient investigations...

E.

fivemack 2009-06-03 22:55

[quote]
[quote]
The enormous filtering run got terminated by something that kills SSH sessions that have produced no output for ages, will try that again.
[/quote]
You mean, the cado filtering programs got killed prematurely ? That would have a tendency to truncate the input to the bwc executables, but I doubt this is the cause, since the balancing program would have choked first.
[/quote]

I wasn't using the script, just running

[code]
~/cado/cado-nfs-20090603-r2189/build/cow/merge/purge -poly snfs.poly -nrels "$( zcat snfs.nodup.gz | wc -l)" -out snfs.purged snfs.nodup.gz > purge.aus 2> purge.err &
[/code]

on a file with half a billion relations without using nohup, and the ssh connection from which I'd started it died.

I'm rerunning it, but the second pass is using 25G of vsize and the machine is swapping terribly, so I'm not expecting much progress.

fivemack 2009-06-03 23:00

[QUOTE=thome;175796]If you could put your failing snfs.small file somewhere where I can grab it, it would be great[/QUOTE]

anonymous ftp to fivemack.dyndns.org and collect snfs.small.bz2 (710MB) and snfs.poly. My upload is quite slow so it may take a little while, I don't know if my ftp server supports resumption of partial transfers, if it gets frustrating tell me and I'll stick the file somewhere more accessible.

fivemack 2009-06-03 23:14

Tiny command-line bug for transpose tool
 
Since 'balance' doesn't appear to have a --transpose command-line option

[code]
nfsslave2@cow:/scratch/fib1039/with-cado$ /home/nfsslave2/cado/cado-nfs-20090528-r2167/build/cow/linalg/balance --transpose --in snfs.small --out cabbage --nslices 1x4 --ramlimit 1G
Unknown option: snfs.small
Usage: ./bw-balance <options>
Typical options:
--in <path> input matrix filename
--out <path> output matrix filename
--nslices <n1>[x<n2>] optimize for <n1>x<n2> strips
--square pad matrix with zeroes to obtain square size
More advanced:
--remove-input remove the input file as soon as possible
--ram-limit <nnn>[kmgKMG] fix maximum memory usage
--keep-temps keep all temporary files
--subdir <d> chdir to <d> beforehand (mkdir if not found)
--legacy produce only one jumbo matrix file
[/code]

I ran

/home/nfsslave2/cado/cado-nfs-20090528-r2167/build/cow/linalg/transpose --in snfs.small --out snfs.small.T

I killed the job after two hours; it was stuck in the argument-parsing loop! Using only one minus sign before 'in' and 'out' made it work, though it's now too late to start the balancing and bench jobs this evening. More later.

fivemack 2009-06-03 23:57

what shape to use for decomposition?
 
I think 100 seconds is too short for statistically significant comparisons for matrices this big, but (with four threads at each size)

- 1x4 decomposition: 19 iterations in 104s, 5.47/1, 21.07 ns/coeff
- 2x2 decomposition: 19 iterations in 102s, 5.34/1, 20.59 ns/coeff
- 4x1 decomposition: 20 iterations in 104s, 5.18/1, 19.95 ns/coeff

20ns/coeff still feels a bit too long.


To my limited surprise, explicitly transposing the matrix and running

/home/nfsslave2/cado/cado-nfs-20090528-r2167/build/cow/linalg/bwc/u128_bench snfs.small.T -impl bucket

gave exactly the same error as having u128_bench do the transposition.


However, if I run balance on the transposed matrix then the u128_bench works with a 4x1 decomposition. 2x2 fails with the same error message as mentioned before.

- 1x4, 2x2 decomposition: fails Assertion `(n+np)*2 == (size_t) (spc - sp0)'
- 4x1 decomposition: 21 iterations in 105s, 4.99/1, 19.22 ns/coeff


If I give inadequately many parameters to a threaded call to u128_bench, it seems to read off the end of argv and into env:

[code]
nfsslave2@cow:/scratch/fib1039/with-cado$ taskset 0f /home/nfsslave2/cado/cado-nfs-20090528-r2167/build/cow/linalg/bwc/u128_bench -impl bucket -nthreads 4 -- butterfly14T.h*
4 threads requested, but 1 files given on the command line.
Using implementation "bucket"
no cache file butterfly14T.h*-bucket.bin
T0 Building cache file for butterfly14T.h*
no cache file (null)-bucket.bin
T1 Building cache file for (null)
no cache file TERM=xterm-color-bucket.bin
T2 Building cache file for TERM=xterm-color
no cache file SHELL=/bin/bash-bucket.bin
fopen(butterfly14T.h*): No such file or directory
fopen((null)): Bad address
fopen(TERM=xterm-color): No such file or directory
[/code]

joral 2009-06-04 15:40

Ok. apparently it is random. I've had it fail at iteration 100, 1000, 1900, and 19500 (out of 29300).

thome 2009-06-04 20:38

arg loop: fixed, thanks (this program is in fact unused -- does not really belong to the set of distributed prgs, yet it can be handy because it does a lot out of core).

running off argv -- this has been fixed in one of the updated tarballs that I had posted.

failing assert: the assert was wrong (sigh). Should have been (n+2*np)*2. A tiny example is the matrix which once piped through ``uniq -c'' gives the following output (one must also set HUGE_MPLEX_MIN to zero in matmul-bucket.cpp):
1 5000 5000
4365 0
1 1 1353
634 0

disappointing performance: I'm working on it.

Thanks,

E.

thome 2009-06-04 20:40

[quote=joral;175888]Ok. apparently it is random. I've had it fail at iteration 100, 1000, 1900, and 19500 (out of 29300).[/quote]

ok perhaps you could give a try on a different machine ?

The good thing is that if you've got a dimm stick at fault, then now you have a handy way to pinpoint the culprit ;-).

E.

frmky 2009-06-04 21:25

[QUOTE=thome;175728]This warning no longer appears (yes, there's a new tarball).[/QUOTE]

I tried compiling the new source in Linux x86_64 using pthreads, but it ends with the error

CMake Error in linalg/bwc/CMakeLists.txt:
Cannot find source file "matmul-sub-large-fbi.S".

Sure enough, this file is referenced in the CMakeLists.txt and a corresponding .h file is #include'd in matmul-bucket.cpp, but it's not in the directory.

joral 2009-06-05 00:30

Haven't been able to get it to build on my dual P3-700 yet. Don't want to compare to an athlon64x2 running at about 2Ghz.

I may pull out my ubuntu cd later and run memtest against it to see what happens.

I do find it interesting the examples run without incident.

joral 2009-06-05 12:37

Ok... Either I take my computer back to 1GB or I go buy a new memory stick. Ran memtest overnight, and it picked up about 808 bit errors right around the 1.5GB mark in 6 passes through.

Good call, though bad for me...


All times are UTC. The time now is 20:28.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.