![]() |
|
|
#12 |
|
Tribal Bullet
Oct 2004
3,541 Posts |
Did this run try to go straight to the square root after the LA? If yes, maybe there would be some difference if you only ran with -ncr? If one of the nodes can't find a dependency file because the final dependencies are getting written out, then calls exit(), then mpirun then kills node zero, it could cause the errors you're seeing.
Last fiddled with by jasonp on 2011-02-21 at 22:23 |
|
|
|
|
|
#13 | |
|
Jun 2003
Ottawa, Canada
3×17×23 Posts |
Quote:
Code:
linear algebra at 99.8%, ETA 0h 3mf 5860705 dimensions (99.8%, ETA 0h 3m) checkpointing every 180000 dimensions860705 dimensions (99.9%, ETA 0h 2m) lanczos halted after 92679 iterations (dim = 5860478)s (100.0%, ETA 0h 0m) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos halted after 92679 iterations (dim = 5860478) lanczos error: only trivial dependencies found lanczos error: only trivial dependencies found lanczos error: only trivial dependencies found lanczos error: only trivial dependencies found BLanczosTime: 311 BLanczosTime: 311 elapsed time 00:05:13 BLanczosTime: 311 elapsed time 00:05:13 BLanczosTime: 311 lanczos error: only trivial dependencies found lanczos error: only trivial dependencies found elapsed time 00:05:13 elapsed time 00:05:13 lanczos error: only trivial dependencies found lanczos error: only trivial dependencies found BLanczosTime: 310 BLanczosTime: 311 elapsed time 00:05:13 BLanczosTime: 311 elapsed time 00:05:13 BLanczosTime: 310 elapsed time 00:05:13 elapsed time 00:05:13 lanczos error: only trivial dependencies found BLanczosTime: 311 lanczos error: only trivial dependencies found elapsed time 00:05:13 BLanczosTime: 311 elapsed time 00:05:13 lanczos error: only trivial dependencies found lanczos error: only trivial dependencies found BLanczosTime: 311 elapsed time 00:05:13 BLanczosTime: 311 elapsed time 00:05:13 lanczos error: only trivial dependencies found BLanczosTime: 311 lanczos error: only trivial dependencies found lanczos error: only trivial dependencies found elapsed time 00:05:13 lanczos error: only trivial dependencies found BLanczosTime: 311 BLanczosTime: 311 elapsed time 00:05:13 elapsed time 00:05:13 BLanczosTime: 311 elapsed time 00:05:13 lanczos error: only trivial dependencies found BLanczosTime: 312 elapsed time 00:05:14 lanczos error: only trivial dependencies found BLanczosTime: 312 elapsed time 00:05:14 lanczos error: only trivial dependencies found BLanczosTime: 312 elapsed time 00:05:14 lanczos error: only trivial dependencies found BLanczosTime: 312 elapsed time 00:05:14 lanczos error: only trivial dependencies found BLanczosTime: 312 elapsed time 00:05:14 lanczos error: only trivial dependencies found BLanczosTime: 312 elapsed time 00:05:15 lanczos error: only trivial dependencies found BLanczosTime: 313 elapsed time 00:05:15 lanczos error: only trivial dependencies found BLanczosTime: 313 elapsed time 00:05:15 recovered 38 nontrivial dependencies BLanczosTime: 320 elapsed time 00:05:22 Last fiddled with by Jeff Gilchrist on 2011-02-22 at 00:01 |
|
|
|
|
|
|
#14 | |
|
Tribal Bullet
Oct 2004
3,541 Posts |
Quote:
More fundamentally, it makes more sense for the LA to detect that multiple MPI processes are running and just make all the non-root MPI processes go away, instead of making up a fake error condition. You've shown that the fake error is worse than confusing, it actually sabotages the entire run. Though it also is of questionable value to make the square root continue on one node while all the others wait for it to finish, possibly for hours. Last fiddled with by jasonp on 2011-02-22 at 01:38 |
|
|
|
|
|
|
#15 | |
|
Jun 2003
Ottawa, Canada
3×17×23 Posts |
Quote:
Besides changing the error messages is there anything you can do to fix something running -nc2 -nc3 with MPI so that it doesn't kill rank 0 before it finishes writing the .dep file? Jeff. |
|
|
|
|
|
|
#16 |
|
Tribal Bullet
Oct 2004
67258 Posts |
Honestly the best policy might be to just refuse to run if you want to do parallel LA and the square root consecutively. You can do the square root easily on multiple MPI processes, but there's little point in doing so unless the parallelism is pushed into the code for a single square root. You also run the risk of exhausting memory if all the MPI processes are on a single machine and you're doing several dependencies at once.
The alternative is to make the LA have all the non-root processes call MPI_Finalize when the Lanczos code finishes; this makes all the non-root MPI processes wait until the dependencies are written, but also makes them wait while the root completes the square root. Thanks again for the bug report, it probably has saved a lot of headaches for everyone else. Last fiddled with by jasonp on 2011-02-22 at 18:05 |
|
|
|
|
|
#17 |
|
Jun 2003
Ottawa, Canada
3·17·23 Posts |
I guess I was just so used to chaining the 3 stages on a multi-threaded machine, that I automatically did the same thing with the MPI version. Now that I know the pitfalls, I can easily avoid it.
|
|
|
|
|
|
#18 |
|
Jun 2003
Ottawa, Canada
3×17×23 Posts |
I think I found another bug. I completed another 30 bit factoring job using a 5x5 grid with MPI and it worked fine. I tried to launch it using a 6x6 grid with 36 CPUs but it fails saying that the maxtrix expects <= 35 MPI procs. Here is the log from rank 0.
Code:
Fri Feb 25 22:15:36 2011 Fri Feb 25 22:15:36 2011 Fri Feb 25 22:15:36 2011 Msieve v. 1.48 Fri Feb 25 22:15:36 2011 random seeds: 8f6dcfec 1e657090 Fri Feb 25 22:15:36 2011 MPI process 0 of 36 Fri Feb 25 22:15:36 2011 factoring 61377934632499616387908546877397575571582016265152677612398342795764922510357526469234310265459567285560840877704465244744514347092220346785330673622832591683289857296301979438391898112628510050636796071812183520024551173853 (224 digits) Fri Feb 25 22:15:39 2011 no P-1/P+1/ECM available, skipping Fri Feb 25 22:15:39 2011 commencing number field sieve (224-digit input) Fri Feb 25 22:15:39 2011 R0: -20000000000000000000000000000000000000 Fri Feb 25 22:15:39 2011 R1: 1 Fri Feb 25 22:15:39 2011 A0: 1 Fri Feb 25 22:15:39 2011 A1: 0 Fri Feb 25 22:15:39 2011 A2: 0 Fri Feb 25 22:15:39 2011 A3: 0 Fri Feb 25 22:15:39 2011 A4: 0 Fri Feb 25 22:15:39 2011 A5: 0 Fri Feb 25 22:15:39 2011 A6: 6250 Fri Feb 25 22:15:39 2011 skew 0.23, size 3.143e-11, alpha -0.950, combined = 1.477e-12 rroots = 0 Fri Feb 25 22:15:39 2011 Fri Feb 25 22:15:39 2011 commencing linear algebra Fri Feb 25 22:15:39 2011 initialized process (0,0) of 6 x 6 grid Fri Feb 25 22:15:45 2011 read 5168723 cycles Fri Feb 25 22:16:02 2011 cycles contain 14325173 unique relations Fri Feb 25 22:21:09 2011 read 14325173 relations Fri Feb 25 22:21:54 2011 using 20 quadratic characters above 1073739768 Fri Feb 25 22:23:11 2011 building initial matrix Fri Feb 25 22:28:30 2011 memory use: 1962.6 MB Fri Feb 25 22:28:49 2011 read 5168723 cycles Fri Feb 25 22:28:51 2011 matrix is 5168543 x 5168723 (1719.3 MB) with weight 516830953 (99.99/col) Fri Feb 25 22:28:51 2011 sparse part has weight 393855757 (76.20/col) Fri Feb 25 22:30:01 2011 filtering completed in 1 passes Fri Feb 25 22:30:03 2011 matrix is 5168543 x 5168723 (1719.3 MB) with weight 516830953 (99.99/col) Fri Feb 25 22:30:03 2011 sparse part has weight 393855757 (76.20/col) Fri Feb 25 22:31:24 2011 error: matrix expects MPI procs <= 35 commencing linear algebra initialized process (0,1) of 6 x 6 grid 4 of the non-rank 0 processes also have the error message after their initialized process message: error: matrix expects MPI procs <= 35 Is there any other info you want to see or want me to try? Jeff. |
|
|
|
|
|
#19 |
|
Tribal Bullet
Oct 2004
DD516 Posts |
Try the latest SVN; this was an error check that needed fixing, and was noticed by Ilya Popovyan a few weeks ago.
Last fiddled with by jasonp on 2011-02-26 at 14:01 |
|
|
|
|
|
#20 | |
|
Bamboozled!
"πΊππ·π·π"
May 2003
Down not across
2A0116 Posts |
Quote:
Running into all sorts of hassles here and have had to hack the Makefile in several ways to make progress. CUDA_ROOT had to be set explicitly (to /usr/local/cuda/ in my case) and the library is presumably /usr/local/cuda/lib64/cudart.so --- there is no cuda.lib as given in the Makefile. The linker keeps faiing to find the cuda library routines: Code:
gcc -D_FILE_OFFSET_BITS=64 -O3 -fomit-frame-pointer -march=k8 -DNDEBUG -D_LARGEFILE64_SOURCE -Wall -W -I. -Iinclude -Ignfs -Ignfs/poly -Ignfs/poly/stage1 -I"/usr/local/cuda/include" -DHAVE_CUDA demo.c -o msieve \
libmsieve.a -lz -lgmp -lm -lpthread
libmsieve.a(stage1.no): In function `poly_stage1_run':
stage1.c:(.text+0x280): undefined reference to `cuCtxCreate_v2'
stage1.c:(.text+0x2a1): undefined reference to `cuModuleLoad'
stage1.c:(.text+0x2c2): undefined reference to `cuModuleLoad'
stage1.c:(.text+0x7ba): undefined reference to `cuCtxDestroy'
libmsieve.a(stage1_sieve_gpu_nosq.no): In function `sieve_lattice_gpu_nosq':
stage1_sieve_gpu_nosq.c:(.text+0x516): undefined reference to `cuModuleGetFunction'
stage1_sieve_gpu_nosq.c:(.text+0x6af): undefined reference to `cuMemAlloc_v2'
stage1_sieve_gpu_nosq.c:(.text+0x6d0): undefined reference to `cuModuleGetGlobal_v2'
stage1_sieve_gpu_nosq.c:(.text+0x6ef): undefined reference to `cuFuncGetAttribute'
stage1_sieve_gpu_nosq.c:(.text+0x715): undefined reference to `cuFuncSetBlockShape'
stage1_sieve_gpu_nosq.c:(.text+0x75e): undefined reference to `cuMemAlloc_v2'
stage1_sieve_gpu_nosq.c:(.text+0x952): undefined reference to `cuParamSetv'
stage1_sieve_gpu_nosq.c:(.text+0x982): undefined reference to `cuParamSetv'
stage1_sieve_gpu_nosq.c:(.text+0x999): undefined reference to `cuParamSetSize'
stage1_sieve_gpu_nosq.c:(.text+0xa50): undefined reference to `cuModuleGetFunction'
stage1_sieve_gpu_nosq.c:(.text+0xaf6): undefined reference to `cuParamSeti'
stage1_sieve_gpu_nosq.c:(.text+0xc0e): undefined reference to `cuMemcpyHtoD_v2'
stage1_sieve_gpu_nosq.c:(.text+0xc29): undefined reference to `cuParamSeti'
stage1_sieve_gpu_nosq.c:(.text+0xcb3): undefined reference to `cuMemcpyHtoD_v2'
stage1_sieve_gpu_nosq.c:(.text+0xccd): undefined reference to `cuParamSeti'
stage1_sieve_gpu_nosq.c:(.text+0xd0c): undefined reference to `cuLaunchGrid'
stage1_sieve_gpu_nosq.c:(.text+0xd3a): undefined reference to `cuMemcpyDtoH_v2'
stage1_sieve_gpu_nosq.c:(.text+0xf24): undefined reference to `cuMemFree_v2'
stage1_sieve_gpu_nosq.c:(.text+0xf35): undefined reference to `cuMemFree_v2'
libmsieve.a(stage1_sieve_gpu_sq.no): In function `trans_batch_sq.clone.1':
stage1_sieve_gpu_sq.c:(.text+0xb9): undefined reference to `cuParamSetv'
stage1_sieve_gpu_sq.c:(.text+0xe6): undefined reference to `cuParamSetv'
stage1_sieve_gpu_sq.c:(.text+0x101): undefined reference to `cuParamSeti'
stage1_sieve_gpu_sq.c:(.text+0x12e): undefined reference to `cuParamSetf'
stage1_sieve_gpu_sq.c:(.text+0x145): undefined reference to `cuParamSetSize'
stage1_sieve_gpu_sq.c:(.text+0x18f): undefined reference to `cuMemcpyHtoD_v2'
stage1_sieve_gpu_sq.c:(.text+0x28e): undefined reference to `cuMemcpyHtoD_v2'
stage1_sieve_gpu_sq.c:(.text+0x2a9): undefined reference to `cuParamSeti'
stage1_sieve_gpu_sq.c:(.text+0x2d9): undefined reference to `cuLaunchGrid'
stage1_sieve_gpu_sq.c:(.text+0x2f9): undefined reference to `cuMemcpyDtoH_v2'
libmsieve.a(stage1_sieve_gpu_sq.no): In function `sieve_lattice_gpu_sq':
stage1_sieve_gpu_sq.c:(.text+0x728): undefined reference to `cuModuleGetFunction'
stage1_sieve_gpu_sq.c:(.text+0x745): undefined reference to `cuModuleGetFunction'
stage1_sieve_gpu_sq.c:(.text+0x819): undefined reference to `cuMemAlloc_v2'
stage1_sieve_gpu_sq.c:(.text+0x834): undefined reference to `cuMemAlloc_v2'
stage1_sieve_gpu_sq.c:(.text+0x853): undefined reference to `cuFuncGetAttribute'
stage1_sieve_gpu_sq.c:(.text+0x879): undefined reference to `cuFuncSetBlockShape'
stage1_sieve_gpu_sq.c:(.text+0x898): undefined reference to `cuFuncGetAttribute'
stage1_sieve_gpu_sq.c:(.text+0x8be): undefined reference to `cuFuncSetBlockShape'
stage1_sieve_gpu_sq.c:(.text+0x910): undefined reference to `cuMemAlloc_v2'
stage1_sieve_gpu_sq.c:(.text+0xbd2): undefined reference to `cuParamSetv'
stage1_sieve_gpu_sq.c:(.text+0xc07): undefined reference to `cuParamSetv'
stage1_sieve_gpu_sq.c:(.text+0xc20): undefined reference to `cuParamSeti'
stage1_sieve_gpu_sq.c:(.text+0xc55): undefined reference to `cuParamSetv'
stage1_sieve_gpu_sq.c:(.text+0xc6c): undefined reference to `cuParamSetSize'
stage1_sieve_gpu_sq.c:(.text+0xdfc): undefined reference to `cuMemcpyHtoD_v2'
stage1_sieve_gpu_sq.c:(.text+0xe1a): undefined reference to `cuParamSeti'
stage1_sieve_gpu_sq.c:(.text+0xf5a): undefined reference to `cuMemcpyHtoD_v2'
stage1_sieve_gpu_sq.c:(.text+0xf75): undefined reference to `cuParamSeti'
stage1_sieve_gpu_sq.c:(.text+0xf93): undefined reference to `cuLaunchGrid'
stage1_sieve_gpu_sq.c:(.text+0xfb9): undefined reference to `cuMemcpyDtoH_v2'
stage1_sieve_gpu_sq.c:(.text+0x11e7): undefined reference to `cuMemFree_v2'
stage1_sieve_gpu_sq.c:(.text+0x11fd): undefined reference to `cuMemFree_v2'
stage1_sieve_gpu_sq.c:(.text+0x1213): undefined reference to `cuMemFree_v2'
stage1_sieve_gpu_sq.c:(.text+0x169b): undefined reference to `cuModuleGetFunction'
libmsieve.a(cuda_xface.o): In function `gpu_init':
cuda_xface.c:(.text+0x1e9): undefined reference to `cuInit'
cuda_xface.c:(.text+0x1f9): undefined reference to `cuDeviceGetCount'
cuda_xface.c:(.text+0x229): undefined reference to `cuDeviceGet'
cuda_xface.c:(.text+0x256): undefined reference to `cuDeviceGetName'
cuda_xface.c:(.text+0x274): undefined reference to `cuDeviceComputeCapability'
cuda_xface.c:(.text+0x288): undefined reference to `cuDeviceGetProperties'
cuda_xface.c:(.text+0x2f1): undefined reference to `cuDeviceTotalMem_v2'
cuda_xface.c:(.text+0x30c): undefined reference to `cuDeviceGetAttribute'
cuda_xface.c:(.text+0x326): undefined reference to `cuDeviceGetAttribute'
cuda_xface.c:(.text+0x33d): undefined reference to `cuDeviceGetAttribute'
collect2: ld returned 1 exit status
make: *** [x86_64] Error 1
I'm stuck at the moment. ![]() Paul |
|
|
|
|
|
|
#21 |
|
Tribal Bullet
Oct 2004
3,541 Posts |
cuda.lib is the name of the driver library under windows. On linux it would be libcuda.a, though I don't know where in the file tree it would be, probably right next to libcudart.a
Msieve uses the driver API, not the runtime API, so that a compiled binary can work on a machine with only the Nvidia driver installed (and not the whole CUDA toolkit). FWIW I had access to a linux system and could not even install the CUDA toolkit to the point where the sample applications worked. |
|
|
|
|
|
#22 | |
|
Bamboozled!
"πΊππ·π·π"
May 2003
Down not across
10,753 Posts |
Quote:
If you wish I can give you ssh access to a machine with CUDA 3.2, 64-bit Fedora 14, a GT460 and the 260.19.36 Nvidia driver. The CUDA kit installed without any problems here and the SDK examples all run correctly. Paul |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Msieve 1.53 feedback | xilman | Msieve | 149 | 2018-11-12 06:37 |
| Msieve 1.50 feedback | firejuggler | Msieve | 99 | 2013-02-17 11:53 |
| Msieve 1.43 feedback | Jeff Gilchrist | Msieve | 47 | 2009-11-24 15:53 |
| Msieve 1.42 feedback | Andi47 | Msieve | 167 | 2009-10-18 19:37 |
| Msieve 1.41 Feedback | Batalov | Msieve | 130 | 2009-06-09 16:01 |