![]() |
msieve SVN940 problem
Doing the linear algebra for a medium-sized SNFS job on my Haswell, with a freshly-compiled SVN940 and running -t4, I get a segfault just as the Lanczos starts
[code] Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffe9f38700 (LWP 615)] 0x000000000043e9c9 in mul_packed_small_core () (gdb) bt #0 0x000000000043e9c9 in mul_packed_small_core () #1 0x000000000044a5b6 in worker_thr_routine () #2 0x00007ffff7235f8e in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #3 0x00007ffff6f5fe1d in clone () from /lib/x86_64-linux-gnu/libc.so.6 [/code] which looks as if mul_packed_small_core has been called with task->matrix==NULL. Rebuilding with -g and re-running, the whole matrix->tasks[] contains only zeroes at this point |
Seconded.
My test was on r944 and under MPI. Typical dumps from workers are : [CODE][node-4-5:24546] [ 0] /lib64/libpthread.so.0 [0x368700eca0] [node-4-5:24546] [ 1] ./ms(mul_packed_core+0x15) [0x43d6d5] [node-4-5:24546] [ 2] ./ms [0x43c77e] [node-4-5:24546] [ 3] ./ms(mul_sym_NxN_Nx64+0x6c) [0x43cc7c] [node-4-5:24546] [ 4] ./ms [0x437eb7] [node-4-5:24546] [ 5] ./ms(block_lanczos+0x33b) [0x43acfb] [node-4-5:24546] [ 6] ./ms(nfs_solve_linear_system+0x438) [0x42f418] [node-4-5:24546] [ 7] ./ms(factor_gnfs+0x966) [0x4178c6] [node-4-5:24546] [ 8] ./ms(msieve_run+0x519) [0x408b39] [node-4-5:24546] [ 9] ./ms(factor_integer+0x140) [0x406ed0] [node-4-5:24546] [10] ./ms(main+0x7c9) [0x4078b9] [node-4-5:24546] [11] /lib64/libc.so.6(__libc_start_main+0xf4) [0x368641d9c4] [node-4-5:24546] [12] ./ms [0x406c39] [node-4-5:24546] *** End of error message *** [node-4-5:24542] [ 0] /lib64/libpthread.so.0 [0x368700eca0] [node-4-5:24542] [ 1] ./ms(mul_packed_core+0x15) [0x43d6d5] [node-4-5:24542] [ 2] ./ms [0x43c77e] [node-4-5:24542] [ 3] ./ms(mul_sym_NxN_Nx64+0x6c) [0x43cc7c] [node-4-5:24542] [ 4] ./ms [0x437eb7] [node-4-5:24542] [ 5] ./ms(block_lanczos+0x33b) [0x43acfb] [node-4-5:24542] [ 6] ./ms(nfs_solve_linear_system+0x438) [0x42f418] [node-4-5:24542] [ 7] ./ms(factor_gnfs+0x966) [0x4178c6] [node-4-5:24542] [ 8] ./ms(msieve_run+0x519) [0x408b39] [node-4-5:24542] [ 9] ./ms(factor_integer+0x140) [0x406ed0] [node-4-5:24542] [10] ./ms(main+0x7c9) [0x4078b9] [node-4-5:24542] [11] /lib64/libc.so.6(__libc_start_main+0xf4) [0x368641d9c4] [node-4-5:24542] [12] ./ms [0x406c39] [node-4-5:24542] *** End of error message ***[/CODE] Note: the matrix was built repeatedly and fine (that part is fine in r940+), but the crash happens immediately when LA iterations start. Following on Greg's sage advice, I downgraded to r923, rebuilt, and the computation then went on without problems in either of the tried combinations (-nc2 4,8; -nc2 8,4; -nc2 8,4 -t 4). The bug is between r923 and r940+. |
The current best guess is that there's a buffer overflow at the beginning of the LA that only became apparent in SVN 937, when initialization that could not afford to be overwritten was moved close by. I don't have the time to deal with it.
|
Understood.
This is just to leave a hint for anyone in the same shoes that for the time being r923 is fine. |
[QUOTE=Batalov;352742]Understood.
This is just to leave a hint for anyone in the same shoes that for the time being r923 is fine.[/QUOTE] Uhm, that was [code] svn checkout -r 932 svn://svn.code.sf.net/p/msieve/code/trunk msieve-code [/code] in case there are any other svn newbies, desperately seeking 932. Matrix for 6p448 C289 is [code] Thu Sep 19 23:02:19 2013 weight of 37122087 cycles is about 4826664707 (130.02/cycle) [/code] up from [code] Fri Mar 8 09:25:30 2013 weight of 30701948 cycles is about 3991401291 (130.00/cycle) [/code] for 3m613 C285. -Bruce |
OK, I think I've got it.
I found an example which crashed repeatably with r944, and which was small enough to run under valgrind. valgrind tells me that a memset operation to t->tmp_b was running off the end of the array. When I annotated the code that allocated t->tmp_b and that did the write to it, I found that it was allocating size 64 and doing a write of size 3000; this is obviously not going to work. But the allocation is of something at least as large as p->first_block_size, and the memset is of size p->first_block_size; so what's going wrong? Answer: t->tmp_b is allocated (in matrix_thread_init around line 397 of lanczos_matmul0.c) as [code] t->tmp_b = (uint64 *)xmalloc(MAX(64, p->first_block_size) * sizeof(uint64)); [/code] matrix_thread_init is called from packed_matrix_init, around line 465 of common/lanczos/lanczos_matmul0.c [B]But p->first_block_size isn't filled in until line 519 of the same function[/B]! If I move the block from '/* start the thread pool */' to 'matrix_thread_init(p, num_threads-1)' inclusive to after the pack_matrix_core(p,A); at the end of packed_matrix_init, it stops blowing up for several test cases of mine. |
Sorry, that broke the case where the matrix is too small to need packing, because an early-out path fired before the matrix_thread_init got called.
Instead of [code] if (max_nrows <= MIN_NROWS_TO_PACK) return; [/code] you need to set p->first_block_size unconditionally, do the packing only if max_nrows > MIN_NROWS_TO_PACK, and then start the thread pool unconditionally. |
With the second version of my patch, I've run a hundred or so small factorisations overnight without issue.
Would someone with SVN privileges mind casting another eye over it and committing? It looks a bit large, but it's three trivial changes; it's just that one of them changed the indentation of quite a long block. [code] =================================================================== --- common/lanczos/lanczos_matmul0.c (revision 944) +++ common/lanczos/lanczos_matmul0.c (working copy) @@ -446,18 +446,6 @@ p->num_threads = num_threads = MIN(num_threads, MAX_THREADS); - /* start the thread pool */ - - control.init = matrix_thread_init; - control.shutdown = matrix_thread_free; - control.data = p; - - if (num_threads > 1) { - p->threadpool = threadpool_init(num_threads - 1, - 200, &control); - } - matrix_thread_init(p, num_threads - 1); - /* pre-generate the structures to drive the thread pool; do this even for single-threaded runs */ @@ -469,63 +457,78 @@ p->tasks[i].task_num = i; } - if (max_nrows <= MIN_NROWS_TO_PACK) - return; + p->first_block_size = first_block_size; - /* determine the block sizes. We assume that the largest - cache in the system is unified and shared across all - threads. When performing matrix multiplies 'A*x=b', - we choose the block size small enough so that one block - of b fits in L1 cache, and choose the superblock size - to be small enough so that a superblock's worth of x - or b takes up 3/4 of the largest cache in the system. - - Making the block size too small increases memory use - and puts more pressure on the larger caches, while - making the superblock too small reduces the effectiveness - of L1 cache and increases the synchronization overhead - in multithreaded runs */ + if (max_nrows > MIN_NROWS_TO_PACK) + { - block_size = 8192; - superblock_size = 3 * obj->cache_size2 / (4 * sizeof(uint64)); + /* determine the block sizes. We assume that the largest + cache in the system is unified and shared across all + threads. When performing matrix multiplies 'A*x=b', + we choose the block size small enough so that one block + of b fits in L1 cache, and choose the superblock size + to be small enough so that a superblock's worth of x + or b takes up 3/4 of the largest cache in the system. + + Making the block size too small increases memory use + and puts more pressure on the larger caches, while + making the superblock too small reduces the effectiveness + of L1 cache and increases the synchronization overhead + in multithreaded runs */ + + block_size = 8192; + superblock_size = 3 * obj->cache_size2 / (4 * sizeof(uint64)); - /* possibly override from the command line */ + /* possibly override from the command line */ - if (obj->nfs_args != NULL) { + if (obj->nfs_args != NULL) { + + const char *tmp; + + tmp = strstr(obj->nfs_args, "la_block="); + if (tmp != NULL) + block_size = atoi(tmp + 9); + + tmp = strstr(obj->nfs_args, "la_superblock="); + if (tmp != NULL) + superblock_size = atoi(tmp + 14); + } - const char *tmp; + logprintf(obj, "using block size %u and superblock size %u for " + "processor cache size %u kB\n", + block_size, superblock_size, + obj->cache_size2 / 1024); + + p->unpacked_cols = NULL; + + p->block_size = block_size; + p->num_block_cols = (ncols + block_size - 1) / block_size; + p->num_block_rows = 1 + (nrows - first_block_size + + block_size - 1) / block_size; + + p->superblock_size = (superblock_size + block_size - 1) / block_size; + p->num_superblock_cols = (p->num_block_cols + p->superblock_size - 1) / + p->superblock_size; + p->num_superblock_rows = (p->num_block_rows - 1 + p->superblock_size - 1) / + p->superblock_size; + + /* do the core work of packing the matrix */ + + pack_matrix_core(p, A); + } - tmp = strstr(obj->nfs_args, "la_block="); - if (tmp != NULL) - block_size = atoi(tmp + 9); + /* start the thread pool */ - tmp = strstr(obj->nfs_args, "la_superblock="); - if (tmp != NULL) - superblock_size = atoi(tmp + 14); + control.init = matrix_thread_init; + control.shutdown = matrix_thread_free; + control.data = p; + + if (num_threads > 1) { + p->threadpool = threadpool_init(num_threads - 1, + 200, &control); } + matrix_thread_init(p, num_threads - 1); - logprintf(obj, "using block size %u and superblock size %u for " - "processor cache size %u kB\n", - block_size, superblock_size, - obj->cache_size2 / 1024); - - p->unpacked_cols = NULL; - p->first_block_size = first_block_size; - - p->block_size = block_size; - p->num_block_cols = (ncols + block_size - 1) / block_size; - p->num_block_rows = 1 + (nrows - first_block_size + - block_size - 1) / block_size; - - p->superblock_size = (superblock_size + block_size - 1) / block_size; - p->num_superblock_cols = (p->num_block_cols + p->superblock_size - 1) / - p->superblock_size; - p->num_superblock_rows = (p->num_block_rows - 1 + p->superblock_size - 1) / - p->superblock_size; - - /* do the core work of packing the matrix */ - - pack_matrix_core(p, A); } [/code] |
AFAICS, your patch fixes the problem reported at [url]http://www.mersenneforum.org/showthread.php?t=18481[/url] , for which I compiled msieve with AddressSanitizer, showing a double free.
|
Makes perfect sense, sorry for not paying attention.
Would it also work to move the setting of first_block_size to the unconditional initialization at the top of packed_matrix_init and leave everything else alone? i.e. the patch would be [code] --- common/lanczos/lanczos_matmul0.c (revision 944) +++ common/lanczos/lanczos_matmul0.c (working copy) @@ -428,6 +428,7 @@ p->start_col = start_col; p->num_dense_rows = num_dense_rows; p->num_threads = 1; + p->first_block_size = first_block_size; /* needed for thread pool init */ #ifdef HAVE_MPI p->mpi_size = obj->mpi_size; p->mpi_nrows = obj->mpi_nrows; @@ -510,7 +511,6 @@ obj->cache_size2 / 1024); p->unpacked_cols = NULL; - p->first_block_size = first_block_size; p->block_size = block_size; p->num_block_cols = (ncols + block_size - 1) / block_size; [/code] |
Yes, that one-line change looks equivalent
|
| All times are UTC. The time now is 04:50. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.