mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Msieve (https://www.mersenneforum.org/forumdisplay.php?f=83)
-   -   msieve SVN940 problem (https://www.mersenneforum.org/showthread.php?t=18460)

fivemack 2013-08-11 23:14

msieve SVN940 problem
 
Doing the linear algebra for a medium-sized SNFS job on my Haswell, with a freshly-compiled SVN940 and running -t4, I get a segfault just as the Lanczos starts

[code]
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe9f38700 (LWP 615)]
0x000000000043e9c9 in mul_packed_small_core ()
(gdb) bt
#0 0x000000000043e9c9 in mul_packed_small_core ()
#1 0x000000000044a5b6 in worker_thr_routine ()
#2 0x00007ffff7235f8e in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007ffff6f5fe1d in clone () from /lib/x86_64-linux-gnu/libc.so.6
[/code]

which looks as if mul_packed_small_core has been called with task->matrix==NULL.

Rebuilding with -g and re-running, the whole matrix->tasks[] contains only zeroes at this point

Batalov 2013-09-11 16:43

Seconded.

My test was on r944 and under MPI.
Typical dumps from workers are :
[CODE][node-4-5:24546] [ 0] /lib64/libpthread.so.0 [0x368700eca0]
[node-4-5:24546] [ 1] ./ms(mul_packed_core+0x15) [0x43d6d5]
[node-4-5:24546] [ 2] ./ms [0x43c77e]
[node-4-5:24546] [ 3] ./ms(mul_sym_NxN_Nx64+0x6c) [0x43cc7c]
[node-4-5:24546] [ 4] ./ms [0x437eb7]
[node-4-5:24546] [ 5] ./ms(block_lanczos+0x33b) [0x43acfb]
[node-4-5:24546] [ 6] ./ms(nfs_solve_linear_system+0x438) [0x42f418]
[node-4-5:24546] [ 7] ./ms(factor_gnfs+0x966) [0x4178c6]
[node-4-5:24546] [ 8] ./ms(msieve_run+0x519) [0x408b39]
[node-4-5:24546] [ 9] ./ms(factor_integer+0x140) [0x406ed0]
[node-4-5:24546] [10] ./ms(main+0x7c9) [0x4078b9]
[node-4-5:24546] [11] /lib64/libc.so.6(__libc_start_main+0xf4) [0x368641d9c4]
[node-4-5:24546] [12] ./ms [0x406c39]
[node-4-5:24546] *** End of error message ***
[node-4-5:24542] [ 0] /lib64/libpthread.so.0 [0x368700eca0]
[node-4-5:24542] [ 1] ./ms(mul_packed_core+0x15) [0x43d6d5]
[node-4-5:24542] [ 2] ./ms [0x43c77e]
[node-4-5:24542] [ 3] ./ms(mul_sym_NxN_Nx64+0x6c) [0x43cc7c]
[node-4-5:24542] [ 4] ./ms [0x437eb7]
[node-4-5:24542] [ 5] ./ms(block_lanczos+0x33b) [0x43acfb]
[node-4-5:24542] [ 6] ./ms(nfs_solve_linear_system+0x438) [0x42f418]
[node-4-5:24542] [ 7] ./ms(factor_gnfs+0x966) [0x4178c6]
[node-4-5:24542] [ 8] ./ms(msieve_run+0x519) [0x408b39]
[node-4-5:24542] [ 9] ./ms(factor_integer+0x140) [0x406ed0]
[node-4-5:24542] [10] ./ms(main+0x7c9) [0x4078b9]
[node-4-5:24542] [11] /lib64/libc.so.6(__libc_start_main+0xf4) [0x368641d9c4]
[node-4-5:24542] [12] ./ms [0x406c39]
[node-4-5:24542] *** End of error message ***[/CODE]
Note: the matrix was built repeatedly and fine (that part is fine in r940+), but the crash happens immediately when LA iterations start.

Following on Greg's sage advice, I downgraded to r923, rebuilt, and the computation then went on without problems in either of the tried combinations (-nc2 4,8; -nc2 8,4; -nc2 8,4 -t 4).

The bug is between r923 and r940+.

jasonp 2013-09-11 16:59

The current best guess is that there's a buffer overflow at the beginning of the LA that only became apparent in SVN 937, when initialization that could not afford to be overwritten was moved close by. I don't have the time to deal with it.

Batalov 2013-09-11 17:22

Understood.

This is just to leave a hint for anyone in the same shoes that for the time being r923 is fine.

bdodson 2013-09-20 15:07

[QUOTE=Batalov;352742]Understood.

This is just to leave a hint for anyone in the same shoes that for the time being r923 is fine.[/QUOTE]

Uhm, that was
[code]
svn checkout -r 932 svn://svn.code.sf.net/p/msieve/code/trunk msieve-code
[/code]
in case there are any other svn newbies, desperately seeking 932.
Matrix for 6p448 C289 is
[code]
Thu Sep 19 23:02:19 2013 weight of 37122087 cycles is about 4826664707 (130.02/cycle) [/code]
up from
[code]
Fri Mar 8 09:25:30 2013 weight of 30701948 cycles is about 3991401291 (130.00/cycle)
[/code]
for 3m613 C285. -Bruce

fivemack 2013-09-21 21:16

OK, I think I've got it.

I found an example which crashed repeatably with r944, and which was small enough to run under valgrind.

valgrind tells me that a memset operation to t->tmp_b was running off the end of the array.

When I annotated the code that allocated t->tmp_b and that did the write to it, I found that it was allocating size 64 and doing a write of size 3000; this is obviously not going to work.

But the allocation is of something at least as large as p->first_block_size, and the memset is of size p->first_block_size; so what's going wrong?

Answer: t->tmp_b is allocated (in matrix_thread_init around line 397 of lanczos_matmul0.c) as

[code]
t->tmp_b = (uint64 *)xmalloc(MAX(64, p->first_block_size) *
sizeof(uint64));
[/code]

matrix_thread_init is called from packed_matrix_init, around line 465 of common/lanczos/lanczos_matmul0.c

[B]But p->first_block_size isn't filled in until line 519 of the same function[/B]!

If I move the block from '/* start the thread pool */' to 'matrix_thread_init(p, num_threads-1)' inclusive to after the pack_matrix_core(p,A); at the end of packed_matrix_init, it stops blowing up for several test cases of mine.

fivemack 2013-09-21 23:40

Sorry, that broke the case where the matrix is too small to need packing, because an early-out path fired before the matrix_thread_init got called.

Instead of

[code]
if (max_nrows <= MIN_NROWS_TO_PACK)
return;
[/code]

you need to set p->first_block_size unconditionally, do the packing only if max_nrows > MIN_NROWS_TO_PACK, and then start the thread pool unconditionally.

fivemack 2013-09-22 07:46

With the second version of my patch, I've run a hundred or so small factorisations overnight without issue.

Would someone with SVN privileges mind casting another eye over it and committing? It looks a bit large, but it's three trivial changes; it's just that one of them changed the indentation of quite a long block.

[code]

===================================================================
--- common/lanczos/lanczos_matmul0.c (revision 944)
+++ common/lanczos/lanczos_matmul0.c (working copy)
@@ -446,18 +446,6 @@

p->num_threads = num_threads = MIN(num_threads, MAX_THREADS);

- /* start the thread pool */
-
- control.init = matrix_thread_init;
- control.shutdown = matrix_thread_free;
- control.data = p;
-
- if (num_threads > 1) {
- p->threadpool = threadpool_init(num_threads - 1,
- 200, &control);
- }
- matrix_thread_init(p, num_threads - 1);
-
/* pre-generate the structures to drive the thread pool;
do this even for single-threaded runs */

@@ -469,63 +457,78 @@
p->tasks[i].task_num = i;
}

- if (max_nrows <= MIN_NROWS_TO_PACK)
- return;
+ p->first_block_size = first_block_size;

- /* determine the block sizes. We assume that the largest
- cache in the system is unified and shared across all
- threads. When performing matrix multiplies 'A*x=b',
- we choose the block size small enough so that one block
- of b fits in L1 cache, and choose the superblock size
- to be small enough so that a superblock's worth of x
- or b takes up 3/4 of the largest cache in the system.
-
- Making the block size too small increases memory use
- and puts more pressure on the larger caches, while
- making the superblock too small reduces the effectiveness
- of L1 cache and increases the synchronization overhead
- in multithreaded runs */
+ if (max_nrows > MIN_NROWS_TO_PACK)
+ {

- block_size = 8192;
- superblock_size = 3 * obj->cache_size2 / (4 * sizeof(uint64));
+ /* determine the block sizes. We assume that the largest
+ cache in the system is unified and shared across all
+ threads. When performing matrix multiplies 'A*x=b',
+ we choose the block size small enough so that one block
+ of b fits in L1 cache, and choose the superblock size
+ to be small enough so that a superblock's worth of x
+ or b takes up 3/4 of the largest cache in the system.
+
+ Making the block size too small increases memory use
+ and puts more pressure on the larger caches, while
+ making the superblock too small reduces the effectiveness
+ of L1 cache and increases the synchronization overhead
+ in multithreaded runs */
+
+ block_size = 8192;
+ superblock_size = 3 * obj->cache_size2 / (4 * sizeof(uint64));

- /* possibly override from the command line */
+ /* possibly override from the command line */

- if (obj->nfs_args != NULL) {
+ if (obj->nfs_args != NULL) {
+
+ const char *tmp;
+
+ tmp = strstr(obj->nfs_args, "la_block=");
+ if (tmp != NULL)
+ block_size = atoi(tmp + 9);
+
+ tmp = strstr(obj->nfs_args, "la_superblock=");
+ if (tmp != NULL)
+ superblock_size = atoi(tmp + 14);
+ }

- const char *tmp;
+ logprintf(obj, "using block size %u and superblock size %u for "
+ "processor cache size %u kB\n",
+ block_size, superblock_size,
+ obj->cache_size2 / 1024);
+
+ p->unpacked_cols = NULL;
+
+ p->block_size = block_size;
+ p->num_block_cols = (ncols + block_size - 1) / block_size;
+ p->num_block_rows = 1 + (nrows - first_block_size +
+ block_size - 1) / block_size;
+
+ p->superblock_size = (superblock_size + block_size - 1) / block_size;
+ p->num_superblock_cols = (p->num_block_cols + p->superblock_size - 1) /
+ p->superblock_size;
+ p->num_superblock_rows = (p->num_block_rows - 1 + p->superblock_size - 1) /
+ p->superblock_size;
+
+ /* do the core work of packing the matrix */
+
+ pack_matrix_core(p, A);
+ }

- tmp = strstr(obj->nfs_args, "la_block=");
- if (tmp != NULL)
- block_size = atoi(tmp + 9);
+ /* start the thread pool */

- tmp = strstr(obj->nfs_args, "la_superblock=");
- if (tmp != NULL)
- superblock_size = atoi(tmp + 14);
+ control.init = matrix_thread_init;
+ control.shutdown = matrix_thread_free;
+ control.data = p;
+
+ if (num_threads > 1) {
+ p->threadpool = threadpool_init(num_threads - 1,
+ 200, &control);
}
+ matrix_thread_init(p, num_threads - 1);

- logprintf(obj, "using block size %u and superblock size %u for "
- "processor cache size %u kB\n",
- block_size, superblock_size,
- obj->cache_size2 / 1024);
-
- p->unpacked_cols = NULL;
- p->first_block_size = first_block_size;
-
- p->block_size = block_size;
- p->num_block_cols = (ncols + block_size - 1) / block_size;
- p->num_block_rows = 1 + (nrows - first_block_size +
- block_size - 1) / block_size;
-
- p->superblock_size = (superblock_size + block_size - 1) / block_size;
- p->num_superblock_cols = (p->num_block_cols + p->superblock_size - 1) /
- p->superblock_size;
- p->num_superblock_rows = (p->num_block_rows - 1 + p->superblock_size - 1) /
- p->superblock_size;
-
- /* do the core work of packing the matrix */
-
- pack_matrix_core(p, A);
}
[/code]

debrouxl 2013-09-22 09:40

AFAICS, your patch fixes the problem reported at [url]http://www.mersenneforum.org/showthread.php?t=18481[/url] , for which I compiled msieve with AddressSanitizer, showing a double free.

jasonp 2013-09-22 15:16

Makes perfect sense, sorry for not paying attention.

Would it also work to move the setting of first_block_size to the unconditional initialization at the top of packed_matrix_init and leave everything else alone?
i.e. the patch would be
[code]
--- common/lanczos/lanczos_matmul0.c (revision 944)
+++ common/lanczos/lanczos_matmul0.c (working copy)
@@ -428,6 +428,7 @@
p->start_col = start_col;
p->num_dense_rows = num_dense_rows;
p->num_threads = 1;
+ p->first_block_size = first_block_size; /* needed for thread pool init */
#ifdef HAVE_MPI
p->mpi_size = obj->mpi_size;
p->mpi_nrows = obj->mpi_nrows;
@@ -510,7 +511,6 @@
obj->cache_size2 / 1024);

p->unpacked_cols = NULL;
- p->first_block_size = first_block_size;

p->block_size = block_size;
p->num_block_cols = (ncols + block_size - 1) / block_size;
[/code]

fivemack 2013-09-22 18:51

Yes, that one-line change looks equivalent


All times are UTC. The time now is 04:50.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.