![]() |
|
|
#188 | |
|
Undefined
"The unspeakable one"
Jun 2006
My evil lair
22×1,553 Posts |
Quote:
|
|
|
|
|
|
|
#189 |
|
Noodles
"Mr. Tuch"
Dec 2007
Chennai, India
4E916 Posts |
Job submission script that is to be used up for executing tasks within compute cluster.
Unfortunately, we does not know at all what is happening inside until the tasks gets completed up fully. Code:
#! /bin/bash #PBS -o logfile.log #PBS -e errorfile.err #PBS -l walltime=200:00:00 #PBS -l nodes=1:ppn=8 tpdir=`echo $PBS_JOBID | cut -f 1 -d .` tempdir=/scratch/$PBS_O_LOGNAME/job$tpdir mkdir -p $tempdir cd $tempdir cp -R $PBS_O_WORKDIR/* . ./msieve143 -v -t 8 -s 7_320P.dat -nc mv * $PBS_O_WORKDIR/. rmdir $tempdir If I try out to write up everything within the file directory itself, the task does not execute up at all, soon instantly it gets killed off or terminated? Here are the files for your reference purposes only. You can check out whether or not they are valid. The relation files are not possible to upload up right now. They are 100 times bigger than the cycle files itself. Only uploading up the cycle files took upto 23 minutes per each file. 10_339P.dat.cyc 10_339P.dat.dep 2_1778L.dat.cyc 7_320P.dat.cyc 7_320P.dat.dep Last fiddled with by Raman on 2010-01-13 at 11:42 |
|
|
|
|
|
#190 | |
|
Tribal Bullet
Oct 2004
3·1,181 Posts |
Quote:
- relations were added or deleted after the filtering or LA ran - there was a bug in the filtering or LA - the cycle file or dependency file is corrupt - there was a bug reading relations, that somehow did not occur when the relations were read for the filtering or LA - the cycle file or dependency file is from an old factorization or an old run for the current factorization - the matrix changed after you ran the LA once That's a lot of possibilities, and there is no way to determine which possibility is happening by the time the square root runs. The basic problem here is that if everything worked perfectly a dependency may still not find a factor, so when that happens you need to know that it was just bad luck and not a bug. If one bit is wrong somewhere, on average half the dependencies will be spoiled but the other half may still work. The checking in the square root is there because at best one can check a few conditions that must be true. That cannot diagnose what the problems, if any, actually are. A correct dependency file will look like random garbage. A correct cycle file will look mostly like random garbage. Without the underlying relations, the most you could verify would be that both represent the same number of matrix columns, and the number of relations in all dependencies is even. What is the output of 'od -tx1 -A d msieve.dat.cyc | head', and what is the exact size of the dependency file? Did you run md5sum like we asked you to? Your previous post suggests that you did, but then why are you still wondering if something was corrupted? Did you delete both the old and the current copy of the dependency file? You should be prepared to get a little frustrated when moving to a new environment and then tackling data-intensive problems, it's difficult for everyone. |
|
|
|
|
|
|
#191 |
|
Noodles
"Mr. Tuch"
Dec 2007
Chennai, India
3·419 Posts |
When I am in my college, you are at sleep. When you are active,
I do not have in touch with those resources. The problem is the time difference, otherwise it would have been possible that we could do an interactive conversation, each and every hour or so, then. Tomorrow is Pongal (Holiday), day after tomorrow I am going out of station upto Rameswaram to view the annular solar eclipse over there only. I will be back only on Saturday to monitor those resources. I will not even be active on Mersenne Forum for these two days, Thursday evening to Saturday morning Indian standard time only. I have uploaded up the old files already to sendspace as I have mentioned above. The dependency files average around 44 MB each. Meanwhile, that I have scheduled the entire post processing for 2,1778L 7,320+ 10,339+ on the compute cluster concurrently with msieve 1.43 compiled under that environment, presuming that the square root crash was with msieve version 1.44 only. These things will not be disturbed by anybody at all. Post processing for 10,339+ will take upto 3 days to get completed up, 7,320+ three days and then 2,1778L: 4 days only. After that, I will see up the results whether or not it got completed up properly. The old scheduled post processing job for 10,351+ has not yet been completed up still, thus, let me see what exactly happens so.
Last fiddled with by Raman on 2010-01-13 at 20:38 |
|
|
|
|
|
#192 |
|
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2
24×593 Posts |
Raman, the .dep and .cyc without the .dat file are useless, and the .dat is too big for practical purposes (and you know that). Therefore, the road to debugging and ultimately finishing is to carefully listen to Jason and try his suggestions at your site, one by one and without excessive emotions.
Everyone had been in your shoes (everyone had problems, everyone at least once started from scratch), so we can relate, and yes, it is unpleasant, but just endure and then one day you will remember it with a smile. The beauty of this hobby is that it is not for passengers. An analogy would be - you are not driving a car on a highway or to a grocery store, rather you took a truck to the desert, off-road, now a couple of tires are low, you are stuck in a narrow passage and the engine won't start and what do you do? First thing is: don't panic (™). You have to have an off-road adventurer's attitude, not a passenger's. Yes, the tires can (and will) blow out, the carburator or whatever may need to be taken apart and cleaned up with the rags - and you may be alone in that desert. There's no helicopter coming to take you. You have a voice on the other side of the radio line who possibly knows and is willing tell you how to fix the truck, just don't waste time complaining and yelling "what the hell is this". Get to business and listen. Ok? It's going to be alright if you do that. |
|
|
|
|
|
#193 | ||
|
Noodles
"Mr. Tuch"
Dec 2007
Chennai, India
3×419 Posts |
Quote:
Quote:
10,339+ Code:
Sat Jan 16 07:07:55 2010 prp76 factor: 5397511769683444928966129716536741688329069042441479613115913313907895924533 Sat Jan 16 07:07:55 2010 prp111 factor: 195536788296069646441612538004667920689480607938818585757708176337130353156036650166508246335832764566608042973 On October 1, 2009, some unknown student has messaged me like this, which I saw only during the last week. ![]() Code:
From kashyap@moderated out Thu Oct 01 18:20:43 2009
Return-path: <kashyap@moderated out>
Envelope-to: ramanv@moderated out
Delivery-date: Thu, 01 Oct 2009 18:20:43 +0530
Received: from kashyap bymoderated outwith local (Exim 4.63)
(envelope-from <kashyap@moderated out>)
id 1MtL7L-0003Du-8a
for ramanv@moderated out; Thu, 01 Oct 2009 18:20:43 +0530
To: ramanv@moderated out
Subject: Wasting the resources of the comp
Message-Id: <E1MtL7L-0003Du-8a@moderated out>
From: kashyap@moderated out
Date: Thu, 01 Oct 2009 18:20:43 +0530
Status: RO
Please stop whatever processeng used stupidly.
Core2Duo being fried like an egg! Bah!
Code:
[cs09m038@leo0 ~]$ ls -l 10__339P/10_339P.dat.dep -rw-r--r-- 1 cs09m038 mtech 40090368 Jan 11 18:13 10__339P/10_339P.dat.dep [cs09m038@leo0 ~]$ ls -l number3/10_339P.dat.dep -rw-r--r-- 1 cs09m038 mtech 40090368 Jan 16 00:51 number3/10_339P.dat.dep [cs09m038@leo0 ~]$ md5sum 10__339P/10_339P.dat.dep eeb2f3d84845a20ec0c8796b52f6627f 10__339P/10_339P.dat.dep [cs09m038@leo0 ~]$ md5sum number3/10_339P.dat.dep fab9aeb429dc4480b39da4a34a64632e number3/10_339P.dat.dep
Last fiddled with by wblipp on 2010-01-16 at 10:16 Reason: obscure email addresses |
||
|
|
|
|
|
#194 | |
|
Jun 2005
lehigh.edu
210 Posts |
Quote:
|
|
|
|
|
|
|
#195 |
|
Oct 2004
Austria
1001101100102 Posts |
|
|
|
|
|
|
#196 |
|
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2
224208 Posts |
Raman, do take the unknown student's remark at its face value, because it is better than the alternative. If you are not monitoring the temperatures on the CPUs that you are loading, - you should. If the administration gave you permission to run the programs, they may revoke it in a moment (or worse!) as soon as just a few of the CPUs will have actually burnt out. Even though you didn't put together those computers and (if they indeed overheat) the fault is not entirely yours, you may be the skapegoat in the end. Take the complaint seriously and investigate.
Try the adminstration's shoes for a moment: nothing burnt out for a while and everything was fine; now, there are complaints from users and a few computers failed, and voila - "you-know-who is runinning you-know-what. This must be the reason". See? |
|
|
|
|
|
#197 | ||
|
Noodles
"Mr. Tuch"
Dec 2007
Chennai, India
23518 Posts |
Quote:
Quote:
It was very old, on October 1, and then later on, no one said anything about that at all. Computers are even given rest inbetween. Don't worry about that. I will take care. Hopefully that Core 2 Duo should not produce that much heat as when compared up to the Core 2 Quad or the Core i7 processors at all. Started up to make use of all those computers at our department only on September 22 itself. Last fiddled with by Raman on 2010-01-16 at 07:19 |
||
|
|
|
|
|
#198 | ||
|
Sep 2006
Brussels, Belgium
32348 Posts |
Quote:
Andi was referring to this Quote:
In your place I would heed the suggestions of all those who warn you :anybody can see what you are doing by issuing the right command (ps-f if I remember well from my Unix SVr4 days...) Jacob |
||
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| What are your CRUS plans? | rogue | Conjectures 'R Us | 35 | 2013-11-09 09:03 |
| Raman's stuff | Raman | Chess | 8 | 2013-04-16 20:52 |
| Further Plans | Kosmaj | Riesel Prime Search | 6 | 2009-05-20 01:27 |
| Further Plans | Kosmaj | Riesel Prime Search | 6 | 2006-09-29 22:32 |
| 64 bit plans | pyrodave | Software | 17 | 2004-06-05 12:27 |