Bit of disaster, need advise
log in

Advanced search

Message boards : Number crunching : Bit of disaster, need advise

Author Message
Profile marmot
Avatar
Send message
Joined: 17 Nov 16
Posts: 97
Credit: 126,410,450
RAC: 21,911
Message 5121 - Posted: 20 Apr 2019, 10:20:19 UTC

So 24 machines all with 2x long2 running with 4 to 6 days of computation time into each WU (BOINC decided to swap WU at 70-90% completion to a new WU, might have to change switch between tasks to 9999 minutes).

Some drunk driver ran into a main electric line pole and dropped it on a house setting the house on fire. Plunged 4000 homes into darkness. I'm 2 miles away and just got 2 spikes, taking down all the machines (no, I can't afford battery backups).

Long2's all report 0% completion. Due dates of 4/21.

So, how sure are you that these WU's all restarted from the save points of 80-95% completion?
If they start completing over the next day and swapping back to the other unfinished then everything's fine. But if they keep going for 3 more days
they are starting completely from scratch?

Please reassure me as it seems like 200+ CPU days of work is kaputt(one of the few German words grandma taught me that wasn't filthy).

[SG]Felix
Avatar
Send message
Joined: 25 Dec 17
Posts: 63
Credit: 21,692,929
RAC: 9
Message 5123 - Posted: 20 Apr 2019, 11:03:09 UTC

let the first one run, the pause it manually for a short time, that the second can run.
then press right click onone of the tasks, then press, i think in english it is properties, and look in which slot the task is. open the folder of this slot and look what progress your stderr.txt file says at the bottom. this is the progress from which the task (should) restart

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7228
Credit: 42,729,227
RAC: 34
Message 5125 - Posted: 20 Apr 2019, 14:05:50 UTC

Dont look at the runtime. It is always 0 after a restart. If you run it on windows you can check your stderr.txt. Checkpoints are every 10min. you can see it in the file "resume..."

Profile marmot
Avatar
Send message
Joined: 17 Nov 16
Posts: 97
Credit: 126,410,450
RAC: 21,911
Message 5130 - Posted: 21 Apr 2019, 0:38:11 UTC

Thankyou for the responses.

Wasn't hard to find the right slot director(ies); it's the ones with llr.exe in it.

The one machine ended at 80.44% and is now at 90.28% so that WU is fine. I'll check them all.

(You all told me how to do this 3 months ago and I forgot. Everytime I have a major depressive episode; there seems to be memory loss.)

These long2 are set to -t3 from earlier data set I collected and that isn't enough threads for the downclocked fall/spring CPU timings.

Changing to -t6 like the long1's.


Post to thread

Message boards : Number crunching : Bit of disaster, need advise


Main page · Your account · Message boards


Copyright © 2014-2024 BOINC Confederation / rebirther