Checkpointing?
log in

Advanced search

Message boards : Number crunching : Checkpointing?

Author Message
Maxwell [MM]
Send message
Joined: 4 Dec 14
Posts: 1
Credit: 1,018,233
RAC: 0
Message 650 - Posted: 15 Jan 2015, 17:58:05 UTC

Is checkpointing at reasonable intervals going to be implemented soon? I have to occasionally restart machines, and I can't have WUs running for hours on end with no checkpointing. I just had to abort a batch of the Long WUs because they were running too long with no end in sight (24+ hours on a couple of them without a checkpoint).

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 3676
Credit: 22,879,153
RAC: 0
Message 651 - Posted: 15 Jan 2015, 18:01:45 UTC - in response to Message 650.
Last modified: 15 Jan 2015, 18:03:08 UTC

Is checkpointing at reasonable intervals going to be implemented soon? I have to occasionally restart machines, and I can't have WUs running for hours on end with no checkpointing. I just had to abort a batch of the Long WUs because they were running too long with no end in sight (24+ hours on a couple of them without a checkpoint).


Checkpointing is every 10min except on Mac, there is no checkpointing yet. Pls note that the runtime is not saved there but fixed credits will compensate the missing time.

STE\/E
Send message
Joined: 1 Dec 14
Posts: 8
Credit: 27,739,762
RAC: 32
Message 655 - Posted: 16 Jan 2015, 12:13:12 UTC

I don't see where there is Checkpointing on my Windows Machine, I forgot & shut BOINC off on 1 Box & when restarted the Wu's were reset to 0 elapsed time ... It's a Windows 8.1 Box ...

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 3676
Credit: 22,879,153
RAC: 0
Message 656 - Posted: 16 Jan 2015, 12:40:03 UTC - in response to Message 655.

I don't see where there is Checkpointing on my Windows Machine, I forgot & shut BOINC off on 1 Box & when restarted the Wu's were reset to 0 elapsed time ... It's a Windows 8.1 Box ...


Thats what I mean with time reset but only until the checkpoint. You can check your stderr.txt in slot folder. There is a restart at the last iteration point.

STE\/E
Send message
Joined: 1 Dec 14
Posts: 8
Credit: 27,739,762
RAC: 32
Message 657 - Posted: 16 Jan 2015, 14:28:17 UTC

I think I see what you mean, one of the Wu's only ran for about 35 Min after restarting BOINC & then finished & got the full 1000 Point's ... The others are still running tough so will see how they do ...

Profile Tarmo Ilves
Avatar
Send message
Joined: 20 Jan 15
Posts: 5
Credit: 1,637,306
RAC: 0
Message 727 - Posted: 21 Jan 2015, 7:09:45 UTC - in response to Message 651.

Is checkpointing at reasonable intervals going to be implemented soon? I have to occasionally restart machines, and I can't have WUs running for hours on end with no checkpointing. I just had to abort a batch of the Long WUs because they were running too long with no end in sight (24+ hours on a couple of them without a checkpoint).


Checkpointing is every 10min except on Mac, there is no checkpointing yet. Pls note that the runtime is not saved there but fixed credits will compensate the missing time.


Is there any plans for checkpointing for Mac?
Long tasks are running ~7 hours on my Macbook pro.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 3676
Credit: 22,879,153
RAC: 0
Message 728 - Posted: 21 Jan 2015, 7:45:32 UTC - in response to Message 727.

Is checkpointing at reasonable intervals going to be implemented soon? I have to occasionally restart machines, and I can't have WUs running for hours on end with no checkpointing. I just had to abort a batch of the Long WUs because they were running too long with no end in sight (24+ hours on a couple of them without a checkpoint).


Checkpointing is every 10min except on Mac, there is no checkpointing yet. Pls note that the runtime is not saved there but fixed credits will compensate the missing time.


Is there any plans for checkpointing for Mac?
Long tasks are running ~7 hours on my Macbook pro.


Iam still looking for someone who is able to compile the wrappers for 32/64bit mac.

Profile Tarmo Ilves
Avatar
Send message
Joined: 20 Jan 15
Posts: 5
Credit: 1,637,306
RAC: 0
Message 729 - Posted: 21 Jan 2015, 8:12:29 UTC - in response to Message 728.

Is checkpointing at reasonable intervals going to be implemented soon? I have to occasionally restart machines, and I can't have WUs running for hours on end with no checkpointing. I just had to abort a batch of the Long WUs because they were running too long with no end in sight (24+ hours on a couple of them without a checkpoint).


Checkpointing is every 10min except on Mac, there is no checkpointing yet. Pls note that the runtime is not saved there but fixed credits will compensate the missing time.


Is there any plans for checkpointing for Mac?
Long tasks are running ~7 hours on my Macbook pro.


Iam still looking for someone who is able to compile the wrappers for 32/64bit mac.

I have some compilers installed. If it is not too complicated I can try.
____________

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 3676
Credit: 22,879,153
RAC: 0
Message 730 - Posted: 21 Jan 2015, 8:53:14 UTC - in response to Message 729.

Is checkpointing at reasonable intervals going to be implemented soon? I have to occasionally restart machines, and I can't have WUs running for hours on end with no checkpointing. I just had to abort a batch of the Long WUs because they were running too long with no end in sight (24+ hours on a couple of them without a checkpoint).


Checkpointing is every 10min except on Mac, there is no checkpointing yet. Pls note that the runtime is not saved there but fixed credits will compensate the missing time.


Is there any plans for checkpointing for Mac?
Long tasks are running ~7 hours on my Macbook pro.


Iam still looking for someone who is able to compile the wrappers for 32/64bit mac.

I have some compilers installed. If it is not too complicated I can try.


You only need to download the boinc source and run make. I will give you the wrapper source later today and some instructions.

Rasputin42
Send message
Joined: 4 Dec 14
Posts: 4
Credit: 216,806
RAC: 0
Message 757 - Posted: 26 Jan 2015, 10:57:35 UTC

HI,
SR switches between different tasks within itself and looses a lot of time due to non existing checkpoints.
I find it very bad, to waste the calculating time of all the volunteers, only because SR cannot be asked to implement these.
This should be highest priority.
Any comments?

Profile Tarmo Ilves
Avatar
Send message
Joined: 20 Jan 15
Posts: 5
Credit: 1,637,306
RAC: 0
Message 758 - Posted: 26 Jan 2015, 11:16:45 UTC - in response to Message 757.
Last modified: 26 Jan 2015, 11:17:48 UTC

HI,
SR switches between different tasks within itself and looses a lot of time due to non existing checkpoints.
I find it very bad, to waste the calculating time of all the volunteers, only because SR cannot be asked to implement these.
This should be highest priority.
Any comments?

Only way to prevent computing time loss right now is to not mix long and short tasks I guess,,,
____________

Extra Ball
Send message
Joined: 29 Nov 14
Posts: 6
Credit: 15,187,970
RAC: 0
Message 759 - Posted: 26 Jan 2015, 11:50:10 UTC - in response to Message 757.
Last modified: 26 Jan 2015, 11:51:09 UTC

HI,
SR switches between different tasks within itself and looses a lot of time due to non existing checkpoints.
I find it very bad, to waste the calculating time of all the volunteers, only because SR cannot be asked to implement these.
This should be highest priority.
Any comments?

Hi Rasputin42,

as far as I know :
- checkpoints do exist at SRBase (every 10' if I remember well) despite the fact that the elapsed time is reset to 0 each time a task is restarted from suspended mode (we've been told that there's no way to fix this problem at the moment -not due to this specific project-)
- maybe you could check your Boinc Manager options (Tools/Computation Preferences) and increase the value for the 'Switch application every XXX minutes' field to prevent from jumping too often between tasks

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 3676
Credit: 22,879,153
RAC: 0
Message 760 - Posted: 26 Jan 2015, 13:27:04 UTC

@Rasputin42:

You will find all in the FAQ, the mac wrapper issue is still in progress to get also the checkpointing there. If you want to know if there are checkpoints by a restart check the slot folder and find the stderr.txt.

Rasputin42
Send message
Joined: 4 Dec 14
Posts: 4
Credit: 216,806
RAC: 0
Message 764 - Posted: 26 Jan 2015, 16:42:48 UTC

The task has been running for about 1.5h.(Sierpinski/Riesel Base - long0.01.
There is no checkpoint, not in the task properties nor in the stderr file.
I did set the "switch between task every xxx" to 999 several weeks ago, so effectivly-never.
When Boinc downloads a new SR Task, it sometimes decides to switch to that task, leaving the current task with no checkpoint.All time lost,up to that point.

I do like to keep a days worth of work buffer, as SR does not always have tasks available.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 3676
Credit: 22,879,153
RAC: 0
Message 765 - Posted: 26 Jan 2015, 16:46:00 UTC - in response to Message 764.

The task has been running for about 1.5h.(Sierpinski/Riesel Base - long0.01.
There is no checkpoint, not in the task properties nor in the stderr file.
I did set the "switch between task every xxx" to 999 several weeks ago, so effectivly-never.
When Boinc downloads a new SR Task, it sometimes decides to switch to that task, leaving the current task with no checkpoint.All time lost,up to that point.

I do like to keep a days worth of work buffer, as SR does not always have tasks available.


Hmm, I have no problems with the longrunners, restarted BOINC several times and restarted from the last checkpoint. If there is nothing in the stderr.txt then something must be wrong on this computer.

Lemon
Send message
Joined: 1 Dec 14
Posts: 5
Credit: 7,563,028
RAC: 0
Message 766 - Posted: 26 Jan 2015, 16:46:51 UTC - in response to Message 764.

To avoid losing work due to task switches, you can set the flag "Leave applications in memory while suspended".

These apps use very little memory and will continue where they left off, even with no checkpoint.

frankhagen
Send message
Joined: 7 Jun 14
Posts: 36
Credit: 166,888
RAC: 0
Message 768 - Posted: 26 Jan 2015, 17:58:03 UTC - in response to Message 766.

To avoid losing work due to task switches, you can set the flag "Leave applications in memory while suspended".

These apps use very little memory and will continue where they left off, even with no checkpoint.


i really doubt that this will work here. you either prevent boinc from switching tasks or stick to the short queues.


Post to thread

Message boards : Number crunching : Checkpointing?


Main page · Your account · Message boards


Copyright © 2014-2018 BOINC Confederation / rebirther