Developers- given any thought to fixing the TF application from failing if not allowed to run non-stop?
Author |
Message |
|
Developers- given any thought to fixing the TF application from failing if not allowed to run non-stop?
The application always fails if the running work unit is stopped and resumed.
Annoying, would like it fixed if possible. |
|
|
rebirtherVolunteer moderator Project administrator Project developer Project tester Project scientist
 Send message
Joined: 2 Jan 13 Posts: 7833 Credit: 44,534,674 RAC: 68
|
I have never seen this issue. Do you have an example? Which hardware, OS, BOINC version? |
|
|
|
This host was stopped for a reboot to update OS software while TF tasks were in progress for example. Just the latest example and the latest host to have had Boinc interrupted. I normally don't try and stop Boinc for any reason to have tasks finish uninterrupted on all my hosts. Your app is not the only one to have this flaw. Gaia@home is another project which can't have tasks stopped or upon restart will show instant computation error.
This host also runs GPUGrid and some of their apps have the same issue where it is a given that you are going to throw away a running compututation if it is stopped midstream and not let complete with no interruption.
https://srbase.my-firewall.org/sr5/results.php?hostid=236361&offset=0&show_names=0&state=6&appid=
Stderr output
<core_client_version>8.3.0</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
2025-07-13 08:25:34 (82403): wrapper (7.24.26018): starting
2025-07-13 08:25:34 (82403): wrapper (7.24.26018): starting
2025-07-13 08:25:34 (82403): wrapper: running ./mfaktc.exe (-d 1)
2025-07-13 08:25:34 (82403): wrapper: created child process 82405
2025-07-13 08:33:13 (82403): ./mfaktc.exe exited; CPU time 1.990786
2025-07-13 08:33:13 (82403): app exit status: 0x1
2025-07-13 08:33:13 (82403): called boinc_finish(195)
</stderr_txt>
]]>
OS is Ubuntu 24.04.2 LTS
Boinc is version 8.30
Epyc 7713 cpu running on a Asrock Rack ROMED8-2T motherboard with 128GB of DDR4-3200 ECC RDIMMS.
Suspect it is due to a common fault with some gpu apps which which will fail upon restart when the restart causes the task to run on a different gpu that the tasks was originally started on. GPUGrid apps being the most common example. |
|
|
rebirtherVolunteer moderator Project administrator Project developer Project tester Project scientist
 Send message
Joined: 2 Jan 13 Posts: 7833 Credit: 44,534,674 RAC: 68
|
BOINC cannot leave GPU tasks in memory while suspended. The app itself is creating a checkpoint file every 1min as long as long as one was written. Before you update an OS (in this case ubuntu) it is always better to stop BOINC before you do that. Ubuntu is always asking to update. This will avoid any issues.
Suspect it is due to a common fault with some gpu apps which will fail upon restart when the restart causes the task to run on a different gpu that the tasks was originally started on. GPUGrid apps being the most common example.
That is equal, you can restart the same task on a different GPU. |
|
|
|
I've always practiced sensible Boinc usage. I never update a BOINC update when BOINC is running.
I still suspect that the app expects only gpu0 to be run on since I only ever see gpu0 mentioned in the stderr.txt output.
This is a common problem with many Boinc apps. The developers only develop on a host with a single gpu and never consider that hosts may have more than one gpu in a host and don't consider the gpu enumeration issue.
It is known issue at GPUGrid for example on the acemd3 application. |
|
|
DeleteNullVolunteer developer Volunteer tester Send message
Joined: 29 Nov 14 Posts: 89 Credit: 570,616,737 RAC: 39,052
|
These are two examples of your GPU's 0 and 1:
<core_client_version>8.3.0</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
2025-07-15 15:42:59 (20049): wrapper (7.24.26018): starting
2025-07-15 15:42:59 (20049): wrapper (7.24.26018): starting
2025-07-15 15:42:59 (20049): wrapper: running ./mfaktc.exe (-d 0)
2025-07-15 15:42:59 (20049): wrapper: created child process 20051
2025-07-15 15:55:05 (20049): ./mfaktc.exe exited; CPU time 2.375904
2025-07-15 15:55:05 (20049): app exit status: 0x1
2025-07-15 15:55:05 (20049): called boinc_finish(195)
</stderr_txt>
]]>
<core_client_version>8.3.0</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
2025-07-15 13:22:16 (16724): wrapper (7.24.26018): starting
2025-07-15 13:22:16 (16724): wrapper (7.24.26018): starting
2025-07-15 13:22:16 (16724): wrapper: running ./mfaktc.exe (-d 1)
2025-07-15 13:22:16 (16724): wrapper: created child process 16726
2025-07-15 13:27:14 (16724): ./mfaktc.exe exited; CPU time 1.665762
2025-07-15 13:27:14 (16724): app exit status: 0x1
2025-07-15 13:27:14 (16724): called boinc_finish(195)
</stderr_txt>
]]>
As far as I can see: The GPU is starting the work for more than 100 seconds, and then errors out.
When I start work (NV 4080 RTX) and then stop and after this resumes it looks like this:
<core_client_version>8.2.4</core_client_version>
<![CDATA[
<stderr_txt>
2025-07-15 10:00:27 (91374): wrapper (7.24.26018): starting
2025-07-15 10:00:27 (91374): wrapper (7.24.26018): starting
2025-07-15 10:00:27 (91374): wrapper: running ./mfaktc.exe (-d 0)
2025-07-15 10:00:27 (91374): wrapper: created child process 91376
2025-07-15 10:03:51 (91486): wrapper (7.24.26018): starting
2025-07-15 10:03:51 (91486): wrapper (7.24.26018): starting
2025-07-15 10:03:51 (91486): wrapper: running ./mfaktc.exe (-d 0)
2025-07-15 10:03:51 (91486): wrapper: created child process 91488
2025-07-15 10:07:12 (91486): ./mfaktc.exe exited; CPU time 0.805704
2025-07-15 10:07:12 (91486): called boinc_finish(0)
</stderr_txt>
]]>
So the app (TF) behaves as we expect. (resuming) |
|
|
|
I will have a chance once again in a couple of days to see whether suspending a TF task can be resumed successfully.
Suspended several TF tasks along with all other gpu project tasks that are not Einstein so I can compete in the latest Boinc Games Sprint contest. |
|
|