Developers- given any thought to fixing the TF application from failing if not allowed to run non-stop?
log in

Advanced search

Message boards : Number crunching : Developers- given any thought to fixing the TF application from failing if not allowed to run non-stop?

Author Message
Keith Myers
Avatar
Send message
Joined: 15 Jul 24
Posts: 4
Credit: 263,841,760
RAC: 3,877,781
Message 10860 - Posted: 11 Jul 2025, 23:32:53 UTC

Developers- given any thought to fixing the TF application from failing if not allowed to run non-stop?

The application always fails if the running work unit is stopped and resumed.

Annoying, would like it fixed if possible.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7833
Credit: 44,534,674
RAC: 68
Message 10861 - Posted: 12 Jul 2025, 5:03:26 UTC

I have never seen this issue. Do you have an example? Which hardware, OS, BOINC version?

Keith Myers
Avatar
Send message
Joined: 15 Jul 24
Posts: 4
Credit: 263,841,760
RAC: 3,877,781
Message 10865 - Posted: 15 Jul 2025, 1:08:53 UTC
Last modified: 15 Jul 2025, 1:17:33 UTC

This host was stopped for a reboot to update OS software while TF tasks were in progress for example. Just the latest example and the latest host to have had Boinc interrupted. I normally don't try and stop Boinc for any reason to have tasks finish uninterrupted on all my hosts. Your app is not the only one to have this flaw. Gaia@home is another project which can't have tasks stopped or upon restart will show instant computation error.

This host also runs GPUGrid and some of their apps have the same issue where it is a given that you are going to throw away a running compututation if it is stopped midstream and not let complete with no interruption.

https://srbase.my-firewall.org/sr5/results.php?hostid=236361&offset=0&show_names=0&state=6&appid=

Stderr output
<core_client_version>8.3.0</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
2025-07-13 08:25:34 (82403): wrapper (7.24.26018): starting
2025-07-13 08:25:34 (82403): wrapper (7.24.26018): starting
2025-07-13 08:25:34 (82403): wrapper: running ./mfaktc.exe (-d 1)
2025-07-13 08:25:34 (82403): wrapper: created child process 82405
2025-07-13 08:33:13 (82403): ./mfaktc.exe exited; CPU time 1.990786
2025-07-13 08:33:13 (82403): app exit status: 0x1
2025-07-13 08:33:13 (82403): called boinc_finish(195)

</stderr_txt>
]]>


OS is Ubuntu 24.04.2 LTS
Boinc is version 8.30

Epyc 7713 cpu running on a Asrock Rack ROMED8-2T motherboard with 128GB of DDR4-3200 ECC RDIMMS.

Suspect it is due to a common fault with some gpu apps which which will fail upon restart when the restart causes the task to run on a different gpu that the tasks was originally started on. GPUGrid apps being the most common example.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7833
Credit: 44,534,674
RAC: 68
Message 10866 - Posted: 15 Jul 2025, 11:36:23 UTC
Last modified: 15 Jul 2025, 11:41:02 UTC

BOINC cannot leave GPU tasks in memory while suspended. The app itself is creating a checkpoint file every 1min as long as long as one was written. Before you update an OS (in this case ubuntu) it is always better to stop BOINC before you do that. Ubuntu is always asking to update. This will avoid any issues.


Suspect it is due to a common fault with some gpu apps which will fail upon restart when the restart causes the task to run on a different gpu that the tasks was originally started on. GPUGrid apps being the most common example.


That is equal, you can restart the same task on a different GPU.

Keith Myers
Avatar
Send message
Joined: 15 Jul 24
Posts: 4
Credit: 263,841,760
RAC: 3,877,781
Message 10867 - Posted: 16 Jul 2025, 15:46:49 UTC - in response to Message 10866.

I've always practiced sensible Boinc usage. I never update a BOINC update when BOINC is running.

I still suspect that the app expects only gpu0 to be run on since I only ever see gpu0 mentioned in the stderr.txt output.

This is a common problem with many Boinc apps. The developers only develop on a host with a single gpu and never consider that hosts may have more than one gpu in a host and don't consider the gpu enumeration issue.

It is known issue at GPUGrid for example on the acemd3 application.

DeleteNull
Volunteer developer
Volunteer tester
Send message
Joined: 29 Nov 14
Posts: 89
Credit: 570,616,737
RAC: 39,052
Message 10868 - Posted: 16 Jul 2025, 17:12:28 UTC - in response to Message 10867.

These are two examples of your GPU's 0 and 1:
<core_client_version>8.3.0</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
2025-07-15 15:42:59 (20049): wrapper (7.24.26018): starting
2025-07-15 15:42:59 (20049): wrapper (7.24.26018): starting
2025-07-15 15:42:59 (20049): wrapper: running ./mfaktc.exe (-d 0)
2025-07-15 15:42:59 (20049): wrapper: created child process 20051
2025-07-15 15:55:05 (20049): ./mfaktc.exe exited; CPU time 2.375904
2025-07-15 15:55:05 (20049): app exit status: 0x1
2025-07-15 15:55:05 (20049): called boinc_finish(195)

</stderr_txt>
]]>
<core_client_version>8.3.0</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
2025-07-15 13:22:16 (16724): wrapper (7.24.26018): starting
2025-07-15 13:22:16 (16724): wrapper (7.24.26018): starting
2025-07-15 13:22:16 (16724): wrapper: running ./mfaktc.exe (-d 1)
2025-07-15 13:22:16 (16724): wrapper: created child process 16726
2025-07-15 13:27:14 (16724): ./mfaktc.exe exited; CPU time 1.665762
2025-07-15 13:27:14 (16724): app exit status: 0x1
2025-07-15 13:27:14 (16724): called boinc_finish(195)

</stderr_txt>
]]>

As far as I can see: The GPU is starting the work for more than 100 seconds, and then errors out.

When I start work (NV 4080 RTX) and then stop and after this resumes it looks like this:
<core_client_version>8.2.4</core_client_version>
<![CDATA[
<stderr_txt>
2025-07-15 10:00:27 (91374): wrapper (7.24.26018): starting
2025-07-15 10:00:27 (91374): wrapper (7.24.26018): starting
2025-07-15 10:00:27 (91374): wrapper: running ./mfaktc.exe (-d 0)
2025-07-15 10:00:27 (91374): wrapper: created child process 91376
2025-07-15 10:03:51 (91486): wrapper (7.24.26018): starting
2025-07-15 10:03:51 (91486): wrapper (7.24.26018): starting
2025-07-15 10:03:51 (91486): wrapper: running ./mfaktc.exe (-d 0)
2025-07-15 10:03:51 (91486): wrapper: created child process 91488
2025-07-15 10:07:12 (91486): ./mfaktc.exe exited; CPU time 0.805704
2025-07-15 10:07:12 (91486): called boinc_finish(0)

</stderr_txt>
]]>

So the app (TF) behaves as we expect. (resuming)

Keith Myers
Avatar
Send message
Joined: 15 Jul 24
Posts: 4
Credit: 263,841,760
RAC: 3,877,781
Message 10869 - Posted: 16 Jul 2025, 19:29:45 UTC

I will have a chance once again in a couple of days to see whether suspending a TF task can be resumed successfully.

Suspended several TF tasks along with all other gpu project tasks that are not Einstein so I can compete in the latest Boinc Games Sprint contest.


Post to thread

Message boards : Number crunching : Developers- given any thought to fixing the TF application from failing if not allowed to run non-stop?


Main page · Your account · Message boards


Copyright © 2014-2025 BOINC Confederation / rebirther