Developers- given any thought to fixing the TF application from failing if not allowed to run non-stop?

Message boards : Number crunching : Developers- given any thought to fixing the TF application from failing if not allowed to run non-stop?

Author	Message
Keith Myers Send message Joined: 15 Jul 24 Posts: 14 Credit: 1,569,243,760 RAC: 122,114	Message 10860 - Posted: 11 Jul 2025, 23:32:53 UTC
	Developers- given any thought to fixing the TF application from failing if not allowed to run non-stop? The application always fails if the running work unit is stopped and resumed. Annoying, would like it fixed if possible.
	ID: 10860 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 8234 Credit: 113,724,563 RAC: 0	Message 10861 - Posted: 12 Jul 2025, 5:03:26 UTC
	I have never seen this issue. Do you have an example? Which hardware, OS, BOINC version?
	ID: 10861 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 15 Jul 24 Posts: 14 Credit: 1,569,243,760 RAC: 122,114	Message 10865 - Posted: 15 Jul 2025, 1:08:53 UTC Last modified: 15 Jul 2025, 1:17:33 UTC
	This host was stopped for a reboot to update OS software while TF tasks were in progress for example. Just the latest example and the latest host to have had Boinc interrupted. I normally don't try and stop Boinc for any reason to have tasks finish uninterrupted on all my hosts. Your app is not the only one to have this flaw. Gaia@home is another project which can't have tasks stopped or upon restart will show instant computation error. This host also runs GPUGrid and some of their apps have the same issue where it is a given that you are going to throw away a running compututation if it is stopped midstream and not let complete with no interruption. https://srbase.my-firewall.org/sr5/results.php?hostid=236361&offset=0&show_names=0&state=6&appid= Stderr output <core_client_version>8.3.0</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 2025-07-13 08:25:34 (82403): wrapper (7.24.26018): starting 2025-07-13 08:25:34 (82403): wrapper (7.24.26018): starting 2025-07-13 08:25:34 (82403): wrapper: running ./mfaktc.exe (-d 1) 2025-07-13 08:25:34 (82403): wrapper: created child process 82405 2025-07-13 08:33:13 (82403): ./mfaktc.exe exited; CPU time 1.990786 2025-07-13 08:33:13 (82403): app exit status: 0x1 2025-07-13 08:33:13 (82403): called boinc_finish(195) </stderr_txt> ]]> OS is Ubuntu 24.04.2 LTS Boinc is version 8.30 Epyc 7713 cpu running on a Asrock Rack ROMED8-2T motherboard with 128GB of DDR4-3200 ECC RDIMMS. Suspect it is due to a common fault with some gpu apps which which will fail upon restart when the restart causes the task to run on a different gpu that the tasks was originally started on. GPUGrid apps being the most common example.
	ID: 10865 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 8234 Credit: 113,724,563 RAC: 0	Message 10866 - Posted: 15 Jul 2025, 11:36:23 UTC Last modified: 15 Jul 2025, 11:41:02 UTC
	BOINC cannot leave GPU tasks in memory while suspended. The app itself is creating a checkpoint file every 1min as long as long as one was written. Before you update an OS (in this case ubuntu) it is always better to stop BOINC before you do that. Ubuntu is always asking to update. This will avoid any issues. Suspect it is due to a common fault with some gpu apps which will fail upon restart when the restart causes the task to run on a different gpu that the tasks was originally started on. GPUGrid apps being the most common example. That is equal, you can restart the same task on a different GPU.
	ID: 10866 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 15 Jul 24 Posts: 14 Credit: 1,569,243,760 RAC: 122,114	Message 10867 - Posted: 16 Jul 2025, 15:46:49 UTC - in response to Message 10866.
	I've always practiced sensible Boinc usage. I never update a BOINC update when BOINC is running. I still suspect that the app expects only gpu0 to be run on since I only ever see gpu0 mentioned in the stderr.txt output. This is a common problem with many Boinc apps. The developers only develop on a host with a single gpu and never consider that hosts may have more than one gpu in a host and don't consider the gpu enumeration issue. It is known issue at GPUGrid for example on the acemd3 application.
	ID: 10867 · Rating: 0 · rate: / Reply Quote

DeleteNull Volunteer developer Volunteer tester Send message Joined: 29 Nov 14 Posts: 90 Credit: 716,945,862 RAC: 11,714	Message 10868 - Posted: 16 Jul 2025, 17:12:28 UTC - in response to Message 10867.
	These are two examples of your GPU's 0 and 1: <core_client_version>8.3.0</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 2025-07-15 15:42:59 (20049): wrapper (7.24.26018): starting 2025-07-15 15:42:59 (20049): wrapper (7.24.26018): starting 2025-07-15 15:42:59 (20049): wrapper: running ./mfaktc.exe (-d 0) 2025-07-15 15:42:59 (20049): wrapper: created child process 20051 2025-07-15 15:55:05 (20049): ./mfaktc.exe exited; CPU time 2.375904 2025-07-15 15:55:05 (20049): app exit status: 0x1 2025-07-15 15:55:05 (20049): called boinc_finish(195) </stderr_txt> ]]> <core_client_version>8.3.0</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 2025-07-15 13:22:16 (16724): wrapper (7.24.26018): starting 2025-07-15 13:22:16 (16724): wrapper (7.24.26018): starting 2025-07-15 13:22:16 (16724): wrapper: running ./mfaktc.exe (-d 1) 2025-07-15 13:22:16 (16724): wrapper: created child process 16726 2025-07-15 13:27:14 (16724): ./mfaktc.exe exited; CPU time 1.665762 2025-07-15 13:27:14 (16724): app exit status: 0x1 2025-07-15 13:27:14 (16724): called boinc_finish(195) </stderr_txt> ]]> As far as I can see: The GPU is starting the work for more than 100 seconds, and then errors out. When I start work (NV 4080 RTX) and then stop and after this resumes it looks like this: <core_client_version>8.2.4</core_client_version> <![CDATA[ <stderr_txt> 2025-07-15 10:00:27 (91374): wrapper (7.24.26018): starting 2025-07-15 10:00:27 (91374): wrapper (7.24.26018): starting 2025-07-15 10:00:27 (91374): wrapper: running ./mfaktc.exe (-d 0) 2025-07-15 10:00:27 (91374): wrapper: created child process 91376 2025-07-15 10:03:51 (91486): wrapper (7.24.26018): starting 2025-07-15 10:03:51 (91486): wrapper (7.24.26018): starting 2025-07-15 10:03:51 (91486): wrapper: running ./mfaktc.exe (-d 0) 2025-07-15 10:03:51 (91486): wrapper: created child process 91488 2025-07-15 10:07:12 (91486): ./mfaktc.exe exited; CPU time 0.805704 2025-07-15 10:07:12 (91486): called boinc_finish(0) </stderr_txt> ]]> So the app (TF) behaves as we expect. (resuming)
	ID: 10868 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 15 Jul 24 Posts: 14 Credit: 1,569,243,760 RAC: 122,114	Message 10869 - Posted: 16 Jul 2025, 19:29:45 UTC
	I will have a chance once again in a couple of days to see whether suspending a TF task can be resumed successfully. Suspended several TF tasks along with all other gpu project tasks that are not Einstein so I can compete in the latest Boinc Games Sprint contest.
	ID: 10869 · Rating: 0 · rate: / Reply Quote

Post to thread

Message boards : Number crunching : Developers- given any thought to fixing the TF application from failing if not allowed to run non-stop?