Linux GPU errors

Message boards : Number crunching : Linux GPU errors

1 · 2 · 3 · Next

Author	Message
Dark Angel Send message Joined: 22 Jun 18 Posts: 15 Credit: 61,726,776 RAC: 0	Message 6819 - Posted: 18 Oct 2020, 20:48:34 UTC
	I left a machine running over the weekend while I went away and I come back to see it's producing errors on all GPU tasks. Every error occurs in less than a minute, all with exit code 0x100. The machine in question was completing units successfully before I went away and hasn't done any automatic updates in that time. This is the machine in question: http://srbase.my-firewall.org/sr5/show_host_detail.php?hostid=208774 I made sure all the CUDA drivers are properly installed, rebooted the machine, all to no change. I moved the GPU to PrimeGrid (PPS Sieve) and it's running fine which makes me suspect it's something with the work here, but I cannot confirm that. Any help/insight would be appreciated.
	ID: 6819 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7808 Credit: 44,534,674 RAC: 332	Message 6820 - Posted: 19 Oct 2020, 15:34:06 UTC - in response to Message 6819.
	You could try to reset the project on this host if you have only TF running.
	ID: 6820 · Rating: 0 · rate: / Reply Quote

Dark Angel Send message Joined: 22 Jun 18 Posts: 15 Credit: 61,726,776 RAC: 0	Message 6821 - Posted: 20 Oct 2020, 0:52:03 UTC - in response to Message 6820.
	I did that. Fully reinstalled all the drivers. Rebooted. Reset the project. Still throws errors on this work but works fine on Prime Grid. It's not opening the child process for some reason. Do the new work units require more RAM on the card than the old ones?
	ID: 6821 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7808 Credit: 44,534,674 RAC: 332	Message 6824 - Posted: 20 Oct 2020, 14:21:20 UTC - in response to Message 6821.
	I did that. Fully reinstalled all the drivers. Rebooted. Reset the project. Still throws errors on this work but works fine on Prime Grid. It's not opening the child process for some reason. Do the new work units require more RAM on the card than the old ones? I have checked for short and used around 45MB RAM, I dont think its more than before.
	ID: 6824 · Rating: 0 · rate: / Reply Quote

Dark Angel Send message Joined: 22 Jun 18 Posts: 15 Credit: 61,726,776 RAC: 0	Message 6826 - Posted: 22 Oct 2020, 7:05:48 UTC - in response to Message 6824.
	This is a typical error: <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 17:46:37 (599282): wrapper (7.2.26012): starting 17:46:37 (599282): wrapper: running ./mfaktc.exe ( --device 0) 17:47:57 (599282): ./mfaktc.exe exited; CPU time 1.075320 17:47:57 (599282): app exit status: 0x100 17:47:57 (599282): called boinc_finish </stderr_txt> ]]> I have no idea what's going on, just that the old version worked and the new one doesn't on this particular card.
	ID: 6826 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7808 Credit: 44,534,674 RAC: 332	Message 6827 - Posted: 22 Oct 2020, 15:15:24 UTC - in response to Message 6826.
	This is a typical error: 7.14.2 process exited with code 195 (0xc3, -61) 17:46:37 (599282): wrapper (7.2.26012): starting 17:46:37 (599282): wrapper: running ./mfaktc.exe ( --device 0) 17:47:57 (599282): ./mfaktc.exe exited; CPU time 1.075320 17:47:57 (599282): app exit status: 0x100 17:47:57 (599282): called boinc_finish ]]> I have no idea what's going on, just that the old version worked and the new one doesn't on this particular card. You could try to test it in standalone with and without the wrapper.
	ID: 6827 · Rating: 0 · rate: / Reply Quote

Dark Angel Send message Joined: 22 Jun 18 Posts: 15 Credit: 61,726,776 RAC: 0	Message 6839 - Posted: 22 Oct 2020, 23:58:08 UTC - in response to Message 6827.
	This is a typical error: <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 17:46:37 (599282): wrapper (7.2.26012): starting 17:46:37 (599282): wrapper: running ./mfaktc.exe ( --device 0) 17:47:57 (599282): ./mfaktc.exe exited; CPU time 1.075320 17:47:57 (599282): app exit status: 0x100 17:47:57 (599282): called boinc_finish </stderr_txt> ]]> I have no idea what's going on, just that the old version worked and the new one doesn't on this particular card. You could try to test it in standalone with and without the wrapper. I'm running Linux. I don't think that will like a .exe file.
	ID: 6839 · Rating: 0 · rate: / Reply Quote

Dark Angel Send message Joined: 22 Jun 18 Posts: 15 Credit: 61,726,776 RAC: 0	Message 6840 - Posted: 23 Oct 2020, 9:02:03 UTC
	It it helps, this is what the client logs: Output file TF_72-73_127-219M_wu_229446_0_0 for task TF_72-73_127-219M_wu_229446_0 absent Is there any way to get a Linux native version of mfaktc? I'm running 64bit linux but only ever get project files compiled for Windows and the wrapper. Running a dependency check on the wrapper gets me: $ldd wrapper_26012-v2_x86_64-pc-linux-gnu linux-vdso.so.1 (0x00007ffebd394000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb842ab7000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb842a9c000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb842a79000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb842887000) /lib64/ld-linux-x86-64.so.2 (0x00007fb842c31000)
	ID: 6840 · Rating: 0 · rate: / Reply Quote

Dark Angel Send message Joined: 22 Jun 18 Posts: 15 Credit: 61,726,776 RAC: 0	Message 6841 - Posted: 23 Oct 2020, 9:22:08 UTC - in response to Message 6827.
	Copied the .exe out of the zip archive and ran it in a terminal. This is the output: mfaktc v0.21 (64bit built) Compiletime options THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 193154bits SIEVE_SPLIT 250 MORE_CLASSES enabled Runtime options SievePrimes 25000 SievePrimesAdjust 1 SievePrimesMin 5000 SievePrimesMax 100000 NumStreams 3 CPUStreams 3 GridSize 3 GPU Sieving enabled GPUSievePrimes 82486 GPUSieveSize 2047Mi bits GPUSieveProcessSize 32Ki bits Checkpoints enabled CheckpointDelay 300s WorkFileAddDelay disabled Stages enabled StopAfterFactor class PrintMode full V5UserID (none) ComputerID (none) AllowSleep no TimeStampInResults no CUDA version info binary compiled for CUDA 10.10 CUDA runtime version 10.10 CUDA driver version 11.0 CUDA device info name Quadro K2000 compute capability 3.0 max threads per block 1024 max shared memory per MP 49152 byte number of multiprocessors 2 CUDA cores per MP 192 CUDA cores - total 384 clock rate (CUDA cores) 954MHz memory clock rate: 2000MHz memory bus width: 128 bit Automatic parameters threads per grid 1048576 GPUSievePrimes (adjusted) 82486 GPUsieve minimum exponent 1055144 running a simple selftest... Selftest statistics number of tests 107 successfull tests 107 selftest PASSED! Can't open workfile worktodo.txt ERROR: get_next_assignment(): can't open "worktodo.txt" Next I tried it with a work file with the relevant numbers but it had a hissy fit about that. I cut off the job numbers and the file ran. I'll let it run in the terminal and see what happens.
	ID: 6841 · Rating: 0 · rate: / Reply Quote

Dark Angel Send message Joined: 22 Jun 18 Posts: 15 Credit: 61,726,776 RAC: 0	Message 6842 - Posted: 23 Oct 2020, 9:25:51 UTC
	Ok, here's the follow up with a work file Same as the previous output but with this at the end: got assignment: exp=140615327 bit_min=72 bit_max=73 (6.80 GHz-days) Starting trial factoring M140615327 from 2^72 to 2^73 (6.80 GHz-days) k_min = 16791791417940 k_max = 33583582839939 Using GPU kernel "barrett76_mul32_gs" Date Time \| class Pct \| time ETA \| GHz-d/day Sieve Wait Oct 23 20:27 \| 0 0.1% \| 7.842 2h05m \| 78.07 82485 n.a.% M140615327 has a factor: 38814612911305349835664385407 ERROR: cudaGetLastError() returned 702: the launch timed out and was terminated
	ID: 6842 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7808 Credit: 44,534,674 RAC: 332	Message 6843 - Posted: 23 Oct 2020, 9:34:49 UTC - in response to Message 6842.
	Ok, here's the follow up with a work file Same as the previous output but with this at the end: got assignment: exp=140615327 bit_min=72 bit_max=73 (6.80 GHz-days) Starting trial factoring M140615327 from 2^72 to 2^73 (6.80 GHz-days) k_min = 16791791417940 k_max = 33583582839939 Using GPU kernel "barrett76_mul32_gs" Date Time \| class Pct \| time ETA \| GHz-d/day Sieve Wait Oct 23 20:27 \| 0 0.1% \| 7.842 2h05m \| 78.07 82485 n.a.% M140615327 has a factor: 38814612911305349835664385407 ERROR: cudaGetLastError() returned 702: the launch timed out and was terminated Hmm, luckily there was a factor, can you try another non factor run? For the wrapper test I will post you a little guide later.
	ID: 6843 · Rating: 0 · rate: / Reply Quote

Dark Angel Send message Joined: 22 Jun 18 Posts: 15 Credit: 61,726,776 RAC: 0	Message 6844 - Posted: 23 Oct 2020, 9:43:18 UTC - in response to Message 6843.
	using the same work file (the unit error'd out in the actual client) got assignment: exp=140615327 bit_min=72 bit_max=73 (6.80 GHz-days) Starting trial factoring M140615327 from 2^72 to 2^73 (6.80 GHz-days) k_min = 16791791417940 k_max = 33583582839939 Using GPU kernel "barrett76_mul32_gs" Date Time \| class Pct \| time ETA \| GHz-d/day Sieve Wait Oct 23 20:42 \| 0 0.1% \| 16.568 4h24m \| 36.95 82485 n.a.% Oct 23 20:43 \| 9 0.2% \| 16.571 4h24m \| 36.94 82485 n.a.% Oct 23 20:43 \| 12 0.3% \| 8.616 2h17m \| 71.05 82485 n.a.% ERROR: cudaGetLastError() returned 702: the launch timed out and was terminated
	ID: 6844 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7808 Credit: 44,534,674 RAC: 332	Message 6845 - Posted: 23 Oct 2020, 10:47:45 UTC - in response to Message 6844.
	Hmm, if the app is running for some minutes then the VRAM size could be an issue.
	ID: 6845 · Rating: 0 · rate: / Reply Quote

Dark Angel Send message Joined: 22 Jun 18 Posts: 15 Credit: 61,726,776 RAC: 0	Message 6847 - Posted: 23 Oct 2020, 20:15:59 UTC - in response to Message 6845.
	I copied the application and data file to a known good machine (1060 6GB card) and it ran fine, so it's something particular to that K2000 rig. That little Quadro only has 2GB of VRAM. If that's not enough for the new work that's fine, but it might be worth putting a test in to confirm the card has at least 3GB before giving out work. Of course it's also an older card so there won't be many of them still in service.
	ID: 6847 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7808 Credit: 44,534,674 RAC: 332	Message 6848 - Posted: 23 Oct 2020, 20:25:15 UTC - in response to Message 6847.
	I copied the application and data file to a known good machine (1060 6GB card) and it ran fine, so it's something particular to that K2000 rig. That little Quadro only has 2GB of VRAM. If that's not enough for the new work that's fine, but it might be worth putting a test in to confirm the card has at least 3GB before giving out work. Of course it's also an older card so there won't be many of them still in service. I have checked this with GPU-Z and its taking around 85MB VRAM + the reserved from the BIOS in total 900MB so 2GB VRAM should be no problem at all. The card could also be overheated.
	ID: 6848 · Rating: 0 · rate: / Reply Quote

Dark Angel Send message Joined: 22 Jun 18 Posts: 15 Credit: 61,726,776 RAC: 0	Message 6849 - Posted: 24 Oct 2020, 2:10:16 UTC - in response to Message 6848. Last modified: 24 Oct 2020, 2:15:18 UTC
	Cleaned the card out, got very little dust out of it, and checked it. Same error. Check the temperature (nVidia Settings app), unit fails before the temperature can update (lagging display, normal for this card), tried it on a PPS Seive unit (PrimeGrid) and it's crunching away happily at under 60C after five minutes continuous, fan still at only 35%. a cOLATZ UNIT (OpenCL) runs fine at the same temperatures, GPUGrid has the card at 61C and 37% fan after 10 minutes continuous. I compared the GPUGrid work on my GTX 1060 rig and it runs a couple of degrees cooler than SRBase but not much. edit: it's lagging because I'm accessing the desktop remotely as well as the load on the GPU. Just to be clear.
	ID: 6849 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7808 Credit: 44,534,674 RAC: 332	Message 6861 - Posted: 25 Oct 2020, 7:25:59 UTC
	I have asked in the forum, it could be the same issue as on windows7 now but less people are affected, maybe a driver issue.
	ID: 6861 · Rating: 0 · rate: / Reply Quote

Dark Angel Send message Joined: 22 Jun 18 Posts: 15 Credit: 61,726,776 RAC: 0	Message 6863 - Posted: 25 Oct 2020, 7:51:10 UTC - in response to Message 6861.
	I was wondering the same thing and spent time today cleaning out any differences between that rig and the other one that works fine. I found an issue with the kernel version I was using and Virtualbox which I figure is unrelated to this and went back to a different kernel that works on both machines, then reinstalled the drivers to be sure. Just rebooted and tried it but the same issue holds. I'm running the 450 drivers and 5.4 Linux kernel, btw.
	ID: 6863 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7808 Credit: 44,534,674 RAC: 332	Message 6864 - Posted: 25 Oct 2020, 7:52:24 UTC - in response to Message 6863.
	I was wondering the same thing and spent time today cleaning out any differences between that rig and the other one that works fine. I found an issue with the kernel version I was using and Virtualbox which I figure is unrelated to this and went back to a different kernel that works on both machines, then reinstalled the drivers to be sure. Just rebooted and tried it but the same issue holds. I'm running the 450 drivers and 5.4 Linux kernel, btw. The windows machine is on win7 with 436.15. I think this kind of error is the same.
	ID: 6864 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7808 Credit: 44,534,674 RAC: 332	Message 6865 - Posted: 25 Oct 2020, 18:58:31 UTC
	a short info from forum: check the factor, the logs, the hardware, driver, temperatures, gpu memory reliability, etc.
	ID: 6865 · Rating: 0 · rate: / Reply Quote

1 · 2 · 3 · Next
Post to thread

Message boards : Number crunching : Linux GPU errors