Linux GPU errors

Message boards : Number crunching : Linux GPU errors

Author	Message
Dark Angel Send message Joined: 22 Jun 18 Posts: 15 Credit: 61,726,776 RAC: 0	Message 6866 - Posted: 25 Oct 2020, 20:43:12 UTC - in response to Message 6865.
	I'll try rolling the driver back to 440 and see what happens. Hardware and temps I doubt are an issue given the card works fine on other prime number projects but it won't hurt to try just the same.
	ID: 6866 · Rating: 0 · rate: / Reply Quote

Dark Angel Send message Joined: 22 Jun 18 Posts: 15 Credit: 61,726,776 RAC: 0	Message 6867 - Posted: 26 Oct 2020, 10:17:19 UTC - in response to Message 6866.
	Went to 455 drivers instead. Still no go.
	ID: 6867 · Rating: 0 · rate: / Reply Quote

Dark Angel Send message Joined: 22 Jun 18 Posts: 15 Credit: 61,726,776 RAC: 0	Message 6869 - Posted: 27 Oct 2020, 1:26:48 UTC - in response to Message 6867.
	Back to 435 drivers, still no go. Still running PrimeGrid tasks with no drama (not at the same time of course) ldd mfaktc.exe confirms I have all required libraries installed correctly.
	ID: 6869 · Rating: 0 · rate: / Reply Quote

Sixtiz Send message Joined: 27 Oct 20 Posts: 2 Credit: 0 RAC: 0	Message 6870 - Posted: 27 Oct 2020, 12:49:24 UTC
	Same issue on my machine. mfaktc was happily running (Manjaro Linux), then since I updated the nvidia drivers, it keeps throwing "ERROR: cudaGetLastError() returned 702: the launch timed out and was terminated". I noticed the nvidia drivers were updated from 450.66 to 450.80 so I rolled back, but no luck... The only way I can make it work is by disabling the "Interactive" option in X.org config as described in https://nvidia.custhelp.com/app/answers/detail/a_id/3029/~/using-cuda-and-x but it makes the computer pretty much unusable... Also worth noting that older version of mfaktc I was using before outside of SRBase (mfaktc-0.21.linux64.cuda65) is still working fine, without changing X.org config.
	ID: 6870 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7808 Credit: 44,534,674 RAC: 300	Message 6871 - Posted: 27 Oct 2020, 14:13:04 UTC - in response to Message 6870.
	Same issue on my machine. mfaktc was happily running (Manjaro Linux), then since I updated the nvidia drivers, it keeps throwing "ERROR: cudaGetLastError() returned 702: the launch timed out and was terminated". I noticed the nvidia drivers were updated from 450.66 to 450.80 so I rolled back, but no luck... The only way I can make it work is by disabling the "Interactive" option in X.org config as described in https://nvidia.custhelp.com/app/answers/detail/a_id/3029/~/using-cuda-and-x but it makes the computer pretty much unusable... Also worth noting that older version of mfaktc I was using before outside of SRBase (mfaktc-0.21.linux64.cuda65) is still working fine, without changing X.org config. thx for the info, never touch a running system, as in win10, driver armageddon, from the rollback, something of the new settings stored somewhere.
	ID: 6871 · Rating: 0 · rate: / Reply Quote

Sixtiz Send message Joined: 27 Oct 20 Posts: 2 Credit: 0 RAC: 0	Message 6872 - Posted: 28 Oct 2020, 4:54:12 UTC - in response to Message 6871. Last modified: 28 Oct 2020, 4:55:11 UTC
	Also worth noting, it looks like from Linux kernel 5.9, the nvidia-uvm module won' t load due to licensing issues (it wants to use other bits of GPL code I think). and this causes CUDA applications to fail... So stay with 5.8 until Nvidia fixes their drivers.
	ID: 6872 · Rating: 0 · rate: / Reply Quote

IDEA Send message Joined: 23 Sep 20 Posts: 35 Credit: 4,510,818,409 RAC: 17,143	Message 6991 - Posted: 21 Nov 2020, 11:18:43 UTC
	Not sure if this is related, but I converted a Windows 7 host to Linux Mint and it successfully completed more than 40 GPU tasks on an NVIDIA GTX 960 but then started to fail with computation errors. The error message is: ./mfaktc.exe: error while loading shared libraries: libcudart.so.10.1: cannot open shared object file: No such file or directory Host ID is 211122 Anybody seen this before and know if it's a simple problem to fix? Rebooting didn't solve it. ____________
	ID: 6991 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7808 Credit: 44,534,674 RAC: 300	Message 6992 - Posted: 21 Nov 2020, 11:24:07 UTC - in response to Message 6991.
	Not sure if this is related, but I converted a Windows 7 host to Linux Mint and it successfully completed more than 40 GPU tasks on an NVIDIA GTX 960 but then started to fail with computation errors. The error message is: ./mfaktc.exe: error while loading shared libraries: libcudart.so.10.1: cannot open shared object file: No such file or directory Host ID is 211122 Anybody seen this before and know if it's a simple problem to fix? Rebooting didn't solve it. yes, see FAQ
	ID: 6992 · Rating: 0 · rate: / Reply Quote

IDEA Send message Joined: 23 Sep 20 Posts: 35 Credit: 4,510,818,409 RAC: 17,143	Message 6995 - Posted: 21 Nov 2020, 19:39:35 UTC - in response to Message 6992.
	The FAQ isn't easy to understand :( The GPU info is: CUDA: NVIDIA GPU 0: GeForce GTX 960 (driver version 455.38, CUDA version 11.1, compute capability 5.2, 1993MB, 1952MB available, 2618 GFLOPS peak) The FAQ says I will be sent Cuda111 and Cuda100 tasks. How do I stop the Cuda100 tasks from causing computation errors? The FAQ says to copy a missing file to /usr/lib64 -- but that file is deleted when you install Cuda11 Nvidia Drivers.
	ID: 6995 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7808 Credit: 44,534,674 RAC: 300	Message 6996 - Posted: 21 Nov 2020, 20:30:20 UTC - in response to Message 6995. Last modified: 21 Nov 2020, 20:50:10 UTC
	The FAQ isn't easy to understand :( The GPU info is: CUDA: NVIDIA GPU 0: GeForce GTX 960 (driver version 455.38, CUDA version 11.1, compute capability 5.2, 1993MB, 1952MB available, 2618 GFLOPS peak) The FAQ says I will be sent Cuda111 and Cuda100 tasks. How do I stop the Cuda100 tasks from causing computation errors? The FAQ says to copy a missing file to /usr/lib64 -- but that file is deleted when you install Cuda11 Nvidia Drivers. I will try to get a file and upload locally. Stay tuned.
	ID: 6996 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7808 Credit: 44,534,674 RAC: 300	Message 6998 - Posted: 22 Nov 2020, 7:43:15 UTC
	http://srbase.my-firewall.org/sr5/download/libcudart.so.10.1
	ID: 6998 · Rating: 0 · rate: / Reply Quote

IDEA Send message Joined: 23 Sep 20 Posts: 35 Credit: 4,510,818,409 RAC: 17,143	Message 6999 - Posted: 22 Nov 2020, 10:55:28 UTC - in response to Message 6998.
	I tried this last night with a libcudart.so.10.1.243 copied from another host and renamed to 10.1, but it didn't work in /usr/lib64. So I created the missing directories and copied it to /usr/local/cuda/lib64 (the location on the other host) and that didn't work either. Now trying again with your file copied to both locations. Still getting computation errors. Do you know that when you install the latest Nvidia driver it removes the entire cuda directory from /usr/local ? It's not just this file that is missing. Everything has gone.
	ID: 6999 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7808 Credit: 44,534,674 RAC: 300	Message 7000 - Posted: 22 Nov 2020, 11:01:53 UTC - in response to Message 6999. Last modified: 22 Nov 2020, 11:03:52 UTC
	I tried this last night with a libcudart.so.10.1.243 copied from another host and renamed to 10.1, but it didn't work in /usr/lib64. So I created the missing directories and copied it to /usr/local/cuda/lib64 (the location on the other host) and that didn't work either. Now trying again with your file copied to both locations. Still getting computation errors. Do you know that when you install the latest Nvidia driver it removes the entire cuda directory from /usr/local ? It's not just this file that is missing. Everything has gone. You need only the latest cuda stuff and copy file. If its not working then use only cuda10 driver + toolkit. Do you not have the /usr/lib64 folder? I cannot change the cuda restrictions because then all Turing cards cant get work thats why you are getting cuda10 and cuda11 apps.
	ID: 7000 · Rating: 0 · rate: / Reply Quote

IDEA Send message Joined: 23 Sep 20 Posts: 35 Credit: 4,510,818,409 RAC: 17,143	Message 7001 - Posted: 22 Nov 2020, 11:33:24 UTC - in response to Message 7000.
	The other cuda files are in /usr/lib (not lib64) and in subdirectories by host architecture. OK, this now works! The file needs to go in /usr/lib Thank you for sending the file link.
	ID: 7001 · Rating: 0 · rate: / Reply Quote

DeleteNull Volunteer developer Volunteer tester Send message Joined: 29 Nov 14 Posts: 88 Credit: 570,219,757 RAC: 104,952	Message 7002 - Posted: 22 Nov 2020, 12:56:07 UTC - in response to Message 7001.
	Depending on the flavour of Linux you are using you have to choose the right directory. For Ubuntu (and derivatives) it is /usr/lib and in OpenSuse it is /usr/lib64. But there are some other Linux Distros to confuse the user ;)
	ID: 7002 · Rating: 0 · rate: / Reply Quote

IDEA Send message Joined: 23 Sep 20 Posts: 35 Credit: 4,510,818,409 RAC: 17,143	Message 7013 - Posted: 23 Nov 2020, 21:53:04 UTC - in response to Message 7002.
	A little confusing, but also rewarding to use with old Macs and PCs :) My next job is to get boinc to recognise a Thunderbolt eGPU. Ubuntu has no problem utilising the eGPU, but boinc says "no GPU found" no matter if it is Nvidia or ATI :( I have a question about your 3080. Have you tried running multiple tasks at the same time? I would have thought that two, or perhaps even 4 at a time would be more efficient because the time spent ending/beginning each tasks is significant when processing tasks so quickly? ____________
	ID: 7013 · Rating: 0 · rate: / Reply Quote

DeleteNull Volunteer developer Volunteer tester Send message Joined: 29 Nov 14 Posts: 88 Credit: 570,219,757 RAC: 104,952	Message 7014 - Posted: 23 Nov 2020, 23:09:02 UTC - in response to Message 7013.
	Have you tried running multiple tasks at the same time? No, I haven't. Performance is outstanding, so there is no need to do so (for me).
	ID: 7014 · Rating: 0 · rate: / Reply Quote

Greger Send message Joined: 1 Nov 16 Posts: 11 Credit: 3,794,073,105 RAC: 2,728,427	Message 7651 - Posted: 12 Jun 2021, 21:06:42 UTC Last modified: 12 Jun 2021, 21:24:44 UTC
	Didn't manage to get 3080 Ti running. Had to use driver 465 and re-added libcudart.so.10.1 to /usr/lib/ but got: process exited with code 195 (0xc3, -61)</message> rebirther: do we need additional changes to libcudart or to application to use cuda 11.3? file in order to work with 465 driver? smi: Driver Version: 465.27 CUDA Version: 11.3 host: https://srbase.my-firewall.org/sr5/show_host_detail.php?hostid=211970 boinc does not detect card fully but running great on other projects.
	ID: 7651 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7808 Credit: 44,534,674 RAC: 300	Message 7652 - Posted: 12 Jun 2021, 21:22:07 UTC - in response to Message 7651.
	Didn't manage to get 3080 Ti running. Had to use driver 465 and re-added libcudart.so.10.1 to /usr/lib/ but got: process exited with code 195 (0xc3, -61) rebirther: do we need additional changes to libcudart file in order to work with 465 driver? What host is affected and what cuda toolkit is installed?
	ID: 7652 · Rating: 0 · rate: / Reply Quote

Greger Send message Joined: 1 Nov 16 Posts: 11 Credit: 3,794,073,105 RAC: 2,728,427	Message 7653 - Posted: 12 Jun 2021, 21:26:34 UTC
	https://srbase.my-firewall.org/sr5/show_host_detail.php?hostid=211970 installed just driver then ocl-icd-libopencl1 and nvidia-opencl-dev
	ID: 7653 · Rating: 0 · rate: / Reply Quote

Previous · 1 · 2 · 3 · Next
Post to thread

Message boards : Number crunching : Linux GPU errors