Linux GPU errors
log in

Advanced search

Message boards : Number crunching : Linux GPU errors

Previous · 1 · 2 · 3 · Next
Author Message
Dark Angel
Send message
Joined: 22 Jun 18
Posts: 15
Credit: 60,126,276
RAC: 8,534
Message 6866 - Posted: 25 Oct 2020, 20:43:12 UTC - in response to Message 6865.

I'll try rolling the driver back to 440 and see what happens.
Hardware and temps I doubt are an issue given the card works fine on other prime number projects but it won't hurt to try just the same.

Dark Angel
Send message
Joined: 22 Jun 18
Posts: 15
Credit: 60,126,276
RAC: 8,534
Message 6867 - Posted: 26 Oct 2020, 10:17:19 UTC - in response to Message 6866.

Went to 455 drivers instead. Still no go.

Dark Angel
Send message
Joined: 22 Jun 18
Posts: 15
Credit: 60,126,276
RAC: 8,534
Message 6869 - Posted: 27 Oct 2020, 1:26:48 UTC - in response to Message 6867.

Back to 435 drivers, still no go.

Still running PrimeGrid tasks with no drama (not at the same time of course)

ldd mfaktc.exe confirms I have all required libraries installed correctly.

Sixtiz
Send message
Joined: 27 Oct 20
Posts: 2
Credit: 0
RAC: 0
Message 6870 - Posted: 27 Oct 2020, 12:49:24 UTC

Same issue on my machine. mfaktc was happily running (Manjaro Linux), then since I updated the nvidia drivers, it keeps throwing "ERROR: cudaGetLastError() returned 702: the launch timed out and was terminated".

I noticed the nvidia drivers were updated from 450.66 to 450.80 so I rolled back, but no luck...
The only way I can make it work is by disabling the "Interactive" option in X.org config as described in https://nvidia.custhelp.com/app/answers/detail/a_id/3029/~/using-cuda-and-x but it makes the computer pretty much unusable...

Also worth noting that older version of mfaktc I was using before outside of SRBase (mfaktc-0.21.linux64.cuda65) is still working fine, without changing X.org config.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7255
Credit: 42,729,227
RAC: 3
Message 6871 - Posted: 27 Oct 2020, 14:13:04 UTC - in response to Message 6870.

Same issue on my machine. mfaktc was happily running (Manjaro Linux), then since I updated the nvidia drivers, it keeps throwing "ERROR: cudaGetLastError() returned 702: the launch timed out and was terminated".

I noticed the nvidia drivers were updated from 450.66 to 450.80 so I rolled back, but no luck...
The only way I can make it work is by disabling the "Interactive" option in X.org config as described in https://nvidia.custhelp.com/app/answers/detail/a_id/3029/~/using-cuda-and-x but it makes the computer pretty much unusable...

Also worth noting that older version of mfaktc I was using before outside of SRBase (mfaktc-0.21.linux64.cuda65) is still working fine, without changing X.org config.


thx for the info, never touch a running system, as in win10, driver armageddon, from the rollback, something of the new settings stored somewhere.

Sixtiz
Send message
Joined: 27 Oct 20
Posts: 2
Credit: 0
RAC: 0
Message 6872 - Posted: 28 Oct 2020, 4:54:12 UTC - in response to Message 6871.
Last modified: 28 Oct 2020, 4:55:11 UTC

Also worth noting, it looks like from Linux kernel 5.9, the nvidia-uvm module won' t load due to licensing issues (it wants to use other bits of GPL code I think). and this causes CUDA applications to fail... So stay with 5.8 until Nvidia fixes their drivers.

Profile IDEA
Avatar
Send message
Joined: 23 Sep 20
Posts: 34
Credit: 4,301,651,437
RAC: 640,327
Message 6991 - Posted: 21 Nov 2020, 11:18:43 UTC

Not sure if this is related, but I converted a Windows 7 host to Linux Mint and it successfully completed more than 40 GPU tasks on an NVIDIA GTX 960 but then started to fail with computation errors.

The error message is:

./mfaktc.exe: error while loading shared libraries: libcudart.so.10.1: cannot open shared object file: No such file or directory

Host ID is 211122

Anybody seen this before and know if it's a simple problem to fix?

Rebooting didn't solve it.
____________

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7255
Credit: 42,729,227
RAC: 3
Message 6992 - Posted: 21 Nov 2020, 11:24:07 UTC - in response to Message 6991.

Not sure if this is related, but I converted a Windows 7 host to Linux Mint and it successfully completed more than 40 GPU tasks on an NVIDIA GTX 960 but then started to fail with computation errors.

The error message is:

./mfaktc.exe: error while loading shared libraries: libcudart.so.10.1: cannot open shared object file: No such file or directory

Host ID is 211122

Anybody seen this before and know if it's a simple problem to fix?

Rebooting didn't solve it.


yes, see FAQ

Profile IDEA
Avatar
Send message
Joined: 23 Sep 20
Posts: 34
Credit: 4,301,651,437
RAC: 640,327
Message 6995 - Posted: 21 Nov 2020, 19:39:35 UTC - in response to Message 6992.

The FAQ isn't easy to understand :(

The GPU info is:

CUDA: NVIDIA GPU 0: GeForce GTX 960 (driver version 455.38, CUDA version 11.1, compute capability 5.2, 1993MB, 1952MB available, 2618 GFLOPS peak)


The FAQ says I will be sent Cuda111 and Cuda100 tasks.

How do I stop the Cuda100 tasks from causing computation errors?

The FAQ says to copy a missing file to /usr/lib64 -- but that file is deleted when you install Cuda11 Nvidia Drivers.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7255
Credit: 42,729,227
RAC: 3
Message 6996 - Posted: 21 Nov 2020, 20:30:20 UTC - in response to Message 6995.
Last modified: 21 Nov 2020, 20:50:10 UTC

The FAQ isn't easy to understand :(

The GPU info is:

CUDA: NVIDIA GPU 0: GeForce GTX 960 (driver version 455.38, CUDA version 11.1, compute capability 5.2, 1993MB, 1952MB available, 2618 GFLOPS peak)


The FAQ says I will be sent Cuda111 and Cuda100 tasks.

How do I stop the Cuda100 tasks from causing computation errors?

The FAQ says to copy a missing file to /usr/lib64 -- but that file is deleted when you install Cuda11 Nvidia Drivers.


I will try to get a file and upload locally. Stay tuned.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7255
Credit: 42,729,227
RAC: 3
Message 6998 - Posted: 22 Nov 2020, 7:43:15 UTC

http://srbase.my-firewall.org/sr5/download/libcudart.so.10.1

Profile IDEA
Avatar
Send message
Joined: 23 Sep 20
Posts: 34
Credit: 4,301,651,437
RAC: 640,327
Message 6999 - Posted: 22 Nov 2020, 10:55:28 UTC - in response to Message 6998.

I tried this last night with a libcudart.so.10.1.243 copied from another host and renamed to 10.1, but it didn't work in /usr/lib64.

So I created the missing directories and copied it to /usr/local/cuda/lib64 (the location on the other host) and that didn't work either.

Now trying again with your file copied to both locations.

Still getting computation errors.

Do you know that when you install the latest Nvidia driver it removes the entire cuda directory from /usr/local ?

It's not just this file that is missing. Everything has gone.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7255
Credit: 42,729,227
RAC: 3
Message 7000 - Posted: 22 Nov 2020, 11:01:53 UTC - in response to Message 6999.
Last modified: 22 Nov 2020, 11:03:52 UTC

I tried this last night with a libcudart.so.10.1.243 copied from another host and renamed to 10.1, but it didn't work in /usr/lib64.

So I created the missing directories and copied it to /usr/local/cuda/lib64 (the location on the other host) and that didn't work either.

Now trying again with your file copied to both locations.

Still getting computation errors.

Do you know that when you install the latest Nvidia driver it removes the entire cuda directory from /usr/local ?

It's not just this file that is missing. Everything has gone.


You need only the latest cuda stuff and copy file. If its not working then use only cuda10 driver + toolkit. Do you not have the /usr/lib64 folder?

I cannot change the cuda restrictions because then all Turing cards cant get work thats why you are getting cuda10 and cuda11 apps.

Profile IDEA
Avatar
Send message
Joined: 23 Sep 20
Posts: 34
Credit: 4,301,651,437
RAC: 640,327
Message 7001 - Posted: 22 Nov 2020, 11:33:24 UTC - in response to Message 7000.

The other cuda files are in /usr/lib (not lib64) and in subdirectories by host architecture.

OK, this now works!

The file needs to go in /usr/lib

Thank you for sending the file link.

DeleteNull
Volunteer developer
Volunteer tester
Send message
Joined: 29 Nov 14
Posts: 80
Credit: 367,636,322
RAC: 251,267
Message 7002 - Posted: 22 Nov 2020, 12:56:07 UTC - in response to Message 7001.

Depending on the flavour of Linux you are using you have to choose the right directory.
For Ubuntu (and derivatives) it is /usr/lib and in OpenSuse it is /usr/lib64.
But there are some other Linux Distros to confuse the user ;)

Profile IDEA
Avatar
Send message
Joined: 23 Sep 20
Posts: 34
Credit: 4,301,651,437
RAC: 640,327
Message 7013 - Posted: 23 Nov 2020, 21:53:04 UTC - in response to Message 7002.

A little confusing, but also rewarding to use with old Macs and PCs :)

My next job is to get boinc to recognise a Thunderbolt eGPU. Ubuntu has no problem utilising the eGPU, but boinc says "no GPU found" no matter if it is Nvidia or ATI :(


I have a question about your 3080.

Have you tried running multiple tasks at the same time?

I would have thought that two, or perhaps even 4 at a time would be more efficient because the time spent ending/beginning each tasks is significant when processing tasks so quickly?
____________

DeleteNull
Volunteer developer
Volunteer tester
Send message
Joined: 29 Nov 14
Posts: 80
Credit: 367,636,322
RAC: 251,267
Message 7014 - Posted: 23 Nov 2020, 23:09:02 UTC - in response to Message 7013.


Have you tried running multiple tasks at the same time?


No, I haven't. Performance is outstanding, so there is no need to do so (for me).

Greger
Send message
Joined: 1 Nov 16
Posts: 11
Credit: 3,279,445,305
RAC: 201,599
Message 7651 - Posted: 12 Jun 2021, 21:06:42 UTC
Last modified: 12 Jun 2021, 21:24:44 UTC

Didn't manage to get 3080 Ti running. Had to use driver 465 and re-added libcudart.so.10.1 to /usr/lib/ but got:
process exited with code 195 (0xc3, -61)</message>

rebirther: do we need additional changes to libcudart or to application to use cuda 11.3? file in order to work with 465 driver?
smi: Driver Version: 465.27 CUDA Version: 11.3

host: https://srbase.my-firewall.org/sr5/show_host_detail.php?hostid=211970
boinc does not detect card fully but running great on other projects.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7255
Credit: 42,729,227
RAC: 3
Message 7652 - Posted: 12 Jun 2021, 21:22:07 UTC - in response to Message 7651.

Didn't manage to get 3080 Ti running. Had to use driver 465 and re-added libcudart.so.10.1 to /usr/lib/ but got:
process exited with code 195 (0xc3, -61)

rebirther: do we need additional changes to libcudart file in order to work with 465 driver?


What host is affected and what cuda toolkit is installed?

Greger
Send message
Joined: 1 Nov 16
Posts: 11
Credit: 3,279,445,305
RAC: 201,599
Message 7653 - Posted: 12 Jun 2021, 21:26:34 UTC

https://srbase.my-firewall.org/sr5/show_host_detail.php?hostid=211970

installed just driver then ocl-icd-libopencl1 and nvidia-opencl-dev

Previous · 1 · 2 · 3 · Next
Post to thread

Message boards : Number crunching : Linux GPU errors


Main page · Your account · Message boards


Copyright © 2014-2024 BOINC Confederation / rebirther