Nvidia Tesla P100 Problems
log in

Advanced search

Message boards : Number crunching : Nvidia Tesla P100 Problems

Previous · 1 · 2 · 3 · Next
Author Message
Ian&Steve C.
Send message
Joined: 7 May 23
Posts: 7
Credit: 1,086,024
RAC: 2
Message 9746 - Posted: 14 Mar 2024, 15:51:33 UTC - in response to Message 9745.

cuda 12+ still supports CC 6.0 devices. why not just recompile the cuda120 app with CC 6.0 support?

otherwise, the only real solution is to transition to Anonymous Platform, and force the cuda100 app to run on everything with an app_info.xml file.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7257
Credit: 42,729,227
RAC: 1
Message 9747 - Posted: 14 Mar 2024, 15:57:29 UTC - in response to Message 9746.

cuda 12+ still supports CC 6.0 devices. why not just recompile the cuda120 app with CC 6.0 support?

otherwise, the only real solution is to transition to Anonymous Platform, and force the cuda100 app to run on everything with an app_info.xml file.


cuda12 app support cc6.1, see 1070ti I dont know why the Tesla doesnt run cuda120 WUs, anonymous patform is not allowed.

Ian&Steve C.
Send message
Joined: 7 May 23
Posts: 7
Credit: 1,086,024
RAC: 2
Message 9748 - Posted: 14 Mar 2024, 16:13:33 UTC - in response to Message 9747.
Last modified: 14 Mar 2024, 16:14:36 UTC

cuda 12+ still supports CC 6.0 devices. why not just recompile the cuda120 app with CC 6.0 support?

otherwise, the only real solution is to transition to Anonymous Platform, and force the cuda100 app to run on everything with an app_info.xml file.


cuda12 app support cc6.1, see 1070ti I dont know why the Tesla doesnt run cuda120 WUs, anonymous patform is not allowed.


i know your app supports 6.1. but CUDA 12 from nvidia still supports down to CC 5.0

I'm saying you should recompile it to add support for 6.0 also. Tesla P100 has CC level 6.0.

6.1 is used on the mainstream pascal GPUs (GTX 10-series, Quadro P-series)

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7257
Credit: 42,729,227
RAC: 1
Message 9749 - Posted: 14 Mar 2024, 16:32:53 UTC - in response to Message 9748.

cuda 12+ still supports CC 6.0 devices. why not just recompile the cuda120 app with CC 6.0 support?

otherwise, the only real solution is to transition to Anonymous Platform, and force the cuda100 app to run on everything with an app_info.xml file.


cuda12 app support cc6.1, see 1070ti I dont know why the Tesla doesnt run cuda120 WUs, anonymous patform is not allowed.


i know your app supports 6.1. but CUDA 12 from nvidia still supports down to CC 5.0

I'm saying you should recompile it to add support for 6.0 also. Tesla P100 has CC level 6.0.

6.1 is used on the mainstream pascal GPUs (GTX 10-series, Quadro P-series)


Yes, we will try to recompile, perhaps this will solve a lot of problems.

Ian&Steve C.
Send message
Joined: 7 May 23
Posts: 7
Credit: 1,086,024
RAC: 2
Message 9750 - Posted: 14 Mar 2024, 18:00:27 UTC - in response to Message 9749.

there also appears to be a problem with the cuda100 package.

the mfaktc-linux64-v10.zip contains Windows files (mfaktc.exe and .ini files). it should contain linux binaries and the cuda10 libcudart.so.10 shared object libraries like the cuda11 package does.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7257
Credit: 42,729,227
RAC: 1
Message 9751 - Posted: 14 Mar 2024, 18:17:57 UTC - in response to Message 9750.

there also appears to be a problem with the cuda100 package.

the mfaktc-linux64-v10.zip contains Windows files (mfaktc.exe and .ini files). it should contain linux binaries and the cuda10 libcudart.so.10 shared object libraries like the cuda11 package does.



the libcudart.so.10 file you can find in FAQs. If the app was dynamically compiled you dont need them in package. The .exe in linux was also compiled so but running under linux. The .ini file is fixed.

Ian&Steve C.
Send message
Joined: 7 May 23
Posts: 7
Credit: 1,086,024
RAC: 2
Message 9752 - Posted: 14 Mar 2024, 18:33:27 UTC - in response to Message 9751.

there also appears to be a problem with the cuda100 package.

the mfaktc-linux64-v10.zip contains Windows files (mfaktc.exe and .ini files). it should contain linux binaries and the cuda10 libcudart.so.10 shared object libraries like the cuda11 package does.



the libcudart.so.10 file you can find in FAQs. If the app was dynamically compiled you dont need them in package. The .exe in linux was also compiled so but running under linux. The .ini file is fixed.


when I run ldd mfaktc.exe on the binary it says libcudart.so.10.1 is missing.

in your cuda111 and cuda120 packages you include this file.

I see nothing in the FAQ that references this file. can you link to it? Is your intended workaround for this to create a symlink from libcudart.so.10->libcudart.so.10.1 ?

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7257
Credit: 42,729,227
RAC: 1
Message 9753 - Posted: 14 Mar 2024, 18:46:43 UTC - in response to Message 9752.

there also appears to be a problem with the cuda100 package.

the mfaktc-linux64-v10.zip contains Windows files (mfaktc.exe and .ini files). it should contain linux binaries and the cuda10 libcudart.so.10 shared object libraries like the cuda11 package does.



the libcudart.so.10 file you can find in FAQs. If the app was dynamically compiled you dont need them in package. The .exe in linux was also compiled so but running under linux. The .ini file is fixed.


when I run ldd mfaktc.exe on the binary it says libcudart.so.10.1 is missing.

in your cuda111 and cuda120 packages you include this file.

I see nothing in the FAQ that references this file. can you link to it? Is your intended workaround for this to create a symlink from libcudart.so.10->libcudart.so.10.1 ?


The file must be somewhere in another thread. Linux is not my thing so I will ask if we can do something. The package need a recompile too, old stuff.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7257
Credit: 42,729,227
RAC: 1
Message 9754 - Posted: 16 Mar 2024, 21:55:38 UTC

new cuda120 TF mfaktc app is up

changelog:
- recompiled from cc5-latest

mmonnin
Send message
Joined: 1 Feb 17
Posts: 27
Credit: 311,202,073
RAC: 14,155
Message 9757 - Posted: 17 Mar 2024, 1:33:31 UTC
Last modified: 17 Mar 2024, 1:35:00 UTC

All errors with new version.
https://srbase.my-firewall.org/sr5/result.php?resultid=145853091

Please undo

Fine with 0.28, errors on 0.29.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7257
Credit: 42,729,227
RAC: 1
Message 9758 - Posted: 17 Mar 2024, 2:03:45 UTC - in response to Message 9757.

All errors with new version.
https://srbase.my-firewall.org/sr5/result.php?resultid=145853091

Please undo

Fine with 0.28, errors on 0.29.


ok, this was not as planned, reverted back to 0.28, need to check...

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7257
Credit: 42,729,227
RAC: 1
Message 9759 - Posted: 17 Mar 2024, 10:38:17 UTC - in response to Message 9758.

All errors with new version.
https://srbase.my-firewall.org/sr5/result.php?resultid=145853091

Please undo

Fine with 0.28, errors on 0.29.


ok, this was not as planned, reverted back to 0.28, need to check...


v30 fixed the issue with a wrong and older.so12 file

mmonnin
Send message
Joined: 1 Feb 17
Posts: 27
Credit: 311,202,073
RAC: 14,155
Message 9760 - Posted: 17 Mar 2024, 11:28:56 UTC

That fixed one of my PCs.

Another still errors out on every task
https://srbase.my-firewall.org/sr5/result.php?resultid=145885980

DeleteNull
Volunteer developer
Volunteer tester
Send message
Joined: 29 Nov 14
Posts: 83
Credit: 367,636,322
RAC: 69,094
Message 9761 - Posted: 17 Mar 2024, 11:55:51 UTC - in response to Message 9760.

That fixed one of my PCs.

Another still errors out on every task
https://srbase.my-firewall.org/sr5/result.php?resultid=145885980


Can you update your driver on this host?

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7257
Credit: 42,729,227
RAC: 1
Message 9762 - Posted: 17 Mar 2024, 13:41:03 UTC - in response to Message 9760.

That fixed one of my PCs.

Another still errors out on every task
https://srbase.my-firewall.org/sr5/result.php?resultid=145885980


v31 is up, should reduce the driver requirement to 525

mmonnin
Send message
Joined: 1 Feb 17
Posts: 27
Credit: 311,202,073
RAC: 14,155
Message 9764 - Posted: 17 Mar 2024, 14:12:52 UTC - in response to Message 9762.

That fixed one of my PCs.

Another still errors out on every task
https://srbase.my-firewall.org/sr5/result.php?resultid=145885980


v31 is up, should reduce the driver requirement to 525


Awesome, thank you. This version works.

crashtech
Send message
Joined: 10 Apr 19
Posts: 28
Credit: 466,952,134
RAC: 130,404
Message 9766 - Posted: 18 Mar 2024, 16:59:23 UTC

I'm still getting errors on a dual P100 PC trying to run TF 0.31 (cuda120) tasks:

<core_client_version>7.20.5</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 2024-03-18 09:49:38 (2334): wrapper (7.24.26018): starting 2024-03-18 09:49:38 (2334): wrapper (7.24.26018): starting 2024-03-18 09:49:38 (2334): wrapper: running ./mfaktc (-d 1) 2024-03-18 09:49:38 (2334): wrapper: created child process 2336 2024-03-18 09:49:39 (2334): ./mfaktc exited; CPU time 0.254903 2024-03-18 09:49:39 (2334): app exit status: 0x1 2024-03-18 09:49:39 (2334): called boinc_finish(195) </stderr_txt> ]]>

$ nvidia-smi Mon Mar 18 11:02:56 2024 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... Off | 00000000:04:00.0 Off | 0 | | N/A 31C P0 25W / 250W | 197MiB / 16384MiB | 3% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla P100-PCIE... Off | 00000000:09:00.0 Off | 0 | | N/A 30C P0 27W / 250W | 4MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1165 G /usr/lib/xorg/Xorg 112MiB | | 0 N/A N/A 1517 G cinnamon 28MiB | | 0 N/A N/A 2181 G ...2gtk-4.0/WebKitWebProcess 55MiB | | 1 N/A N/A 1165 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------+

nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7257
Credit: 42,729,227
RAC: 1
Message 9767 - Posted: 18 Mar 2024, 17:46:21 UTC - in response to Message 9766.

Interesting, try to run ldd ./mfaktc

crashtech
Send message
Joined: 10 Apr 19
Posts: 28
Credit: 466,952,134
RAC: 130,404
Message 9768 - Posted: 18 Mar 2024, 18:57:47 UTC - in response to Message 9767.

Interesting, try to run ldd ./mfaktc


$ ldd ./mfaktc linux-vdso.so.1 (0x00007ffd1f50d000) libcudart.so.12 => ./libcudart.so.12 (0x00007ff29f6b8000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ff29f552000) libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007ff29f370000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007ff29f355000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ff29f163000) /lib64/ld-linux-x86-64.so.2 (0x00007ff29f960000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ff29f15d000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ff29f138000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007ff29f12e000)

Ian&Steve C.
Send message
Joined: 7 May 23
Posts: 7
Credit: 1,086,024
RAC: 2
Message 9769 - Posted: 18 Mar 2024, 19:03:16 UTC
Last modified: 18 Mar 2024, 19:06:42 UTC

was v31 compiled with explicit support for CC_6.0 in the NVCCFLAGS section of your makefile or build script?


the argument should be something similar to this:

--generate-code arch=compute_60,code=sm_60


this was what I was suggesting when I mentioned recompiling it to add 6.0 support.

Previous · 1 · 2 · 3 · Next
Post to thread

Message boards : Number crunching : Nvidia Tesla P100 Problems


Main page · Your account · Message boards


Copyright © 2014-2024 BOINC Confederation / rebirther