Nvidia Tesla P100 Problems
log in

Advanced search

Message boards : Number crunching : Nvidia Tesla P100 Problems

1 · 2 · 3 · Next
Author Message
crashtech
Send message
Joined: 10 Apr 19
Posts: 28
Credit: 466,952,134
RAC: 214,244
Message 9726 - Posted: 13 Mar 2024, 2:06:56 UTC

Hi,

I have many Linux computers that are used for BOINC. They are all Mint Linux 20 or 20.3, have glibc 2.31, Nvidia driver 525 or 535, and have nvidia-cuda-toolkit installed. All of the GPUs run TF just fine except for the P100s. It seems that the P100s can run cuda100 tasks but not cuda120 tasks. Does anyone know why this might be? If I can't fix them, I might have to find a way to only process cuda100 tasks, but this would be a last resort.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7257
Credit: 42,729,227
RAC: 1
Message 9727 - Posted: 13 Mar 2024, 2:15:57 UTC - in response to Message 9726.
Last modified: 13 Mar 2024, 2:20:42 UTC

Hi,

I have many Linux computers that are used for BOINC. They are all Mint Linux 20 or 20.3, have glibc 2.31, Nvidia driver 525 or 535, and have nvidia-cuda-toolkit installed. All of the GPUs run TF just fine except for the P100s. It seems that the P100s can run cuda100 tasks but not cuda120 tasks. Does anyone know why this might be? If I can't fix them, I might have to find a way to only process cuda100 tasks, but this would be a last resort.


Which cuda version is installed?

To only getting cuda100 and avoid any issues try this app_config

<app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda100</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config>

crashtech
Send message
Joined: 10 Apr 19
Posts: 28
Credit: 466,952,134
RAC: 214,244
Message 9728 - Posted: 13 Mar 2024, 12:24:52 UTC - in response to Message 9727.
Last modified: 13 Mar 2024, 12:27:29 UTC

Hi,

Thanks for your reply. Here is the cuda version of one of the P100 hosts:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

Also nvidia-smi:

$ nvidia-smi Wed Mar 13 06:31:10 2024 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... Off | 00000000:02:00.0 Off | 0 | | N/A 52C P0 120W / 250W | 1843MiB / 16384MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1599 G /usr/lib/xorg/Xorg 187MiB | | 0 N/A N/A 2046 G xfwm4 3MiB | | 0 N/A N/A 16247 G ...2gtk-4.0/WebKitWebProcess 8MiB | | 0 N/A N/A 17960 G ...2gtk-4.0/WebKitWebProcess 8MiB | | 0 N/A N/A 1103548 C ...64-pc-linux-gnu__cuda1222 816MiB | | 0 N/A N/A 1109316 C ...64-pc-linux-gnu__cuda1222 816MiB | +-----------------------------------------------------------------------------+


Thank you for the app_config. I hope an alternate solution can be found that will allow me to run all the TF apps!

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7257
Credit: 42,729,227
RAC: 1
Message 9729 - Posted: 13 Mar 2024, 14:31:48 UTC - in response to Message 9728.

Tesla P100 should has an older compute capability and was not compiled in cuda120 app. The plan_class is a bit tricky and to overwrite the server side plan_class with app_config this is the best solution to avoid any issues. Perhaps we need a recompiled cuda120 app to support older cards.

crashtech
Send message
Joined: 10 Apr 19
Posts: 28
Credit: 466,952,134
RAC: 214,244
Message 9730 - Posted: 13 Mar 2024, 15:26:07 UTC - in response to Message 9729.

Tesla P100 should has an older compute capability and was not compiled in cuda120 app. The plan_class is a bit tricky and to overwrite the server side plan_class with app_config this is the best solution to avoid any issues. Perhaps we need a recompiled cuda120 app to support older cards.


Thanks for your reply, this would seem to confirm that there isn't anything different I can do with my computer configurations other than the cuda120 exclusion. Interestingly, my other Pascal GPUs (1080ti and 1070ti) do not have any problems with cuda120, so it seems there may be something unique about the datacenter Pascal GPU version that is causing this problem.

crashtech
Send message
Joined: 10 Apr 19
Posts: 28
Credit: 466,952,134
RAC: 214,244
Message 9731 - Posted: 13 Mar 2024, 19:26:48 UTC - in response to Message 9727.

...To only getting cuda100 and avoid any issues try this app_config

<app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda100</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config>


For some reason this app_config does not work in dual-GPU systems where the GPUs don't match, e.g. a 2080ti and a P100, or a 1070ti and a P100. Single P100s work now, and dual P100s also.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7257
Credit: 42,729,227
RAC: 1
Message 9732 - Posted: 13 Mar 2024, 19:50:16 UTC - in response to Message 9731.
Last modified: 13 Mar 2024, 20:08:02 UTC

...To only getting cuda100 and avoid any issues try this app_config

<app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda100</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config>


For some reason this app_config does not work in dual-GPU systems where the GPUs don't match, e.g. a 2080ti and a P100, or a 1070ti and a P100. Single P100s work now, and dual P100s also.


Tricky, do you get work for 1070ti too? In this case all of your GPUs should only getting cuda100 as intended. So we need a solution. I can try to add a plan_class only for Tesla P100. Lets see how we can fix this so we dont need an app_config but giving another issues with different cuda versions and drivers.

<plan_class> <name>cuda120</name> <gpu_type>nvidia</gpu_type> <cuda/> <min_cuda_version>12000</min_cuda_version> <max_cuda_version>13000</max_cuda_version> <min_nvidia_compcap>800</min_nvidia_compcap> <min_driver_version>52560</min_driver_version> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </plan_class>


Try to add this too in app_config so cuda100 goes to Tesla the rest to your other cards and running in dual.

crashtech
Send message
Joined: 10 Apr 19
Posts: 28
Credit: 466,952,134
RAC: 214,244
Message 9733 - Posted: 13 Mar 2024, 20:07:55 UTC - in response to Message 9732.
Last modified: 13 Mar 2024, 20:10:24 UTC

...To only getting cuda100 and avoid any issues try this app_config

<app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda100</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config>


For some reason this app_config does not work in dual-GPU systems where the GPUs don't match, e.g. a 2080ti and a P100, or a 1070ti and a P100. Single P100s work now, and dual P100s also.


Tricky, do you get work for 1070ti too? In this case all of your GPUs should only getting cuda100 as intended. So we need a solution. I can try to add a plan_class only for Tesla P100. Lets see how we can fix this so we dont need an app_config but giving another issues with different cuda versions and drivers.


Here is the host with the 1070ti:
https://srbase.my-firewall.org/sr5/show_host_detail.php?hostid=225572
All of the valid tasks were completed by the 1070ti, is is apparently able to crunch cuda120.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7257
Credit: 42,729,227
RAC: 1
Message 9734 - Posted: 13 Mar 2024, 20:14:36 UTC - in response to Message 9733.
Last modified: 13 Mar 2024, 20:24:39 UTC

...To only getting cuda100 and avoid any issues try this app_config

<app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda100</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config>


For some reason this app_config does not work in dual-GPU systems where the GPUs don't match, e.g. a 2080ti and a P100, or a 1070ti and a P100. Single P100s work now, and dual P100s also.


Tricky, do you get work for 1070ti too? In this case all of your GPUs should only getting cuda100 as intended. So we need a solution. I can try to add a plan_class only for Tesla P100. Lets see how we can fix this so we dont need an app_config but giving another issues with different cuda versions and drivers.


Here is the host with the 1070ti:
https://srbase.my-firewall.org/sr5/show_host_detail.php?hostid=225572
All of the valid tasks were completed by the 1070ti, is is apparently able to crunch cuda120.


Then its only working with 2080 cards, 1070ti has also cc6.1. Thats new for me.

<app_config> <app_version> <plan_class>cuda100</plan_class> <host_summary_regex>Tesla P100</host_summary_regex> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> <app_version> <plan_class>cuda120</plan_class> <host_summary_regex>GeForce GTX 1070 Ti</host_summary_regex> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> <app_config>

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7257
Credit: 42,729,227
RAC: 1
Message 9735 - Posted: 13 Mar 2024, 20:32:09 UTC - in response to Message 9734.

...To only getting cuda100 and avoid any issues try this app_config

<app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda100</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config>


For some reason this app_config does not work in dual-GPU systems where the GPUs don't match, e.g. a 2080ti and a P100, or a 1070ti and a P100. Single P100s work now, and dual P100s also.


Tricky, do you get work for 1070ti too? In this case all of your GPUs should only getting cuda100 as intended. So we need a solution. I can try to add a plan_class only for Tesla P100. Lets see how we can fix this so we dont need an app_config but giving another issues with different cuda versions and drivers.


Here is the host with the 1070ti:
https://srbase.my-firewall.org/sr5/show_host_detail.php?hostid=225572
All of the valid tasks were completed by the 1070ti, is is apparently able to crunch cuda120.


Then its only working with 2080 cards, 1070ti has also cc6.1. Thats new for me.

<app_config> <app_version> <plan_class>cuda100</plan_class> <host_summary_regex>Tesla P100</host_summary_regex> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> <app_version> <plan_class>cuda120</plan_class> <host_summary_regex>GeForce GTX 1070 Ti</host_summary_regex> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> <app_config>


If its not running I will add a special plan_class to the server so we dont need an app_config. In the current sample dual GPU should not be a problem. The wrapper manage that.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7257
Credit: 42,729,227
RAC: 1
Message 9736 - Posted: 13 Mar 2024, 20:45:36 UTC

I have added the cuda100 plan_class for Tesla P100 to the server, better for both sides. You can get rid of the app_config. Lets test it.

crashtech
Send message
Joined: 10 Apr 19
Posts: 28
Credit: 466,952,134
RAC: 214,244
Message 9737 - Posted: 13 Mar 2024, 20:59:51 UTC - in response to Message 9735.
Last modified: 13 Mar 2024, 21:09:24 UTC

I'm sorry, after getting rid of the app_config and resetting the project, the client is still getting cuda120 tasks which error out immediately on the P100.
Perhaps it's not worth the trouble, not many people have these old compute GPUs.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7257
Credit: 42,729,227
RAC: 1
Message 9738 - Posted: 13 Mar 2024, 21:15:09 UTC - in response to Message 9737.
Last modified: 13 Mar 2024, 21:17:06 UTC

I'm sorry, after getting rid of the app_config and resetting the project, the client is still getting cuda120 tasks which error out immediately on the P100.
Perhaps it's not worth the trouble, not many people have these old compute GPUs.


Yeah I see, we have 2 plan_classes and both are working so we need it all in app_config manually to overwrite the settings.

<app_config> <app_version> <plan_class>cuda100</plan_class> <host_summary_regex>Tesla P100</host_summary_regex> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> <app_version> <plan_class>cuda120</plan_class> <host_summary_regex>GeForce GTX 1070 Ti</host_summary_regex> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> <app_config>

crashtech
Send message
Joined: 10 Apr 19
Posts: 28
Credit: 466,952,134
RAC: 214,244
Message 9739 - Posted: 13 Mar 2024, 21:19:15 UTC

It could be that since the P100 is the second GPU, it's presence isn't being properly detected on the server side. That's just a wild guess based on the fact that the list of hosts always claims the computer has x number of whatever the first GPU is, instead of what's really there.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7257
Credit: 42,729,227
RAC: 1
Message 9740 - Posted: 13 Mar 2024, 21:23:39 UTC - in response to Message 9739.

It could be that since the P100 is the second GPU, it's presence isn't being properly detected on the server side. That's just a wild guess based on the fact that the list of hosts always claims the computer has x number of whatever the first GPU is, instead of what's really there.


Yes but with the app_config we can reduce the possibilities.

crashtech
Send message
Joined: 10 Apr 19
Posts: 28
Credit: 466,952,134
RAC: 214,244
Message 9741 - Posted: 14 Mar 2024, 2:22:20 UTC - in response to Message 9738.

Yeah I see, we have 2 plan_classes and both are working so we need it all in app_config manually to overwrite the settings.

<app_config> <app_version> <plan_class>cuda100</plan_class> <host_summary_regex>Tesla P100</host_summary_regex> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> <app_version> <plan_class>cuda120</plan_class> <host_summary_regex>GeForce GTX 1070 Ti</host_summary_regex> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> <app_config>


I added a missing slash in the closing tag, but even then it did not work for me, only cuda120 tasks were downloaded.

crashtech
Send message
Joined: 10 Apr 19
Posts: 28
Credit: 466,952,134
RAC: 214,244
Message 9742 - Posted: 14 Mar 2024, 2:53:35 UTC - in response to Message 9741.

Also my P100-only machines are having problems again despite having got them working with the initial app_config you posted.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7257
Credit: 42,729,227
RAC: 1
Message 9743 - Posted: 14 Mar 2024, 3:15:24 UTC - in response to Message 9742.

Also my P100-only machines are having problems again despite having got them working with the initial app_config you posted.


Yes, you need the app_config also a 2nd entry for the other card if you have 2

crashtech
Send message
Joined: 10 Apr 19
Posts: 28
Credit: 466,952,134
RAC: 214,244
Message 9744 - Posted: 14 Mar 2024, 13:58:45 UTC

I'm trying to say that the situation is worse now for some reason. With your help, I was able to process tasks for a time (on P100 only computers) using the first app_config you posted, but now that one no longer helps, nor does the second app_config with plan_class help either. I can't process tasks on P100s at all now, whether they are alone in the computer, in pairs, or mixed with another GPU. Nothing works now.

My humble request would be for you to revert whatever changes were made to the server side, so I can see if at least the P100-only computers can process tasks again. That would be a large improvement.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7257
Credit: 42,729,227
RAC: 1
Message 9745 - Posted: 14 Mar 2024, 14:23:31 UTC - in response to Message 9744.

I'm trying to say that the situation is worse now for some reason. With your help, I was able to process tasks for a time (on P100 only computers) using the first app_config you posted, but now that one no longer helps, nor does the second app_config with plan_class help either. I can't process tasks on P100s at all now, whether they are alone in the computer, in pairs, or mixed with another GPU. Nothing works now.

My humble request would be for you to revert whatever changes were made to the server side, so I can see if at least the P100-only computers can process tasks again. That would be a large improvement.


The server plan_class has the original. You need the app_config in the first post.

1 · 2 · 3 · Next
Post to thread

Message boards : Number crunching : Nvidia Tesla P100 Problems


Main page · Your account · Message boards


Copyright © 2014-2024 BOINC Confederation / rebirther