Nvidia Tesla P100 Problems

Author	Message
crashtech Send message Joined: 10 Apr 19 Posts: 29 Credit: 1,447,795,484 RAC: 6,433,165	Message 9726 - Posted: 13 Mar 2024, 2:06:56 UTC
Hi, I have many Linux computers that are used for BOINC. They are all Mint Linux 20 or 20.3, have glibc 2.31, Nvidia driver 525 or 535, and have nvidia-cuda-toolkit installed. All of the GPUs run TF just fine except for the P100s. It seems that the P100s can run cuda100 tasks but not cuda120 tasks. Does anyone know why this might be? If I can't fix them, I might have to find a way to only process cuda100 tasks, but this would be a last resort.
ID: 9726 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7825 Credit: 44,534,674 RAC: 136	Message 9727 - Posted: 13 Mar 2024, 2:15:57 UTC - in response to Message 9726. Last modified: 13 Mar 2024, 2:20:42 UTC
Hi, I have many Linux computers that are used for BOINC. They are all Mint Linux 20 or 20.3, have glibc 2.31, Nvidia driver 525 or 535, and have nvidia-cuda-toolkit installed. All of the GPUs run TF just fine except for the P100s. It seems that the P100s can run cuda100 tasks but not cuda120 tasks. Does anyone know why this might be? If I can't fix them, I might have to find a way to only process cuda100 tasks, but this would be a last resort. Which cuda version is installed? To only getting cuda100 and avoid any issues try this app_config <app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda100</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config>
ID: 9727 · Rating: 0 · rate: / Reply Quote

crashtech Send message Joined: 10 Apr 19 Posts: 29 Credit: 1,447,795,484 RAC: 6,433,165	Message 9728 - Posted: 13 Mar 2024, 12:24:52 UTC - in response to Message 9727. Last modified: 13 Mar 2024, 12:27:29 UTC
Hi, Thanks for your reply. Here is the cuda version of one of the P100 hosts: $ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243 Also nvidia-smi: $ nvidia-smi Wed Mar 13 06:31:10 2024 +-----------------------------------------------------------------------------+ \| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 \| \|-------------------------------+----------------------+----------------------+ \| GPU Name Persistence-M\| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap\| Memory-Usage \| GPU-Util Compute M. \| \| \| \| MIG M. \| \|===============================+======================+======================\| \| 0 Tesla P100-PCIE... Off \| 00000000:02:00.0 Off \| 0 \| \| N/A 52C P0 120W / 250W \| 1843MiB / 16384MiB \| 100% Default \| \| \| \| N/A \| +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ \| Processes: \| \| GPU GI CI PID Type Process name GPU Memory \| \| ID ID Usage \| \|=============================================================================\| \| 0 N/A N/A 1599 G /usr/lib/xorg/Xorg 187MiB \| \| 0 N/A N/A 2046 G xfwm4 3MiB \| \| 0 N/A N/A 16247 G ...2gtk-4.0/WebKitWebProcess 8MiB \| \| 0 N/A N/A 17960 G ...2gtk-4.0/WebKitWebProcess 8MiB \| \| 0 N/A N/A 1103548 C ...64-pc-linux-gnu__cuda1222 816MiB \| \| 0 N/A N/A 1109316 C ...64-pc-linux-gnu__cuda1222 816MiB \| +-----------------------------------------------------------------------------+ Thank you for the app_config. I hope an alternate solution can be found that will allow me to run all the TF apps!
ID: 9728 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7825 Credit: 44,534,674 RAC: 136	Message 9729 - Posted: 13 Mar 2024, 14:31:48 UTC - in response to Message 9728.
Tesla P100 should has an older compute capability and was not compiled in cuda120 app. The plan_class is a bit tricky and to overwrite the server side plan_class with app_config this is the best solution to avoid any issues. Perhaps we need a recompiled cuda120 app to support older cards.
ID: 9729 · Rating: 0 · rate: / Reply Quote

crashtech Send message Joined: 10 Apr 19 Posts: 29 Credit: 1,447,795,484 RAC: 6,433,165	Message 9730 - Posted: 13 Mar 2024, 15:26:07 UTC - in response to Message 9729.
Tesla P100 should has an older compute capability and was not compiled in cuda120 app. The plan_class is a bit tricky and to overwrite the server side plan_class with app_config this is the best solution to avoid any issues. Perhaps we need a recompiled cuda120 app to support older cards. Thanks for your reply, this would seem to confirm that there isn't anything different I can do with my computer configurations other than the cuda120 exclusion. Interestingly, my other Pascal GPUs (1080ti and 1070ti) do not have any problems with cuda120, so it seems there may be something unique about the datacenter Pascal GPU version that is causing this problem.
ID: 9730 · Rating: 0 · rate: / Reply Quote

crashtech Send message Joined: 10 Apr 19 Posts: 29 Credit: 1,447,795,484 RAC: 6,433,165	Message 9731 - Posted: 13 Mar 2024, 19:26:48 UTC - in response to Message 9727.
...To only getting cuda100 and avoid any issues try this app_config <app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda100</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config> For some reason this app_config does not work in dual-GPU systems where the GPUs don't match, e.g. a 2080ti and a P100, or a 1070ti and a P100. Single P100s work now, and dual P100s also.
ID: 9731 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7825 Credit: 44,534,674 RAC: 136	Message 9732 - Posted: 13 Mar 2024, 19:50:16 UTC - in response to Message 9731. Last modified: 13 Mar 2024, 20:08:02 UTC
...To only getting cuda100 and avoid any issues try this app_config <app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda100</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config> For some reason this app_config does not work in dual-GPU systems where the GPUs don't match, e.g. a 2080ti and a P100, or a 1070ti and a P100. Single P100s work now, and dual P100s also. Tricky, do you get work for 1070ti too? In this case all of your GPUs should only getting cuda100 as intended. So we need a solution. I can try to add a plan_class only for Tesla P100. Lets see how we can fix this so we dont need an app_config but giving another issues with different cuda versions and drivers. <plan_class> <name>cuda120</name> <gpu_type>nvidia</gpu_type> <cuda/> <min_cuda_version>12000</min_cuda_version> <max_cuda_version>13000</max_cuda_version> <min_nvidia_compcap>800</min_nvidia_compcap> <min_driver_version>52560</min_driver_version> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </plan_class> Try to add this too in app_config so cuda100 goes to Tesla the rest to your other cards and running in dual.
ID: 9732 · Rating: 0 · rate: / Reply Quote

crashtech Send message Joined: 10 Apr 19 Posts: 29 Credit: 1,447,795,484 RAC: 6,433,165	Message 9733 - Posted: 13 Mar 2024, 20:07:55 UTC - in response to Message 9732. Last modified: 13 Mar 2024, 20:10:24 UTC
...To only getting cuda100 and avoid any issues try this app_config <app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda100</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config> For some reason this app_config does not work in dual-GPU systems where the GPUs don't match, e.g. a 2080ti and a P100, or a 1070ti and a P100. Single P100s work now, and dual P100s also. Tricky, do you get work for 1070ti too? In this case all of your GPUs should only getting cuda100 as intended. So we need a solution. I can try to add a plan_class only for Tesla P100. Lets see how we can fix this so we dont need an app_config but giving another issues with different cuda versions and drivers. Here is the host with the 1070ti: https://srbase.my-firewall.org/sr5/show_host_detail.php?hostid=225572 All of the valid tasks were completed by the 1070ti, is is apparently able to crunch cuda120.
ID: 9733 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7825 Credit: 44,534,674 RAC: 136	Message 9734 - Posted: 13 Mar 2024, 20:14:36 UTC - in response to Message 9733. Last modified: 13 Mar 2024, 20:24:39 UTC
...To only getting cuda100 and avoid any issues try this app_config <app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda100</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config> For some reason this app_config does not work in dual-GPU systems where the GPUs don't match, e.g. a 2080ti and a P100, or a 1070ti and a P100. Single P100s work now, and dual P100s also. Tricky, do you get work for 1070ti too? In this case all of your GPUs should only getting cuda100 as intended. So we need a solution. I can try to add a plan_class only for Tesla P100. Lets see how we can fix this so we dont need an app_config but giving another issues with different cuda versions and drivers. Here is the host with the 1070ti: https://srbase.my-firewall.org/sr5/show_host_detail.php?hostid=225572 All of the valid tasks were completed by the 1070ti, is is apparently able to crunch cuda120. Then its only working with 2080 cards, 1070ti has also cc6.1. Thats new for me. <app_config> <app_version> <plan_class>cuda100</plan_class> <host_summary_regex>Tesla P100</host_summary_regex> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> <app_version> <plan_class>cuda120</plan_class> <host_summary_regex>GeForce GTX 1070 Ti</host_summary_regex> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> <app_config>
ID: 9734 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7825 Credit: 44,534,674 RAC: 136	Message 9735 - Posted: 13 Mar 2024, 20:32:09 UTC - in response to Message 9734.
...To only getting cuda100 and avoid any issues try this app_config <app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda100</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config> For some reason this app_config does not work in dual-GPU systems where the GPUs don't match, e.g. a 2080ti and a P100, or a 1070ti and a P100. Single P100s work now, and dual P100s also. Tricky, do you get work for 1070ti too? In this case all of your GPUs should only getting cuda100 as intended. So we need a solution. I can try to add a plan_class only for Tesla P100. Lets see how we can fix this so we dont need an app_config but giving another issues with different cuda versions and drivers. Here is the host with the 1070ti: https://srbase.my-firewall.org/sr5/show_host_detail.php?hostid=225572 All of the valid tasks were completed by the 1070ti, is is apparently able to crunch cuda120. Then its only working with 2080 cards, 1070ti has also cc6.1. Thats new for me. <app_config> <app_version> <plan_class>cuda100</plan_class> <host_summary_regex>Tesla P100</host_summary_regex> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> <app_version> <plan_class>cuda120</plan_class> <host_summary_regex>GeForce GTX 1070 Ti</host_summary_regex> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> <app_config> If its not running I will add a special plan_class to the server so we dont need an app_config. In the current sample dual GPU should not be a problem. The wrapper manage that.
ID: 9735 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7825 Credit: 44,534,674 RAC: 136	Message 9736 - Posted: 13 Mar 2024, 20:45:36 UTC
I have added the cuda100 plan_class for Tesla P100 to the server, better for both sides. You can get rid of the app_config. Lets test it.
ID: 9736 · Rating: 0 · rate: / Reply Quote

crashtech Send message Joined: 10 Apr 19 Posts: 29 Credit: 1,447,795,484 RAC: 6,433,165	Message 9737 - Posted: 13 Mar 2024, 20:59:51 UTC - in response to Message 9735. Last modified: 13 Mar 2024, 21:09:24 UTC
I'm sorry, after getting rid of the app_config and resetting the project, the client is still getting cuda120 tasks which error out immediately on the P100. Perhaps it's not worth the trouble, not many people have these old compute GPUs.
ID: 9737 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7825 Credit: 44,534,674 RAC: 136	Message 9738 - Posted: 13 Mar 2024, 21:15:09 UTC - in response to Message 9737. Last modified: 13 Mar 2024, 21:17:06 UTC
I'm sorry, after getting rid of the app_config and resetting the project, the client is still getting cuda120 tasks which error out immediately on the P100. Perhaps it's not worth the trouble, not many people have these old compute GPUs. Yeah I see, we have 2 plan_classes and both are working so we need it all in app_config manually to overwrite the settings. <app_config> <app_version> <plan_class>cuda100</plan_class> <host_summary_regex>Tesla P100</host_summary_regex> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> <app_version> <plan_class>cuda120</plan_class> <host_summary_regex>GeForce GTX 1070 Ti</host_summary_regex> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> <app_config>
ID: 9738 · Rating: 0 · rate: / Reply Quote

crashtech Send message Joined: 10 Apr 19 Posts: 29 Credit: 1,447,795,484 RAC: 6,433,165	Message 9739 - Posted: 13 Mar 2024, 21:19:15 UTC
It could be that since the P100 is the second GPU, it's presence isn't being properly detected on the server side. That's just a wild guess based on the fact that the list of hosts always claims the computer has x number of whatever the first GPU is, instead of what's really there.
ID: 9739 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7825 Credit: 44,534,674 RAC: 136	Message 9740 - Posted: 13 Mar 2024, 21:23:39 UTC - in response to Message 9739.
It could be that since the P100 is the second GPU, it's presence isn't being properly detected on the server side. That's just a wild guess based on the fact that the list of hosts always claims the computer has x number of whatever the first GPU is, instead of what's really there. Yes but with the app_config we can reduce the possibilities.
ID: 9740 · Rating: 0 · rate: / Reply Quote

crashtech Send message Joined: 10 Apr 19 Posts: 29 Credit: 1,447,795,484 RAC: 6,433,165	Message 9741 - Posted: 14 Mar 2024, 2:22:20 UTC - in response to Message 9738.
Yeah I see, we have 2 plan_classes and both are working so we need it all in app_config manually to overwrite the settings. <app_config> <app_version> <plan_class>cuda100</plan_class> <host_summary_regex>Tesla P100</host_summary_regex> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> <app_version> <plan_class>cuda120</plan_class> <host_summary_regex>GeForce GTX 1070 Ti</host_summary_regex> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> <app_config> I added a missing slash in the closing tag, but even then it did not work for me, only cuda120 tasks were downloaded.
ID: 9741 · Rating: 0 · rate: / Reply Quote

crashtech Send message Joined: 10 Apr 19 Posts: 29 Credit: 1,447,795,484 RAC: 6,433,165	Message 9742 - Posted: 14 Mar 2024, 2:53:35 UTC - in response to Message 9741.
Also my P100-only machines are having problems again despite having got them working with the initial app_config you posted.
ID: 9742 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7825 Credit: 44,534,674 RAC: 136	Message 9743 - Posted: 14 Mar 2024, 3:15:24 UTC - in response to Message 9742.
Also my P100-only machines are having problems again despite having got them working with the initial app_config you posted. Yes, you need the app_config also a 2nd entry for the other card if you have 2
ID: 9743 · Rating: 0 · rate: / Reply Quote

crashtech Send message Joined: 10 Apr 19 Posts: 29 Credit: 1,447,795,484 RAC: 6,433,165	Message 9744 - Posted: 14 Mar 2024, 13:58:45 UTC
I'm trying to say that the situation is worse now for some reason. With your help, I was able to process tasks for a time (on P100 only computers) using the first app_config you posted, but now that one no longer helps, nor does the second app_config with plan_class help either. I can't process tasks on P100s at all now, whether they are alone in the computer, in pairs, or mixed with another GPU. Nothing works now. My humble request would be for you to revert whatever changes were made to the server side, so I can see if at least the P100-only computers can process tasks again. That would be a large improvement.
ID: 9744 · Rating: 0 · rate: / Reply Quote

rebirther Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 2 Jan 13 Posts: 7825 Credit: 44,534,674 RAC: 136	Message 9745 - Posted: 14 Mar 2024, 14:23:31 UTC - in response to Message 9744.
I'm trying to say that the situation is worse now for some reason. With your help, I was able to process tasks for a time (on P100 only computers) using the first app_config you posted, but now that one no longer helps, nor does the second app_config with plan_class help either. I can't process tasks on P100s at all now, whether they are alone in the computer, in pairs, or mixed with another GPU. Nothing works now. My humble request would be for you to revert whatever changes were made to the server side, so I can see if at least the P100-only computers can process tasks again. That would be a large improvement. The server plan_class has the original. You need the app_config in the first post.
ID: 9745 · Rating: 0 · rate: / Reply Quote

Author

Message

crashtech
Send message
Joined: 10 Apr 19
Posts: 29
Credit: 1,447,795,484
RAC: 6,433,165
1M in Sierpinski / Riesel Base credit

500k in Sierpinski / Riesel Base - short credit

10M in Sierpinski / Riesel Base - long credit

2M in Sierpinski / Riesel Base - average credit

Hi,

I have many Linux computers that are used for BOINC. They are all Mint Linux 20 or 20.3, have glibc 2.31, Nvidia driver 525 or 535, and have nvidia-cuda-toolkit installed. All of the GPUs run TF just fine except for the P100s. It seems that the P100s can run cuda100 tasks but not cuda120 tasks. Does anyone know why this might be? If I can't fix them, I might have to find a way to only process cuda100 tasks, but this would be a last resort.

ID: 9726 · Rating: 0 · rate: