log in |
Message boards : Number crunching : Nvidia Tesla P100 Problems
1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Hi, | |
ID: 9726 · Rating: 0 · rate: / Reply Quote | |
Hi, Which cuda version is installed? To only getting cuda100 and avoid any issues try this app_config <app_config>
<app_version>
<app_name>TF</app_name>
<plan_class>cuda100</plan_class>
<min_gpu_ram_mb>384</min_gpu_ram_mb>
<gpu_ram_used_mb>384</gpu_ram_used_mb>
<gpu_peak_flops_scale>0.22</gpu_peak_flops_scale>
<cpu_frac>0.01</cpu_frac>
</app_version>
</app_config> | |
ID: 9727 · Rating: 0 · rate: / Reply Quote | |
Hi, $ nvidia-smi
Wed Mar 13 06:31:10 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:02:00.0 Off | 0 |
| N/A 52C P0 120W / 250W | 1843MiB / 16384MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1599 G /usr/lib/xorg/Xorg 187MiB |
| 0 N/A N/A 2046 G xfwm4 3MiB |
| 0 N/A N/A 16247 G ...2gtk-4.0/WebKitWebProcess 8MiB |
| 0 N/A N/A 17960 G ...2gtk-4.0/WebKitWebProcess 8MiB |
| 0 N/A N/A 1103548 C ...64-pc-linux-gnu__cuda1222 816MiB |
| 0 N/A N/A 1109316 C ...64-pc-linux-gnu__cuda1222 816MiB |
+-----------------------------------------------------------------------------+
Thank you for the app_config. I hope an alternate solution can be found that will allow me to run all the TF apps! | |
ID: 9728 · Rating: 0 · rate: / Reply Quote | |
Tesla P100 should has an older compute capability and was not compiled in cuda120 app. The plan_class is a bit tricky and to overwrite the server side plan_class with app_config this is the best solution to avoid any issues. Perhaps we need a recompiled cuda120 app to support older cards. | |
ID: 9729 · Rating: 0 · rate: / Reply Quote | |
Tesla P100 should has an older compute capability and was not compiled in cuda120 app. The plan_class is a bit tricky and to overwrite the server side plan_class with app_config this is the best solution to avoid any issues. Perhaps we need a recompiled cuda120 app to support older cards. Thanks for your reply, this would seem to confirm that there isn't anything different I can do with my computer configurations other than the cuda120 exclusion. Interestingly, my other Pascal GPUs (1080ti and 1070ti) do not have any problems with cuda120, so it seems there may be something unique about the datacenter Pascal GPU version that is causing this problem. | |
ID: 9730 · Rating: 0 · rate: / Reply Quote | |
...To only getting cuda100 and avoid any issues try this app_config For some reason this app_config does not work in dual-GPU systems where the GPUs don't match, e.g. a 2080ti and a P100, or a 1070ti and a P100. Single P100s work now, and dual P100s also. | |
ID: 9731 · Rating: 0 · rate: / Reply Quote | |
...To only getting cuda100 and avoid any issues try this app_config Tricky, do you get work for 1070ti too? In this case all of your GPUs should only getting cuda100 as intended. So we need a solution. I can try to add a plan_class only for Tesla P100. Lets see how we can fix this so we dont need an app_config but giving another issues with different cuda versions and drivers. <plan_class>
<name>cuda120</name>
<gpu_type>nvidia</gpu_type>
<cuda/>
<min_cuda_version>12000</min_cuda_version>
<max_cuda_version>13000</max_cuda_version>
<min_nvidia_compcap>800</min_nvidia_compcap>
<min_driver_version>52560</min_driver_version>
<min_gpu_ram_mb>384</min_gpu_ram_mb>
<gpu_ram_used_mb>384</gpu_ram_used_mb>
<gpu_peak_flops_scale>0.22</gpu_peak_flops_scale>
<cpu_frac>0.01</cpu_frac>
</plan_class> Try to add this too in app_config so cuda100 goes to Tesla the rest to your other cards and running in dual. | |
ID: 9732 · Rating: 0 · rate: / Reply Quote | |
...To only getting cuda100 and avoid any issues try this app_config Here is the host with the 1070ti: https://srbase.my-firewall.org/sr5/show_host_detail.php?hostid=225572 All of the valid tasks were completed by the 1070ti, is is apparently able to crunch cuda120. | |
ID: 9733 · Rating: 0 · rate: / Reply Quote | |
...To only getting cuda100 and avoid any issues try this app_config Then its only working with 2080 cards, 1070ti has also cc6.1. Thats new for me. <app_config>
<app_version>
<plan_class>cuda100</plan_class>
<host_summary_regex>Tesla P100</host_summary_regex>
<min_gpu_ram_mb>384</min_gpu_ram_mb>
<gpu_ram_used_mb>384</gpu_ram_used_mb>
<gpu_peak_flops_scale>0.22</gpu_peak_flops_scale>
<cpu_frac>0.01</cpu_frac>
</app_version>
<app_version>
<plan_class>cuda120</plan_class>
<host_summary_regex>GeForce GTX 1070 Ti</host_summary_regex>
<min_gpu_ram_mb>384</min_gpu_ram_mb>
<gpu_ram_used_mb>384</gpu_ram_used_mb>
<gpu_peak_flops_scale>0.22</gpu_peak_flops_scale>
<cpu_frac>0.01</cpu_frac>
</app_version>
<app_config> | |
ID: 9734 · Rating: 0 · rate: / Reply Quote | |
...To only getting cuda100 and avoid any issues try this app_config If its not running I will add a special plan_class to the server so we dont need an app_config. In the current sample dual GPU should not be a problem. The wrapper manage that. | |
ID: 9735 · Rating: 0 · rate: / Reply Quote | |
I have added the cuda100 plan_class for Tesla P100 to the server, better for both sides. You can get rid of the app_config. Lets test it. | |
ID: 9736 · Rating: 0 · rate: / Reply Quote | |
I'm sorry, after getting rid of the app_config and resetting the project, the client is still getting cuda120 tasks which error out immediately on the P100. | |
ID: 9737 · Rating: 0 · rate: / Reply Quote | |
I'm sorry, after getting rid of the app_config and resetting the project, the client is still getting cuda120 tasks which error out immediately on the P100. Yeah I see, we have 2 plan_classes and both are working so we need it all in app_config manually to overwrite the settings. <app_config>
<app_version>
<plan_class>cuda100</plan_class>
<host_summary_regex>Tesla P100</host_summary_regex>
<min_gpu_ram_mb>384</min_gpu_ram_mb>
<gpu_ram_used_mb>384</gpu_ram_used_mb>
<gpu_peak_flops_scale>0.22</gpu_peak_flops_scale>
<cpu_frac>0.01</cpu_frac>
</app_version>
<app_version>
<plan_class>cuda120</plan_class>
<host_summary_regex>GeForce GTX 1070 Ti</host_summary_regex>
<min_gpu_ram_mb>384</min_gpu_ram_mb>
<gpu_ram_used_mb>384</gpu_ram_used_mb>
<gpu_peak_flops_scale>0.22</gpu_peak_flops_scale>
<cpu_frac>0.01</cpu_frac>
</app_version>
<app_config> | |
ID: 9738 · Rating: 0 · rate: / Reply Quote | |
It could be that since the P100 is the second GPU, it's presence isn't being properly detected on the server side. That's just a wild guess based on the fact that the list of hosts always claims the computer has x number of whatever the first GPU is, instead of what's really there. | |
ID: 9739 · Rating: 0 · rate: / Reply Quote | |
It could be that since the P100 is the second GPU, it's presence isn't being properly detected on the server side. That's just a wild guess based on the fact that the list of hosts always claims the computer has x number of whatever the first GPU is, instead of what's really there. Yes but with the app_config we can reduce the possibilities. | |
ID: 9740 · Rating: 0 · rate: / Reply Quote | |
Yeah I see, we have 2 plan_classes and both are working so we need it all in app_config manually to overwrite the settings. I added a missing slash in the closing tag, but even then it did not work for me, only cuda120 tasks were downloaded. | |
ID: 9741 · Rating: 0 · rate: / Reply Quote | |
Also my P100-only machines are having problems again despite having got them working with the initial app_config you posted. | |
ID: 9742 · Rating: 0 · rate: / Reply Quote | |
Also my P100-only machines are having problems again despite having got them working with the initial app_config you posted. Yes, you need the app_config also a 2nd entry for the other card if you have 2 | |
ID: 9743 · Rating: 0 · rate: / Reply Quote | |
I'm trying to say that the situation is worse now for some reason. With your help, I was able to process tasks for a time (on P100 only computers) using the first app_config you posted, but now that one no longer helps, nor does the second app_config with plan_class help either. I can't process tasks on P100s at all now, whether they are alone in the computer, in pairs, or mixed with another GPU. Nothing works now. | |
ID: 9744 · Rating: 0 · rate: / Reply Quote | |
I'm trying to say that the situation is worse now for some reason. With your help, I was able to process tasks for a time (on P100 only computers) using the first app_config you posted, but now that one no longer helps, nor does the second app_config with plan_class help either. I can't process tasks on P100s at all now, whether they are alone in the computer, in pairs, or mixed with another GPU. Nothing works now. The server plan_class has the original. You need the app_config in the first post. | |
ID: 9745 · Rating: 0 · rate: / Reply Quote | |
Message boards :
Number crunching :
Nvidia Tesla P100 Problems