log in |
Message boards : Number crunching : Nvidia Tesla P100 Problems
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
was v31 compiled with explicit support for CC_6.0 in the NVCCFLAGS section of your makefile or build script? yes it was but with cc5 | |
ID: 9770 · Rating: 0 · rate: / Reply Quote | |
you need to add support for every cc level explicitly. --generate-code arch=compute_50,code=sm_50 --generate-code arch=compute_52,code=sm_52 --generate-code arch=compute_60,code=sm_60 --generate-code arch=compute_61,code=sm_61 --generate-code arch=compute_70,code=sm_70 --generate-code arch=compute_75,code=sm_75 --generate-code arch=compute_72,code=sm_72 --generate-code arch=compute_80,code=sm_80 --generate-code arch=compute_86,code=sm_86 --generate-code arch=compute_89,code=sm_89 plus whatever other flags you have there | |
ID: 9771 · Rating: 0 · rate: / Reply Quote | |
you need to add support for every cc level explicitly. ok, noticed, thx, we need to change that again | |
ID: 9772 · Rating: 0 · rate: / Reply Quote | |
v32 recompiled with the fix, I hope... | |
ID: 9773 · Rating: 0 · rate: / Reply Quote | |
I wanted to try v32, but even after resetting the project in the client, the server only sent me v28 cuda100 tasks. Oddly enough, the cuda100 tasks used to work on the P100 but now they don't. So I put this as an app_config.xml: <app_config>
<app_version>
<app_name>TF</app_name>
<plan_class>cuda120</plan_class>
<min_gpu_ram_mb>384</min_gpu_ram_mb>
<gpu_ram_used_mb>384</gpu_ram_used_mb>
<gpu_peak_flops_scale>0.22</gpu_peak_flops_scale>
<cpu_frac>0.01</cpu_frac>
</app_version>
</app_config> Now I can run the new v32 and it works fine, but the app_config has to be used at least on this particular computer. I have not tried the rest yet. EDIT: Actually, I spoke too soon. This app_config does NOT prevent the download of v28 cuda100 tasks. So I'm still not able to run all my GPUs. | |
ID: 9775 · Rating: 0 · rate: / Reply Quote | |
I saw the new version download. Works on both my PCs. | |
ID: 9776 · Rating: 0 · rate: / Reply Quote | |
I saw the new version download. Works on both my PCs. And you have P100s? | |
ID: 9777 · Rating: 0 · rate: / Reply Quote | |
You most likely know I don't. | |
ID: 9778 · Rating: 0 · rate: / Reply Quote | |
I actually had no idea, your computers are hidden. | |
ID: 9779 · Rating: 0 · rate: / Reply Quote | |
I wanted to try v32, but even after resetting the project in the client, the server only sent me v28 cuda100 tasks. Oddly enough, the cuda100 tasks used to work on the P100 but now they don't. So I put this as an app_config.xml: It should work without an app_config file or not? | |
ID: 9780 · Rating: 0 · rate: / Reply Quote | |
your cuda100 app P100s are now failing with this: ./mfaktc.exe: error while loading shared libraries: libcudart.so.10.1: cannot open shared object file: No such file or directory maybe you removed the cuda toolkit? | |
ID: 9781 · Rating: 0 · rate: / Reply Quote | |
your cuda100 app P100s are now failing with this: https://srbase.my-firewall.org/sr5/download/libcudart.so.10.1 I will update the app including this file. Update: The missing libfile is now in the zipfile of v29 cuda100 app | |
ID: 9782 · Rating: 0 · rate: / Reply Quote | |
Thank you for all the work you agreed to do just to get one old model of GPU running, I appreciate it greatly. | |
ID: 9784 · Rating: 0 · rate: / Reply Quote | |
All seems to be working well on P100s. I believe that in honor of this fix, I will put all my GPUs on TF for a while. | |
ID: 9856 · Rating: 0 · rate: / Reply Quote | |
all cuda120 tasks error out on my V100s. in the same way that crash's P100s were erroring. | |
ID: 10028 · Rating: 0 · rate: / Reply Quote | |
all cuda120 tasks error out on my V100s. in the same way that crash's P100s were erroring. We can try to use cuda100 WUs with an app_config. Thats the oldest app which we are running. <app_config>
<app_version>
<app_name>TF</app_name>
<plan_class>cuda100</plan_class>
<min_gpu_ram_mb>384</min_gpu_ram_mb>
<gpu_ram_used_mb>384</gpu_ram_used_mb>
<gpu_peak_flops_scale>0.22</gpu_peak_flops_scale>
<cpu_frac>0.01</cpu_frac>
</app_version>
</app_config> | |
ID: 10029 · Rating: 0 · rate: / Reply Quote | |
all cuda120 tasks error out on my V100s. in the same way that crash's P100s were erroring. tried it. didnt work. this app config does not work. it will not limit the project to only getting the cuda100 app. same results on a Titan V (same cc-7.0), instant error. I'm guessing the problem is how the app was compiled without cc 7.0 support. | |
ID: 10030 · Rating: 0 · rate: / Reply Quote | |
all cuda120 tasks error out on my V100s. in the same way that crash's P100s were erroring. We can only try this <app_config>
<app_version>
<app_name>TF</app_name>
<plan_class>cuda111</plan_class>
<min_gpu_ram_mb>384</min_gpu_ram_mb>
<gpu_ram_used_mb>384</gpu_ram_used_mb>
<gpu_peak_flops_scale>0.22</gpu_peak_flops_scale>
<cpu_frac>0.01</cpu_frac>
</app_version>
</app_config> Or the app was not compiled with cc7, P100 was working, right? | |
ID: 10031 · Rating: 0 · rate: / Reply Quote | |
all cuda120 tasks error out on my V100s. in the same way that crash's P100s were erroring. this method of using an app_config to try to force what app gets sent, do not work. i've never had this work reliably and it didnt work for crashtech when you suggested it last time. an app config is designed to enact some configuration on apps that are already downloaded. it does not impose any influence on what the project server actually sends me. i think your app was not compiled with CC 7.0 support. and if you didn't do it on the latest app, I would guess you didnt do it on the older apps either. P100 works as far as I know. but that's because you fixed it by recompiling the app to include the CC 6.0 support. when compiling cuda apps you cannot just put a minimum value. you need to put ALL values that you want to support. to support everything from CC5.0+ you need to explicitly state every version (5.0, 5.2, 6.0, 6.1, 7.0, 7.5, 8.6, 8.9) 5.0 - GM100 Maxwell 5.2 - GM200 Maxwell 6.0 - GP100 Pascal 6.1 - GP102+ Pascal 7.0 - GV100 Volta (Titan V and V100) 7.5 - TU102+ Turing 8.0 - GA100 Ampere 8.6 - GA102+ Ampere 8.9 - AD102+ Ada Lovelace 9.0 - GH100 Hopper and you're going to have to recompile again in the future to add 10.x something if you want to support the upcoming Blackwell GPUs. | |
ID: 10032 · Rating: 0 · rate: / Reply Quote | |
all cuda120 tasks error out on my V100s. in the same way that crash's P100s were erroring. Yes, this was only testing the current plan_class with Cuda111 while there are some different things in config. It looks like the app was not compiled with cc7 but cc6. For cc10 we will have a separate app of course. | |
ID: 10033 · Rating: 0 · rate: / Reply Quote | |
Message boards :
Number crunching :
Nvidia Tesla P100 Problems