Nvidia Tesla P100 Problems
log in

Advanced search

Message boards : Number crunching : Nvidia Tesla P100 Problems

Previous · 1 · 2 · 3 · 4 · Next
Author Message
Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7346
Credit: 42,730,867
RAC: 5
Message 9770 - Posted: 18 Mar 2024, 19:27:09 UTC - in response to Message 9769.

was v31 compiled with explicit support for CC_6.0 in the NVCCFLAGS section of your makefile or build script?


the argument should be something similar to this:

--generate-code arch=compute_60,code=sm_60


this was what I was suggesting when I mentioned recompiling it to add 6.0 support.


yes it was but with cc5

Ian&Steve C.
Send message
Joined: 7 May 23
Posts: 15
Credit: 455,077,624
RAC: 247,302
Message 9771 - Posted: 18 Mar 2024, 19:33:59 UTC - in response to Message 9770.
Last modified: 18 Mar 2024, 19:51:55 UTC

you need to add support for every cc level explicitly.

the whole section would be something like this:

--generate-code arch=compute_50,code=sm_50 --generate-code arch=compute_52,code=sm_52 --generate-code arch=compute_60,code=sm_60 --generate-code arch=compute_61,code=sm_61 --generate-code arch=compute_70,code=sm_70 --generate-code arch=compute_75,code=sm_75 --generate-code arch=compute_72,code=sm_72 --generate-code arch=compute_80,code=sm_80 --generate-code arch=compute_86,code=sm_86 --generate-code arch=compute_89,code=sm_89


plus whatever other flags you have there

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7346
Credit: 42,730,867
RAC: 5
Message 9772 - Posted: 18 Mar 2024, 19:51:54 UTC - in response to Message 9771.

you need to add support for every cc level explicitly.

the whole section would be something like this:

--generate-code arch=compute_50,code=sm_50 --generate-code arch=compute_52,code=sm_52 --generate-code arch=compute_60,code=sm_60 --generate-code arch=compute_62,code=sm_62 --generate-code arch=compute_70,code=sm_70 --generate-code arch=compute_75,code=sm_75 --generate-code arch=compute_72,code=sm_72 --generate-code arch=compute_80,code=sm_80 --generate-code arch=compute_86,code=sm_86 --generate-code arch=compute_89,code=sm_89


plus whatever other flags you have there


ok, noticed, thx, we need to change that again

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7346
Credit: 42,730,867
RAC: 5
Message 9773 - Posted: 18 Mar 2024, 22:20:16 UTC

v32 recompiled with the fix, I hope...

crashtech
Send message
Joined: 10 Apr 19
Posts: 29
Credit: 613,234,834
RAC: 65,046
Message 9775 - Posted: 19 Mar 2024, 0:39:35 UTC
Last modified: 19 Mar 2024, 1:16:05 UTC

I wanted to try v32, but even after resetting the project in the client, the server only sent me v28 cuda100 tasks. Oddly enough, the cuda100 tasks used to work on the P100 but now they don't. So I put this as an app_config.xml:

<app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda120</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config>


Now I can run the new v32 and it works fine, but the app_config has to be used at least on this particular computer. I have not tried the rest yet.

EDIT: Actually, I spoke too soon. This app_config does NOT prevent the download of v28 cuda100 tasks. So I'm still not able to run all my GPUs.

mmonnin
Send message
Joined: 1 Feb 17
Posts: 27
Credit: 407,072,973
RAC: 569,068
Message 9776 - Posted: 19 Mar 2024, 0:41:10 UTC

I saw the new version download. Works on both my PCs.

crashtech
Send message
Joined: 10 Apr 19
Posts: 29
Credit: 613,234,834
RAC: 65,046
Message 9777 - Posted: 19 Mar 2024, 0:47:26 UTC - in response to Message 9776.

I saw the new version download. Works on both my PCs.

And you have P100s?

mmonnin
Send message
Joined: 1 Feb 17
Posts: 27
Credit: 407,072,973
RAC: 569,068
Message 9778 - Posted: 19 Mar 2024, 0:53:42 UTC

You most likely know I don't.

It was a confirmation post that an update for P100 systems didn't break other systems. Like it did earlier.

crashtech
Send message
Joined: 10 Apr 19
Posts: 29
Credit: 613,234,834
RAC: 65,046
Message 9779 - Posted: 19 Mar 2024, 1:12:55 UTC - in response to Message 9778.
Last modified: 19 Mar 2024, 1:16:23 UTC

I actually had no idea, your computers are hidden. So everything works now but the P100s need an app_config, for now anyway.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7346
Credit: 42,730,867
RAC: 5
Message 9780 - Posted: 19 Mar 2024, 1:54:18 UTC - in response to Message 9775.

I wanted to try v32, but even after resetting the project in the client, the server only sent me v28 cuda100 tasks. Oddly enough, the cuda100 tasks used to work on the P100 but now they don't. So I put this as an app_config.xml:

<app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda120</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config>


Now I can run the new v32 and it works fine, but the app_config has to be used at least on this particular computer. I have not tried the rest yet.

EDIT: Actually, I spoke too soon. This app_config does NOT prevent the download of v28 cuda100 tasks. So I'm still not able to run all my GPUs.


It should work without an app_config file or not?

Ian&Steve C.
Send message
Joined: 7 May 23
Posts: 15
Credit: 455,077,624
RAC: 247,302
Message 9781 - Posted: 19 Mar 2024, 11:26:51 UTC - in response to Message 9775.
Last modified: 19 Mar 2024, 11:27:05 UTC

your cuda100 app P100s are now failing with this:

./mfaktc.exe: error while loading shared libraries: libcudart.so.10.1: cannot open shared object file: No such file or directory


maybe you removed the cuda toolkit?

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7346
Credit: 42,730,867
RAC: 5
Message 9782 - Posted: 19 Mar 2024, 12:05:32 UTC - in response to Message 9781.
Last modified: 19 Mar 2024, 12:43:13 UTC

your cuda100 app P100s are now failing with this:

./mfaktc.exe: error while loading shared libraries: libcudart.so.10.1: cannot open shared object file: No such file or directory


maybe you removed the cuda toolkit?


https://srbase.my-firewall.org/sr5/download/libcudart.so.10.1

I will update the app including this file.

Update:
The missing libfile is now in the zipfile of v29 cuda100 app

crashtech
Send message
Joined: 10 Apr 19
Posts: 29
Credit: 613,234,834
RAC: 65,046
Message 9784 - Posted: 19 Mar 2024, 17:51:33 UTC - in response to Message 9782.

Thank you for all the work you agreed to do just to get one old model of GPU running, I appreciate it greatly.

I will return to testing TF after the upcoming PrimeGrid challenge is concluded!

crashtech
Send message
Joined: 10 Apr 19
Posts: 29
Credit: 613,234,834
RAC: 65,046
Message 9856 - Posted: 25 Mar 2024, 2:03:27 UTC

All seems to be working well on P100s. I believe that in honor of this fix, I will put all my GPUs on TF for a while.

Thank you!

Ian&Steve C.
Send message
Joined: 7 May 23
Posts: 15
Credit: 455,077,624
RAC: 247,302
Message 10028 - Posted: 15 Jul 2024, 16:28:32 UTC

all cuda120 tasks error out on my V100s. in the same way that crash's P100s were erroring.

did you omit support for CC 7.0 (titan V and V100 at this CC level)? as i pointed out before you need to explicitly add support for every CC level individually.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7346
Credit: 42,730,867
RAC: 5
Message 10029 - Posted: 15 Jul 2024, 16:42:17 UTC - in response to Message 10028.
Last modified: 15 Jul 2024, 16:44:24 UTC

all cuda120 tasks error out on my V100s. in the same way that crash's P100s were erroring.

did you omit support for CC 7.0 (titan V and V100 at this CC level)? as i pointed out before you need to explicitly add support for every CC level individually.


We can try to use cuda100 WUs with an app_config. Thats the oldest app which we are running.

<app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda100</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config>

Ian&Steve C.
Send message
Joined: 7 May 23
Posts: 15
Credit: 455,077,624
RAC: 247,302
Message 10030 - Posted: 15 Jul 2024, 16:55:21 UTC - in response to Message 10029.
Last modified: 15 Jul 2024, 16:55:53 UTC

all cuda120 tasks error out on my V100s. in the same way that crash's P100s were erroring.

did you omit support for CC 7.0 (titan V and V100 at this CC level)? as i pointed out before you need to explicitly add support for every CC level individually.


We can try to use cuda100 WUs with an app_config. Thats the oldest app which we are running.

<app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda100</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config>


tried it. didnt work. this app config does not work. it will not limit the project to only getting the cuda100 app.

same results on a Titan V (same cc-7.0), instant error.

I'm guessing the problem is how the app was compiled without cc 7.0 support.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7346
Credit: 42,730,867
RAC: 5
Message 10031 - Posted: 15 Jul 2024, 17:21:15 UTC - in response to Message 10030.

all cuda120 tasks error out on my V100s. in the same way that crash's P100s were erroring.

did you omit support for CC 7.0 (titan V and V100 at this CC level)? as i pointed out before you need to explicitly add support for every CC level individually.


We can try to use cuda100 WUs with an app_config. Thats the oldest app which we are running.

<app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda100</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config>


tried it. didnt work. this app config does not work. it will not limit the project to only getting the cuda100 app.

same results on a Titan V (same cc-7.0), instant error.

I'm guessing the problem is how the app was compiled without cc 7.0 support.


We can only try this

<app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda111</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config>


Or the app was not compiled with cc7, P100 was working, right?

Ian&Steve C.
Send message
Joined: 7 May 23
Posts: 15
Credit: 455,077,624
RAC: 247,302
Message 10032 - Posted: 15 Jul 2024, 17:31:43 UTC - in response to Message 10031.

all cuda120 tasks error out on my V100s. in the same way that crash's P100s were erroring.

did you omit support for CC 7.0 (titan V and V100 at this CC level)? as i pointed out before you need to explicitly add support for every CC level individually.


We can try to use cuda100 WUs with an app_config. Thats the oldest app which we are running.

<app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda100</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config>


tried it. didnt work. this app config does not work. it will not limit the project to only getting the cuda100 app.

same results on a Titan V (same cc-7.0), instant error.

I'm guessing the problem is how the app was compiled without cc 7.0 support.


We can only try this

<app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda111</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config>


Or the app was not compiled with cc7, P100 was working, right?


this method of using an app_config to try to force what app gets sent, do not work. i've never had this work reliably and it didnt work for crashtech when you suggested it last time. an app config is designed to enact some configuration on apps that are already downloaded. it does not impose any influence on what the project server actually sends me.

i think your app was not compiled with CC 7.0 support. and if you didn't do it on the latest app, I would guess you didnt do it on the older apps either.

P100 works as far as I know. but that's because you fixed it by recompiling the app to include the CC 6.0 support.

when compiling cuda apps you cannot just put a minimum value. you need to put ALL values that you want to support. to support everything from CC5.0+ you need to explicitly state every version (5.0, 5.2, 6.0, 6.1, 7.0, 7.5, 8.6, 8.9)

5.0 - GM100 Maxwell
5.2 - GM200 Maxwell
6.0 - GP100 Pascal
6.1 - GP102+ Pascal
7.0 - GV100 Volta (Titan V and V100)
7.5 - TU102+ Turing
8.0 - GA100 Ampere
8.6 - GA102+ Ampere
8.9 - AD102+ Ada Lovelace
9.0 - GH100 Hopper

and you're going to have to recompile again in the future to add 10.x something if you want to support the upcoming Blackwell GPUs.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7346
Credit: 42,730,867
RAC: 5
Message 10033 - Posted: 15 Jul 2024, 17:38:04 UTC - in response to Message 10032.
Last modified: 15 Jul 2024, 17:38:37 UTC

all cuda120 tasks error out on my V100s. in the same way that crash's P100s were erroring.

did you omit support for CC 7.0 (titan V and V100 at this CC level)? as i pointed out before you need to explicitly add support for every CC level individually.


We can try to use cuda100 WUs with an app_config. Thats the oldest app which we are running.

<app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda100</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config>


tried it. didnt work. this app config does not work. it will not limit the project to only getting the cuda100 app.

same results on a Titan V (same cc-7.0), instant error.

I'm guessing the problem is how the app was compiled without cc 7.0 support.


We can only try this

<app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda111</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config>


Or the app was not compiled with cc7, P100 was working, right?


this method of using an app_config to try to force what app gets sent, do not work. i've never had this work reliably and it didnt work for crashtech when you suggested it last time. an app config is designed to enact some configuration on apps that are already downloaded. it does not impose any influence on what the project server actually sends me.

i think your app was not compiled with CC 7.0 support. and if you didn't do it on the latest app, I would guess you didnt do it on the older apps either.

P100 works as far as I know. but that's because you fixed it by recompiling the app to include the CC 6.0 support.

when compiling cuda apps you cannot just put a minimum value. you need to put ALL values that you want to support. to support everything from CC5.0+ you need to explicitly state every version (5.0, 5.2, 6.0, 6.1, 7.0, 7.5, 8.6, 8.9)

5.0 - GM100 Maxwell
5.2 - GM200 Maxwell
6.0 - GP100 Pascal
6.1 - GP102+ Pascal
7.0 - GV100 Volta (Titan V and V100)
7.5 - TU102+ Turing
8.0 - GA100 Ampere
8.6 - GA102+ Ampere
8.9 - AD102+ Ada Lovelace
9.0 - GH100 Hopper

and you're going to have to recompile again in the future to add 10.x something if you want to support the upcoming Blackwell GPUs.


Yes, this was only testing the current plan_class with Cuda111 while there are some different things in config. It looks like the app was not compiled with cc7 but cc6. For cc10 we will have a separate app of course.

Previous · 1 · 2 · 3 · 4 · Next
Post to thread

Message boards : Number crunching : Nvidia Tesla P100 Problems


Main page · Your account · Message boards


Copyright © 2014-2024 BOINC Confederation / rebirther