Posts by crashtech
log in
1) Message boards : Number crunching : Nvidia Tesla P100 Problems (Message 9856)
Posted 25 Mar 2024 by crashtech
All seems to be working well on P100s. I believe that in honor of this fix, I will put all my GPUs on TF for a while.

Thank you!
2) Message boards : Number crunching : Nvidia Tesla P100 Problems (Message 9784)
Posted 19 Mar 2024 by crashtech
Thank you for all the work you agreed to do just to get one old model of GPU running, I appreciate it greatly.

I will return to testing TF after the upcoming PrimeGrid challenge is concluded!
3) Message boards : Number crunching : Nvidia Tesla P100 Problems (Message 9779)
Posted 19 Mar 2024 by crashtech
I actually had no idea, your computers are hidden. So everything works now but the P100s need an app_config, for now anyway.
4) Message boards : Number crunching : Nvidia Tesla P100 Problems (Message 9777)
Posted 19 Mar 2024 by crashtech
I saw the new version download. Works on both my PCs.

And you have P100s?
5) Message boards : Number crunching : Nvidia Tesla P100 Problems (Message 9775)
Posted 19 Mar 2024 by crashtech
I wanted to try v32, but even after resetting the project in the client, the server only sent me v28 cuda100 tasks. Oddly enough, the cuda100 tasks used to work on the P100 but now they don't. So I put this as an app_config.xml:

<app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda120</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config>


Now I can run the new v32 and it works fine, but the app_config has to be used at least on this particular computer. I have not tried the rest yet.

EDIT: Actually, I spoke too soon. This app_config does NOT prevent the download of v28 cuda100 tasks. So I'm still not able to run all my GPUs.
6) Message boards : Number crunching : Nvidia Tesla P100 Problems (Message 9768)
Posted 18 Mar 2024 by crashtech
Interesting, try to run ldd ./mfaktc


$ ldd ./mfaktc linux-vdso.so.1 (0x00007ffd1f50d000) libcudart.so.12 => ./libcudart.so.12 (0x00007ff29f6b8000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ff29f552000) libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007ff29f370000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007ff29f355000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ff29f163000) /lib64/ld-linux-x86-64.so.2 (0x00007ff29f960000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ff29f15d000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ff29f138000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007ff29f12e000)
7) Message boards : Number crunching : Nvidia Tesla P100 Problems (Message 9766)
Posted 18 Mar 2024 by crashtech
I'm still getting errors on a dual P100 PC trying to run TF 0.31 (cuda120) tasks:

<core_client_version>7.20.5</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 2024-03-18 09:49:38 (2334): wrapper (7.24.26018): starting 2024-03-18 09:49:38 (2334): wrapper (7.24.26018): starting 2024-03-18 09:49:38 (2334): wrapper: running ./mfaktc (-d 1) 2024-03-18 09:49:38 (2334): wrapper: created child process 2336 2024-03-18 09:49:39 (2334): ./mfaktc exited; CPU time 0.254903 2024-03-18 09:49:39 (2334): app exit status: 0x1 2024-03-18 09:49:39 (2334): called boinc_finish(195) </stderr_txt> ]]>

$ nvidia-smi Mon Mar 18 11:02:56 2024 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... Off | 00000000:04:00.0 Off | 0 | | N/A 31C P0 25W / 250W | 197MiB / 16384MiB | 3% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla P100-PCIE... Off | 00000000:09:00.0 Off | 0 | | N/A 30C P0 27W / 250W | 4MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1165 G /usr/lib/xorg/Xorg 112MiB | | 0 N/A N/A 1517 G cinnamon 28MiB | | 0 N/A N/A 2181 G ...2gtk-4.0/WebKitWebProcess 55MiB | | 1 N/A N/A 1165 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------+

nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243
8) Message boards : Number crunching : Nvidia Tesla P100 Problems (Message 9744)
Posted 14 Mar 2024 by crashtech
I'm trying to say that the situation is worse now for some reason. With your help, I was able to process tasks for a time (on P100 only computers) using the first app_config you posted, but now that one no longer helps, nor does the second app_config with plan_class help either. I can't process tasks on P100s at all now, whether they are alone in the computer, in pairs, or mixed with another GPU. Nothing works now.

My humble request would be for you to revert whatever changes were made to the server side, so I can see if at least the P100-only computers can process tasks again. That would be a large improvement.
9) Message boards : Number crunching : Nvidia Tesla P100 Problems (Message 9742)
Posted 14 Mar 2024 by crashtech
Also my P100-only machines are having problems again despite having got them working with the initial app_config you posted.
10) Message boards : Number crunching : Nvidia Tesla P100 Problems (Message 9741)
Posted 14 Mar 2024 by crashtech
Yeah I see, we have 2 plan_classes and both are working so we need it all in app_config manually to overwrite the settings.

<app_config> <app_version> <plan_class>cuda100</plan_class> <host_summary_regex>Tesla P100</host_summary_regex> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> <app_version> <plan_class>cuda120</plan_class> <host_summary_regex>GeForce GTX 1070 Ti</host_summary_regex> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> <app_config>


I added a missing slash in the closing tag, but even then it did not work for me, only cuda120 tasks were downloaded.
11) Message boards : Number crunching : Nvidia Tesla P100 Problems (Message 9739)
Posted 13 Mar 2024 by crashtech
It could be that since the P100 is the second GPU, it's presence isn't being properly detected on the server side. That's just a wild guess based on the fact that the list of hosts always claims the computer has x number of whatever the first GPU is, instead of what's really there.
12) Message boards : Number crunching : Nvidia Tesla P100 Problems (Message 9737)
Posted 13 Mar 2024 by crashtech
I'm sorry, after getting rid of the app_config and resetting the project, the client is still getting cuda120 tasks which error out immediately on the P100.
Perhaps it's not worth the trouble, not many people have these old compute GPUs.
13) Message boards : Number crunching : Nvidia Tesla P100 Problems (Message 9733)
Posted 13 Mar 2024 by crashtech
...To only getting cuda100 and avoid any issues try this app_config

<app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda100</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config>


For some reason this app_config does not work in dual-GPU systems where the GPUs don't match, e.g. a 2080ti and a P100, or a 1070ti and a P100. Single P100s work now, and dual P100s also.


Tricky, do you get work for 1070ti too? In this case all of your GPUs should only getting cuda100 as intended. So we need a solution. I can try to add a plan_class only for Tesla P100. Lets see how we can fix this so we dont need an app_config but giving another issues with different cuda versions and drivers.


Here is the host with the 1070ti:
https://srbase.my-firewall.org/sr5/show_host_detail.php?hostid=225572
All of the valid tasks were completed by the 1070ti, is is apparently able to crunch cuda120.
14) Message boards : Number crunching : Nvidia Tesla P100 Problems (Message 9731)
Posted 13 Mar 2024 by crashtech
...To only getting cuda100 and avoid any issues try this app_config

<app_config> <app_version> <app_name>TF</app_name> <plan_class>cuda100</plan_class> <min_gpu_ram_mb>384</min_gpu_ram_mb> <gpu_ram_used_mb>384</gpu_ram_used_mb> <gpu_peak_flops_scale>0.22</gpu_peak_flops_scale> <cpu_frac>0.01</cpu_frac> </app_version> </app_config>


For some reason this app_config does not work in dual-GPU systems where the GPUs don't match, e.g. a 2080ti and a P100, or a 1070ti and a P100. Single P100s work now, and dual P100s also.
15) Message boards : Number crunching : Nvidia Tesla P100 Problems (Message 9730)
Posted 13 Mar 2024 by crashtech
Tesla P100 should has an older compute capability and was not compiled in cuda120 app. The plan_class is a bit tricky and to overwrite the server side plan_class with app_config this is the best solution to avoid any issues. Perhaps we need a recompiled cuda120 app to support older cards.


Thanks for your reply, this would seem to confirm that there isn't anything different I can do with my computer configurations other than the cuda120 exclusion. Interestingly, my other Pascal GPUs (1080ti and 1070ti) do not have any problems with cuda120, so it seems there may be something unique about the datacenter Pascal GPU version that is causing this problem.
16) Message boards : Number crunching : Nvidia Tesla P100 Problems (Message 9728)
Posted 13 Mar 2024 by crashtech
Hi,

Thanks for your reply. Here is the cuda version of one of the P100 hosts:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

Also nvidia-smi:

$ nvidia-smi Wed Mar 13 06:31:10 2024 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... Off | 00000000:02:00.0 Off | 0 | | N/A 52C P0 120W / 250W | 1843MiB / 16384MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1599 G /usr/lib/xorg/Xorg 187MiB | | 0 N/A N/A 2046 G xfwm4 3MiB | | 0 N/A N/A 16247 G ...2gtk-4.0/WebKitWebProcess 8MiB | | 0 N/A N/A 17960 G ...2gtk-4.0/WebKitWebProcess 8MiB | | 0 N/A N/A 1103548 C ...64-pc-linux-gnu__cuda1222 816MiB | | 0 N/A N/A 1109316 C ...64-pc-linux-gnu__cuda1222 816MiB | +-----------------------------------------------------------------------------+


Thank you for the app_config. I hope an alternate solution can be found that will allow me to run all the TF apps!
17) Message boards : Number crunching : Nvidia Tesla P100 Problems (Message 9726)
Posted 13 Mar 2024 by crashtech
Hi,

I have many Linux computers that are used for BOINC. They are all Mint Linux 20 or 20.3, have glibc 2.31, Nvidia driver 525 or 535, and have nvidia-cuda-toolkit installed. All of the GPUs run TF just fine except for the P100s. It seems that the P100s can run cuda100 tasks but not cuda120 tasks. Does anyone know why this might be? If I can't fix them, I might have to find a way to only process cuda100 tasks, but this would be a last resort.
18) Message boards : Number crunching : 2 GPUs, 2 tasks on one card, not utilizing gpu 1, (Message 8637)
Posted 7 Jan 2023 by crashtech
I have already posted this everywhere. The problem is the app itself. If one of the devs can change the single instance to multiple then we are good. Iam still open for any hints or solutions.

Unfortunately gpuowl doesnt support TF which has multi-GPU support.

OK, I was seeing "BOINC app" mentioned and thinking that meant that the boinc-client app was being blamed for the problem.
19) Message boards : Number crunching : 2 GPUs, 2 tasks on one card, not utilizing gpu 1, (Message 8634)
Posted 7 Jan 2023 by crashtech
There must be some kind of translation problem here.
I'm going to try to say this clearly:
SRBase is the only project I know of with this problem.
The other projects have solved the problem.
It shouldn't be the user's problem to make SRBase work properly.
This project might benefit from the experience of people at PrimeGrid.
Primegrid does not have problems with multi GPU, nor is any special user configuration required.
20) Message boards : Number crunching : 2 GPUs, 2 tasks on one card, not utilizing gpu 1, (Message 8557)
Posted 25 Dec 2022 by crashtech
Why can't this app just make use of all GPUs?


In standalone you can run 2 instances but not designed for Multi-GPU. Its more a BOINC issue while its only running on device 0.

If it's a BOINC issue, virtually every other project seems to have solved it. But perhaps you are getting the amount of results desired, so there is no incentive to get more compute power by fixing this problem.


Next 20

Main page · Your account · Message boards


Copyright © 2014-2024 BOINC Confederation / rebirther