Linux GPU errors
log in

Advanced search

Message boards : Number crunching : Linux GPU errors

1 · 2 · 3 · Next
Author Message
Dark Angel
Send message
Joined: 22 Jun 18
Posts: 14
Credit: 12,503,427
RAC: 0
Message 6819 - Posted: 18 Oct 2020, 20:48:34 UTC

I left a machine running over the weekend while I went away and I come back to see it's producing errors on all GPU tasks.
Every error occurs in less than a minute, all with exit code 0x100.
The machine in question was completing units successfully before I went away and hasn't done any automatic updates in that time.
This is the machine in question: http://srbase.my-firewall.org/sr5/show_host_detail.php?hostid=208774
I made sure all the CUDA drivers are properly installed, rebooted the machine, all to no change.
I moved the GPU to PrimeGrid (PPS Sieve) and it's running fine which makes me suspect it's something with the work here, but I cannot confirm that.
Any help/insight would be appreciated.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 6033
Credit: 26,762,929
RAC: 37
Message 6820 - Posted: 19 Oct 2020, 15:34:06 UTC - in response to Message 6819.

You could try to reset the project on this host if you have only TF running.

Dark Angel
Send message
Joined: 22 Jun 18
Posts: 14
Credit: 12,503,427
RAC: 0
Message 6821 - Posted: 20 Oct 2020, 0:52:03 UTC - in response to Message 6820.

I did that.
Fully reinstalled all the drivers. Rebooted. Reset the project. Still throws errors on this work but works fine on Prime Grid. It's not opening the child process for some reason. Do the new work units require more RAM on the card than the old ones?

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 6033
Credit: 26,762,929
RAC: 37
Message 6824 - Posted: 20 Oct 2020, 14:21:20 UTC - in response to Message 6821.

I did that.
Fully reinstalled all the drivers. Rebooted. Reset the project. Still throws errors on this work but works fine on Prime Grid. It's not opening the child process for some reason. Do the new work units require more RAM on the card than the old ones?


I have checked for short and used around 45MB RAM, I dont think its more than before.

Dark Angel
Send message
Joined: 22 Jun 18
Posts: 14
Credit: 12,503,427
RAC: 0
Message 6826 - Posted: 22 Oct 2020, 7:05:48 UTC - in response to Message 6824.

This is a typical error:
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
17:46:37 (599282): wrapper (7.2.26012): starting
17:46:37 (599282): wrapper: running ./mfaktc.exe ( --device 0)
17:47:57 (599282): ./mfaktc.exe exited; CPU time 1.075320
17:47:57 (599282): app exit status: 0x100
17:47:57 (599282): called boinc_finish

</stderr_txt>
]]>

I have no idea what's going on, just that the old version worked and the new one doesn't on this particular card.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 6033
Credit: 26,762,929
RAC: 37
Message 6827 - Posted: 22 Oct 2020, 15:15:24 UTC - in response to Message 6826.

This is a typical error:
7.14.2

process exited with code 195 (0xc3, -61)


17:46:37 (599282): wrapper (7.2.26012): starting
17:46:37 (599282): wrapper: running ./mfaktc.exe ( --device 0)
17:47:57 (599282): ./mfaktc.exe exited; CPU time 1.075320
17:47:57 (599282): app exit status: 0x100
17:47:57 (599282): called boinc_finish


]]>

I have no idea what's going on, just that the old version worked and the new one doesn't on this particular card.


You could try to test it in standalone with and without the wrapper.

Dark Angel
Send message
Joined: 22 Jun 18
Posts: 14
Credit: 12,503,427
RAC: 0
Message 6839 - Posted: 22 Oct 2020, 23:58:08 UTC - in response to Message 6827.

This is a typical error:
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
17:46:37 (599282): wrapper (7.2.26012): starting
17:46:37 (599282): wrapper: running ./mfaktc.exe ( --device 0)
17:47:57 (599282): ./mfaktc.exe exited; CPU time 1.075320
17:47:57 (599282): app exit status: 0x100
17:47:57 (599282): called boinc_finish

</stderr_txt>
]]>

I have no idea what's going on, just that the old version worked and the new one doesn't on this particular card.


You could try to test it in standalone with and without the wrapper.

I'm running Linux. I don't think that will like a .exe file.

Dark Angel
Send message
Joined: 22 Jun 18
Posts: 14
Credit: 12,503,427
RAC: 0
Message 6840 - Posted: 23 Oct 2020, 9:02:03 UTC

It it helps, this is what the client logs:
Output file TF_72-73_127-219M_wu_229446_0_0 for task TF_72-73_127-219M_wu_229446_0 absent

Is there any way to get a Linux native version of mfaktc? I'm running 64bit linux but only ever get project files compiled for Windows and the wrapper.

Running a dependency check on the wrapper gets me:
$ldd wrapper_26012-v2_x86_64-pc-linux-gnu
linux-vdso.so.1 (0x00007ffebd394000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb842ab7000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb842a9c000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb842a79000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb842887000)
/lib64/ld-linux-x86-64.so.2 (0x00007fb842c31000)

Dark Angel
Send message
Joined: 22 Jun 18
Posts: 14
Credit: 12,503,427
RAC: 0
Message 6841 - Posted: 23 Oct 2020, 9:22:08 UTC - in response to Message 6827.

Copied the .exe out of the zip archive and ran it in a terminal. This is the output:
mfaktc v0.21 (64bit built)

Compiletime options
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 193154bits
SIEVE_SPLIT 250
MORE_CLASSES enabled

Runtime options
SievePrimes 25000
SievePrimesAdjust 1
SievePrimesMin 5000
SievePrimesMax 100000
NumStreams 3
CPUStreams 3
GridSize 3
GPU Sieving enabled
GPUSievePrimes 82486
GPUSieveSize 2047Mi bits
GPUSieveProcessSize 32Ki bits
Checkpoints enabled
CheckpointDelay 300s
WorkFileAddDelay disabled
Stages enabled
StopAfterFactor class
PrintMode full
V5UserID (none)
ComputerID (none)
AllowSleep no
TimeStampInResults no

CUDA version info
binary compiled for CUDA 10.10
CUDA runtime version 10.10
CUDA driver version 11.0

CUDA device info
name Quadro K2000
compute capability 3.0
max threads per block 1024
max shared memory per MP 49152 byte
number of multiprocessors 2
CUDA cores per MP 192
CUDA cores - total 384
clock rate (CUDA cores) 954MHz
memory clock rate: 2000MHz
memory bus width: 128 bit

Automatic parameters
threads per grid 1048576
GPUSievePrimes (adjusted) 82486
GPUsieve minimum exponent 1055144

running a simple selftest...
Selftest statistics
number of tests 107
successfull tests 107

selftest PASSED!

Can't open workfile worktodo.txt
ERROR: get_next_assignment(): can't open "worktodo.txt"

Next I tried it with a work file with the relevant numbers but it had a hissy fit about that. I cut off the job numbers and the file ran. I'll let it run in the terminal and see what happens.

Dark Angel
Send message
Joined: 22 Jun 18
Posts: 14
Credit: 12,503,427
RAC: 0
Message 6842 - Posted: 23 Oct 2020, 9:25:51 UTC

Ok, here's the follow up with a work file

Same as the previous output but with this at the end:

got assignment: exp=140615327 bit_min=72 bit_max=73 (6.80 GHz-days)
Starting trial factoring M140615327 from 2^72 to 2^73 (6.80 GHz-days)
k_min = 16791791417940
k_max = 33583582839939
Using GPU kernel "barrett76_mul32_gs"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Oct 23 20:27 | 0 0.1% | 7.842 2h05m | 78.07 82485 n.a.%
M140615327 has a factor: 38814612911305349835664385407
ERROR: cudaGetLastError() returned 702: the launch timed out and was terminated

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 6033
Credit: 26,762,929
RAC: 37
Message 6843 - Posted: 23 Oct 2020, 9:34:49 UTC - in response to Message 6842.

Ok, here's the follow up with a work file

Same as the previous output but with this at the end:

got assignment: exp=140615327 bit_min=72 bit_max=73 (6.80 GHz-days)
Starting trial factoring M140615327 from 2^72 to 2^73 (6.80 GHz-days)
k_min = 16791791417940
k_max = 33583582839939
Using GPU kernel "barrett76_mul32_gs"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Oct 23 20:27 | 0 0.1% | 7.842 2h05m | 78.07 82485 n.a.%
M140615327 has a factor: 38814612911305349835664385407
ERROR: cudaGetLastError() returned 702: the launch timed out and was terminated


Hmm, luckily there was a factor, can you try another non factor run?

For the wrapper test I will post you a little guide later.

Dark Angel
Send message
Joined: 22 Jun 18
Posts: 14
Credit: 12,503,427
RAC: 0
Message 6844 - Posted: 23 Oct 2020, 9:43:18 UTC - in response to Message 6843.

using the same work file (the unit error'd out in the actual client)

got assignment: exp=140615327 bit_min=72 bit_max=73 (6.80 GHz-days)
Starting trial factoring M140615327 from 2^72 to 2^73 (6.80 GHz-days)
k_min = 16791791417940
k_max = 33583582839939
Using GPU kernel "barrett76_mul32_gs"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Oct 23 20:42 | 0 0.1% | 16.568 4h24m | 36.95 82485 n.a.%
Oct 23 20:43 | 9 0.2% | 16.571 4h24m | 36.94 82485 n.a.%
Oct 23 20:43 | 12 0.3% | 8.616 2h17m | 71.05 82485 n.a.%
ERROR: cudaGetLastError() returned 702: the launch timed out and was terminated

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 6033
Credit: 26,762,929
RAC: 37
Message 6845 - Posted: 23 Oct 2020, 10:47:45 UTC - in response to Message 6844.

Hmm, if the app is running for some minutes then the VRAM size could be an issue.

Dark Angel
Send message
Joined: 22 Jun 18
Posts: 14
Credit: 12,503,427
RAC: 0
Message 6847 - Posted: 23 Oct 2020, 20:15:59 UTC - in response to Message 6845.

I copied the application and data file to a known good machine (1060 6GB card) and it ran fine, so it's something particular to that K2000 rig. That little Quadro only has 2GB of VRAM. If that's not enough for the new work that's fine, but it might be worth putting a test in to confirm the card has at least 3GB before giving out work.
Of course it's also an older card so there won't be many of them still in service.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 6033
Credit: 26,762,929
RAC: 37
Message 6848 - Posted: 23 Oct 2020, 20:25:15 UTC - in response to Message 6847.

I copied the application and data file to a known good machine (1060 6GB card) and it ran fine, so it's something particular to that K2000 rig. That little Quadro only has 2GB of VRAM. If that's not enough for the new work that's fine, but it might be worth putting a test in to confirm the card has at least 3GB before giving out work.
Of course it's also an older card so there won't be many of them still in service.


I have checked this with GPU-Z and its taking around 85MB VRAM + the reserved from the BIOS in total 900MB so 2GB VRAM should be no problem at all.

The card could also be overheated.

Dark Angel
Send message
Joined: 22 Jun 18
Posts: 14
Credit: 12,503,427
RAC: 0
Message 6849 - Posted: 24 Oct 2020, 2:10:16 UTC - in response to Message 6848.
Last modified: 24 Oct 2020, 2:15:18 UTC

Cleaned the card out, got very little dust out of it, and checked it. Same error. Check the temperature (nVidia Settings app), unit fails before the temperature can update (lagging display, normal for this card), tried it on a PPS Seive unit (PrimeGrid) and it's crunching away happily at under 60C after five minutes continuous, fan still at only 35%. a cOLATZ UNIT (OpenCL) runs fine at the same temperatures, GPUGrid has the card at 61C and 37% fan after 10 minutes continuous. I compared the GPUGrid work on my GTX 1060 rig and it runs a couple of degrees cooler than SRBase but not much.
edit: it's lagging because I'm accessing the desktop remotely as well as the load on the GPU. Just to be clear.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 6033
Credit: 26,762,929
RAC: 37
Message 6861 - Posted: 25 Oct 2020, 7:25:59 UTC

I have asked in the forum, it could be the same issue as on windows7 now but less people are affected, maybe a driver issue.

Dark Angel
Send message
Joined: 22 Jun 18
Posts: 14
Credit: 12,503,427
RAC: 0
Message 6863 - Posted: 25 Oct 2020, 7:51:10 UTC - in response to Message 6861.

I was wondering the same thing and spent time today cleaning out any differences between that rig and the other one that works fine. I found an issue with the kernel version I was using and Virtualbox which I figure is unrelated to this and went back to a different kernel that works on both machines, then reinstalled the drivers to be sure. Just rebooted and tried it but the same issue holds.
I'm running the 450 drivers and 5.4 Linux kernel, btw.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 6033
Credit: 26,762,929
RAC: 37
Message 6864 - Posted: 25 Oct 2020, 7:52:24 UTC - in response to Message 6863.

I was wondering the same thing and spent time today cleaning out any differences between that rig and the other one that works fine. I found an issue with the kernel version I was using and Virtualbox which I figure is unrelated to this and went back to a different kernel that works on both machines, then reinstalled the drivers to be sure. Just rebooted and tried it but the same issue holds.
I'm running the 450 drivers and 5.4 Linux kernel, btw.


The windows machine is on win7 with 436.15. I think this kind of error is the same.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 6033
Credit: 26,762,929
RAC: 37
Message 6865 - Posted: 25 Oct 2020, 18:58:31 UTC

a short info from forum:


check the factor, the logs, the hardware, driver, temperatures, gpu memory reliability, etc.

1 · 2 · 3 · Next
Post to thread

Message boards : Number crunching : Linux GPU errors


Main page · Your account · Message boards


Copyright © 2014-2021 BOINC Confederation / rebirther