Author |
Message |
|
I left a machine running over the weekend while I went away and I come back to see it's producing errors on all GPU tasks.
Every error occurs in less than a minute, all with exit code 0x100.
The machine in question was completing units successfully before I went away and hasn't done any automatic updates in that time.
This is the machine in question: http://srbase.my-firewall.org/sr5/show_host_detail.php?hostid=208774
I made sure all the CUDA drivers are properly installed, rebooted the machine, all to no change.
I moved the GPU to PrimeGrid (PPS Sieve) and it's running fine which makes me suspect it's something with the work here, but I cannot confirm that.
Any help/insight would be appreciated. |
|
|
rebirtherVolunteer moderator Project administrator Project developer Project tester Project scientist
Send message
Joined: 2 Jan 13 Posts: 7496 Credit: 44,029,375 RAC: 36,320 |
You could try to reset the project on this host if you have only TF running. |
|
|
|
I did that.
Fully reinstalled all the drivers. Rebooted. Reset the project. Still throws errors on this work but works fine on Prime Grid. It's not opening the child process for some reason. Do the new work units require more RAM on the card than the old ones? |
|
|
rebirtherVolunteer moderator Project administrator Project developer Project tester Project scientist
Send message
Joined: 2 Jan 13 Posts: 7496 Credit: 44,029,375 RAC: 36,320 |
I did that.
Fully reinstalled all the drivers. Rebooted. Reset the project. Still throws errors on this work but works fine on Prime Grid. It's not opening the child process for some reason. Do the new work units require more RAM on the card than the old ones?
I have checked for short and used around 45MB RAM, I dont think its more than before. |
|
|
|
This is a typical error:
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
17:46:37 (599282): wrapper (7.2.26012): starting
17:46:37 (599282): wrapper: running ./mfaktc.exe ( --device 0)
17:47:57 (599282): ./mfaktc.exe exited; CPU time 1.075320
17:47:57 (599282): app exit status: 0x100
17:47:57 (599282): called boinc_finish
</stderr_txt>
]]>
I have no idea what's going on, just that the old version worked and the new one doesn't on this particular card. |
|
|
rebirtherVolunteer moderator Project administrator Project developer Project tester Project scientist
Send message
Joined: 2 Jan 13 Posts: 7496 Credit: 44,029,375 RAC: 36,320 |
This is a typical error:
7.14.2
process exited with code 195 (0xc3, -61)
17:46:37 (599282): wrapper (7.2.26012): starting
17:46:37 (599282): wrapper: running ./mfaktc.exe ( --device 0)
17:47:57 (599282): ./mfaktc.exe exited; CPU time 1.075320
17:47:57 (599282): app exit status: 0x100
17:47:57 (599282): called boinc_finish
]]>
I have no idea what's going on, just that the old version worked and the new one doesn't on this particular card.
You could try to test it in standalone with and without the wrapper. |
|
|
|
This is a typical error:
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
17:46:37 (599282): wrapper (7.2.26012): starting
17:46:37 (599282): wrapper: running ./mfaktc.exe ( --device 0)
17:47:57 (599282): ./mfaktc.exe exited; CPU time 1.075320
17:47:57 (599282): app exit status: 0x100
17:47:57 (599282): called boinc_finish
</stderr_txt>
]]>
I have no idea what's going on, just that the old version worked and the new one doesn't on this particular card.
You could try to test it in standalone with and without the wrapper.
I'm running Linux. I don't think that will like a .exe file. |
|
|
|
It it helps, this is what the client logs:
Output file TF_72-73_127-219M_wu_229446_0_0 for task TF_72-73_127-219M_wu_229446_0 absent
Is there any way to get a Linux native version of mfaktc? I'm running 64bit linux but only ever get project files compiled for Windows and the wrapper.
Running a dependency check on the wrapper gets me:
$ldd wrapper_26012-v2_x86_64-pc-linux-gnu
linux-vdso.so.1 (0x00007ffebd394000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb842ab7000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb842a9c000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb842a79000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb842887000)
/lib64/ld-linux-x86-64.so.2 (0x00007fb842c31000) |
|
|
|
Copied the .exe out of the zip archive and ran it in a terminal. This is the output:
mfaktc v0.21 (64bit built)
Compiletime options
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 193154bits
SIEVE_SPLIT 250
MORE_CLASSES enabled
Runtime options
SievePrimes 25000
SievePrimesAdjust 1
SievePrimesMin 5000
SievePrimesMax 100000
NumStreams 3
CPUStreams 3
GridSize 3
GPU Sieving enabled
GPUSievePrimes 82486
GPUSieveSize 2047Mi bits
GPUSieveProcessSize 32Ki bits
Checkpoints enabled
CheckpointDelay 300s
WorkFileAddDelay disabled
Stages enabled
StopAfterFactor class
PrintMode full
V5UserID (none)
ComputerID (none)
AllowSleep no
TimeStampInResults no
CUDA version info
binary compiled for CUDA 10.10
CUDA runtime version 10.10
CUDA driver version 11.0
CUDA device info
name Quadro K2000
compute capability 3.0
max threads per block 1024
max shared memory per MP 49152 byte
number of multiprocessors 2
CUDA cores per MP 192
CUDA cores - total 384
clock rate (CUDA cores) 954MHz
memory clock rate: 2000MHz
memory bus width: 128 bit
Automatic parameters
threads per grid 1048576
GPUSievePrimes (adjusted) 82486
GPUsieve minimum exponent 1055144
running a simple selftest...
Selftest statistics
number of tests 107
successfull tests 107
selftest PASSED!
Can't open workfile worktodo.txt
ERROR: get_next_assignment(): can't open "worktodo.txt"
Next I tried it with a work file with the relevant numbers but it had a hissy fit about that. I cut off the job numbers and the file ran. I'll let it run in the terminal and see what happens. |
|
|
|
Ok, here's the follow up with a work file
Same as the previous output but with this at the end:
got assignment: exp=140615327 bit_min=72 bit_max=73 (6.80 GHz-days)
Starting trial factoring M140615327 from 2^72 to 2^73 (6.80 GHz-days)
k_min = 16791791417940
k_max = 33583582839939
Using GPU kernel "barrett76_mul32_gs"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Oct 23 20:27 | 0 0.1% | 7.842 2h05m | 78.07 82485 n.a.%
M140615327 has a factor: 38814612911305349835664385407
ERROR: cudaGetLastError() returned 702: the launch timed out and was terminated |
|
|
rebirtherVolunteer moderator Project administrator Project developer Project tester Project scientist
Send message
Joined: 2 Jan 13 Posts: 7496 Credit: 44,029,375 RAC: 36,320 |
Ok, here's the follow up with a work file
Same as the previous output but with this at the end:
got assignment: exp=140615327 bit_min=72 bit_max=73 (6.80 GHz-days)
Starting trial factoring M140615327 from 2^72 to 2^73 (6.80 GHz-days)
k_min = 16791791417940
k_max = 33583582839939
Using GPU kernel "barrett76_mul32_gs"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Oct 23 20:27 | 0 0.1% | 7.842 2h05m | 78.07 82485 n.a.%
M140615327 has a factor: 38814612911305349835664385407
ERROR: cudaGetLastError() returned 702: the launch timed out and was terminated
Hmm, luckily there was a factor, can you try another non factor run?
For the wrapper test I will post you a little guide later. |
|
|
|
using the same work file (the unit error'd out in the actual client)
got assignment: exp=140615327 bit_min=72 bit_max=73 (6.80 GHz-days)
Starting trial factoring M140615327 from 2^72 to 2^73 (6.80 GHz-days)
k_min = 16791791417940
k_max = 33583582839939
Using GPU kernel "barrett76_mul32_gs"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Oct 23 20:42 | 0 0.1% | 16.568 4h24m | 36.95 82485 n.a.%
Oct 23 20:43 | 9 0.2% | 16.571 4h24m | 36.94 82485 n.a.%
Oct 23 20:43 | 12 0.3% | 8.616 2h17m | 71.05 82485 n.a.%
ERROR: cudaGetLastError() returned 702: the launch timed out and was terminated |
|
|
rebirtherVolunteer moderator Project administrator Project developer Project tester Project scientist
Send message
Joined: 2 Jan 13 Posts: 7496 Credit: 44,029,375 RAC: 36,320 |
Hmm, if the app is running for some minutes then the VRAM size could be an issue. |
|
|
|
I copied the application and data file to a known good machine (1060 6GB card) and it ran fine, so it's something particular to that K2000 rig. That little Quadro only has 2GB of VRAM. If that's not enough for the new work that's fine, but it might be worth putting a test in to confirm the card has at least 3GB before giving out work.
Of course it's also an older card so there won't be many of them still in service. |
|
|
rebirtherVolunteer moderator Project administrator Project developer Project tester Project scientist
Send message
Joined: 2 Jan 13 Posts: 7496 Credit: 44,029,375 RAC: 36,320 |
I copied the application and data file to a known good machine (1060 6GB card) and it ran fine, so it's something particular to that K2000 rig. That little Quadro only has 2GB of VRAM. If that's not enough for the new work that's fine, but it might be worth putting a test in to confirm the card has at least 3GB before giving out work.
Of course it's also an older card so there won't be many of them still in service.
I have checked this with GPU-Z and its taking around 85MB VRAM + the reserved from the BIOS in total 900MB so 2GB VRAM should be no problem at all.
The card could also be overheated. |
|
|
|
Cleaned the card out, got very little dust out of it, and checked it. Same error. Check the temperature (nVidia Settings app), unit fails before the temperature can update (lagging display, normal for this card), tried it on a PPS Seive unit (PrimeGrid) and it's crunching away happily at under 60C after five minutes continuous, fan still at only 35%. a cOLATZ UNIT (OpenCL) runs fine at the same temperatures, GPUGrid has the card at 61C and 37% fan after 10 minutes continuous. I compared the GPUGrid work on my GTX 1060 rig and it runs a couple of degrees cooler than SRBase but not much.
edit: it's lagging because I'm accessing the desktop remotely as well as the load on the GPU. Just to be clear. |
|
|
rebirtherVolunteer moderator Project administrator Project developer Project tester Project scientist
Send message
Joined: 2 Jan 13 Posts: 7496 Credit: 44,029,375 RAC: 36,320 |
I have asked in the forum, it could be the same issue as on windows7 now but less people are affected, maybe a driver issue. |
|
|
|
I was wondering the same thing and spent time today cleaning out any differences between that rig and the other one that works fine. I found an issue with the kernel version I was using and Virtualbox which I figure is unrelated to this and went back to a different kernel that works on both machines, then reinstalled the drivers to be sure. Just rebooted and tried it but the same issue holds.
I'm running the 450 drivers and 5.4 Linux kernel, btw. |
|
|
rebirtherVolunteer moderator Project administrator Project developer Project tester Project scientist
Send message
Joined: 2 Jan 13 Posts: 7496 Credit: 44,029,375 RAC: 36,320 |
I was wondering the same thing and spent time today cleaning out any differences between that rig and the other one that works fine. I found an issue with the kernel version I was using and Virtualbox which I figure is unrelated to this and went back to a different kernel that works on both machines, then reinstalled the drivers to be sure. Just rebooted and tried it but the same issue holds.
I'm running the 450 drivers and 5.4 Linux kernel, btw.
The windows machine is on win7 with 436.15. I think this kind of error is the same. |
|
|
rebirtherVolunteer moderator Project administrator Project developer Project tester Project scientist
Send message
Joined: 2 Jan 13 Posts: 7496 Credit: 44,029,375 RAC: 36,320 |
a short info from forum:
check the factor, the logs, the hardware, driver, temperatures, gpu memory reliability, etc.
|
|
|