Posts by Dark Angel

1) Message boards : News : Linux wrapper suspending task issue (Message 8930)
Posted 10 Jun 2023 by Dark Angel

I've noticed this repeatedly. Since I often suspend BOINC work to use my system for other things that require both a large amount of RAM and full GPU access this comes up a lot.
I had previously considered it to be a known but unfortunate issue and something that just had to be accepted to run the project. I'm glad it's being looked into.

2) Message boards : Number crunching : Linux GPU errors (Message 6869)
Posted 27 Oct 2020 by Dark Angel

Back to 435 drivers, still no go.

Still running PrimeGrid tasks with no drama (not at the same time of course)

ldd mfaktc.exe confirms I have all required libraries installed correctly.

3) Message boards : Number crunching : Linux GPU errors (Message 6867)
Posted 26 Oct 2020 by Dark Angel

Went to 455 drivers instead. Still no go.

4) Message boards : Number crunching : Linux GPU errors (Message 6866)
Posted 25 Oct 2020 by Dark Angel

I'll try rolling the driver back to 440 and see what happens.
Hardware and temps I doubt are an issue given the card works fine on other prime number projects but it won't hurt to try just the same.

5) Message boards : Number crunching : Linux GPU errors (Message 6863)
Posted 25 Oct 2020 by Dark Angel

I was wondering the same thing and spent time today cleaning out any differences between that rig and the other one that works fine. I found an issue with the kernel version I was using and Virtualbox which I figure is unrelated to this and went back to a different kernel that works on both machines, then reinstalled the drivers to be sure. Just rebooted and tried it but the same issue holds.
I'm running the 450 drivers and 5.4 Linux kernel, btw.

6) Message boards : Number crunching : Linux GPU errors (Message 6849)
Posted 24 Oct 2020 by Dark Angel

Cleaned the card out, got very little dust out of it, and checked it. Same error. Check the temperature (nVidia Settings app), unit fails before the temperature can update (lagging display, normal for this card), tried it on a PPS Seive unit (PrimeGrid) and it's crunching away happily at under 60C after five minutes continuous, fan still at only 35%. a cOLATZ UNIT (OpenCL) runs fine at the same temperatures, GPUGrid has the card at 61C and 37% fan after 10 minutes continuous. I compared the GPUGrid work on my GTX 1060 rig and it runs a couple of degrees cooler than SRBase but not much.
edit: it's lagging because I'm accessing the desktop remotely as well as the load on the GPU. Just to be clear.

7) Message boards : Number crunching : Linux GPU errors (Message 6847)
Posted 23 Oct 2020 by Dark Angel

I copied the application and data file to a known good machine (1060 6GB card) and it ran fine, so it's something particular to that K2000 rig. That little Quadro only has 2GB of VRAM. If that's not enough for the new work that's fine, but it might be worth putting a test in to confirm the card has at least 3GB before giving out work.
Of course it's also an older card so there won't be many of them still in service.

8) Message boards : Number crunching : Linux GPU errors (Message 6844)
Posted 23 Oct 2020 by Dark Angel

using the same work file (the unit error'd out in the actual client)

got assignment: exp=140615327 bit_min=72 bit_max=73 (6.80 GHz-days)
Starting trial factoring M140615327 from 2^72 to 2^73 (6.80 GHz-days)
k_min = 16791791417940
k_max = 33583582839939
Using GPU kernel "barrett76_mul32_gs"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Oct 23 20:42 | 0 0.1% | 16.568 4h24m | 36.95 82485 n.a.%
Oct 23 20:43 | 9 0.2% | 16.571 4h24m | 36.94 82485 n.a.%
Oct 23 20:43 | 12 0.3% | 8.616 2h17m | 71.05 82485 n.a.%
ERROR: cudaGetLastError() returned 702: the launch timed out and was terminated

9) Message boards : Number crunching : Linux GPU errors (Message 6842)
Posted 23 Oct 2020 by Dark Angel

Ok, here's the follow up with a work file

Same as the previous output but with this at the end:

got assignment: exp=140615327 bit_min=72 bit_max=73 (6.80 GHz-days)
Starting trial factoring M140615327 from 2^72 to 2^73 (6.80 GHz-days)
k_min = 16791791417940
k_max = 33583582839939
Using GPU kernel "barrett76_mul32_gs"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Oct 23 20:27 | 0 0.1% | 7.842 2h05m | 78.07 82485 n.a.%
M140615327 has a factor: 38814612911305349835664385407
ERROR: cudaGetLastError() returned 702: the launch timed out and was terminated

10) Message boards : Number crunching : Linux GPU errors (Message 6841)
Posted 23 Oct 2020 by Dark Angel

Copied the .exe out of the zip archive and ran it in a terminal. This is the output:
mfaktc v0.21 (64bit built)

Compiletime options
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 193154bits
SIEVE_SPLIT 250
MORE_CLASSES enabled

Runtime options
SievePrimes 25000
SievePrimesAdjust 1
SievePrimesMin 5000
SievePrimesMax 100000
NumStreams 3
CPUStreams 3
GridSize 3
GPU Sieving enabled
GPUSievePrimes 82486
GPUSieveSize 2047Mi bits
GPUSieveProcessSize 32Ki bits
Checkpoints enabled
CheckpointDelay 300s
WorkFileAddDelay disabled
Stages enabled
StopAfterFactor class
PrintMode full
V5UserID (none)
ComputerID (none)
AllowSleep no
TimeStampInResults no

CUDA version info
binary compiled for CUDA 10.10
CUDA runtime version 10.10
CUDA driver version 11.0

CUDA device info
name Quadro K2000
compute capability 3.0
max threads per block 1024
max shared memory per MP 49152 byte
number of multiprocessors 2
CUDA cores per MP 192
CUDA cores - total 384
clock rate (CUDA cores) 954MHz
memory clock rate: 2000MHz
memory bus width: 128 bit

Automatic parameters
threads per grid 1048576
GPUSievePrimes (adjusted) 82486
GPUsieve minimum exponent 1055144

running a simple selftest...
Selftest statistics
number of tests 107
successfull tests 107

selftest PASSED!

Can't open workfile worktodo.txt
ERROR: get_next_assignment(): can't open "worktodo.txt"

Next I tried it with a work file with the relevant numbers but it had a hissy fit about that. I cut off the job numbers and the file ran. I'll let it run in the terminal and see what happens.

11) Message boards : Number crunching : Linux GPU errors (Message 6840)
Posted 23 Oct 2020 by Dark Angel

It it helps, this is what the client logs:
Output file TF_72-73_127-219M_wu_229446_0_0 for task TF_72-73_127-219M_wu_229446_0 absent

Is there any way to get a Linux native version of mfaktc? I'm running 64bit linux but only ever get project files compiled for Windows and the wrapper.

Running a dependency check on the wrapper gets me:
$ldd wrapper_26012-v2_x86_64-pc-linux-gnu
linux-vdso.so.1 (0x00007ffebd394000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb842ab7000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb842a9c000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb842a79000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb842887000)
/lib64/ld-linux-x86-64.so.2 (0x00007fb842c31000)

12) Message boards : Number crunching : Linux GPU errors (Message 6839)
Posted 22 Oct 2020 by Dark Angel

This is a typical error:
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
17:46:37 (599282): wrapper (7.2.26012): starting
17:46:37 (599282): wrapper: running ./mfaktc.exe ( --device 0)
17:47:57 (599282): ./mfaktc.exe exited; CPU time 1.075320
17:47:57 (599282): app exit status: 0x100
17:47:57 (599282): called boinc_finish

</stderr_txt>
]]>

I have no idea what's going on, just that the old version worked and the new one doesn't on this particular card.

You could try to test it in standalone with and without the wrapper.

I'm running Linux. I don't think that will like a .exe file.

13) Message boards : Number crunching : Linux GPU errors (Message 6826)
Posted 22 Oct 2020 by Dark Angel

This is a typical error:
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
17:46:37 (599282): wrapper (7.2.26012): starting
17:46:37 (599282): wrapper: running ./mfaktc.exe ( --device 0)
17:47:57 (599282): ./mfaktc.exe exited; CPU time 1.075320
17:47:57 (599282): app exit status: 0x100
17:47:57 (599282): called boinc_finish

</stderr_txt>
]]>

I have no idea what's going on, just that the old version worked and the new one doesn't on this particular card.

14) Message boards : Number crunching : Linux GPU errors (Message 6821)
Posted 20 Oct 2020 by Dark Angel

I did that.
Fully reinstalled all the drivers. Rebooted. Reset the project. Still throws errors on this work but works fine on Prime Grid. It's not opening the child process for some reason. Do the new work units require more RAM on the card than the old ones?

15) Message boards : Number crunching : Linux GPU errors (Message 6819)
Posted 18 Oct 2020 by Dark Angel

I left a machine running over the weekend while I went away and I come back to see it's producing errors on all GPU tasks.
Every error occurs in less than a minute, all with exit code 0x100.
The machine in question was completing units successfully before I went away and hasn't done any automatic updates in that time.
This is the machine in question: http://srbase.my-firewall.org/sr5/show_host_detail.php?hostid=208774
I made sure all the CUDA drivers are properly installed, rebooted the machine, all to no change.
I moved the GPU to PrimeGrid (PPS Sieve) and it's running fine which makes me suspect it's something with the work here, but I cannot confirm that.
Any help/insight would be appreciated.