log in |
1)
Message boards :
News :
Linux wrapper suspending task issue
(Message 8930)
Posted 10 Jun 2023 by Dark Angel I've noticed this repeatedly. Since I often suspend BOINC work to use my system for other things that require both a large amount of RAM and full GPU access this comes up a lot. I had previously considered it to be a known but unfortunate issue and something that just had to be accepted to run the project. I'm glad it's being looked into. |
2)
Message boards :
Number crunching :
Linux GPU errors
(Message 6869)
Posted 27 Oct 2020 by Dark Angel Back to 435 drivers, still no go. Still running PrimeGrid tasks with no drama (not at the same time of course) ldd mfaktc.exe confirms I have all required libraries installed correctly. |
3)
Message boards :
Number crunching :
Linux GPU errors
(Message 6867)
Posted 26 Oct 2020 by Dark Angel Went to 455 drivers instead. Still no go. |
4)
Message boards :
Number crunching :
Linux GPU errors
(Message 6866)
Posted 25 Oct 2020 by Dark Angel I'll try rolling the driver back to 440 and see what happens. Hardware and temps I doubt are an issue given the card works fine on other prime number projects but it won't hurt to try just the same. |
5)
Message boards :
Number crunching :
Linux GPU errors
(Message 6863)
Posted 25 Oct 2020 by Dark Angel I was wondering the same thing and spent time today cleaning out any differences between that rig and the other one that works fine. I found an issue with the kernel version I was using and Virtualbox which I figure is unrelated to this and went back to a different kernel that works on both machines, then reinstalled the drivers to be sure. Just rebooted and tried it but the same issue holds. I'm running the 450 drivers and 5.4 Linux kernel, btw. |
6)
Message boards :
Number crunching :
Linux GPU errors
(Message 6849)
Posted 24 Oct 2020 by Dark Angel Cleaned the card out, got very little dust out of it, and checked it. Same error. Check the temperature (nVidia Settings app), unit fails before the temperature can update (lagging display, normal for this card), tried it on a PPS Seive unit (PrimeGrid) and it's crunching away happily at under 60C after five minutes continuous, fan still at only 35%. a cOLATZ UNIT (OpenCL) runs fine at the same temperatures, GPUGrid has the card at 61C and 37% fan after 10 minutes continuous. I compared the GPUGrid work on my GTX 1060 rig and it runs a couple of degrees cooler than SRBase but not much. edit: it's lagging because I'm accessing the desktop remotely as well as the load on the GPU. Just to be clear. |
7)
Message boards :
Number crunching :
Linux GPU errors
(Message 6847)
Posted 23 Oct 2020 by Dark Angel I copied the application and data file to a known good machine (1060 6GB card) and it ran fine, so it's something particular to that K2000 rig. That little Quadro only has 2GB of VRAM. If that's not enough for the new work that's fine, but it might be worth putting a test in to confirm the card has at least 3GB before giving out work. Of course it's also an older card so there won't be many of them still in service. |
8)
Message boards :
Number crunching :
Linux GPU errors
(Message 6844)
Posted 23 Oct 2020 by Dark Angel using the same work file (the unit error'd out in the actual client) got assignment: exp=140615327 bit_min=72 bit_max=73 (6.80 GHz-days) Starting trial factoring M140615327 from 2^72 to 2^73 (6.80 GHz-days) k_min = 16791791417940 k_max = 33583582839939 Using GPU kernel "barrett76_mul32_gs" Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Oct 23 20:42 | 0 0.1% | 16.568 4h24m | 36.95 82485 n.a.% Oct 23 20:43 | 9 0.2% | 16.571 4h24m | 36.94 82485 n.a.% Oct 23 20:43 | 12 0.3% | 8.616 2h17m | 71.05 82485 n.a.% ERROR: cudaGetLastError() returned 702: the launch timed out and was terminated |
9)
Message boards :
Number crunching :
Linux GPU errors
(Message 6842)
Posted 23 Oct 2020 by Dark Angel Ok, here's the follow up with a work file Same as the previous output but with this at the end: got assignment: exp=140615327 bit_min=72 bit_max=73 (6.80 GHz-days) Starting trial factoring M140615327 from 2^72 to 2^73 (6.80 GHz-days) k_min = 16791791417940 k_max = 33583582839939 Using GPU kernel "barrett76_mul32_gs" Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Oct 23 20:27 | 0 0.1% | 7.842 2h05m | 78.07 82485 n.a.% M140615327 has a factor: 38814612911305349835664385407 ERROR: cudaGetLastError() returned 702: the launch timed out and was terminated |
10)
Message boards :
Number crunching :
Linux GPU errors
(Message 6841)
Posted 23 Oct 2020 by Dark Angel Copied the .exe out of the zip archive and ran it in a terminal. This is the output: mfaktc v0.21 (64bit built) Compiletime options THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 193154bits SIEVE_SPLIT 250 MORE_CLASSES enabled Runtime options SievePrimes 25000 SievePrimesAdjust 1 SievePrimesMin 5000 SievePrimesMax 100000 NumStreams 3 CPUStreams 3 GridSize 3 GPU Sieving enabled GPUSievePrimes 82486 GPUSieveSize 2047Mi bits GPUSieveProcessSize 32Ki bits Checkpoints enabled CheckpointDelay 300s WorkFileAddDelay disabled Stages enabled StopAfterFactor class PrintMode full V5UserID (none) ComputerID (none) AllowSleep no TimeStampInResults no CUDA version info binary compiled for CUDA 10.10 CUDA runtime version 10.10 CUDA driver version 11.0 CUDA device info name Quadro K2000 compute capability 3.0 max threads per block 1024 max shared memory per MP 49152 byte number of multiprocessors 2 CUDA cores per MP 192 CUDA cores - total 384 clock rate (CUDA cores) 954MHz memory clock rate: 2000MHz memory bus width: 128 bit Automatic parameters threads per grid 1048576 GPUSievePrimes (adjusted) 82486 GPUsieve minimum exponent 1055144 running a simple selftest... Selftest statistics number of tests 107 successfull tests 107 selftest PASSED! Can't open workfile worktodo.txt ERROR: get_next_assignment(): can't open "worktodo.txt" Next I tried it with a work file with the relevant numbers but it had a hissy fit about that. I cut off the job numbers and the file ran. I'll let it run in the terminal and see what happens. |
11)
Message boards :
Number crunching :
Linux GPU errors
(Message 6840)
Posted 23 Oct 2020 by Dark Angel It it helps, this is what the client logs: Output file TF_72-73_127-219M_wu_229446_0_0 for task TF_72-73_127-219M_wu_229446_0 absent Is there any way to get a Linux native version of mfaktc? I'm running 64bit linux but only ever get project files compiled for Windows and the wrapper. Running a dependency check on the wrapper gets me: $ldd wrapper_26012-v2_x86_64-pc-linux-gnu linux-vdso.so.1 (0x00007ffebd394000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb842ab7000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb842a9c000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb842a79000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb842887000) /lib64/ld-linux-x86-64.so.2 (0x00007fb842c31000) |
12)
Message boards :
Number crunching :
Linux GPU errors
(Message 6839)
Posted 22 Oct 2020 by Dark Angel This is a typical error: I'm running Linux. I don't think that will like a .exe file. |
13)
Message boards :
Number crunching :
Linux GPU errors
(Message 6826)
Posted 22 Oct 2020 by Dark Angel This is a typical error: <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 17:46:37 (599282): wrapper (7.2.26012): starting 17:46:37 (599282): wrapper: running ./mfaktc.exe ( --device 0) 17:47:57 (599282): ./mfaktc.exe exited; CPU time 1.075320 17:47:57 (599282): app exit status: 0x100 17:47:57 (599282): called boinc_finish </stderr_txt> ]]> I have no idea what's going on, just that the old version worked and the new one doesn't on this particular card. |
14)
Message boards :
Number crunching :
Linux GPU errors
(Message 6821)
Posted 20 Oct 2020 by Dark Angel I did that. Fully reinstalled all the drivers. Rebooted. Reset the project. Still throws errors on this work but works fine on Prime Grid. It's not opening the child process for some reason. Do the new work units require more RAM on the card than the old ones? |
15)
Message boards :
Number crunching :
Linux GPU errors
(Message 6819)
Posted 18 Oct 2020 by Dark Angel I left a machine running over the weekend while I went away and I come back to see it's producing errors on all GPU tasks. Every error occurs in less than a minute, all with exit code 0x100. The machine in question was completing units successfully before I went away and hasn't done any automatic updates in that time. This is the machine in question: http://srbase.my-firewall.org/sr5/show_host_detail.php?hostid=208774 I made sure all the CUDA drivers are properly installed, rebooted the machine, all to no change. I moved the GPU to PrimeGrid (PPS Sieve) and it's running fine which makes me suspect it's something with the work here, but I cannot confirm that. Any help/insight would be appreciated. |