Odd issue(s) with srbase6 WU's
log in

Advanced search

Message boards : Number crunching : Odd issue(s) with srbase6 WU's

Author Message
Profile 3w9S5axoMXm13kFkYfnpnhFC5C3U
Avatar
Send message
Joined: 17 Nov 16
Posts: 97
Credit: 188,684,281
RAC: 40,031
Message 4966 - Posted: 19 Feb 2019, 13:01:14 UTC
Last modified: 19 Feb 2019, 13:11:08 UTC

On my 1090t there are 8 average(srbase6) WU's that came down and they refuse to multi-thread. They start, hit 33% CPU (although set to 6 cores of 6 cores) then go to 0 CPU, 0 I/O, 0 Disk reads, 0 network for as long as I watched.
One was at 99.6% completed, 10 hours and still doing nothing as I watched with process hacker. (Caught this because the machine was producing less heat).

Seem to work fine at 1 thread but then are coming back invalid with rounding errors >0.4 only evident error (2 still in process):
WU
369102975 Validate error
369102997 Validate error
369102952 Validate error
369102978 Validate error
369102947 Validate error
369102968 Validate error

I checked the srbase6 app running on my Xeon's and they are set to 6 threads but are typically pulling about 2.5 -> 3 threads of active CPU cycles so I changed the app_config back to 3 threads for them (they have short deadlines and was giving them ample thread counts).

However, the Xeon srbase6 are valid and this is two separate issues.

1) I should refuse srbase6 on the 1090t as it's not capable of accurately doing srbase6 WU? (All the other work completed by that machine is valid so far, but it hasn't tried every app type srbase->srbase12)
Can the WU's be invalid because their thread count was changed midway through? This doesn't make sense to me because 5 or 6 of the WU hadn't even started yet and were 1 thread their whole run.

2) Why is srbase6 refusing to perform at 6 threads?
(6 threads show up in the properties of the PID when looked at with Process Hacker, 1 idle and the other 5 normal priority).
Is it because the data set runs out on some of the threads before others so that average thread usage is under 3 cores?

BTW, the long (srbase5) when set to -t8 will use 8 threads with 99% core usage.

Profile 3w9S5axoMXm13kFkYfnpnhFC5C3U
Avatar
Send message
Joined: 17 Nov 16
Posts: 97
Credit: 188,684,281
RAC: 40,031
Message 4967 - Posted: 19 Feb 2019, 19:00:58 UTC - in response to Message 4966.

7 invalid and the last completed valid.

Not sure if that is from shifting thread count midway but seems coincidental. Don't understand how that happened.

The failure to run continuous more than 3 threads on srbase5 is the more important mystery to me.

Profile 3w9S5axoMXm13kFkYfnpnhFC5C3U
Avatar
Send message
Joined: 17 Nov 16
Posts: 97
Credit: 188,684,281
RAC: 40,031
Message 4970 - Posted: 19 Feb 2019, 20:41:14 UTC
Last modified: 19 Feb 2019, 20:42:20 UTC

Learned some things.

Just clearing the app_config.xml to a simple <project_max_concurrent>4</> doesn't change the thread count as the project stores that in cc_config.xml permanently. Actually have to set thread counts and AVE CPU to 1 on every app, save app_config, then reread the config files.

The issue of llr.exe eventually using 0% CPU multi-threaded only happens on the 1090t, and only if another project comes into RAM. If llr.exe are the only running application then things run smoothly (but there are high priority WU's from another project now also).

Set all back to single thread, not accepting srbase5 or srbase6 apps on the 1090t and won't be able to figure out this mystery till I change the OS on that machine in a while.

Profile Conan
Avatar
Send message
Joined: 7 Dec 14
Posts: 38
Credit: 15,425,638
RAC: 17,274
Message 4971 - Posted: 19 Feb 2019, 22:05:16 UTC

I have a 4 core Phenom that is showing similar behavior.

I have aborted 6 work units so far that have got as high as 99.993% but are sitting there spinning there wheels doing nothing at all.
Zero CPU is being used but all 4 cores are held up.
The longest one I caught was over 10 hours.

They are fighting with a high priority project at the moment, so are using 3 cores not 4 but still if I don't come by and check then the WU could be stuck for a very long time.

And often on any of the WU that I get it can show multi-threading has worked or it has not, seems a bit random. The run time and CPU times are often the same.

Conan

Profile 3w9S5axoMXm13kFkYfnpnhFC5C3U
Avatar
Send message
Joined: 17 Nov 16
Posts: 97
Credit: 188,684,281
RAC: 40,031
Message 4972 - Posted: 20 Feb 2019, 18:08:17 UTC - in response to Message 4971.
Last modified: 20 Feb 2019, 18:09:39 UTC



And often on any of the WU that I get it can show multi-threading has worked or it has not, seems a bit random. The run time and CPU times are often the same.

Conan


Thanks for speaking up. So it seems to be a quirk of the Phenom; maybe threading instruction that doesn't act like Intel architecture? If so it would be OS independent. What OS are you on? Mine is Windows 7 Pro.

As long as llr.exe is the ONLY BOINC application (or llr is single threaded), I get hundreds of valid, multi-threaded results (confirmed in the WU log) with 99.7% CPU usage (this Phenom is my testing machine and I've been watching it like a hawk for weeks collecting data on projects I hadn't tried before and Collatz for the video cards I bought).

Profile 3w9S5axoMXm13kFkYfnpnhFC5C3U
Avatar
Send message
Joined: 17 Nov 16
Posts: 97
Credit: 188,684,281
RAC: 40,031
Message 4973 - Posted: 20 Feb 2019, 18:16:41 UTC

BTW, the invalid results aren't related to the threading issue; it's from changing thread count in the middle of the run:


------------------------------------------------------
Starting N+1 prime test of 506*143^427250-1
Using AMD K10 FFT length 320K, Pass1=320, Pass2=1K, 2 threads, a = 3

14:17:18 (1684): wrapper (7.5.26012): starting
14:17:18 (1684): wrapper: running llr.exe ( -d -oPgenInputFile=input.prp -oPgenOutputFile=primes.txt -oDiskWriteTime=10 -oOutputIterations=50000 -oResultsFileIterations=99999999 -t1)
Base factorized as : 11*13
Base prime factor(s) taken : 13
Starting N+1 prime test of 506*143^427250-1
Using AMD K10 FFT length 320K, Pass1=320, Pass2=1K, a = 3


Iter: 1/3059065, ERROR: ROUND OFF (0.4999000276) > 0.4
Continuing from last save file.
Starting N+1 prime test of 506*143^427250-1
Using AMD K10 FFT length 320K, Pass1=320, Pass2=1K, a = 3

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7998
Credit: 44,538,964
RAC: 1
Message 4974 - Posted: 21 Feb 2019, 17:01:49 UTC

Thx for collecting datas about this issue. I dont know if Primegrid has the same problems. I will forward this to the developer of the app.

Profile 3w9S5axoMXm13kFkYfnpnhFC5C3U
Avatar
Send message
Joined: 17 Nov 16
Posts: 97
Credit: 188,684,281
RAC: 40,031
Message 4976 - Posted: 21 Feb 2019, 20:03:46 UTC - in response to Message 4974.

I dont know if Primegrid has the same problems.


Oh good idea.

I'll have to try some llr.exe based WU from Primegrid on that machine.

Also need to record the version details of the process in RAM.

New video card shows up tomorrow so testing it out will take priority for a week.

Profile rebirther
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 2 Jan 13
Posts: 7998
Credit: 44,538,964
RAC: 1
Message 4978 - Posted: 23 Feb 2019, 14:18:09 UTC

Got an answer from the app dev:

The only thing I did to implement the multithreading in LLR was to add
this line before each setup of the Gwnum code :

gwset_num_threads (gwdata, );

The Gwnum version used in LLR 3.8.21 is 28.14

There are no other changes in LLR code related to multithreading.
And it worked fine for now more of two years...


So it must have something to do with the gwum lib.

Profile 3w9S5axoMXm13kFkYfnpnhFC5C3U
Avatar
Send message
Joined: 17 Nov 16
Posts: 97
Credit: 188,684,281
RAC: 40,031
Message 4997 - Posted: 28 Feb 2019, 18:36:24 UTC

Here's a clue to possibly the rounding error invalid results (about 5% failure rate):

NFS tasks have a specific WU written with some exception for Phenom II X6.

The WU running on my Xeons is named "14e Lattice Sieve v1.08 (notphenomiix6)
windows_x86_64"

The WU running on the Phenom II X6 is "14e Lattice Sieve v1.08
windows_intelx86".

Those are single threaded WU's so I assume it's a fix for the rounding errors?


It's 8 years old but it's the home to the GPU's and still puts out 1/3 the work as my Xeon's for equivalent Wattage per thread (when down volted 0.05v)


Post to thread

Message boards : Number crunching : Odd issue(s) with srbase6 WU's


Main page · Your account · Message boards


Copyright © 2014-2025 BOINC Confederation / rebirther