R 290X processing more than 1 task at a time invalids

chase1902

Joined: 13 Aug 11

Posts: 37

Credit: 1264094642

RAC: 0

Gary this was a new install,

12 Aug 2015 21:05:59 UTC

Message 133104

(moderation:

)

Gary this was a new install, I haven't added a config file. just downloaded Bonic and set it to run. I don't use a config file, i just adjust the setting in Einstein.
Perhaps I installed Bonic wrong, but to be honest i didn't read any of the boxes that come up when installing it.

Yes I couldn't see any of these GPU running multi tasks, so I was wondering if its a driver problem that hasn't been resolved.

chase1902

Joined: 13 Aug 11

Posts: 37

Credit: 1264094642

RAC: 0

Yes don't mind anything under

12 Aug 2015 21:20:44 UTC

Message 133105 in response to message 133100

(moderation:

)

Yes don't mind anything under 80C, but over that gets me a bit worried. blew a graphic card up last year on this machine when the temperature got to hot.
Stripped the computer down and gave it a good clean, amazing the amount of dust that gets inside and thats with filters on the intakes.
Temperature back down now so should be good, although this computer runs very hot compared to my others, must be down to the age of it, as its got plenty of cooling, same as my other computers.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117678182618

RAC: 35186198

RE: Gary this was a new

13 Aug 2015 3:54:33 UTC

Message 133106 in response to message 133104

(moderation:

)

Quote:

Gary this was a new install, I haven't added a config file ....

OK, this is even more weird then :-)

First of all, it wont be anything with the way you have installed BOINC. In fact, what I'm showing below leads me to believe that it's nothing to do with your host at all!!

I've had a look through the server log links for a number of your machines and there was no evidence of 'clennts' on the 290X machine but I found the following on the machine with the HD6900 series GPU. I've only snipped the first line to show the time stamp at the start and later on a group of lines that include the problem.

2015-08-13 00:23:14.2730 [PID=12527]   Request: [USER#xxxxx] [HOST#12001109] [IP xxx.xxx.xxx.94] client 7.4.42
....

2015-08-13 00:23:14.4850 [PID=12527] [version] Checking plan class 'SSE2'
2015-08-13 00:23:14.4880 [PID=12527] [version] reading plan classes from file '/BOINC/projects/EinsteinAtHome/plan_class_spec.xml'
2015-08-13 00:23:14.4880 [PID=12527] [version] plan class ok
2015-08-13 00:23:14.4880 [PID=12527] [version] Best version of app einstein_S6BucketFU2UB is 1.01 ID 713 SSE2 (6.16 GFLOPS)
2015-08-13 00:23:14.4880 [PID=12527] [send] [HOST#12001109] [WU#225006221 h1_0505.75_S6GC1__S6BucketFU2UBb_35790206] using delay bound 1209600 (opt: 1209600 pess: 1209600)
2015-08-13 00:23:14.4887 [PID=12527] [debug] Sorted list of URLs follows [host timezone: UTC+3600]
2015-08-13 00:23:14.4887 [PID=12527] [debug] zone=+03600 url=http://einstein2.aei.uni-hannover.de
2015-08-13 00:23:14.4887 [PID=12527] [debug] zone=-18900 url=http://einstein-dl.syr.edu
2015-08-13 00:23:14.4888 [PID=12527] [debug] zone=-21600 url=http://einstein-dl2.phys.uwm.edu
2015-08-13 00:23:14.4888 [PID=12527] [debug] zone=-2882015-08-13 00:23:13.5132 [PID=12538] SCHEDULER_REQUEST::parse(): unrecognized: 0
2015-08-13 00:23:100 url=http://einstein.ligo.caltech.edu
2015-08-13 00:23:14.4890 [PID=12527] [send] [HOST#12001109] Sending app_version 713 einstein_S6BucketFU2UB 2 101 SSE2; 6.16 GFLOPS
2015-08-13 00:23:14.4898 [PID=12527] [send] est. duration for WU 225006221: unscaled 44771.88 scaled 58365.22

At first glance you would think the scheduler is complaining about your host including a rubbish tag in its request. However if you look closely at the time stamps, you can see that the last [debug] line is stamped 00:23:14.4888 and included in the middle of the "zone=-288..." part of the line is the start of what appears to be a stray line with a time stamp of 2015-08-13 00:23:13.5132. I say a 'stray line' because of where it is sitting in the middle and because the time stamp is well before the starting time stamp of your host's scheduler log entry, ie. earlier than 2015-08-13 00:23:14.2730. The other thing to notice is that there is a change in process ID. The scheduler process talking to your host is 12527. The process complaining about the tag is 12538. It seems like responses to two separate hosts are being mixed up.

Now that this has been noticed, take a look at what AgentB originally posted, ie.,

2015-08-01 12:35:14.8597 [PID=18794] Request: [USER#xxxxx] [HOST#11995469] [IP xxx.xxx.xxx.168] client 7.4.42
2015-08-01 12:35:14.8603 [PID=18794] [send] effective_ncpus 6 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2015-08-01 12:35:14.8603 [PID=18794] [send] effective_ngpus 1 max_jobs_on_host_gpu 999999
2015-08-01 12:35:14.8603 [PID=18794] [send] Not using matchmaker scheduling; Not using EDF sim
2015-08-01 12:35:14.8603 [PID=18794] [send] CPU: req 259200.00 sec, 6.00 instances; est delay 0.00
2015-08-01 12:35:14.8603 [PID=18794] [send] ATI: req 0.00 sec, 0.00 instances; est delay 0.00
2015-08-01 12:35:14.8603 2015-08-01 12:35:14.4255 [PID=18787] SCHEDULER_REQUEST::parse(): unrecognized: 0

Once again, two different PIDs and the complaint line has been randomly inserted in the middle of some other line. It has an incompatible time stamp for the log entry within which it has been dumped.

I'll give Bernd a heads-up about this.

Cheers,
Gary.

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7225384931

RAC: 1041960

Gary, I've seen the

13 Aug 2015 4:23:59 UTC

Message 133107 in response to message 133106

(moderation:

)

Gary,

I've seen the "clennts" complaint in my logs, and have failed to check up on it. I did not notice the timing and mixed hosts issues you've outlined. But just now, one of my hosts has a log showing the same thing, I think, like this:

[pre]2015-08-13 03:48:11.0122 [PID=25123] Request: [USER#xxxxx] [HOST#10706295] [IP xxx.xxx.xxx.139] client 7.3.11
2015-08-13 03:48:11.0128 [PID=25123] [send] effective_ncpus 1 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2015-08-13 03:48:11.0128 [PID=25123] [send] effective_ngpus 2 max_jobs_on_host_gpu 999999
2015-08-13 03:48:11.0128 [PID=25123] [send] Not using matchmaker scheduling; Not using EDF sim
2015-08-13 03:48:11.0128 [PID=25123] [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
2015-08-13 03:48:11.0128 [PID=25123] [send] CUDA: req 4340.02 sec, 0.00 instances; est delay 0.00
2015-08-13 03:48:11.0128 [PID=25123] [send] work_req_seconds: 0.00 secs
2015-08-13 03:48:11.0128 [PID=25123] [send] available disk 59.01 GB, work_buf_min 343872
2015-08-13 03:48:11.0128 [PID=25123] [send] active_frac 0.999954 on_frac 0.995809 DCF 1.074428
2015-08-13 03:48:11.0168 [PID=25123] 2015-08-13 03:48:10.8334 [PID=25136] SCHEDULER_REQUEST::parse(): unrecognized: 0[/pre]

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117678182618

RAC: 35186198

Yep, exactly the same

13 Aug 2015 8:31:44 UTC

Message 133108 in response to message 133107

(moderation:

)

Yep, exactly the same 'symptoms'!

It's hard to imagine a whole swag of hosts out there with the same typo so I guess this has to be some sort of weird scheduler bug with so many examples of this showing up when you start looking. I haven't been looking at scheduler logs for any of my hosts lately. I hardly ever do these days because my control script is reporting all OK and, in the middle of winter, problems are few. So I picked a host at random just now and, sure enough, it has exactly the same type of 'complaint' about "clennts" :-).

I have reported this thread to Bernd earlier on.

Cheers,
Gary.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250565806

RAC: 34385

RE: 2015-08-13

15 Aug 2015 16:49:41 UTC

Message 133109

(moderation:

)

Quote:

2015-08-13 03:48:11.0168 [PID=25123] 2015-08-13 03:48:10.8334 [PID=25136] SCHEDULER_REQUEST::parse(): unrecognized: 0

This is not really a scheduler bug, but the effect of limited buffer sizes. There are two scheduler instances running, both at that time writing to the same file (scheduler log). The "unrecognized" warning is from the parser of the one scheduler instance that is not dealing with your host. However at that time of the message the hostid is not yet known and written to the logs, so the "scheduler log publisher" doesn't know where to put it and assigns it to your host.

This is merely an effect of our way of publishing the scheduler logs, not a bug in the scheduler itself.

AgentB

Joined: 17 Mar 12

Posts: 915

Credit: 513211304

RAC: 0

RE: This is merely an

15 Aug 2015 17:54:19 UTC

Message 133110 in response to message 133109

(moderation:

)

Quote:

This is merely an effect of our way of publishing the scheduler logs, not a bug in the scheduler itself.

Thanks BM

I just notice one of my hosts gluon picking up exactly the same "error"

[pre]
2015-08-15 16:15:23.2202 [PID=27664] [version] No CUDA devices found
2015-08-15 16:15:23.2202 [PID=27664] [version] Checking plan class 'BRP6-opencl-ati'
2015-08-15 16:2015-08-15 16:15:22.8218 [PID=27665] SCHEDULER_REQUEST::parse(): unrecognized: 0
2015-08-15 16:15:23.2203 [PID=27664] [version] Peak flops supplied: 5e+10
[/pre]

Am I reading this as a valid error, but being caused by some other host(s) at that time (not mine)?

Floyd1

Joined: 29 Jun 14

Posts: 14

Credit: 590463278

RAC: 0

Thanks for the explanation,

16 Aug 2015 22:06:46 UTC

Message 133111

(moderation:

)

Thanks for the explanation, Bernd, although it does not actually identify the cause of the error, or its likely impact.

However, that was drifting away from the original reason for the thread, namely that multiple concurrent WUs on an AMD 290X are consistently failing validation.

I have been able to replicate the original poster's observation and have tried a few things including running these ATI GPU tasks exclusively and still got the same results.

I don't recall seeing any errors reported during running, just either "Completed, marked as invalid" or "Validate error".

I believe BRP4 tasks run happily - this seems to be restricted to BRP6 tasks.

Is there a simple way to stop my machine from receiving BRP6 tasks but still get BRP4 ones until there is some progress on this?

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7225384931

RAC: 1041960

RE: Is there a simple way

16 Aug 2015 22:34:36 UTC

Message 133112 in response to message 133111

(moderation:

)

Quote:

Is there a simple way to stop my machine from receiving BRP6 tasks but still get BRP4 ones until there is some progress on this?

1. identify which location (aka venue) your have this machine set to (default|home|work|school).
2. open the Einstein preferences from your Einstein account page
3. select the location for your machine, and click on "edit"
4. in the "Run only the selected applications" matter deselect "Binary Radio Pulsar Search (Parkes PMPS XT)"
5. enable things you are willing to run

Holmis

Joined: 4 Jan 05

Posts: 1118

Credit: 1055935564

RAC: 0

RE: Is there a simple way

16 Aug 2015 22:36:27 UTC

Message 133113 in response to message 133111

(moderation:

)

Quote:

Is there a simple way to stop my machine from receiving BRP6 tasks but still get BRP4 ones until there is some progress on this?

Yes, just go to your Einstin@home prefs and under the heading "Run only the selected applications" opt out of "Binary Radio Pulsar Search (Parkes PMPS XT)".

But be aware tha BRP4G aka "Binary Radio Pulsar Search (Arecibo, GPU)" do not have work available all the time. You will most certainly run out of work for your GPU if you only allow BRP4G task on it. Right now the Server status page reports 0 (zero) tasks to send for BRP4G.

R 290X processing more than 1 task at a time invalids

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner