R 290X processing more than 1 task at a time invalids

chase1902
chase1902
Joined: 13 Aug 11
Posts: 37
Credit: 1264094642
RAC: 0

Gary this was a new install,

Gary this was a new install, I haven't added a config file. just downloaded Bonic and set it to run. I don't use a config file, i just adjust the setting in Einstein.
Perhaps I installed Bonic wrong, but to be honest i didn't read any of the boxes that come up when installing it.

Yes I couldn't see any of these GPU running multi tasks, so I was wondering if its a driver problem that hasn't been resolved.

chase1902
chase1902
Joined: 13 Aug 11
Posts: 37
Credit: 1264094642
RAC: 0

Yes don't mind anything under

Yes don't mind anything under 80C, but over that gets me a bit worried. blew a graphic card up last year on this machine when the temperature got to hot.
Stripped the computer down and gave it a good clean, amazing the amount of dust that gets inside and thats with filters on the intakes.
Temperature back down now so should be good, although this computer runs very hot compared to my others, must be down to the age of it, as its got plenty of cooling, same as my other computers.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5849
Credit: 110009866534
RAC: 23982007

RE: Gary this was a new

Quote:
Gary this was a new install, I haven't added a config file ....


OK, this is even more weird then :-)

First of all, it wont be anything with the way you have installed BOINC. In fact, what I'm showing below leads me to believe that it's nothing to do with your host at all!!

I've had a look through the server log links for a number of your machines and there was no evidence of 'clennts' on the 290X machine but I found the following on the machine with the HD6900 series GPU. I've only snipped the first line to show the time stamp at the start and later on a group of lines that include the problem.

2015-08-13 00:23:14.2730 [PID=12527]   Request: [USER#xxxxx] [HOST#12001109] [IP xxx.xxx.xxx.94] client 7.4.42
....

2015-08-13 00:23:14.4850 [PID=12527] [version] Checking plan class 'SSE2'
2015-08-13 00:23:14.4880 [PID=12527] [version] reading plan classes from file '/BOINC/projects/EinsteinAtHome/plan_class_spec.xml'
2015-08-13 00:23:14.4880 [PID=12527] [version] plan class ok
2015-08-13 00:23:14.4880 [PID=12527] [version] Best version of app einstein_S6BucketFU2UB is 1.01 ID 713 SSE2 (6.16 GFLOPS)
2015-08-13 00:23:14.4880 [PID=12527] [send] [HOST#12001109] [WU#225006221 h1_0505.75_S6GC1__S6BucketFU2UBb_35790206] using delay bound 1209600 (opt: 1209600 pess: 1209600)
2015-08-13 00:23:14.4887 [PID=12527] [debug] Sorted list of URLs follows [host timezone: UTC+3600]
2015-08-13 00:23:14.4887 [PID=12527] [debug] zone=+03600 url=http://einstein2.aei.uni-hannover.de
2015-08-13 00:23:14.4887 [PID=12527] [debug] zone=-18900 url=http://einstein-dl.syr.edu
2015-08-13 00:23:14.4888 [PID=12527] [debug] zone=-21600 url=http://einstein-dl2.phys.uwm.edu
2015-08-13 00:23:14.4888 [PID=12527] [debug] zone=-2882015-08-13 00:23:13.5132 [PID=12538] SCHEDULER_REQUEST::parse(): unrecognized: 0
2015-08-13 00:23:100 url=http://einstein.ligo.caltech.edu
2015-08-13 00:23:14.4890 [PID=12527] [send] [HOST#12001109] Sending app_version 713 einstein_S6BucketFU2UB 2 101 SSE2; 6.16 GFLOPS
2015-08-13 00:23:14.4898 [PID=12527] [send] est. duration for WU 225006221: unscaled 44771.88 scaled 58365.22

At first glance you would think the scheduler is complaining about your host including a rubbish tag in its request. However if you look closely at the time stamps, you can see that the last [debug] line is stamped 00:23:14.4888 and included in the middle of the "zone=-288..." part of the line is the start of what appears to be a stray line with a time stamp of 2015-08-13 00:23:13.5132. I say a 'stray line' because of where it is sitting in the middle and because the time stamp is well before the starting time stamp of your host's scheduler log entry, ie. earlier than 2015-08-13 00:23:14.2730. The other thing to notice is that there is a change in process ID. The scheduler process talking to your host is 12527. The process complaining about the tag is 12538. It seems like responses to two separate hosts are being mixed up.

Now that this has been noticed, take a look at what AgentB originally posted, ie.,

2015-08-01 12:35:14.8597 [PID=18794] Request: [USER#xxxxx] [HOST#11995469] [IP xxx.xxx.xxx.168] client 7.4.42
2015-08-01 12:35:14.8603 [PID=18794] [send] effective_ncpus 6 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2015-08-01 12:35:14.8603 [PID=18794] [send] effective_ngpus 1 max_jobs_on_host_gpu 999999
2015-08-01 12:35:14.8603 [PID=18794] [send] Not using matchmaker scheduling; Not using EDF sim
2015-08-01 12:35:14.8603 [PID=18794] [send] CPU: req 259200.00 sec, 6.00 instances; est delay 0.00
2015-08-01 12:35:14.8603 [PID=18794] [send] ATI: req 0.00 sec, 0.00 instances; est delay 0.00
2015-08-01 12:35:14.8603 2015-08-01 12:35:14.4255 [PID=18787] SCHEDULER_REQUEST::parse(): unrecognized: 0

Once again, two different PIDs and the complaint line has been randomly inserted in the middle of some other line. It has an incompatible time stamp for the log entry within which it has been dumped.

I'll give Bernd a heads-up about this.

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7059434931
RAC: 1255003

Gary, I've seen the

Gary,

I've seen the "clennts" complaint in my logs, and have failed to check up on it. I did not notice the timing and mixed hosts issues you've outlined. But just now, one of my hosts has a log showing the same thing, I think, like this:

[pre]2015-08-13 03:48:11.0122 [PID=25123] Request: [USER#xxxxx] [HOST#10706295] [IP xxx.xxx.xxx.139] client 7.3.11
2015-08-13 03:48:11.0128 [PID=25123] [send] effective_ncpus 1 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2015-08-13 03:48:11.0128 [PID=25123] [send] effective_ngpus 2 max_jobs_on_host_gpu 999999
2015-08-13 03:48:11.0128 [PID=25123] [send] Not using matchmaker scheduling; Not using EDF sim
2015-08-13 03:48:11.0128 [PID=25123] [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
2015-08-13 03:48:11.0128 [PID=25123] [send] CUDA: req 4340.02 sec, 0.00 instances; est delay 0.00
2015-08-13 03:48:11.0128 [PID=25123] [send] work_req_seconds: 0.00 secs
2015-08-13 03:48:11.0128 [PID=25123] [send] available disk 59.01 GB, work_buf_min 343872
2015-08-13 03:48:11.0128 [PID=25123] [send] active_frac 0.999954 on_frac 0.995809 DCF 1.074428
2015-08-13 03:48:11.0168 [PID=25123] 2015-08-13 03:48:10.8334 [PID=25136] SCHEDULER_REQUEST::parse(): unrecognized: 0[/pre]

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5849
Credit: 110009866534
RAC: 23982007

Yep, exactly the same

Yep, exactly the same 'symptoms'!

It's hard to imagine a whole swag of hosts out there with the same typo so I guess this has to be some sort of weird scheduler bug with so many examples of this showing up when you start looking. I haven't been looking at scheduler logs for any of my hosts lately. I hardly ever do these days because my control script is reporting all OK and, in the middle of winter, problems are few. So I picked a host at random just now and, sure enough, it has exactly the same type of 'complaint' about "clennts" :-).

I have reported this thread to Bernd earlier on.

Cheers,
Gary.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245261446
RAC: 12614

RE: 2015-08-13

Quote:
2015-08-13 03:48:11.0168 [PID=25123] 2015-08-13 03:48:10.8334 [PID=25136] SCHEDULER_REQUEST::parse(): unrecognized: 0

This is not really a scheduler bug, but the effect of limited buffer sizes. There are two scheduler instances running, both at that time writing to the same file (scheduler log). The "unrecognized" warning is from the parser of the one scheduler instance that is not dealing with your host. However at that time of the message the hostid is not yet known and written to the logs, so the "scheduler log publisher" doesn't know where to put it and assigns it to your host.

This is merely an effect of our way of publishing the scheduler logs, not a bug in the scheduler itself.

BM

BM

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

RE: This is merely an

Quote:


This is merely an effect of our way of publishing the scheduler logs, not a bug in the scheduler itself.

Thanks BM

I just notice one of my hosts gluon picking up exactly the same "error"

[pre]
2015-08-15 16:15:23.2202 [PID=27664] [version] No CUDA devices found
2015-08-15 16:15:23.2202 [PID=27664] [version] Checking plan class 'BRP6-opencl-ati'
2015-08-15 16:2015-08-15 16:15:22.8218 [PID=27665] SCHEDULER_REQUEST::parse(): unrecognized: 0
2015-08-15 16:15:23.2203 [PID=27664] [version] Peak flops supplied: 5e+10
[/pre]

Am I reading this as a valid error, but being caused by some other host(s) at that time (not mine)?

Floyd1
Floyd1
Joined: 29 Jun 14
Posts: 14
Credit: 590463278
RAC: 0

Thanks for the explanation,

Thanks for the explanation, Bernd, although it does not actually identify the cause of the error, or its likely impact.

However, that was drifting away from the original reason for the thread, namely that multiple concurrent WUs on an AMD 290X are consistently failing validation.

I have been able to replicate the original poster's observation and have tried a few things including running these ATI GPU tasks exclusively and still got the same results.

I don't recall seeing any errors reported during running, just either "Completed, marked as invalid" or "Validate error".

I believe BRP4 tasks run happily - this seems to be restricted to BRP6 tasks.

Is there a simple way to stop my machine from receiving BRP6 tasks but still get BRP4 ones until there is some progress on this?

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7059434931
RAC: 1255003

RE: Is there a simple way

Quote:
Is there a simple way to stop my machine from receiving BRP6 tasks but still get BRP4 ones until there is some progress on this?


1. identify which location (aka venue) your have this machine set to (default|home|work|school).
2. open the Einstein preferences from your Einstein account page
3. select the location for your machine, and click on "edit"
4. in the "Run only the selected applications" matter deselect "Binary Radio Pulsar Search (Parkes PMPS XT)"
5. enable things you are willing to run

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

RE: Is there a simple way

Quote:
Is there a simple way to stop my machine from receiving BRP6 tasks but still get BRP4 ones until there is some progress on this?


Yes, just go to your Einstin@home prefs and under the heading "Run only the selected applications" opt out of "Binary Radio Pulsar Search (Parkes PMPS XT)".

But be aware tha BRP4G aka "Binary Radio Pulsar Search (Arecibo, GPU)" do not have work available all the time. You will most certainly run out of work for your GPU if you only allow BRP4G task on it. Right now the Server status page reports 0 (zero) tasks to send for BRP4G.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.