Some invalids with opencl and none with cuda

TJ
TJ
Joined: 11 Feb 05
Posts: 178
Credit: 21041858
RAC: 0
Topic 196775

I saw that with my AMD 5870 about 1 in 40 WU is marked as invalid, while a pc with nVidia is running CUDA 24/7 for Einstein@home with zero invalids.

Another pc with a GTX285 does GPUGRID normally but no tasks now thus Einstein, as well has also zero invalids.

As I see that at Milkyway@home the ATI with opencl is resulting in more invalids and none when doing with nVidia, it must be something of the opencl software.

I am wondering if more crunchers see these invalids?

Greetings from
TJ

Nobody316
Nobody316
Joined: 14 Jan 13
Posts: 141
Credit: 2008126
RAC: 0

Some invalids with opencl and none with cuda

I am running Nvivida and have some invalids on the GPU but only running E@H. I had some issues which I think is now taking care of but only time will tell.

PC setup MSI-970A-G46 AMD FX-8350 8 core OC'd 4.45GHz 16GB ram PC3-10700 Geforce GTX 650Ti Windows 7 x64 Einstein@Home

Alex
Alex
Joined: 1 Mar 05
Posts: 451
Credit: 511040227
RAC: 137830

Yes, in the last two weeks

Yes, in the last two weeks the invalids have increased.
From november until mid jan I had nearly none, since then ~12.
They come mostly from AMD-wu's, but I've also seen SSE wu's failing to validate. One was from nVidia wu.
Some of them where due to bad data, wu's have been canceled.

Edit: sorry, I forgot to say that the invalids come from different pc's. None is OC'd.

DGG
DGG
Joined: 6 Feb 09
Posts: 10
Credit: 65736696
RAC: 0

I too have a large number of

I too have a large number of Einstein@Home only invalids. I haven't had this trouble in the past. I'm using 2 PCs with a single AMD 6950 in one and an AMD 7950 in the other. (I also run an older PC with an even older NVidia card which isn't effected by this problem.)

The case with the 2 AMDs is the same all the time in that the work unit is assigned to two others both with CUDA who one shows as completed OK and the other shows as waiting for conclusion. Every workunit that fails has 3 users with these descriptions. They are not the same users. I also noticed that when a WU that is about to become an invalid is is running, the GPU usage is about 0-4% only and takes the full amount of time just like it completed normally.

I am getting valid results on most of my WUs but this high level of invalids is a bit alarming. I notice it seems to be happening with a lot of AMD users and even a few CPU and NVidia users.

Can an admin person please check on why this is happening? It's wasting a lot of processing time! Until this is fixed, I think I'll switch back to the slower CPU only WUs as I don't like my GPU processing time going down the drain like this.

Thanks

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5876
Credit: 118564104296
RAC: 25517672

RE: I too have a large

Quote:
I too have a large number of Einstein@Home only invalids.


A few days ago (on your 7950 host) you had a lot of invalid results - ie results that were deemed not to agree sufficiently closely with the other two results. Most of your recent results are a bit different - they are failing with a different message of 'Validate error'. This means that the validator is not even doing a comparison with another result because the information in your result is failing the basic sanity checks that the validator performs. This is more fully explained in the 'Validate errors' sticky thread.
The most likely cause of these problems is a 'hardware' issue with your machine. Unfortunately, this is something that only you can deal with.

Quote:
I haven't had this trouble in the past. I'm using 2 PCs with a single AMD 6950 in one and an AMD 7950 in the other. (I also run an older PC with an even older NVidia card which isn't effected by this problem.)


I looked at your 6950 host as well and there are no invalid results currently visible there. It seems like you are not currently crunching with that host.

Quote:
The case with the 2 AMDs is the same all the time in that the work unit is assigned to two others both with CUDA who one shows as completed OK and the other shows as waiting for conclusion. Every workunit that fails has 3 users with these descriptions. They are not the same users.


I'm not quite sure of what you mean by this but I imagine you are just seeing the normal sequence of events that always occurs if a validation cannot be achieved with the initial two results that are returned. So let's explain exactly what happens.

A normal quorum consists of two copies only and if the two results that come back agree, that's the end of the matter. If they don't agree, or if one of the results is deemed to be rubbish anyway, then and only then will a third copy be sent out. Over time, this process of sending out an extra copy will be repeated until there are two sane results that do agree, or a limit of 20 copies is reached. So, if your tasks are failing, you are bound to see at least 3 copies in each quorum you look at. If you pick one of your tasks that didn't fail, you are likely to see only two copies in the quorum - unless somebody else has returned a failed task.

Quote:
I also noticed that when a WU that is about to become an invalid is is running, the GPU usage is about 0-4% only and takes the full amount of time just like it completed normally.


I'm not sure what you're talking about here as well. Perhaps you are referring to tasks like this one which finished as a compute error after 23 secs of CPU time and a run time of 61 Ksecs. The actual error is "Maximum elapsed time exceeded" and this seems to have been due to the GPU refusing to compute - for whatever reason.

Quote:
I am getting valid results on most of my WUs but this high level of invalids is a bit alarming. I notice it seems to be happening with a lot of AMD users and even a few CPU and NVidia users.


Actually, many of your results seem to be ending up as invalid so you certainly should be concerned. You seem to be implying that there is some problem with the tasks themselves. Unfortunately, the problem is most likely with your hardware and only you will be able to resolve that.

You could start by investigating temperature while your GPU is crunching. If it seems high, can it be lowered by removing the side of your case and perhaps using a desk fan to improve cooling? Do the errors stop if you improve cooling? Does everything look clean and fluff-free inside your case?

If cooling is OK, check (by feeling) how hot your PSU is. Is the cooling fan running freely and at full speed? Does it really have enough 12V capacity to run both your system and the GPU? Your problems could easily be caused by inadequate power. How old is your PSU? Older PSUs sometimes develop swollen capacitors which leads to unstable power. What is its current rating for 12V?

Quote:
Can an admin person please check on why this is happening?


How do you think they might be able to do that?

Quote:
It's wasting a lot of processing time!


It certainly is!

Quote:
Until this is fixed, I think I'll switch back to the slower CPU only WUs as I don't like my GPU processing time going down the drain like this.


It will only get fixed if you do some testing to work out what is causing this. Most likely (if everything was OK previously and you haven't made software changes) it will be something to do with temperature and/or power. If you check those things out and the problem remains, please report back with details about what you have tried and someone will give you more suggestions about things to try.

Cheers,
Gary.

TJ
TJ
Joined: 11 Feb 05
Posts: 178
Credit: 21041858
RAC: 0

I still didn't get any

I still didn't get any invalids from my nVidia's but a few from ATI.
But I noticed something else, a few weeks ago and today again: The GPU WU for Einstein takes about 37 minutes on my system, I saw one that has run for 8 hours, and had 15 hours to go!
I suspended this WU and another starts running. When that was finished I resumed the "long" WU again. It started went a few percent of ready down, and the time run and still needed is "normal" again. I have seen this now twice, but are not all the time behind the systems off course.

Greetings from
TJ

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5876
Credit: 118564104296
RAC: 25517672

RE: ... I saw one that has

Quote:
... I saw one that has run for 8 hours, and had 15 hours to go!


I'm only guessing, but I assume GPUs have internal protection mechanisms that can throttle them right down if certain limits are exceeded. I saw something similar on one occasion about a month ago and stopping, waiting briefly, then restarting BOINC seemed to fix it. I was prepared to reboot the machine but I don't remember having to do that.

The key to trouble free running seems to be to make sure there is extremely good ventilation. I deliberately use horizontal desktop cases with the inverted U-shaped top cover removed. The GPU card stands vertically and the fan has direct access to cooler, 'outside the case' air.

If you don't notice these 'slow running' tasks, they will most likely end up as 'maximum elapsed time exceeded' errors. I linked to one of these in my previous reply to DGG.

Cheers,
Gary.

ggesmundo
ggesmundo
Joined: 3 Jun 12
Posts: 31
Credit: 18699116
RAC: 0

RE: I still didn't get any

Quote:
I still didn't get any invalids from my nVidia's but a few from ATI.
But I noticed something else, a few weeks ago and today again: The GPU WU for Einstein takes about 37 minutes on my system, I saw one that has run for 8 hours, and had 15 hours to go!
I suspended this WU and another starts running. When that was finished I resumed the "long" WU again. It started went a few percent of ready down, and the time run and still needed is "normal" again. I have seen this now twice, but are not all the time behind the systems off course.

I have an AMD 7950 and experienced the same behaviour. I used CCC to turn down the GPU clock speed by 10Mhz and the problem stopped.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5876
Credit: 118564104296
RAC: 25517672

RE: I have an AMD 7950 and

Quote:
I have an AMD 7950 and experienced the same behaviour. I used CCC to turn down the GPU clock speed by 10Mhz and the problem stopped.


Thanks for contributing that information. Was your card running at factory default settings or had you overclocked it a bit as well?

I had a look at your current crunch times - around 1900 to 2000 secs. Are you running 3x or 4x? I'm guessing 3x as that was what you were using when you reported in the 'Benchmarks' thread but I wondered if you had upgraded to the new driver that gives a 10-20% performance boost and maybe you are now running 4x? :-).

Do you know what your power draw is? How close do you think you lie to the 'Best efficiency line' as shown in the graph posted by Robert here?

Cheers,
Gary.

ggesmundo
ggesmundo
Joined: 3 Jun 12
Posts: 31
Credit: 18699116
RAC: 0

RE: RE: I have an AMD

Quote:
Quote:
I have an AMD 7950 and experienced the same behaviour. I used CCC to turn down the GPU clock speed by 10Mhz and the problem stopped.

Thanks for contributing that information. Was your card running at factory default settings or had you overclocked it a bit as well?

I had a look at your current crunch times - around 1900 to 2000 secs. Are you running 3x or 4x? I'm guessing 3x as that was what you were using when you reported in the 'Benchmarks' thread but I wondered if you had upgraded to the new driver that gives a 10-20% performance boost and maybe you are now running 4x? :-).

Do you know what your power draw is? How close do you think you lie to the 'Best efficiency line' as shown in the graph posted by Robert here?

I am currently running 4x with completion times of 33 to 34 minutes depending on the CPU tasks running at the time. I run CCC 12.10 right now, tried the later beta but saw a decrease in performance and reverted back to 12.10. I am not sure about the power draw, have never measured it. I had overclocked the GPU from 880Mhz to 1000Mhz based on info at Toms Hardware. That worked fine with Seti and Milkway, but I had 1 to 2 BRP4 tasks a day that would not validate. Turning the clock back to 990Mhz resolved it. I run with 2 cores free at all times, based on looking at the number of threads created by both the CPU and GPU tasks. I also have my BRP4 tasks set to .5 CPU usage which frees 2 more cores during BRP4 processing. This may be a bit of overkill, but the runtime to CPU times are very close and it keeps both the cpus and gpu runnning at 70C or less.

Edit: If my memory is correct, I was actually running at 1010Mhz when I saw the problem mentioned by TJ in the earlier post. Cutting back to 1000Mhz stop the that problem, but then left the 1 to 2 invalids a day. Cutting back another 10Mhz resolved the invalids.

ggesmundo
ggesmundo
Joined: 3 Jun 12
Posts: 31
Credit: 18699116
RAC: 0

RE: I wondered if you had

Quote:
I wondered if you had upgraded to the new driver that gives a 10-20% performance boost

I upgraded to CCC 13.1 this evening and ran 12 tasks, 3 sets at 4x, to compare the performance against the 12.10 I had been running. Unlike other reports I have read, my runtimes increased from 33.75 minutes average to 36.5 minutes. Not a large increase but over time becomes significant. I have reverted to 12.10 and am a happy cruncher. My motherboard only has PCIe 2.0 which may be some of the difference in comparisons with other users results.

Gary

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.