Trouble with BRP5-opencL-xxx ( xxx=ati only?)

astro-marwil
astro-marwil
Joined: 28 May 05
Posts: 452
Credit: 173,402,218
RAC: 40,788
Topic 197509

Hallo!
As one can see in the Server Status Page, BRP5 isn´t the most reliable application. That´s well known here.

But for me it is not only a higher risk of getting no validation, but it blocks sometimes my PC, so I get a black screen and have to switch the PC completely off for some seconds before restarting it. During time having a black screen, sometimes I observe action of background processes like virus scan, as one can hear by disk activity. This happens, as I observed, by tasks that are running much longer than normaly, a factor of 3 to 4. Some of these are stopping completely crunching on CPU and GPU, as one can see in the Task Manager and MSI Afterburner, before they are bloccking the display. Updating to the newest GPU driver and the newest BIOS at the motherboard didn´t help at all. Today I had a task, that stopped crunching without blocking the display, checked as discribed, that proceeded crunching after restarting the PC and became validated. But most of these tasks are lost task and crunching time of 10 h or even more too. Angry! About two weeks ago, I obsereved a task that was running for more tha 9h, normal running time is 2.5h. I let it running, to see what happens with it. When I cam back some hours later, the PC was blocked. After restarting it, this task had disapeared completely. I couldn´t find it neither in my BOINC Manager nore in any of the list of my tasks. The BOINC Manager had starte a fresh task.
How often does this happen? I have no concret numbers available, but, as I obeserve this since several moth, I think about 5% of the task in the mean, sometimes 2 or even 3 a day. But the loss in crunching time is much higher. This application requires unusual high attention.

My question here is: Does observer such others too, or is this behavior single only to me?

Kind regards and happy crunching
Martin

Jim1348
Jim1348
Joined: 19 Jan 06
Posts: 445
Credit: 206,617,104
RAC: 1,047

Trouble with BRP5-opencL-xxx ( xxx=ati only?)

I don't recall ever having had a problem with BRP5 (or BRP4G either) on my GTX 650 Ti running under either Win7 64-bit, or currently XP.
http://einsteinathome.org/host/10009676/tasks&offset=0&show_names=1&state=3&appid=0

It sounds like your antivirus may be blocking something, or some other software incompatibility. Be sure to add BOINC (both program and data folders) to your AV exclusion list for a start.

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 769
Credit: 301,461,386
RAC: 378,479

Which driver are you using?

Which driver are you using? Your GPU is reported as "CAL Bonaire", which seems quite strange (Bonaire is the chip and CAL the old "close to metal" framework to access AMD GPUs, which this GPU shouldn't support any more).

MrS

Scanning for our furry friends since Jan 2002

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,491
Credit: 63,225,191,676
RAC: 54,369,067

RE: As one can see in the

Quote:
As one can see in the Server Status Page, BRP5 isn´t the most reliable application. That´s well known here.


I'm afraid I have to disagree with you. I have a large number of hosts crunching BRP5 and I just don't see failed tasks. Whenever I (rarely) do, it has always been a hardware issue and not the app.

Quote:
But for me it is not only a higher risk of getting no validation, but it blocks sometimes my PC, so I get a black screen and have to switch the PC completely off for some seconds before restarting it.
....

What you are describing could easily be due to excess heat. When was the last time you opened up the case and checked for blockages of the heat sinks and fans for both the CPU and GPU? Do you monitor the fan speeds for both? If all of these seem OK, try running with the case open and a room fan blowing cool air at the GPU. If the problems abate, you have your answer.

Even if the heat sinks seem perfectly clean, you can still have heat problems if the thermal interface material (thermal grease) has dried out. I've found that replacing the grease sometimes cures these sorts of problems, particularly if the CPU/GPU has done a couple of years of service.

Another possible cause of problems could be faulty power. What is the rating of your PSU and how old is it? Over the years I've seen quite a few PSUs develop problems with swollen capacitors as they age. This tends to cause problems similar to what you describe.

Cheers,
Gary.

astro-marwil
astro-marwil
Joined: 28 May 05
Posts: 452
Credit: 173,402,218
RAC: 40,788

Hallo ! Thankyou for your

Hallo !
Thankyou for your answers. I will reply tomorow, as I´m much busy today.

Kind regards and happy crunching
Martin

astro-marwil
astro-marwil
Joined: 28 May 05
Posts: 452
Credit: 173,402,218
RAC: 40,788

Hallo! Thank you for your

Hallo!
Thank you for your answers. Here now my response.

Hallo Jim1318!
It´s interesting, that NVIDIA GPU´s doesn´t give trouble.

Quote:
It sounds like your antivirus may be blocking something.


I don´t believe that, as I´m crunching for nearly 9 years now with the same AV-program without problems of this kind, regardless of the project and application. But let´s see, I´ll try it.

Hallo ExtraTerestrial Apes!

Quote:
Which driver are you using? Your GPU is reported as "CAL Bonaire".

It´s a HD 7790 with driver 13.251.0.0 (from Gerätemanager), dated 06.12.2013, or 13-12_win7_win8_64_dd_ccc_whql.exe (from AMD) downloaded 23.03.2014. It´s a actual and highly efficient card. But the actual driver didn´t make any advantage to me.

Hallo Gary!
As you are conducting a big farm of PC for so long time successfully, your experience is of special interest for me.
My PC here is just 1/2 year old. Soon after installation it gave sporadic trouble and I gave it back to the seller for guaranty repair. They checked the RAM with memtest86+ for more than 24h without any failure and updated the BIOS. After that it seemed to work perfect for some weeks, but the rate of failure now increases slowly within the last several weeks. The temperatures of CPU and GPU become almost periodically checked and are about 40°C below maximum. Within the last week I crunched FGRP3 at the GPU, which gives most time a much less GPU-load, but also this tasks gave similar trouble. Sometimes I observe a sudden black screen for some seconds with the notice, that the graphics driver have failed and become restarted now. Sometimes after that, the application running on the GPU has failed, but not necessarily. No other application, running on the PC, suffer from such event. As these failures happen only to application running at the GPU, I now assume, there might be sporadic interruptions in the 12V for the graphic card, or the graphics card itself fails sporadic. Or do you have any other idea?
In May I may crunch MilkyWay for test, as this application makes heavily use of the GPU. If the failures happen similar there also, it´s somehow the hardware.

Quote:
Quote:
As one can see in the Server Status Page, BRP5 isn´t the most reliable application. That´s well known here.
I'm afraid I have to disagree with you. I have a large number of hosts crunching BRP5 and I just don't see failed tasks. ...


From Server Status Page last updated 4 Apr 2014 12:05:02 UTC I took the following data:

  • Work__________ | __FGRP3 | S6CasA | _BRP4_ | BRP5_ | BRP4G_ | in DB
    ====================================================
    Tasks valid______ | 402,161 | _67,252 | 83,619 | 54,859 | 110,502 | 718,393
    Tasks invalid ____ | _25,847 | _15,374 | _9,924 | 15,370 | _15,437 | _81,952
    Tasks inconclusive | ____618 | ___157 | ___564 | __967 | ____921 | __3,227
    Tasks failed _____ | _24,770 | _15,289 | _9,071 | 14,102 | _13,507 | _76,739
    Tasks too late ___ | __1,324 | ___158 | ____98 | ____98 | ___238 | __1,916

And these figures remain similar for prolonged time. If I reference these figures to all crunched tasks of the application, so to tasks (vaid + invalid + inconclusive + failed), I get the following relative figures:

  • Work__________ | __

FGRP3 | _S6CasA_| _BRP4_ | _BRP5_ | BRP4G_ | in DB
====================================================
Tasks valid______ | _88,80% | _68,57% | 68,57% | 64,31% | 78,72% | 81,61%
Tasks invalid ____ | __5,62% | _15,68% | _9,62% | 18,02% | 11,00% | _9,31%
Tasks inconclusive | __0,14% | __0,16% | _0,55% | _1,13% | _0,66% | _0,37%
Tasks failed _____ | __5,46% | _15,59% | _8,79% | 16,53% | _9,62% | _8,72%
Tasks too late ___ | __0,29% | __0,16% | _0,09% | _0,11% | _0,17% | _0,22%

From this table I do learn, that FGRP3 is most the reliable application, whereas in BRP5 I have a chance of 1/3 for a task that it fails or suffer from others. Of course this is the average over all OS and hardware. The individual situation may look quite different.

I will be pleased to get your responses and remain
with kind regards and all time happy crunching
Martin

Holmis
Joined: 4 Jan 05
Posts: 1,118
Credit: 1,005,206,623
RAC: 990,083

RE: RE: RE: As one

Quote:
Quote:
Quote:
As one can see in the Server Status Page, BRP5 isn´t the most reliable application. That´s well known here.
I'm afraid I have to disagree with you. I have a large number of hosts crunching BRP5 and I just don't see failed tasks. ...

From Server Status Page last updated 4 Apr 2014 12:05:02 UTC I took the following data:
  • Work__________ | __FGRP3 | S6CasA | _BRP4_ | BRP5_ | BRP4G_ | in DB
    ====================================================
    Tasks valid______ | 402,161 | _67,252 | 83,619 | 54,859 | 110,502 | 718,393
    Tasks invalid ____ | _25,847 | _15,374 | _9,924 | 15,370 | _15,437 | _81,952
    Tasks inconclusive | ____618 | ___157 | ___564 | __967 | ____921 | __3,227
    Tasks failed _____ | _24,770 | _15,289 | _9,071 | 14,102 | _13,507 | _76,739
    Tasks too late ___ | __1,324 | ___158 | ____98 | ____98 | ___238 | __1,916
And these figures remain similar for prolonged time. If I reference these figures to all crunched tasks of the application, so to tasks (vaid + invalid + inconclusive + failed), I get the following relative figures:
  • Work__________ | __
FGRP3 | _S6CasA_| _BRP4_ | _BRP5_ | BRP4G_ | in DB
====================================================
Tasks valid______ | _88,80% | _68,57% | 68,57% | 64,31% | 78,72% | 81,61%
Tasks invalid ____ | __5,62% | _15,68% | _9,62% | 18,02% | 11,00% | _9,31%
Tasks inconclusive | __0,14% | __0,16% | _0,55% | _1,13% | _0,66% | _0,37%
Tasks failed _____ | __5,46% | _15,59% | _8,79% | 16,53% | _9,62% | _8,72%
Tasks too late ___ | __0,29% | __0,16% | _0,09% | _0,11% | _0,17% | _0,22%

From this table I do learn, that FGRP3 is most the reliable application, whereas in BRP5 I have a chance of 1/3 for a task that it fails or suffer from others. Of course this is the average over all OS and hardware. The individual situation may look quite different.

I will be pleased to get your responses and remain
with kind regards and all time happy crunching
Martin


Don't forget to check the detailed status page for the BRP5 run found here:
http://einstein6.aei.uni-hannover.de/EinsteinAtHome/download/BRP5-progress/
You can also get to this page from the server status page if you scroll down to the BRP5 search progress box and click on the details link.

From that page I draw the conclusion that the total error rate (invalid, client error and validate error) is about 5-6% of the number of valid tasks. I might be wrong here as it's not every day I read these types of diagrams.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,491
Credit: 63,225,191,676
RAC: 54,369,067

RE: ... Hallo Gary! As you

Quote:
...
Hallo Gary!
As you are conducting a big farm of PC for so long time successfully, your experience is of special interest for me.


I have well over 30 hosts equipped with GPUs that are crunching BRP5. There are more than 80 hosts in the 'farm'. Obviously, I don't have time to monitor every single one of these - in fact, I only monitor anything directly if there is a failure or if I want to tweak particular machines for whatever reason. I have a fairly sophisticated script which does a number of things to do with work cache control and sharing of common data files between the hosts. The script contacts every host in the fleet several times per day and the very first thing it does is check that the host is running and that BOINC is running. It produces an extensive log which allows me to see problems quite quickly without physically going to each machine. The log contains an extensive record of how many and what type of task that each machine downloads, amongst other things.

Sure, that doesn't pick up if returned results are not validating, but I do regularly peruse the complete list of hosts on the website and check the RACs. It's surprisingly easy to spot hosts returning rubbish by noting a change in RAC. So on a regular basis, I will pick up any suspicious changes and then will check the full tasks list on the website. If there still seems to be an issue, I will check the physical host. I've done this for a long time over many hosts and I hardly ever see an invalid result that can't be directly attributed to a hardware issue. To me, problems are not due to the app but rather to hardware that has some sort of issue. Many of my hosts have been running 24/7 for over 4 years in a non-airconditioned environment (30-35C ambient) so there will be hardware failures from time to time. I'm continually surprised at how few failures there really are.

Quote:
....
From this table I do learn, that FGRP3 is most the reliable application, whereas in BRP5 I have a chance of 1/3 for a task that it fails or suffer from others. Of course this is the average over all OS and hardware. The individual situation may look quite different.


You can't infer the 'quality' of the app from that table. You may be able to infer that the BRP5 app puts more stress on a system, perhaps. If a bunch of hosts in my farm over quite a period of time and under adverse conditions (hot environment) are not showing invalid results, how can you say there is a problem with the app??

I think that perhaps some people may have added/upgraded to powerful GPUs in systems that are not quite 'up to the task', either power wise or cooling wise, or both. I think that this may be skewing the table and making the invalid count higher than it otherwise would be. The simple fact remains. A real issue with the app should show up for everybody. And it doesn't!

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.