Hallo!
As one can see in the Server Status Page, BRP5 isn´t the most reliable application. That´s well known here.
But for me it is not only a higher risk of getting no validation, but it blocks sometimes my PC, so I get a black screen and have to switch the PC completely off for some seconds before restarting it. During time having a black screen, sometimes I observe action of background processes like virus scan, as one can hear by disk activity. This happens, as I observed, by tasks that are running much longer than normaly, a factor of 3 to 4. Some of these are stopping completely crunching on CPU and GPU, as one can see in the Task Manager and MSI Afterburner, before they are bloccking the display. Updating to the newest GPU driver and the newest BIOS at the motherboard didn´t help at all. Today I had a task, that stopped crunching without blocking the display, checked as discribed, that proceeded crunching after restarting the PC and became validated. But most of these tasks are lost task and crunching time of 10 h or even more too. Angry! About two weeks ago, I obsereved a task that was running for more tha 9h, normal running time is 2.5h. I let it running, to see what happens with it. When I cam back some hours later, the PC was blocked. After restarting it, this task had disapeared completely. I couldn´t find it neither in my BOINC Manager nore in any of the list of my tasks. The BOINC Manager had starte a fresh task.
How often does this happen? I have no concret numbers available, but, as I obeserve this since several moth, I think about 5% of the task in the mean, sometimes 2 or even 3 a day. But the loss in crunching time is much higher. This application requires unusual high attention.
My question here is: Does observer such others too, or is this behavior single only to me?
Kind regards and happy crunching
Martin
Copyright © 2024 Einstein@Home. All rights reserved.
Trouble with BRP5-opencL-xxx ( xxx=ati only?)
)
I don't recall ever having had a problem with BRP5 (or BRP4G either) on my GTX 650 Ti running under either Win7 64-bit, or currently XP.
http://einsteinathome.org/host/10009676/tasks&offset=0&show_names=1&state=3&appid=0
It sounds like your antivirus may be blocking something, or some other software incompatibility. Be sure to add BOINC (both program and data folders) to your AV exclusion list for a start.
Which driver are you using?
)
Which driver are you using? Your GPU is reported as "CAL Bonaire", which seems quite strange (Bonaire is the chip and CAL the old "close to metal" framework to access AMD GPUs, which this GPU shouldn't support any more).
MrS
Scanning for our furry friends since Jan 2002
RE: As one can see in the
)
I'm afraid I have to disagree with you. I have a large number of hosts crunching BRP5 and I just don't see failed tasks. Whenever I (rarely) do, it has always been a hardware issue and not the app.
What you are describing could easily be due to excess heat. When was the last time you opened up the case and checked for blockages of the heat sinks and fans for both the CPU and GPU? Do you monitor the fan speeds for both? If all of these seem OK, try running with the case open and a room fan blowing cool air at the GPU. If the problems abate, you have your answer.
Even if the heat sinks seem perfectly clean, you can still have heat problems if the thermal interface material (thermal grease) has dried out. I've found that replacing the grease sometimes cures these sorts of problems, particularly if the CPU/GPU has done a couple of years of service.
Another possible cause of problems could be faulty power. What is the rating of your PSU and how old is it? Over the years I've seen quite a few PSUs develop problems with swollen capacitors as they age. This tends to cause problems similar to what you describe.
Cheers,
Gary.
Hallo ! Thankyou for your
)
Hallo !
Thankyou for your answers. I will reply tomorow, as I´m much busy today.
Kind regards and happy crunching
Martin
Hallo! Thank you for your
)
Hallo!
Thank you for your answers. Here now my response.
Hallo Jim1318!
It´s interesting, that NVIDIA GPU´s doesn´t give trouble.
I don´t believe that, as I´m crunching for nearly 9 years now with the same AV-program without problems of this kind, regardless of the project and application. But let´s see, I´ll try it.
Hallo ExtraTerestrial Apes!
It´s a HD 7790 with driver 13.251.0.0 (from Gerätemanager), dated 06.12.2013, or 13-12_win7_win8_64_dd_ccc_whql.exe (from AMD) downloaded 23.03.2014. It´s a actual and highly efficient card. But the actual driver didn´t make any advantage to me.
Hallo Gary!
As you are conducting a big farm of PC for so long time successfully, your experience is of special interest for me.
My PC here is just 1/2 year old. Soon after installation it gave sporadic trouble and I gave it back to the seller for guaranty repair. They checked the RAM with memtest86+ for more than 24h without any failure and updated the BIOS. After that it seemed to work perfect for some weeks, but the rate of failure now increases slowly within the last several weeks. The temperatures of CPU and GPU become almost periodically checked and are about 40°C below maximum. Within the last week I crunched FGRP3 at the GPU, which gives most time a much less GPU-load, but also this tasks gave similar trouble. Sometimes I observe a sudden black screen for some seconds with the notice, that the graphics driver have failed and become restarted now. Sometimes after that, the application running on the GPU has failed, but not necessarily. No other application, running on the PC, suffer from such event. As these failures happen only to application running at the GPU, I now assume, there might be sporadic interruptions in the 12V for the graphic card, or the graphics card itself fails sporadic. Or do you have any other idea?
In May I may crunch MilkyWay for test, as this application makes heavily use of the GPU. If the failures happen similar there also, it´s somehow the hardware.
From Server Status Page last updated 4 Apr 2014 12:05:02 UTC I took the following data:
====================================================
Tasks valid______ | 402,161 | _67,252 | 83,619 | 54,859 | 110,502 | 718,393
Tasks invalid ____ | _25,847 | _15,374 | _9,924 | 15,370 | _15,437 | _81,952
Tasks inconclusive | ____618 | ___157 | ___564 | __967 | ____921 | __3,227
Tasks failed _____ | _24,770 | _15,289 | _9,071 | 14,102 | _13,507 | _76,739
Tasks too late ___ | __1,324 | ___158 | ____98 | ____98 | ___238 | __1,916
And these figures remain similar for prolonged time. If I reference these figures to all crunched tasks of the application, so to tasks (vaid + invalid + inconclusive + failed), I get the following relative figures:
FGRP3 | _S6CasA_| _BRP4_ | _BRP5_ | BRP4G_ | in DB
====================================================
Tasks valid______ | _88,80% | _68,57% | 68,57% | 64,31% | 78,72% | 81,61%
Tasks invalid ____ | __5,62% | _15,68% | _9,62% | 18,02% | 11,00% | _9,31%
Tasks inconclusive | __0,14% | __0,16% | _0,55% | _1,13% | _0,66% | _0,37%
Tasks failed _____ | __5,46% | _15,59% | _8,79% | 16,53% | _9,62% | _8,72%
Tasks too late ___ | __0,29% | __0,16% | _0,09% | _0,11% | _0,17% | _0,22%
From this table I do learn, that FGRP3 is most the reliable application, whereas in BRP5 I have a chance of 1/3 for a task that it fails or suffer from others. Of course this is the average over all OS and hardware. The individual situation may look quite different.
I will be pleased to get your responses and remain
with kind regards and all time happy crunching
Martin
RE: RE: RE: As one
)
Don't forget to check the detailed status page for the BRP5 run found here:
http://einstein6.aei.uni-hannover.de/EinsteinAtHome/download/BRP5-progress/
You can also get to this page from the server status page if you scroll down to the BRP5 search progress box and click on the details link.
From that page I draw the conclusion that the total error rate (invalid, client error and validate error) is about 5-6% of the number of valid tasks. I might be wrong here as it's not every day I read these types of diagrams.
RE: ... Hallo Gary! As you
)
I have well over 30 hosts equipped with GPUs that are crunching BRP5. There are more than 80 hosts in the 'farm'. Obviously, I don't have time to monitor every single one of these - in fact, I only monitor anything directly if there is a failure or if I want to tweak particular machines for whatever reason. I have a fairly sophisticated script which does a number of things to do with work cache control and sharing of common data files between the hosts. The script contacts every host in the fleet several times per day and the very first thing it does is check that the host is running and that BOINC is running. It produces an extensive log which allows me to see problems quite quickly without physically going to each machine. The log contains an extensive record of how many and what type of task that each machine downloads, amongst other things.
Sure, that doesn't pick up if returned results are not validating, but I do regularly peruse the complete list of hosts on the website and check the RACs. It's surprisingly easy to spot hosts returning rubbish by noting a change in RAC. So on a regular basis, I will pick up any suspicious changes and then will check the full tasks list on the website. If there still seems to be an issue, I will check the physical host. I've done this for a long time over many hosts and I hardly ever see an invalid result that can't be directly attributed to a hardware issue. To me, problems are not due to the app but rather to hardware that has some sort of issue. Many of my hosts have been running 24/7 for over 4 years in a non-airconditioned environment (30-35C ambient) so there will be hardware failures from time to time. I'm continually surprised at how few failures there really are.
You can't infer the 'quality' of the app from that table. You may be able to infer that the BRP5 app puts more stress on a system, perhaps. If a bunch of hosts in my farm over quite a period of time and under adverse conditions (hot environment) are not showing invalid results, how can you say there is a problem with the app??
I think that perhaps some people may have added/upgraded to powerful GPUs in systems that are not quite 'up to the task', either power wise or cooling wise, or both. I think that this may be skewing the table and making the invalid count higher than it otherwise would be. The simple fact remains. A real issue with the app should show up for everybody. And it doesn't!
Cheers,
Gary.