Validation Error - Help with GPU (Caicos)

Suj78
Suj78
Joined: 12 Apr 15
Posts: 3
Credit: 3843779
RAC: 0
Topic 198678

I have been informed my work units have been invalidated for some time on one of my machines. Its an older Dell computer but here is the GPU information.

https://www.techpowerup.com/gpuz/details/dxvuw

I am not sure what exactly is wrong to cause the invalidation. Can someone help me out.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117922291273
RAC: 34539932

Validation Error - Help with GPU (Caicos)

You actually have two machines that are having problems. The problems are quite different so I'll comment on each one individually.

Firstly, the machine with the AMD A10-6700 APU. I really know nothing about these processors but if you look at the task ID link for the most recently failed task, you will see the stderr.txt messages that were sent back to the project. The snippet below shows the actual error message.

7.6.22

exceeded elapsed time limit 113685.60 (5600000.00G/49.26G)


In other words, the crunching of this task was terminated by BOINC, not because of a computation error, but because BOINC saw that the task was taking far too long to crunch and had exceeded a built-in time limit. If you look at your full list of tasks, you will see that the very next one actually succeeded in a time (103,784.15 secs) just a bit less than the limit (113,685.60 secs).

To make sure you crunch these GPU tasks faster in future and don't risk testing the limit, the best thing for you to do initially would be to change your BOINC preferences to NOT allow all 4 CPU cores to crunch CPU tasks. You may have to experiment to get the best results. You should open BOINC Manager (Advanced view) and find the menu item that will allow you to reduce the number of CPU cores that BOINC is allowed to use. I don't use v7.6.22 (I use 7.2.42) so it may be a different menu item to what I'm used to. On mine, it's under Tools->Computing Prefs->Processor usage. I would change the 100% default value to 75% or even 50% of the processors to see what gives the best improvement in reduction of GPU crunch time. I think you may get a quite useful improvement.

Please try both 75% and 50% of processors and allow enough time to do a couple of tasks completely at each setting so that you can get a reasonable idea of the improvement. Once we have that information, there are other things that can be tried to get perhaps even further improvement in efficiency.

Now the machine with the Caicos series GPU. This is not an integrated GPU and there is no "time limit exceeded" problem here. If you pick any failed task in your list and click on its task ID link, there are no error messages. The crunching was completed successfully in every case. However, none of your results seem to agree with what other hosts are producing. If you look at your complete list of tasks, none have been marked as valid.

There is no easy way to predict what might be causing this. It's highly unlikely to be a problem with the tasks themselves since other computers doing the same work are producing valid answers. It's most likely some issue with your hardware. From the link you provided, I assume there is no overclocking. The information in that link doesn't show GPU temperature. Have you measured that and have you inspected the heat sink and fan for cleanliness and proper running? If these things are OK, you need to think about things like RAM and PSU, as well as the GPU itself. There will likely be something operating out of spec, perhaps not enough to crash the machine but just enough to make the answers wrong enough to be declared invalid. To protect the science, the answers have to agree pretty closely with what another host produces to be declared valid.

Without additional information from you about what else you run on that machine, if you ever get crashes, do other projects all run correctly (no invalid results), it's very hard to guess what the problem might be. You don't seem to be doing CPU tasks at Einstein. I presume you are using your CPU cores at other projects. If you can supply more details it may be easier for someone to guess what might be causing the problem. If you draw a blank with checking out hardware and running conditions, I have a suggestion for you to try. If you take some load (temporarily) off the machine by setting the local preferences on it to use just 50% of processor cores, it would be interesting to see what happens to GPU tasks. They should crunch faster. They may even validate, but I don't really expect that. However, if they did, it would at least indicate that the problem may be load (or heat) related.

Cheers,
Gary.

Jasper
Jasper
Joined: 14 Feb 12
Posts: 63
Credit: 4032891
RAC: 0

RE: I don't use v7.6.22 (I

Quote:
I don't use v7.6.22 (I use 7.2.42) so it may be a different menu item to what I'm used to. On mine, it's under Tools->Computing Prefs->Processor usage. I would change the 100% default value to 75% or even 50% of the processors to see what gives the best improvement in reduction of GPU crunch time. I think you may get a quite useful improvement.

It´s almost the same under 7.6.22, or at least, on my version for OS X it is. In the Manager, it will be found under ’Options | Computing preferences...’ and in the dialog that opens, you select the tab ’⚙ Computing’ where the top most two items read ’Use at most ... % of the CPUs’ and ’Use at most ... % of CPU time’.

Alternatively, one can of course use the Web preferences, to be found under ’Your account | When and how BOINC uses your computer’, where both parameters are also available. It´s either way but not both, mentioning that just for completeness sake.

For me without suitable GPU, both settings are by default at 100%. I don't know whether that´s always or maybe different for other situations / hardware.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117922291273
RAC: 34539932

RE: ... top most two items

Quote:
... top most two items read ’Use at most ... % of the CPUs’ ...


Yes, thank you, this is the option the OP needs to change.

Quote:
Alternatively, one can of course use the Web preferences, to be found under ’Your account | When and how BOINC uses your computer’, where both parameters are also available. It´s either way but not both, mentioning that just for completeness sake.


It can be both - it's just that local settings will take precedence over website settings :-).

For this situation where the OP has more than one computer, a change on the website will affect all computers unless the OP knows in advance to put each host in a different 'location' (or 'venue'). Setting up locations was a complication that would be best to avoid until they are actually needed. For the purposes of testing to find the best settings, local changes are quick and easy to apply on each individual machine. The change is immediate and doesn't require an 'update' to force communication with the project.

Quote:
For me without suitable GPU, both settings are by default at 100%. I don't know whether that´s always or maybe different for other situations / hardware.


Yes, both settings by default always start out at 100%. This is appropriate for CPU only crunching. For crunching with AMD GPUs in particular, using 100% of CPU cores seems always to lead to a loss of GPU crunching efficiency, even when only 1 GPU task is running concurrently. With low end AMD GPUs, it's often best to run 2 GPU tasks concurrently just to cause BOINC to stop using one CPU core by default without actually reducing the 'cores' setting to 75%. This is particularly so where the computer is supporting multiple projects. If there were no Einstein GPU tasks crunching, 100% of cores would be available at other projects, rather than just 75% if the setting were in place.

In my experience, even for some relatively low end AMD GPUs, running two GPU tasks concurrently actually still provides a small improvement compared to running tasks singly with 1 CPU core not crunching by using the 'cores' setting. For mid-range and higher GPUs, there are significant benefits from running two (or more, sometimes) concurrent GPU tasks.

I know that archae86 has commented before about what should perhaps be the 'default' GPU task concurrency for hosts with a GPU capable of crunching here. There must be literally thousands of hosts out there crunching tasks singly that could be a lot more productive if the default was '2x'.

Cheers,
Gary.

Suj78
Suj78
Joined: 12 Apr 15
Posts: 3
Credit: 3843779
RAC: 0

Hi all, Thank you for the

Hi all,

Thank you for the helpful and well reasoned explanations.

The more important computer is the AMD A10-6700 APU. I reduced CPU core usage to 75%. I will wait to see if the results can compute in a faster time frame and not be terminated.

For the Caicos GPU Computer, I will have to wait til Monday to see if I can do the same procedure as the A10 and see if there is improvement. That computer is actually used by someone else and they run World Community Grid SkyNet Pogs on it. Those projects' submissions have all been validated and no errors

For both computers, I seem to have these issues only with E@Home. The other projects seem fine. The one that would probably be closest to having the same issues is MilkyWay@home but BOINC hasn't been pulling tasks for that project for the past few days. I will wait to see if there is similar issues. That's also a very GPU intensive project.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117922291273
RAC: 34539932

RE: ... The more important

Quote:
... The more important computer is the AMD A10-6700 APU. I reduced CPU core usage to 75%. I will wait to see if the results can compute in a faster time frame and not be terminated.


You already have an answer about that. From the tasks list for that computer, there is now a further validated result. This one is for a BRP6 task. The previous task of this type exceeded the time limit at 239,562.66 secs. The latest task completed and validated in 206,055.63 secs. You should note that even though this is a good reduction in crunch time, it's only a fraction of what you will see for the next task. This is because most of the crunching of the 'just completed' task would have been done under the previous settings before you made the change.

The crunching of this task would have started at 8 Jul 2016, 14:07:09 UTC, when the previous task finished. Assuming you made the change in settings not too long before you posted about it (10 Jul 2016, 17:09:36 UTC), it's likely that the task had been running for around 2 full days before it got the benefit from the change in settings. The new task currently crunching will have the complete benefit so I expect to see a further substantial reduction in crunch time.

Please note that you are crunching both BRP4G (Arecibo Radio Telescope data) and BRP6 (Parkes Radio Telescope in Australia) and that these different runs have quite different crunch times and credit awards. There is no problem in doing this - you can select whichever science run appeals to you or select them all, which is the default. I just wanted to be sure you were aware of the difference so there is no confusion about the different crunch times. The only 'disadvantage' with BRP4G tasks is that data is not always available so there will be periods where you can't get tasks for that run.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117922291273
RAC: 34539932

RE: ... From the tasks list

Quote:
... From the tasks list for that computer .... I expect to see a further substantial reduction in crunch time.


Well, you sure got that!! At the time of posting :-

Latest BRP6 task just 35,634.33 secs.

Latest BRP4G task just 14,350.37 secs.

Rather dramatic improvements.

Cheers,
Gary.

Suj78
Suj78
Joined: 12 Apr 15
Posts: 3
Credit: 3843779
RAC: 0

Thank you! Looks like your

Thank you! Looks like your idea worked. I will try to adjust the Dell computer tomorrow and see if that works.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.