intel gpu driver restart LATeah task

h
h
Joined: 10 Dec 08
Posts: 4
Credit: 298227
RAC: 0
Topic 224687

As subject title said. 

2/1/2021 5:10:16 PM | Einstein@Home | task LATeah3001L00_676.0_0_0.0_4374648_0 suspended by user

three times reproducible.

 

At 89.007% there always going in restart of gpu driver because it stop working.


Application
Gamma-ray pulsar binary search #1 on GPUs 1.22 (FGRPopencl-intel_gpu)
Name
LATeah3001L00_676.0_0_0.0_4374648
State
Task suspended by user
Received
1/31/2021 11:12:58 AM
Report deadline
2/14/2021 11:13:00 AM
Resources
1 CPU + 1 Intel GPU
Estimated computation size
525,000 GFLOPs
CPU time
00:03:03
CPU time since checkpoint
00:00:11
Elapsed time
02:53:15
Estimated time remaining
00:21:09
Fraction done
89.998%
Virtual memory size
266.73 MB
Working set size
278.16 MB
Directory
slots/2
Process ID
2624
Progress rate
11.160% per hour
Executable
hsgamma_FGRPB1G_1.22_windows_x86_64__FGRPopencl-intel_gpu.exe
 

There is not text row in event log. Just stays sitting as working good.

 

After restart of OS the same thing happen.

I will abort this task. There wasn't problem until now.

P.S.

driver version 21.20.16.4997 from 3/16/2018

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7023894931
RAC: 1804540

It is a general property of

It is a general property of the einstein Gamma-ray Pulsar GPU tasks that they report steady progress until somewhere just below 90%, then enter a different phase of execution, which proceeds without further progress reported until completion.

Several users have reported that the 3001-series tasks spent much longer in this "final phase" than did previous work.  Something like ten or twenty times longer has been reported.

One live possibility is that you just did not wait long enough.  Another is that these tasks behave badly on your Intel GPU system.

h
h
Joined: 10 Dec 08
Posts: 4
Credit: 298227
RAC: 0

Thank for reply, but

Thank for reply, but unfortunately the task broke driver, then it doesn't use gpu or cpu anymore. There is a watchdog timer, that usually signals problems like that and task finished but not completed as normal.

I think maybe this project or milky way has task for gpu that can't survive suspend on task, and after resume just sits where is left but not making any further progress simultaneously consume real power  and is killed by watchdog timer, without recover. 

So this seams as a real bug to me, that is why I reported it here.

There is other behavior on task that need extra time to decide what to do. One way is to zerofy the progress and start again, other project cheats on elapsed time (hardly it is independent and part of boinc software) and adjust time to completion.

For what you say it sounds like bugs too. So the people that code this must react.

Useful information

2/1/2021 5:59:01 PM |  | Starting BOINC client version 7.16.11 for windows_x86_64
2/1/2021 5:59:03 PM |  | OpenCL CPU: Intel(R) Pentium(R) CPU G4400T @ 2.90GHz (OpenCL driver vendor: Intel(R) Corporation, driver version 6.8.0.2, device version OpenCL 1.2 (Build 2))
 

Good hunting

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109399026663
RAC: 35683236

h wrote:So this seams as a

h wrote:
So this seams as a real bug to me, that is why I reported it here.

I very much doubt there is anything that you could describe as a "real bug".

My understanding is that the GRP GPU tasks are designed to run on discrete GPUs that have some double precision (DP) capability for the 'followup stage' - the bit from ~90% to 100% where there is no ongoing progress indication.  They were not really designed for Intel GPUs, although some people have successfully run the previous types of these tasks on such hardware.  Those previous types had extremely short followup stages, so little if any DP use.

The latest version of the tasks has a much longer followup stage which (after very brief observation) seems to be related to DP capability.  For example on an AMD RX 570 GPU (which has reasonably good DP capability) the followup stage takes around 40-50 seconds.  On a similar series (but lower end and less DP capable) RX 460, the followup stage seems to take around ~2 mins or so.  The 570 has more than double the DP power of a 460.  These have just been a couple of casual observations rather than an actual attempt to measure it precisely.  It probably varies quite a bit from task to task.

I have no idea how much longer it might be for an Intel GPU.  The DP calculations would probably be done on a CPU core rather than the GPU and it might take an order of magnitude longer.  If that were happening, it wouldn't be surprising to see GPU utilization drop right away during that long process.

If you wanted to help this community, you could allow a task to run to completion, however long it takes, and without any interference or disturbance, and then report what you observe along with the time it took.  I'm guessing that it would eventually complete successfully.

That would be some useful information for the benefit of anyone else trying to run these newest tasks on an Intel GPU.

Cheers,
Gary.

h
h
Joined: 10 Dec 08
Posts: 4
Credit: 298227
RAC: 0

This happen again in the same

This happen again in the same place.

Now for


Application
Gamma-ray pulsar binary search #1 on GPUs 1.22 (FGRPopencl-intel_gpu)
Name
LATeah3001L01_660.0_0_0.0_16897596
State
Running
Received
2/2/2021 9:01:53 PM
Report deadline
2/16/2021 9:01:53 PM
Resources
1 CPU + 1 Intel GPU
Estimated computation size
525,000 GFLOPs
CPU time
00:02:50
CPU time since checkpoint
00:00:04
Elapsed time
02:52:43
Estimated time remaining
00:21:09
Fraction done
89.998%
Virtual memory size
265.39 MB
Working set size
282.67 MB
Directory
slots/1
Process ID
5024
Progress rate
31.320% per hour
Executable
hsgamma_FGRPB1G_1.22_windows_x86_64__FGRPopencl-intel_gpu.exe
 

I will let task some more time

In answer of previous post. This is not normal behaviour. The driver is digitally signed. OS is windows 7.

2/2/2021 4:50:31 PM |  | Starting BOINC client version 7.16.11 for windows_x86_64

2/2/2021 4:50:32 PM |  | OpenCL: Intel GPU 0: Intel(R) HD Graphics 510 (driver version 21.20.16.4997, device version OpenCL 1.2, 1298MB, 1298MB available, 91 GFLOPS peak)
2/2/2021 4:50:32 PM |  | OpenCL CPU: Intel(R) Pentium(R) CPU G4400T @ 2.90GHz (OpenCL driver vendor: Intel(R) Corporation, driver version 6.8.0.2, device version OpenCL 1.2 (Build 2))

Processor is with microcode E2 if this could help. Some sensor utility HWinfo monitoring reporting

GT fuses limit

Ring VR max voltage ICCMax PL4

but temperature is 60 degree Celsius. So there is no overheating.
 

OpenCL 1.2 has single and double precision, in fact there are three openCL devices.

I am not sure about this, is it a bug or feature of the software and just reporting.

It has past half of an hour, the task has no progress.

Maybe it makes self diagnostic and return some more valuable information. 

That is all for now.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109399026663
RAC: 35683236

h wrote:This happen again in

h wrote:
This happen again in the same place.

Of course it does.  It's perfectly normal.  Everyone sees a much longer delay for the followup stage with the new "LATeah3001" type tasks.  It's different from what happened with tasks prior to "LATeah3001".

h wrote:
OpenCL 1.2 has single and double precision, in fact there are three openCL devices.

The thing that is important is the DP capability of the hardware.  That is probably what is slowing things right down.  I don't know anything about Intel GPUs but I suspect they don't support DP in the hardware and the followup stage is being offloaded to a CPU core.  This would be the action that the app is designed to take when there is no DP in the hardware.

Since there is no progress reporting during the entire followup stage, the ONLY way to know if Intel GPUs can handle LATeah3001 tasks is to see if it will complete (without interference) BEFORE the built in time limit kicks in and terminates the task.  That could be several hours.  If that happens, there will be an error and you will be able to see a "TIME_LIMIT_EXCEEDED" message in the stderr output returned to the project.  To see the full log of information returned, click the "TaskID" link for the failed task on the website and scroll down below the "Stderr" heading.

If you get that error message, you will know for sure that there is no point trying to run the LATeah3001 tasks on your Intel GPU.  It could be different with a different driver but you would have to establish that.

Cheers,
Gary.

h
h
Joined: 10 Dec 08
Posts: 4
Credit: 298227
RAC: 0

I tried to sleep the computer

I tried to sleep the computer as it hasn't option to hibernate, but accidentally press key on keyboard, and when resumes the task started from place close to moment of the problem. Time elapsed backs from 5 hour and something to last saved position.

The verdict is that GPU can't save its  computation state even that virtualbox can save state of the virtual machine, if boinc uses virtual box for that computation.

I decide for myself to uncheck this application even that my other machine which is cuda (and opencl) capable has not problem with this.

I aborted this and another task of this type waiting.

Intel GPU HD510 has DP in half of SP performance, I doubt they use cpu for double precision as they have route with driver that enables cpu with sse4.1 to become opencl device.

I wonder why others with Intel gpu not confirm or deny such problem and on what hardware, but that means that some task will return result with that problem.

So thanks for replies.

Good crunching

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.