Discussion Thread for the Continuous GW Search known as O2MD1 (now O2MDF - GPUs only)

tullio

Joined: 22 Jan 05

Posts: 2118

Credit: 61407735

RAC: 0

My only wingman uses a 2 GB

20 May 2020 18:51:40 UTC

Message 177945

(moderation:

)

My only wingman uses a 2 GB board.So what?

Tullio

Mr P Hucker

Joined: 12 Aug 06

Posts: 838

Credit: 519369318

RAC: 15392

tullio wrote: Watching GW

20 May 2020 21:00:45 UTC

Message 177947 in response to message 177924

(moderation:

)

tullio wrote:

Watching GW GPU tasks on my BOINC manager I see a curious thing. Progress rises very rapidly up about 14% in about 3 minutes, then falls back to 0.470 and rises more slowly.

Tullio

That happens on many projects, I wouldn't worry about it. It's just the task going through two different stages and having a rubbish progress meter.

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

Mr P Hucker

Joined: 12 Aug 06

Posts: 838

Credit: 519369318

RAC: 15392

Ian&Steve C. wrote: weird

20 May 2020 21:01:53 UTC

Message 177948 in response to message 177925

(moderation:

)

Ian&Steve C. wrote:

weird stuff happens when you run out of video memory and produce computation errors.

https://einsteinathome.org/host/12735373/tasks/6/0

you should remove this GPU from running GW tasks. these things will continue to happen until you do so. run Gamma Ray on the GPU if you still want to use it for Einstein. If you want to run GW tasks, run it on your CPU, or upgrade to a GPU with at least 4GB of memory.

Since most of his tasks are completing ok, I don't see a problem. If they go wrong, they go wrong quickly, not wasting his GPU's time, and the server just hands it to someone else.

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

Mr P Hucker

Joined: 12 Aug 06

Posts: 838

Credit: 519369318

RAC: 15392

Ian&Steve C. wrote: I don't

20 May 2020 21:03:15 UTC

Message 177949 in response to message 177930

(moderation:

)

Ian&Steve C. wrote:

I don't know how you don't understand this.

it has nothing to do with your wingmen or youtube or windows updates or whatever other nonsense you are trying to distract with. the GPU is insufficient for certain GW tasks, and there is nothing you can do to prevent from getting them except stopping GPU GW processing.

some tasks will run OK. some tasks will not. some tasks require less than 2GB of GPU memory, these will succeed. some tasks require more than 3GB of GPU memory. these are the ones that will fail. they come in random and unpredictable times. the reason you haven't seen failures today yet is simply because you haven't been sent the large tasks in a little while. that doesn't mean the issue is fixed. you WILL receive the large tasks again and you will produce errors again.

please do everyone a favor and just stop GW processing on that 3GB GPU. you are just making the already bad situation worse.

What you should be doing is hitting the developers with a clue stick so that large tasks are not sent to people with small cards.

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3965

Credit: 47220532642

RAC: 65383364

tullio wrote: My only

20 May 2020 21:03:27 UTC

Message 177950 in response to message 177945

(moderation:

)

tullio wrote:

My only wingman uses a 2 GB board.So what?

Tullio

So it will probably fail, and then be resent to someone else. and keep doing so until it lands in the hands of a GPU with enough GPU memory.

He's your only wingman "right now". but the nature of the validation process creates new tasks to be sent to additional hosts when the first two don't agree or one of them returns an error.

_________________________________________________________________________

Mr P Hucker

Joined: 12 Aug 06

Posts: 838

Credit: 519369318

RAC: 15392

Tom M wrote: Preliminary

20 May 2020 21:05:32 UTC

Message 177951 in response to message 177935

(moderation:

)

Tom M wrote:

Preliminary results with an Amd Radeon 5700 on GW gpu indicate almost not change in processing speeds from 1 task to 3 tasks.

It runs from 18+ minutes to under 21 minutes. So basically an R5700 can out produce high-end Nvidia cards if you run 3 gpu tasks.

There may be memory issues with 3 gpu tasks just like there is with a Gtx 1060 3GB video card.

I am switching my Pulsar Search#1 box to run both P and GW gpus since I expect the R5700 to end up on it possibly by the weekend.

Tom M

AMD cards don't care if they run out of memory (they just use system memory). Nvidias do. I avoid Nvidias.

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3965

Credit: 47220532642

RAC: 65383364

Peter Hucker

20 May 2020 21:25:41 UTC

Message 177952 in response to message 177948

(moderation:

)

Peter Hucker wrote:

Ian&Steve C. wrote:

weird stuff happens when you run out of video memory and produce computation errors.

https://einsteinathome.org/host/12735373/tasks/6/0

you should remove this GPU from running GW tasks. these things will continue to happen until you do so. run Gamma Ray on the GPU if you still want to use it for Einstein. If you want to run GW tasks, run it on your CPU, or upgrade to a GPU with at least 4GB of memory.

Since most of his tasks are completing ok, I don't see a problem. If they go wrong, they go wrong quickly, not wasting his GPU's time, and the server just hands it to someone else.

hard to say. when I was testing out my 1060 3GB to see the failure mode, what happened was the card started loading data into GPU memory, and filled up in the first 10-15 seconds or so, at that point, the task progress jumped from 0 to 100% and moved to "complete", but kept trying to reload the data into the GPU memory over and over and wouldn't start a new task. I watched the GPU mem climb to full then crash to 0, and over and over and over. it sat there for quite a while with this behavior (several minutes) until I manually aborted it. but the reported run time only logged 15 seconds or so, when in reality it had really wasted minutes and minutes (and only that because I intervened) of the PCs time. It's probably the similar for others. so you can't use the runtime to accurately gauge how much time it wasted on the the afflicted hosts.

I think if you are aware that your system is producing errors and bad results for a project, and you know the reason, and you know how to fix it, you have the obligation to do so for the benefit of the project. not just let it keep pumping out errors because "some" are still succeeding. this same mindset plagued several projects where bad AMD drivers caused consistently incorrect computations on RX5700 (navi) cards, which validated with each other, but invalidated against everyone else. or when Nvidia changed something in their ~436 Windows drivers that only affected one type of WU at SETI. several people thought they would just leave their system generating tons of bad results choosing to ignore the bad results and only look at the ones that validated without understanding what was happening.

_________________________________________________________________________

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3965

Credit: 47220532642

RAC: 65383364

Peter Hucker wrote: What you

20 May 2020 21:17:32 UTC

Message 177953 in response to message 177949

(moderation:

)

Peter Hucker wrote:

What you should be doing is hitting the developers with a clue stick so that large tasks are not sent to people with small cards.

I've posted the relevant information in the tech forum, but I can't make the required people read it, and/or they may have other priorities.

theres a better chance of it getting more attention if more people bring it up, not just one man.

the squeaky wheel gets the grease. or something like that.

_________________________________________________________________________

Betreger

Joined: 25 Feb 05

Posts: 992

Credit: 1594102326

RAC: 767228

I am starting to think they

20 May 2020 21:26:20 UTC

Message 177954 in response to message 177953

(moderation:

)

I am starting to think they are willing to accept the overhead. In my case I was still producing good results 2/3 of the time.

As an aside my 3GB cards are now only doing pulsars, the 6GB card get GWs.

Mr P Hucker

Joined: 12 Aug 06

Posts: 838

Credit: 519369318

RAC: 15392

Ian&Steve C. wrote: hard to

20 May 2020 22:34:09 UTC

Message 177957 in response to message 177952

(moderation:

)

Ian&Steve C. wrote:

hard to say. when I was testing out my 1060 3GB to see the failure mode, what happened was the card started loading data into GPU memory, and filled up in the first 10-15 seconds or so, at that point, the task progress jumped from 0 to 100% and moved to "complete", but kept trying to reload the data into the GPU memory over and over and wouldn't start a new task. I watched the GPU mem climb to full then crash to 0, and over and over and over. it sat there for quite a while with this behavior (several minutes) until I manually aborted it. but the reported run time only logged 15 seconds or so, when in reality it had really wasted minutes and minutes (and only that because I intervened) of the PCs time. It's probably the similar for others. so you can't use the runtime to accurately gauge how much time it wasted on the the afflicted hosts.

Doesn't happen with my AMDs. :-P
They just use system memory and run slower. No invalid results.

Ian&Steve C. wrote:

I think if you are aware that your system is producing errors and bad results for a project, and you know the reason, and you know how to fix it, you have the obligation to do so for the benefit of the project. not just let it keep pumping out errors because "some" are still succeeding. this same mindset plagued several projects where bad AMD drivers caused consistently incorrect computations on RX5700 (navi) cards, which validated with each other, but invalidated against everyone else. or when Nvidia changed something in their ~436 Windows drivers that only affected one type of WU at SETI. several people thought they would just leave their system generating tons of bad results choosing to ignore the bad results and only look at the ones that validated without understanding what was happening.

If I had a machine that produced 8 good results, then failed on 2, but quickly, I'd leave it as is. It's doing good work. If I got 1 good result and 9 failures, I'd consider it was wasting server bandwidth. But no matter which of the above occurs, surely the programmers at Einstein can see the high failure rate, find the problem, and fix it at their end?

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

Discussion Thread for the Continuous GW Search known as O2MD1 (now O2MDF - GPUs only)

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner