Latest data file for FGRPB1G GPU tasks

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2960592689

RAC: 706136

Oliver has PM'd me to say

25 Jan 2019 10:41:46 UTC

Message 169069 in response to message 169068

(moderation:

)

Oliver has PM'd me to say that he wasn't aware of this part of the problem. Argh.

Leaving our collective frustration to one side, he does know now.

@Gary and @Peter - could you please gather your collective summary records from months ago, and ram a list of which data file types do, or don't, run on Turing down Oliver's throat?

Edit - sorry about that, it came over a bit harsh. But I do think the staff need to be aware of the depth of feeling that this issue has caused.

Oliver says that it will be Bernd who does the actual coding, so Gary and Peter might send their data direct to him while Oliver is working on other things...

Oliver Behnke

Moderator

Administrator

Joined: 4 Sep 07

Posts: 984

Credit: 25171438

RAC: 23

Gary Roberts wrote:Then how

25 Jan 2019 10:50:00 UTC

Message 169071 in response to message 169068

(moderation:

)

Gary Roberts wrote:

Then how come Turing GPUs are quite able to crunch tasks based on certain series of data files that have particular characteristics but not on others that have different characteristics?

Mea culpa, I wasn't aware of that as I'm not the one working on the problem, and it seemed very unlikely from what I've seen so far. So thanks for the pointer, we'll take that into account.

Gary Roberts wrote:

To characterize this as just 'noise' is rather demeaning to those who have spent a lot of cash buying the latest hardware with the intention of supporting this project.

Gary, the "this" in your sentence is a misunderstanding. I didn't mean the RTX 2080 issue. In my understanding this thread here started about the flawed FGRP dataset that we issued by mistake, which already got understood and fixed in the meantime. That dataset problem was totally independent of the RTX 2080 compatibility problem. Every Turing-post was an off-topic discussion to me that now (that we said we work on it) returned to the seemingly original topic of discussing FGRP dataset features or releases. Thus by "noise" I only meant further independent FGRP dataset discussions. Does that make sense?

So to reiterate: we are working actively on the RTX 2080 problem and there already have been a number of very helpful exchanges via PMs that you guys aren't always aware of. In addition to that I'm not always aware of PMs only Bernd received. What I'm trying to do here is to engage with our volunteers to get across that the problem is dealt with, as Bernd might not always be able to do that himself as often as he'd like to.

Richard wrote:

@Gary and @Peter - could you please gather your collective summary records from months ago, and ram a list of which data file types do, or don't, run on Turing down Oliver's throat?

This is such an example of redundant communication that might add to people's frustration (here: at least Peter). Peter already provided us with his awesome summary so there's no need to bug him further.

If there is a better canonical thread to the discuss the Turing problem please let us know as we'd love to avoid any further confusion and misunderstandings since we'd rather like to put that time into fixing the actual problem.

Best,
Oliver

Einstein@Home Project

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2960592689

RAC: 706136

I think the nearest thing we

25 Jan 2019 11:23:03 UTC

Message 169072

(moderation:

)

I think the nearest thing we have to a canonical thread is 'Pascal again available, Turing may be coming soon', starting at comment 167059 (30 September 2018). But that thread has also become far too unwieldy for this purpose.

I thought Gary had started a dedicated warning thread for new RTX users in the 'Problems and Bug Reports' area, but I can't find it now. Perhaps it's time to start a clean thread for the debug and testing phase?

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117759295401

RAC: 34781539

Oliver Behnke wrote:... Does

26 Jan 2019 11:01:20 UTC

Message 169096 in response to message 169071

(moderation:

)

Oliver Behnke wrote:

... Does that make sense?

This thread was never about the inadvertent re-issuing of the LATeah2003L.dat data file - an event that didn't happen until 18th December.

I started this thread on 5th December with three purposes in mind. One was to give people a heads-up about the expected speed of crunching - fast or slow. Another was to help people avoid tasks that could be predicted to fail on Turing series GPUs, based on previous experience. I went to the trouble of providing a link to Peter's initial report so people could understand the issue better, if they so desired. The third was to draw the further attention of Project staff to the fact that there was an ongoing issue with the failure of certain types of tasks.

The mistake I made was to assume that the number of existing reports/comments of Turing GPUs failing and all of the discussion about how to setup a standalone environment for testing different task types with new driver or firmware versions could not possibly have been missed by project staff. Following Peter's initial reports, and the discussion that followed, there was no clear indicator of the cause of the problem - hardware, firmware, driver or application, or perhaps a mix. I assumed that Project staff were not entering these discussions because they were not prepared to waste time speculating about causes which seemed likely to be out of their control.

I know you guys are extremely busy but perhaps you need to rope in someone like a postgrad student or two and give them the job of spending 5 mins each day to check for stuff of importance outside the normal chatter that goes on a lot of the time. If you like, I could edit the thread title of a standalone report that really needs Project staff attention by adding something like "**Staff attention needed**" to the title. If the report is buried in some other thread, I could start a new thread just to point to the buried message.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117759295401

RAC: 34781539

Richard Haselgrove wrote:I

26 Jan 2019 11:26:56 UTC

Message 169097 in response to message 169072

(moderation:

)

Richard Haselgrove wrote:

I thought Gary had started a dedicated warning thread for new RTX users in the 'Problems and Bug Reports' area, but I can't find it now. Perhaps it's time to start a clean thread for the debug and testing phase?

I did create a separate thread, but it was this one, in Cruncher's Corner :-). I tried to post on whether tasks for a new data file were likely to fail or not as soon as possible after the new file was issued - hopefully in time for people to do something about it :-).

Cheers,
Gary.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2960592689

RAC: 706136

Gary Roberts wrote:I know you

26 Jan 2019 12:33:25 UTC

Message 169098 in response to message 169096

(moderation:

)

Gary Roberts wrote:

I know you guys are extremely busy but perhaps you need to rope in someone like a postgrad student or two and give them the job of spending 5 mins each day to check for stuff of importance outside the normal chatter that goes on a lot of the time. If you like, I could edit the thread title of a standalone report that really needs Project staff attention by adding something like "**Staff attention needed**" to the title. If the report is buried in some other thread, I could start a new thread just to point to the buried message.

A technique which worked quite well for me was a new thread in the 'Problems and Bug Reports' area: Urgent bug: certificate error [cleared]. I backed that up with an email to a member of staff - I won't name them here, because it turned out to be the wrong person - but the problem was fixed within a couple of hours. Oliver fixed it, Shawn reported the fix, and I edited the thread title. Not necessarily in that order.

That one was a particularly clear-cut and simple fix, but it suggests that we could negotiate a process with the staff, for clearly identified problems only:

Open a specific new 'problem' thread
Notify staff by email
Staff member who fixes it responds to notifier
Reporter can clean up the thread, reply to secondary reports in other threads or fora, etc.

The remaining question is - who does the email go to? Two options - we ask for an 'areas of operation' staff list, so we pick the right one: or we (perhaps some of us) are given permission to pollute the internal staff email list - which again I won't identify in public.

And it would be greatly appreciated if whoever picks up on a problem report could report back to the notifier, even if there's only time for a one-word email.

lunkerlander

Joined: 25 Jul 18

Posts: 46

Credit: 31464094

RAC: 0

Both of your suggestions,

26 Jan 2019 13:00:53 UTC

Message 169102

(moderation:

)

Both of your suggestions, Gary and Richard, seem like they'd work.

I also had an idea; when a problem like this comes up, someone like Gary who is a moderator and I'm sure reads all of the posts here could send a quick email or message to a staff member. That way they don't have to keep reading the message boards all the time, but can still be notified in a timely manner when there is something that needs there attention, like a server issue or this Turing Gpu error issue.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2960592689

RAC: 706136

We can't heap the whole

26 Jan 2019 13:10:51 UTC

Message 169105 in response to message 169102

(moderation:

)

We can't heap the whole responsibility on Gary's shoulders (broad though they are), because of the timezone problem. Gary was in Australia last time I heard, I'm in the UK, and and Peter is fairly far west in the USA. The project staff are divided between central Europe and central USA. We don't often get a chance to meet or discuss in real time, but there's probably enough overlap to operate a common procedure, if we choose to follow that course.

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3562358667

RAC: 0

Richard Haselgrove wrote:We

26 Jan 2019 15:02:16 UTC

Message 169110 in response to message 169105

(moderation:

)

Richard Haselgrove wrote:

We can't heap the whole responsibility on Gary's shoulders (broad though they are), because of the timezone problem. Gary was in Australia last time I heard, I'm in the UK, and and Peter is fairly far west in the USA. The project staff are divided between central Europe and central USA. We don't often get a chance to meet or discuss in real time, but there's probably enough overlap to operate a common procedure, if we choose to follow that course.

Outside of infrastructure problems or mass failings of all tasks there really isn't anything that needs realtime reporting, and both of those should be sending alerts to the relevant staff via monitoring tools, so I think having a single moderator on point to send "have you seen this" messages to project staff is probably sufficient. Having several means that either project staff get multiple messages about the same issue or some fall through the cracks when everyone assumes someone else sent the message.

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7228961563

RAC: 1134407

Gary Roberts wrote:A new data

27 Jan 2019 14:15:32 UTC

Message 169141 in response to message 169056

(moderation:

)

Gary Roberts wrote:

A new data file LATeah0104Y.dat came into play more than 12 hours ago. It has the same size as, and would appear to be a continuation of a previous series that ended with LATeah0104X.dat (first mentioned in the opening post of this thread). The tasks based on the new file will most likely crunch faster than the previous 2103L tasks.

Based on the fact that 0104X tasks did fail on Turing GPUs, I imagine the new ones would also fail as well, unfortunately.

As was typical in that series of datafiles in the past, the 0104Y was only in new issue for a few days. Now we are getting new work issue from 1041L. If past behavior based on groups by the file name holds, these would be predicted to have long elapsed times, and to work correctly on Turing cards with the current applications and drivers. The data file size is 819,029, which matches exactly the datafile size Gary Roberts reported for previous groups of tasks in that series.

To re-state our observations on similar-behaving sets of Einstein Gamma-Ray Pulsar tasks:

Filename bytes      Elapsed time Turing
10nnL      819,029  longest      works
0104?    2,720,502  shortest     fails
20nnL    1,935,482  intermediate fails

These behaviors have held true since Turing cards appeared on Einstein late in September, 2018. The 2103L file run very recently is for this purpose classed in the 20nnL group, because of renaming mentioned earlier in this thread.

Latest data file for FGRPB1G GPU tasks

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner