Moments ago, I finally submitted a trouble report to Nvidia, using the Feedback site pointed to by the Reddit Nvidia forum Nvidia driver feedback.
I was able to say that four of four Einstein users reporting their Turing card experience have seen the same rapid failure syndrome on the "high-pay" WUs, and that one person (Vyper from SETI) had successfully reproduced the problem relying only on the ZIP file test case I provided them. (This was the portable test I developed with massive guidance from Juha, and extra help from Gary Roberts and Richard Haselgrove).
I see a series of obstacles:
- We are not their dominant user base, and they are probably knee-deep in new release issues
- They are not probably used to being pointed to ZIP files with test environments
- If they do see my test case fail, they may lack tools to investigate what is going on
- They may be inclined to blame the application
- They may request application instrumentation
- If they think they understand the problem, it still may not make it onto the fix priority list
But I've done what I can, and what I've done has been made vastly better than it would have been by input here. Thank you.
Maybe you could explain to me your use of "high-pay" and "low-pay" task terminology in your posts. As far as I have been able to figure out. Einstein uses a fixed credit mechanism that allots 3465 credits for a gpu task and it doesn't matter how long it takes to compute. So how can there be a higher or lower paying task?
So how can there be a higher or lower paying task?
We are doing piecework. Constant credit per piece, but the high-pay units take far less time to finish. So the pay rate (per unit time) is much higher on the high-pay units.
Gary Roberts spoke against my terminology also, but I half thought he was joking. Maybe I should adopt another, but I don't like the one he proposed, either.
By the way, I don't even know whether the lethal difference between the two different work types distributed in the last month is actually in the data files, template files, or the (very long) string of input parameters. Personally, I suspect the input parameters. Vyper tried hacking off crudely more than half the parameter string on my test case, and the application then got the GPU going. But of course it seems unlikely the result would have met requirements, so it is a stretch to say that made it "work".
No I didn't! I called it 'sexy and fashionable' and then said I wasn't complaining :-).
archae86 wrote:
... but I half thought he was joking.
As I certainly was. I just knew someone was bound to come along sooner or later and ask the obvious question that was just begging to be asked :-). So I tried to make a joke about it so you wouldn't have to spend time explaining that there weren't actually any tasks that 'paid' more than the standard amount :-). Looks like that didn't work too well either :-).
archae86 wrote:
Maybe I should adopt another ...
Don't you dare do that!! Your chosen terminology is part of the folklore now so changing it would be a disaster :-). It's just like the old days when somebody came up with the term 'wingman' instead of 'quorum partner' (or some other more official equivalent - if there ever was one). Everyone quickly got to know what 'wingman' meant and if this chopping and changing of tasks with distinctly different duration continues, we'll certainly need a popular term for it. High-pay and Low-pay are as good as any.
Ahh, OK, got it. At Seti, we call the fast computing Arecibo tasks "shorties" You are correct, in little time in the forums the vernacular shorthand becomes common and accepted. OK low-pay and high-pay it is.
Vyper tried hacking off crudely more than half the parameter string on my test case, and the application then got the GPU going. But of course it seems unlikely the result would have met requirements, so it is a stretch to say that made it "work".
But if he could file a proper bug report stating which parameters were 'hacked off', that might point a programmer to the area of code which is either incompatible with, or needs re-compiling for, the new hardware. RTX cards aren't going away - and going by previous experience, people will just throw them into a working machine and break it. Einstein will have to get the debugger and the compiler out sooner or later, or suffer the error rate.
Until the new GW tasks are fully sorted out and transitioned from beta to production, I doubt E@H is going to have any developer resources available to look into other problems.
Not sure what NV will do considering a different dataset runs ok. On the other hand just the RTX cards are having issues. I'd think a joint investigation with E@H would be needed to really resolve it. Otherwise as mentioned, we're just a minority.
As of somewhat over a half day ago, Einstein current issue of Gamma-ray Pulsar GPU work has switched from the recent string of O104* files that have "high-pay" characteristics and fail fast on Turing cards to 1025L file work, which on the established naming pattern I expect to be low-pay work which will function on Turing cards entirely properly.
I plan to work down my stock of existing work before putting the 2080 card back in the box, but if anyone has an interest in trying out their Turing now would be a good time.
And now there are five Einstein users with same syndrome Turing fast failures on Einstein high-pay GRP WUs.
User CElliott has a 2070 host (the first of that variant for which we have any report here).
The system has processed 22 high-pay WUs in the 104V file, all failing, with typical elapsed time around 22 seconds. The error 36 is returned.
The user reports seeing a short dark screen interval, and has observed the error report "Display driver nvlddmkm stopped responding and has successfully recovered" (this specific observation matches one by Vyper when he tried out my portable trial directory).
Moments ago, I finally
)
Moments ago, I finally submitted a trouble report to Nvidia, using the Feedback site pointed to by the Reddit Nvidia forum Nvidia driver feedback.
I was able to say that four of four Einstein users reporting their Turing card experience have seen the same rapid failure syndrome on the "high-pay" WUs, and that one person (Vyper from SETI) had successfully reproduced the problem relying only on the ZIP file test case I provided them. (This was the portable test I developed with massive guidance from Juha, and extra help from Gary Roberts and Richard Haselgrove).
I see a series of obstacles:
- We are not their dominant user base, and they are probably knee-deep in new release issues
- They are not probably used to being pointed to ZIP files with test environments
- If they do see my test case fail, they may lack tools to investigate what is going on
- They may be inclined to blame the application
- They may request application instrumentation
- If they think they understand the problem, it still may not make it onto the fix priority list
But I've done what I can, and what I've done has been made vastly better than it would have been by input here. Thank you.
Maybe you could explain to me
)
Maybe you could explain to me your use of "high-pay" and "low-pay" task terminology in your posts. As far as I have been able to figure out. Einstein uses a fixed credit mechanism that allots 3465 credits for a gpu task and it doesn't matter how long it takes to compute. So how can there be a higher or lower paying task?
Keith Myers wrote:So how can
)
We are doing piecework. Constant credit per piece, but the high-pay units take far less time to finish. So the pay rate (per unit time) is much higher on the high-pay units.
Gary Roberts spoke against my terminology also, but I half thought he was joking. Maybe I should adopt another, but I don't like the one he proposed, either.
By the way, I don't even know whether the lethal difference between the two different work types distributed in the last month is actually in the data files, template files, or the (very long) string of input parameters. Personally, I suspect the input parameters. Vyper tried hacking off crudely more than half the parameter string on my test case, and the application then got the GPU going. But of course it seems unlikely the result would have met requirements, so it is a stretch to say that made it "work".
archae86 wrote:... Gary
)
No I didn't! I called it 'sexy and fashionable' and then said I wasn't complaining :-).
As I certainly was. I just knew someone was bound to come along sooner or later and ask the obvious question that was just begging to be asked :-). So I tried to make a joke about it so you wouldn't have to spend time explaining that there weren't actually any tasks that 'paid' more than the standard amount :-). Looks like that didn't work too well either :-).
Don't you dare do that!! Your chosen terminology is part of the folklore now so changing it would be a disaster :-). It's just like the old days when somebody came up with the term 'wingman' instead of 'quorum partner' (or some other more official equivalent - if there ever was one). Everyone quickly got to know what 'wingman' meant and if this chopping and changing of tasks with distinctly different duration continues, we'll certainly need a popular term for it. High-pay and Low-pay are as good as any.
Cheers,
Gary.
Ahh, OK, got it. At Seti, we
)
Ahh, OK, got it. At Seti, we call the fast computing Arecibo tasks "shorties" You are correct, in little time in the forums the vernacular shorthand becomes common and accepted. OK low-pay and high-pay it is.
archae86 wrote:Vyper tried
)
But if he could file a proper bug report stating which parameters were 'hacked off', that might point a programmer to the area of code which is either incompatible with, or needs re-compiling for, the new hardware. RTX cards aren't going away - and going by previous experience, people will just throw them into a working machine and break it. Einstein will have to get the debugger and the compiler out sooner or later, or suffer the error rate.
Until the new GW tasks are
)
Until the new GW tasks are fully sorted out and transitioned from beta to production, I doubt E@H is going to have any developer resources available to look into other problems.
Not sure what NV will do
)
Not sure what NV will do considering a different dataset runs ok. On the other hand just the RTX cards are having issues. I'd think a joint investigation with E@H would be needed to really resolve it. Otherwise as mentioned, we're just a minority.
As of somewhat over a half
)
As of somewhat over a half day ago, Einstein current issue of Gamma-ray Pulsar GPU work has switched from the recent string of O104* files that have "high-pay" characteristics and fail fast on Turing cards to 1025L file work, which on the established naming pattern I expect to be low-pay work which will function on Turing cards entirely properly.
I plan to work down my stock of existing work before putting the 2080 card back in the box, but if anyone has an interest in trying out their Turing now would be a good time.
And now there are five
)
And now there are five Einstein users with same syndrome Turing fast failures on Einstein high-pay GRP WUs.
User CElliott has a 2070 host (the first of that variant for which we have any report here).
The system has processed 22 high-pay WUs in the 104V file, all failing, with typical elapsed time around 22 seconds. The error 36 is returned.
The user reports seeing a short dark screen interval, and has observed the error report "Display driver nvlddmkm stopped responding and has successfully recovered" (this specific observation matches one by Vyper when he tried out my portable trial directory).