... it's definitely a windows vs linux thing. and due to the relative spread of windows vs linux hosts (many more windows) that puts the linux hosts at a disadvantage.
Thanks for the confirmation. My whole setup is Linux based and I have no desire (or even the ability) to change that. The problem really needs to be fixed.
Do you have any thoughts about how this could be done, or at least improved?
... it's definitely a windows vs linux thing. and due to the relative spread of windows vs linux hosts (many more windows) that puts the linux hosts at a disadvantage.
Thanks for the confirmation. My whole setup is Linux based and I have no desire (or even the ability) to change that. The problem really needs to be fixed.
Do you have any thoughts about how this could be done, or at least improved?
I think realistically . . . it will require a change in the scheduler to segregate hosts to OS camps.
that's if the project even wants to do that. because from their POV the overall/cumulative invalids are pretty low with the vast majority of results (from Windows hosts) are valid and only getting higher invalids from the few(er) Linux hosts.
I don't think full segregation is feasible as you might run into a situation where the project doesn't send you any work, even though work is available, just because they don't have anyone to match you to within your "group" like what happens sometimes with the locality scheduling scheme. I'm not sure if the admins have the ability to fully separate host pairing by OS type, even if they wanted to. i think before they even consider that they should look into whether the tasks marked invalid are close enough that they would be "valid" if they came from a pair of linux hosts.
the weird thing is that it's not like I'm not validating with cuda55 and win-ati hosts. i still have a majority of valids from them too. its just that when it's deemed invalid it was a Windows pairing. so why do most tasks still validate fine where others don't? maybe we just need to wait a little longer and let things reach some kind of equilibrium after the project's change in the apps to see if things settle down.
personally I'll accept whatever the project decides to do. If they want to just leave it as-is, I'll deal with it.
Thanks for the thoughtful response. It's always good to have a range of opinions and ideas to consider.
Ian&Steve C. wrote:
I don't think full segregation is feasible as you might run into a situation where the project doesn't send you any work, even though work is available, just because they don't have anyone to match you to within your "group" like what happens sometimes with the locality scheduling scheme.
Before GPUs became a thing, I had a number of years experiencing LS on CPU only hosts. The biggest thing people complained about was completed work where no quorum partner had been assigned. The main reason seemed to be that there were large numbers of 'frequency bins' and relatively few hosts per bin. People fixated on credits used to get quite vocal if they had completed lots of tasks (multiple days or even weeks worth) with no quorum partner in sight. This didn't worry me at all, since extra hosts eventually would be assigned when they ran out of work for their previous frequency bin.
The current situation is nowhere near as severe since there would be only two 'bins' with more than enough hosts to ensure regular filling of quorums for both. Task turnaround times are much faster as well. If a linux host asks for work and there aren't any where a linux _1 task is needed, the host immediately should get new _0 tasks. The time for such a quorum to be filled would be very short. I don't think the project not sending work could occur, certainly not with enough regularity to be troublesome.
Ian&Steve C. wrote:
... they should look into whether the tasks marked invalid are close enough that they would be "valid" if they came from a pair of linux hosts.
That was my hope when I presented information about a ~10% invalid rate for the 0.15 app. The 'fix' wasn't a tweaking of the validator but rather the 0.17 app. That app seems to be giving much worse results for me. The fact that the validator hasn't been mentioned may indicate that they are unwilling to touch that.
Ian&Steve C. wrote:
the weird thing is that it's not like I'm not validating with cuda55 and win-ati hosts. i still have a majority of valids from them too. its just that when it's deemed invalid it was a Windows pairing. so why do most tasks still validate fine where others don't?
When I started a test machine for the 0.17 app, I downloaded sufficient tasks to last for 10 days. I didn't want to have ongoing work requests interfering with the counts for the various categories. For example, the 'All' category started with 560 and that remains unchanged. I know nothing has yet been removed from the on-line database. When that does happen, I will immediately know about it.
Here are the current stats for all categories:-
All = 560
In progress = 392
Pending = 24
Valid = 107
Invalid = 15
Error = 0
Inconclusive = 22
On the assumption that inconclusives are a Windows vs Linux thing, I expect that ~80% of the current 22 (ie. ~18 of them) will be deemed to be invalid when 'resend' tasks are returned. With that in mind, the real ratio of valid to invalid will turn out to be something like 111 valid to 33 invalid. That's very close to 23% invalid.
I'm confident the machine and the GPU are in good working order. It performed very well on FGRPB1G with a very low rate of invalids.
1. There was a flaw in the validator that has been fixed, but ultimately that didn't have much of an impact on validation, so I didn't mention it.
2. The problem with validation is not about tolerances of parameters of results. In all "invalid" cases I have looked at (between Windows an Linux) there are a few lines in the result table that are really different, and are not just at the "bottom" of the toplist (cutoff by hard criteria because of tiny numerical differences). I really don't know where these different "candidates" come from.
3. There is a mechanism in BOINC called "homogeneous redundancy", which could do what Gary suggest. Its a bit clumsy to set up, but once done, it should work well. I'm discussing this with technicians and scientists, so far we didn't reach an agreement on whether to use this or not. I, personally, am all for that.
4. With Tuesday being a public holiday in Germany this is a long weekend with a short week afterwards, and quite few people took this whole week off. Please bear with us.
Thanks for looking into this and for the extra information. It's good to know that it's a significant and unexplained difference and not just an overly zealous validator. I could imagine the reason for these differences in candidate lists might be hard to find and take some time. Is the validity of the science done thus far in any way affected?
The worrying point is that if two Windows hosts give a quite different results list compared to what would be returned for the same data given to two Linux hosts, which set of results should you believe as 'correct'? I imagine the cause really needs to be identified before proceeding too far with the analysis.
I'm sure we would all "bear with you" for however long it might take to sort this out.
... the 'All' category started with 560 and that remains unchanged. I know nothing has yet been removed from the on-line database.
That has now changed and the 'All' figure is 559.
I've been recording the stats values regularly and the final values before any were removed are listed below:-
All = 560
In progress = 283
Pending = 9
Valid = 207
Invalid = 35
Error = 0
Inconclusive = 26
Assuming that ~80% of inconclusives will eventually become invalid, that means the 26 inconclusives will eventually provide ~21 invalid and ~5 valid. Of the 268 tasks completed and subjected to validation, ~212 are likely to be valid and ~56 are likely to be invalid. 56 out of 268 is 20.9% invalid.
I his last message, Bernd commented:-
Quote:
... a few lines in the result table that are really different, and are not just at the "bottom" of the toplist ...
That "really different" part concerns me.
My understanding is that the 'toplist' is a list of potential candidates in decreasing order of 'likelihood' of something being detected. To have higher value candidates not agreeing rather than those at the "bottom" of the list suggests something quite bad. Others can speak for themselves, but I'd like some assurance that what is currently being produced is not just a waste of volunteered resources.
I have a couple of questions:-
Is there a 'reference' implementation of the app and a 'reference' hardware/OS combination that could be used to separately validate inconclusive results to see if there is some sort of 'pointer' to exactly what is causing this?
Could the scientists/researchers who will use the returned results provide further commentary as to whether or not this interferes with their work? I'm not a scientist but sayings like "garbage in == garbage out" tend to spring to mind.
Is there any evidence that the use of 'optimised apps' is a contributing factor in any way?
I ran across this host that is running an optimised app under anonymous platform and the current stats are:-
All = 2052
In progress = 24
Pending = 414
Valid = 1191
Invalid = 315
Error = 3
Inconclusive = 105
It's running linux so the majority of the inclusives will likely become invalid. I had an inconclusive task that was paired to one on that machine. The 'decider' was a Windows machine. My task got validated and the other lost out. The above results suggest an invalid rate around 30%.
I'm also seeing improvements
)
I'm also seeing improvements on invalids going from 1200 to 1222.
Keith Myers wrote:No that is
)
Thanks for that. I don't have any nvidia hardware and haven't been paying attention.
Cheers,
Gary.
Ian&Steve C. wrote:... it's
)
Thanks for the confirmation. My whole setup is Linux based and I have no desire (or even the ability) to change that. The problem really needs to be fixed.
Do you have any thoughts about how this could be done, or at least improved?
Cheers,
Gary.
Gary Roberts
)
I think realistically . . . it will require a change in the scheduler to segregate hosts to OS camps.
that's if the project even
)
that's if the project even wants to do that. because from their POV the overall/cumulative invalids are pretty low with the vast majority of results (from Windows hosts) are valid and only getting higher invalids from the few(er) Linux hosts.
I don't think full segregation is feasible as you might run into a situation where the project doesn't send you any work, even though work is available, just because they don't have anyone to match you to within your "group" like what happens sometimes with the locality scheduling scheme. I'm not sure if the admins have the ability to fully separate host pairing by OS type, even if they wanted to. i think before they even consider that they should look into whether the tasks marked invalid are close enough that they would be "valid" if they came from a pair of linux hosts.
the weird thing is that it's not like I'm not validating with cuda55 and win-ati hosts. i still have a majority of valids from them too. its just that when it's deemed invalid it was a Windows pairing. so why do most tasks still validate fine where others don't? maybe we just need to wait a little longer and let things reach some kind of equilibrium after the project's change in the apps to see if things settle down.
personally I'll accept whatever the project decides to do. If they want to just leave it as-is, I'll deal with it.
_________________________________________________________________________
Thanks for the thoughtful
)
Thanks for the thoughtful response. It's always good to have a range of opinions and ideas to consider.
Before GPUs became a thing, I had a number of years experiencing LS on CPU only hosts. The biggest thing people complained about was completed work where no quorum partner had been assigned. The main reason seemed to be that there were large numbers of 'frequency bins' and relatively few hosts per bin. People fixated on credits used to get quite vocal if they had completed lots of tasks (multiple days or even weeks worth) with no quorum partner in sight. This didn't worry me at all, since extra hosts eventually would be assigned when they ran out of work for their previous frequency bin.
The current situation is nowhere near as severe since there would be only two 'bins' with more than enough hosts to ensure regular filling of quorums for both. Task turnaround times are much faster as well. If a linux host asks for work and there aren't any where a linux _1 task is needed, the host immediately should get new _0 tasks. The time for such a quorum to be filled would be very short. I don't think the project not sending work could occur, certainly not with enough regularity to be troublesome.
That was my hope when I presented information about a ~10% invalid rate for the 0.15 app. The 'fix' wasn't a tweaking of the validator but rather the 0.17 app. That app seems to be giving much worse results for me. The fact that the validator hasn't been mentioned may indicate that they are unwilling to touch that.
Exactly!
Cheers,
Gary.
When I started a test machine
)
When I started a test machine for the 0.17 app, I downloaded sufficient tasks to last for 10 days. I didn't want to have ongoing work requests interfering with the counts for the various categories. For example, the 'All' category started with 560 and that remains unchanged. I know nothing has yet been removed from the on-line database. When that does happen, I will immediately know about it.
Here are the current stats for all categories:-
On the assumption that inconclusives are a Windows vs Linux thing, I expect that ~80% of the current 22 (ie. ~18 of them) will be deemed to be invalid when 'resend' tasks are returned. With that in mind, the real ratio of valid to invalid will turn out to be something like 111 valid to 33 invalid. That's very close to 23% invalid.
I'm confident the machine and the GPU are in good working order. It performed very well on FGRPB1G with a very low rate of invalids.
Cheers,
Gary.
Few things:1. There was a
)
Few things:
1. There was a flaw in the validator that has been fixed, but ultimately that didn't have much of an impact on validation, so I didn't mention it.
2. The problem with validation is not about tolerances of parameters of results. In all "invalid" cases I have looked at (between Windows an Linux) there are a few lines in the result table that are really different, and are not just at the "bottom" of the toplist (cutoff by hard criteria because of tiny numerical differences). I really don't know where these different "candidates" come from.
3. There is a mechanism in BOINC called "homogeneous redundancy", which could do what Gary suggest. Its a bit clumsy to set up, but once done, it should work well. I'm discussing this with technicians and scientists, so far we didn't reach an agreement on whether to use this or not. I, personally, am all for that.
4. With Tuesday being a public holiday in Germany this is a long weekend with a short week afterwards, and quite few people took this whole week off. Please bear with us.
BM
Thanks for looking into this
)
Thanks for looking into this and for the extra information. It's good to know that it's a significant and unexplained difference and not just an overly zealous validator. I could imagine the reason for these differences in candidate lists might be hard to find and take some time. Is the validity of the science done thus far in any way affected?
The worrying point is that if two Windows hosts give a quite different results list compared to what would be returned for the same data given to two Linux hosts, which set of results should you believe as 'correct'? I imagine the cause really needs to be identified before proceeding too far with the analysis.
I'm sure we would all "bear with you" for however long it might take to sort this out.
Cheers,
Gary.
In a previous message I
)
In a previous message I wrote:-
That has now changed and the 'All' figure is 559.
I've been recording the stats values regularly and the final values before any were removed are listed below:-
Assuming that ~80% of inconclusives will eventually become invalid, that means the 26 inconclusives will eventually provide ~21 invalid and ~5 valid. Of the 268 tasks completed and subjected to validation, ~212 are likely to be valid and ~56 are likely to be invalid. 56 out of 268 is 20.9% invalid.
I his last message, Bernd commented:-
That "really different" part concerns me.
My understanding is that the 'toplist' is a list of potential candidates in decreasing order of 'likelihood' of something being detected. To have higher value candidates not agreeing rather than those at the "bottom" of the list suggests something quite bad. Others can speak for themselves, but I'd like some assurance that what is currently being produced is not just a waste of volunteered resources.
I have a couple of questions:-
I ran across this host that is running an optimised app under anonymous platform and the current stats are:-
It's running linux so the majority of the inclusives will likely become invalid. I had an inconclusive task that was paired to one on that machine. The 'decider' was a Windows machine. My task got validated and the other lost out. The above results suggest an invalid rate around 30%.
Cheers,
Gary.