For me it was obvious in only 4 cases out of approximately 90. If I hadn't been using a script to monitor unusual behaviour, the backoff would have counted down to zero and a further communication would have returned the accumulated results, both good and bad, new work would have been fetched and a casual observer may not have noticed anything particularly unusual. Actually more than 80% of my hosts have had no issue at all.
When I checked this morning, there were no further backoffs in communication but there were further errors, with the machine recovering on its own quite quickly. As it happens, the particular machine was one of those that had a previous bout of errors followed by a quick recovery and no backoff - as it did again this time. I've trawled through the complete tasks list starting at this page and I think there might be a couple of things of interest. Please note that particular pages are dynamic. As older tasks get removed from the database, remaining tasks will move up the list and, over time, to progressively lower page numbers.
I chose the linked page as a starting point because (currently at the bottom) there are the final tasks for the previous 0061 data file. These show a crunch time of around 2100-2200 secs. On the very next page are the first tasks for 0101. These quickly settle down to a time of around 1650-1700 secs. The first of these tasks were returned around 1.00AM UTC on May 2. There are no signs of any problems with the transition. Nothing changes with that until several pages later where good tasks were being returned at 21:16:00 UTC on May 2. The last batch of good tasks (I counted 5) included a 0061 resend. The next communication with the server occurred at 22:07:48 - less than an hour later. At that time I counted a total of 33 tasks being returned, 30 errors and 3 good. One of the good tasks was a 0061 resend.
After that brief episode, there were reasonably regular returns of good tasks until we get to this page where the last return of a good task is at 17:06:57 UTC on May 5. It's currently at the top of the page so will probably be near the bottom of the previous page soon. If you look around, the tasks (all 0101 data file) are being returned regularly in very small batches. The next return after that is at 17:59:32, less than an hour later and includes 24 compute errors and 3 good tasks, one of which is a 0061 resend. I haven't seen any resend tasks failing - just the 0101 tasks. I guess it could easily just be coincidence but it is a bit strange that failures seem to occur in 0101 tasks just at the time resends are being processed.
I just hooked up peripherals and opened BOINC Manager on this machine. There is a further 0061 resend task that will get processed sometime tonight, probably well after I've left. I also noticed that the 0101 data file changes to 0102 about the same time. The last frequency listed in 0101 tasks is 1204.0 and the initial frequency for the new file (which normally starts at a quite low value) is listed as 1212.0. Seems that the 0102 tasks are just the higher frequency continuation of the 0101 tasks. Previously, the frequency listed for a particular data file has gone right through to something like 1560 or so. I haven't previously seen one data file transition to a different one at an intermediate frequency like this.
I'm glad I hooked up the monitor and went through the tasks list on the machine because there was a very interesting thing to observe. It's a quad core host with a HD7850 GPU. It crunches one CPU task and 2 GPU tasks concurrently. When I start a machine, I make an effort to ensure that the two GPU tasks are offset to the maximum degree possible - around 45-47% difference between the 'stage' that each task in the pair is at. Of course, over time, this 'separation' wanders around but it is quite unusual to see it almost completely non-existent, as it was here. Both tasks were about 56% complete with only 15 seconds between their two crunch times. i think this is a fairly strong indication of what may have happened at the times tasks were failing.
My thinking is this. Two reasonably well 'separated' 0101 tasks were crunching. One finished while the other was partly finished. The 0061 resend task started without issue. The second 0101 task finished OK sometime later but when the replacement tried to start, it had the memory allocation error. This continued for approximately 24x21=8.5 mins (each failed task consumes about 21s run time) until the 0061 task finished successfully. At that point two 0101 tasks could start in quick succession now that the 'troublemaker' had finished. A few hours later I come along and find the evidence :-).
Of course, it could all just be coincidence but this does seem to 'fit' very nicely. Another thing that 'fits' is the fact that the maximum number of trashed tasks was around 40 or so and it occurred a few times. The bulk of my GPUs are doing these tasks in around 28-30 minutes. Half of that (15 mins) corresponds to 43 failed tasks if each failed task was clocking up 21 secs of elapsed time before failure. Many of my hosts would have their two crunching tasks separated by close to 15 mins. What a surprise to see several examples of 12-15 minutes worth of task failures before sanity returns.
If this scenario is anywhere near correct, only GPUs crunching concurrent tasks should see the problem. Lots of people would be eliminated because the default is 'one at a time'. Secondly, it could well be that this is specific to Linux. If so, that would make a large difference in the number who might see the problem. Thirdly, the problem might be specific to AMD GPUs. Fourthly, the problem might depend on a particular driver. With that in mind, I just had a quick look at the hosts that have so far exhibited the problem. They are all using the older fglrx driver because Southern Islands and Sea Islands GPUs (which they all have) are not yet supported by the new Linux amdgpu open source driver. I also have a lot of Polaris GPUs using the amdgpu driver. None of then have exhibited the problem.
So perhaps that might be why there aren't other reports about this. Maybe I've got a monopoly on older AMD GPUs crunching two concurrent tasks with the fglrx driver under Linux :-).
Yes, Really!! It did not occur to me that the website was the issue.
Perhaps that's because I don't browse other project websites so have nothing to compare with.
Zalster wrote:
You are going to play innocent in the very long and old discussion of why people don't like the new format of the webpages.
Boy, you seem to have a pretty poor opinion of me and seem to love jumping to ridiculous conclusions :-(.
Believe it or not, I'm not playing anything. Being autistic, I quite often don't get the subtle nuances of cryptic comments where the reason isn't explained and doesn't seem to be related to the current conversation. I don't do hidden agendas so find it hard to understand the reason behind such comments. Sometimes I try to find out. More fool me!! I've been bitten so many times before that you would think I should have woken up by now :-(.
I vaguely remember the website before Drupal and I contributed to some of the initial complaints - particularly things like the loss of the ability to categorise tasks, both per application and per task status (valid, invalid, error, pending, etc). There are probably other things I've forgotten about. The important things have gradually been addressed so that what exists today seems largely OK. Maybe I'm just too stupid to see problems that still exist.
Zalster wrote:
I have things to do....
Then why bother wasting time posting in this particular thread? It's a rhetorical question that doesn't need an answer. I've probably got enough clues to work that one out.
If this scenario is anywhere near correct, only GPUs crunching concurrent tasks should see the problem. Lots of people would be eliminated because the default is 'one at a time'. Secondly, it could well be that this is specific to Linux. If so, that would make a large difference in the number who might see the problem. Thirdly, the problem might be specific to AMD GPUs. Fourthly, the problem might depend on a particular driver. With that in mind, I just had a quick look at the hosts that have so far exhibited the problem. They are all using the older fglrx driver because Southern Islands and Sea Islands GPUs (which they all have) are not yet supported by the new Linux amdgpu open source driver. I also have a lot of Polaris GPUs using the amdgpu driver. None of then have exhibited the problem.
So perhaps that might be why there aren't other reports about this. Maybe I've got a monopoly on older AMD GPUs crunching two concurrent tasks with the fglrx driver under Linux :-).
It pretty much lets me off the hook. I am Linux, but only one work unit at a time and with an Nvidia card. I think I will not look through my results with the thoroughness that you did, but thanks for the explanation.
... I think I will not look through my results with the thoroughness that you did, but thanks for the explanation.
You're most welcome.
I find it hard to move on from a situation where the reason for unexpected behaviour is not only unknown but seemingly rather crazy. After all, why should a machine suddenly trash a bunch of tasks and then just as suddenly fix itself and go back to normal behaviour. If I can find some sort of half-way logical explanation, I feel much less troubled by it. It's good to have an answer of sorts, even if there's no way of knowing whether it's correct or not. I guess if it goes away when the 0061 resends finish, it will be some sort of confirmation.
There were no further examples of a machine going into a 24hr backoff this morning. However there was one machine with dual Pitcairn series (HD7850) GPUs that trashed a group of 5 tasks a couple of days ago before recovering that did the same think again last night - 5 tasks trashed followed by a quick recovery. The machine does 4 concurrent tasks on the 2 GPUs so there are tasks finishing at quite short intervals. There was a 0061 resend successfully crunched and returned just before the error tasks showed up but that doesn't seem to fit the pattern. There were two more good tasks returned after the resend and then there were 5 errors and 4 good tasks in the next batch returned. That all happened about 18 hours ago now and nothing further has gone wrong since. I guess we'll see what tomorrow brings :-).
These shorter run times have pushed my tasks pending to a new record high level.
I get my full list of tasks for all hosts in order to select the list of just errors. Hopefully this is a pretty small list most of the time but it's over 200 at the moment as of a result of what I've been documenting over the last couple of days.
I don't normally pay any attention to pendings so I've no real idea of what the trend is - up or down. Since you made a comment about it, I had a look at mine just now - over 11,600. I guess that's quite a few :-). No idea it it's a 'record high' or not, though :-).
The faster running data continues and RAC continues to rise. My goal with this project was to maintain a 500K RAC and give the rest of my resources to SETI. I shall slowly reduce my resources to E@H as things haven't even leveled out yet. I see this as a bonus for DC.
I wondered how long it might take you to notice :-).
A little over 24 hours ago, one of my various scripts advised me that LATeah0105L.dat had been replaced by LATeah1001L.dat. The function of the script isn't so much to advise of a change but rather to store a copy of the new file and to deploy it to all the other hosts in the fleet. It's other main function is to keep deploying previous data files in the series so that when resends come along, previous files don't get downloaded repeatedly just to service a single resend task. As all this activity is logged by the script, I've been rather surprised at how much download activity is associated with the processing of resends. The aim of all this is to develop automatic procedures for limiting my impact on the project servers.
This particular script is relatively new so what it is revealing about download activity has been very useful for understanding why the project is often slow to respond. I'm really glad I spent the effort to document and better understand what goes on and to work out ways to avoid downloading whenever possible.
In earlier posts in this thread, I commented on a nasty aspect of an earlier transition from LATeah0061L.dat to LATeah0101L.dat. That was the sudden occurrence of potentially large numbers of compute errors - a situation that always seemed to right itself but often had the side effect of causing a 24hr backoff in project communication. My conclusion was that it seemed to be associated with LATeah0061L.dat resend tasks when the host was processing mainly the newer 0101L tasks. I surmised that it might be associated with the use of Linux and the deprecated fglrx driver that was needed for Pitcairn series GPUs. I wondered if the problem would stop when the 0061L resends stopped.
After I wrote that, there were more examples of the problem all giving pretty much identical symptoms to what I had already described. In recent times, I haven't seen any further examples and since the supply of 0061L resends has virtually finished, I guess that's the reason why. I'm now wondering what sort of transition might be in store for 0105L going to 1001L.
When I was alerted to the arrival of 1001L tasks, I promoted one to see what the crunch time would be like. It seemed to have a very similar run time to what had been the norm for the 0061L series. So yes, the party is over, as you put it. We are back to 'normal' behaviour. It was very nice while it lasted :-).
The recent data file series has been 0059L - 0060L - 0061L - 0101L - 0102L - 0103L - 0104L - 0105L - 1001L. Whilst I wasn't paying particular attention, the early part of that series all had a similar performance that was around 20-25% 'slower' than that of the 010nL series. Since the single task of the latest series seemed to be about 20-25% 'slower' once again, that gives me the impression that the 010nL series was a brief detour into a 'different' type of data and now we have simply returned to the 'main game'. This is obviously just a complete guess based on a single result - so pretty useless really :-).
Here is a screenshot which shows the effect that this 'party' had for all my hosts as a single group. You can see that it was almost a month long party. It will be interesting to see what happens over the next 30 days.
Jim1348 wrote:I don't see the
)
For me it was obvious in only 4 cases out of approximately 90. If I hadn't been using a script to monitor unusual behaviour, the backoff would have counted down to zero and a further communication would have returned the accumulated results, both good and bad, new work would have been fetched and a casual observer may not have noticed anything particularly unusual. Actually more than 80% of my hosts have had no issue at all.
When I checked this morning, there were no further backoffs in communication but there were further errors, with the machine recovering on its own quite quickly. As it happens, the particular machine was one of those that had a previous bout of errors followed by a quick recovery and no backoff - as it did again this time. I've trawled through the complete tasks list starting at this page and I think there might be a couple of things of interest. Please note that particular pages are dynamic. As older tasks get removed from the database, remaining tasks will move up the list and, over time, to progressively lower page numbers.
I chose the linked page as a starting point because (currently at the bottom) there are the final tasks for the previous 0061 data file. These show a crunch time of around 2100-2200 secs. On the very next page are the first tasks for 0101. These quickly settle down to a time of around 1650-1700 secs. The first of these tasks were returned around 1.00AM UTC on May 2. There are no signs of any problems with the transition. Nothing changes with that until several pages later where good tasks were being returned at 21:16:00 UTC on May 2. The last batch of good tasks (I counted 5) included a 0061 resend. The next communication with the server occurred at 22:07:48 - less than an hour later. At that time I counted a total of 33 tasks being returned, 30 errors and 3 good. One of the good tasks was a 0061 resend.
After that brief episode, there were reasonably regular returns of good tasks until we get to this page where the last return of a good task is at 17:06:57 UTC on May 5. It's currently at the top of the page so will probably be near the bottom of the previous page soon. If you look around, the tasks (all 0101 data file) are being returned regularly in very small batches. The next return after that is at 17:59:32, less than an hour later and includes 24 compute errors and 3 good tasks, one of which is a 0061 resend. I haven't seen any resend tasks failing - just the 0101 tasks. I guess it could easily just be coincidence but it is a bit strange that failures seem to occur in 0101 tasks just at the time resends are being processed.
I just hooked up peripherals and opened BOINC Manager on this machine. There is a further 0061 resend task that will get processed sometime tonight, probably well after I've left. I also noticed that the 0101 data file changes to 0102 about the same time. The last frequency listed in 0101 tasks is 1204.0 and the initial frequency for the new file (which normally starts at a quite low value) is listed as 1212.0. Seems that the 0102 tasks are just the higher frequency continuation of the 0101 tasks. Previously, the frequency listed for a particular data file has gone right through to something like 1560 or so. I haven't previously seen one data file transition to a different one at an intermediate frequency like this.
I'm glad I hooked up the monitor and went through the tasks list on the machine because there was a very interesting thing to observe. It's a quad core host with a HD7850 GPU. It crunches one CPU task and 2 GPU tasks concurrently. When I start a machine, I make an effort to ensure that the two GPU tasks are offset to the maximum degree possible - around 45-47% difference between the 'stage' that each task in the pair is at. Of course, over time, this 'separation' wanders around but it is quite unusual to see it almost completely non-existent, as it was here. Both tasks were about 56% complete with only 15 seconds between their two crunch times. i think this is a fairly strong indication of what may have happened at the times tasks were failing.
My thinking is this. Two reasonably well 'separated' 0101 tasks were crunching. One finished while the other was partly finished. The 0061 resend task started without issue. The second 0101 task finished OK sometime later but when the replacement tried to start, it had the memory allocation error. This continued for approximately 24x21=8.5 mins (each failed task consumes about 21s run time) until the 0061 task finished successfully. At that point two 0101 tasks could start in quick succession now that the 'troublemaker' had finished. A few hours later I come along and find the evidence :-).
Of course, it could all just be coincidence but this does seem to 'fit' very nicely. Another thing that 'fits' is the fact that the maximum number of trashed tasks was around 40 or so and it occurred a few times. The bulk of my GPUs are doing these tasks in around 28-30 minutes. Half of that (15 mins) corresponds to 43 failed tasks if each failed task was clocking up 21 secs of elapsed time before failure. Many of my hosts would have their two crunching tasks separated by close to 15 mins. What a surprise to see several examples of 12-15 minutes worth of task failures before sanity returns.
If this scenario is anywhere near correct, only GPUs crunching concurrent tasks should see the problem. Lots of people would be eliminated because the default is 'one at a time'. Secondly, it could well be that this is specific to Linux. If so, that would make a large difference in the number who might see the problem. Thirdly, the problem might be specific to AMD GPUs. Fourthly, the problem might depend on a particular driver. With that in mind, I just had a quick look at the hosts that have so far exhibited the problem. They are all using the older fglrx driver because Southern Islands and Sea Islands GPUs (which they all have) are not yet supported by the new Linux amdgpu open source driver. I also have a lot of Polaris GPUs using the amdgpu driver. None of then have exhibited the problem.
So perhaps that might be why there aren't other reports about this. Maybe I've got a monopoly on older AMD GPUs crunching two concurrent tasks with the fglrx driver under Linux :-).
Cheers,
Gary.
Zalster wrote:Really
)
Yes, Really!! It did not occur to me that the website was the issue.
Perhaps that's because I don't browse other project websites so have nothing to compare with.
Boy, you seem to have a pretty poor opinion of me and seem to love jumping to ridiculous conclusions :-(.
Believe it or not, I'm not playing anything. Being autistic, I quite often don't get the subtle nuances of cryptic comments where the reason isn't explained and doesn't seem to be related to the current conversation. I don't do hidden agendas so find it hard to understand the reason behind such comments. Sometimes I try to find out. More fool me!! I've been bitten so many times before that you would think I should have woken up by now :-(.
I vaguely remember the website before Drupal and I contributed to some of the initial complaints - particularly things like the loss of the ability to categorise tasks, both per application and per task status (valid, invalid, error, pending, etc). There are probably other things I've forgotten about. The important things have gradually been addressed so that what exists today seems largely OK. Maybe I'm just too stupid to see problems that still exist.
Then why bother wasting time posting in this particular thread? It's a rhetorical question that doesn't need an answer. I've probably got enough clues to work that one out.
Cheers,
Gary.
Gary Roberts wrote:Jim1348
)
It pretty much lets me off the hook. I am Linux, but only one work unit at a time and with an Nvidia card. I think I will not look through my results with the thoroughness that you did, but thanks for the explanation.
These shorter run times have
)
These shorter run times have pushed my tasks pending to a new record high level.
Jim1348 wrote:... I think I
)
You're most welcome.
I find it hard to move on from a situation where the reason for unexpected behaviour is not only unknown but seemingly rather crazy. After all, why should a machine suddenly trash a bunch of tasks and then just as suddenly fix itself and go back to normal behaviour. If I can find some sort of half-way logical explanation, I feel much less troubled by it. It's good to have an answer of sorts, even if there's no way of knowing whether it's correct or not. I guess if it goes away when the 0061 resends finish, it will be some sort of confirmation.
There were no further examples of a machine going into a 24hr backoff this morning. However there was one machine with dual Pitcairn series (HD7850) GPUs that trashed a group of 5 tasks a couple of days ago before recovering that did the same think again last night - 5 tasks trashed followed by a quick recovery. The machine does 4 concurrent tasks on the 2 GPUs so there are tasks finishing at quite short intervals. There was a 0061 resend successfully crunched and returned just before the error tasks showed up but that doesn't seem to fit the pattern. There were two more good tasks returned after the resend and then there were 5 errors and 4 good tasks in the next batch returned. That all happened about 18 hours ago now and nothing further has gone wrong since. I guess we'll see what tomorrow brings :-).
Cheers,
Gary.
Betreger wrote:These shorter
)
I get my full list of tasks for all hosts in order to select the list of just errors. Hopefully this is a pretty small list most of the time but it's over 200 at the moment as of a result of what I've been documenting over the last couple of days.
I don't normally pay any attention to pendings so I've no real idea of what the trend is - up or down. Since you made a comment about it, I had a look at mine just now - over 11,600. I guess that's quite a few :-). No idea it it's a 'record high' or not, though :-).
Cheers,
Gary.
The faster running data
)
The faster running data continues and RAC continues to rise. My goal with this project was to maintain a 500K RAC and give the rest of my resources to SETI. I shall slowly reduce my resources to E@H as things haven't even leveled out yet. I see this as a bonus for DC.
Well it looks kike the party
)
Well it looks kike the party may be over, I'm now getting data that takes ~38 min, running 2 at a time on a GTX1060.
I wondered how long it might
)
I wondered how long it might take you to notice :-).
A little over 24 hours ago, one of my various scripts advised me that LATeah0105L.dat had been replaced by LATeah1001L.dat. The function of the script isn't so much to advise of a change but rather to store a copy of the new file and to deploy it to all the other hosts in the fleet. It's other main function is to keep deploying previous data files in the series so that when resends come along, previous files don't get downloaded repeatedly just to service a single resend task. As all this activity is logged by the script, I've been rather surprised at how much download activity is associated with the processing of resends. The aim of all this is to develop automatic procedures for limiting my impact on the project servers.
This particular script is relatively new so what it is revealing about download activity has been very useful for understanding why the project is often slow to respond. I'm really glad I spent the effort to document and better understand what goes on and to work out ways to avoid downloading whenever possible.
In earlier posts in this thread, I commented on a nasty aspect of an earlier transition from LATeah0061L.dat to LATeah0101L.dat. That was the sudden occurrence of potentially large numbers of compute errors - a situation that always seemed to right itself but often had the side effect of causing a 24hr backoff in project communication. My conclusion was that it seemed to be associated with LATeah0061L.dat resend tasks when the host was processing mainly the newer 0101L tasks. I surmised that it might be associated with the use of Linux and the deprecated fglrx driver that was needed for Pitcairn series GPUs. I wondered if the problem would stop when the 0061L resends stopped.
After I wrote that, there were more examples of the problem all giving pretty much identical symptoms to what I had already described. In recent times, I haven't seen any further examples and since the supply of 0061L resends has virtually finished, I guess that's the reason why. I'm now wondering what sort of transition might be in store for 0105L going to 1001L.
When I was alerted to the arrival of 1001L tasks, I promoted one to see what the crunch time would be like. It seemed to have a very similar run time to what had been the norm for the 0061L series. So yes, the party is over, as you put it. We are back to 'normal' behaviour. It was very nice while it lasted :-).
The recent data file series has been 0059L - 0060L - 0061L - 0101L - 0102L - 0103L - 0104L - 0105L - 1001L. Whilst I wasn't paying particular attention, the early part of that series all had a similar performance that was around 20-25% 'slower' than that of the 010nL series. Since the single task of the latest series seemed to be about 20-25% 'slower' once again, that gives me the impression that the 010nL series was a brief detour into a 'different' type of data and now we have simply returned to the 'main game'. This is obviously just a complete guess based on a single result - so pretty useless really :-).
Here is a screenshot which shows the effect that this 'party' had for all my hosts as a single group. You can see that it was almost a month long party. It will be interesting to see what happens over the next 30 days.
Cheers,
Gary.
So yes, the party is over,
)
So yes, the party is over, as you put it. We are back to 'normal' behaviour. It was very nice while it lasted :-).
Gary on my 1060 hosts ~30 min per tasks running 2 at a time, the party gave me ~25 min and 38 min now seems to be the new normal.