... and the contents of the results web page will be saved to the file results.txt in a space-delimited 4 column table:
[Result_id] [Frequency] [seq.No] [cpu_time]
For people saving data over a period of time the RID field is rather useless, seeing as the tasks disappear from the database quite quickly and that number then points to nothing. I've been piping things through "cut -c10" to leave just the last three fields, which works fine for me.
Is the RID field actually useful for anything do you reckon?
Bikeman and Gary Roberts,may I port and use your scripts?
:-)
Feel free to use it for whatever you want.
Nice shell script, Gary! However if your workunit cache is so big that it covers the whole (or most) of the first result page, this won't work, I guess you also have to consider following pages (using the offset URL parameter) to be on the safe side.
Well, I'm kind of "relational" so I like to keep the IDs for everything :-). Might be useful to check the individual Result later on
... if your workunit cache is so big that it covers the whole (or most) of the first result page, this won't work, I guess you also have to consider following pages (using the offset URL parameter) to be on the safe side.
In my case, very few of my machines go onto a second page as my cache is quite small and the machines are relatively slow. The average is around 10 - 16 results only with an occasional machine having about 20 or so. Because I've been collecting data manually, I'm in the process of storing what I already have in "result.nnnnn" files and using the script to update them on a daily basis. For that job, the new stuff I really need will always be on the first results page.
However, you are quite correct that the script shouldn't be so limited. The plan is to grab the host_detail page (which actually records the total number of tasks) and parse it to get that value. Let's say there were 267 tasks. That would mean there would be 14 results pages to get so it should be easy to loop 14 times, incrementing the offset, to grab each page of 20 and append the extracted info to a growing results.tmp file. I've just got to work out how to best extract the number of tasks from the host detail page. Perhaps again awk would be best for the job.
Bikeman and Gary Roberts,may I port and use your scripts?
:-)
Of course you can. You're quite free to do whatever you like with them.
It indeed would be nice to have a single app that could harvest the data and interface directly with Mike Hewson's RR_V3 to allow seamless analysis of results and the ability to predict future runtimes. Please be aware that Bernd has said a couple of times that they know the causes of the cyclic variation and that they intend (at some stage) to have the WU generator take into account the variation when calculating the appropriate credit for each task. It's also possible that the cyclic variation might be modified or even eliminated at some point. In other words you may need to throw away much of your investment if things change.
And that is why it will be a simple app.If no runtime errors pop up it could be done under few days.
If no changes to variation I can start as early as 12.2.That day I should have last test. :-)
(Mat. analysis and Linear algebra are awaiting me next two weeks...)
However, you are quite correct that the script shouldn't be so limited. The plan is to grab the host_detail page (which actually records the total number of tasks) and parse it to get that value.
Or just keep fetching the next page until a grep fails to find the "Next" hyperlink in the fetched result page, so you don't have to care about two different HTML pages.
However, you are quite correct that the script shouldn't be so limited. The plan is to grab the host_detail page (which actually records the total number of tasks) and parse it to get that value.
Or just keep fetching the next page until a grep fails to find the "Next" hyperlink in the fetched result page, so you don't have to care about two different HTML pages.
CU
H-B
It would be more elegant to be able to fetch results based on the userID and parse out the hostIDs on the fly. To do that though, I believe you'd need to read in the data from the cookie. We will be doing something similar soon in our Java Servlet / JSP class... I still haven't been motivated enough to worry about writing something in Java for doing something like this. I probably won't until sometime in May/June...
Well, I want to do this mainly for my own machines for which I know the ids, so I guess I won't be doing anything fancier in the near future either.
Here's an update of the awk script, with missing/wrong escapes fixed and a filter that will only extract the runtime of finished AND successful results.
BEGIN {start =0; delim = " "}
/result\\.php\\?resultid/ {ind=0 ; start=1;
match($0,">[0-9]+");
if(RLENGTH > 1) {
wuid= substr($0, RSTART+1, RLENGTH-5);
} else {
wuid="??";
}
line = wuid delim;
}
// {
if(start != 0) {
ind = ind +1;
if(ind == 1) {
match($0,"h1_[0-9]+\\\\.[0-9]+_S5R2__[0-9]+_S5R3[a-z]_[0-9]+");
wuname = substr($0, RSTART, RLENGTH);
split(wuname,ar,"_");
line = line ar[2] delim ar [5] delim ;
}
However, you are quite correct that the script shouldn't be so limited. The plan is to grab the host_detail page (which actually records the total number of tasks) and parse it to get that value.
Another strategy would be to look for the next-20 link on the results page and if present, follow it.
I just started learning Python a couple of weeks ago, and for my first ‘real’ project I’m writing a script for the same purpose. So if anyone’s interested I’ll post it when it reaches a functional state. Assuming I don’t get stuck somewhere, that is—I seem to have the page-parsing/data-collection part of it working already, but I have yet to start on the storage & analysis parts.
- rounding of time estimates to nearest second ( prior significance was excessive ).
- summary button in Step 5 will create a new browser window and populate with data as given, plus estimates generated from them. You can print or whatever from there.
- layout & prissy polishing stuff.
Possibilities for V5:
- addition of step 5 text area type input box for dumping pasted data - say in CSV format - which can then be sucked into the analysis pipeline and emit same metrics ( peak, average, trough ..... )
- addition of 'analysis' button to activate the processing of text area data as above.
- CSV format as : FREQ,SEQUENCE#,RUNTIME say. Suggestions anyone? I can do any old way, as long as it's fixed. Although .... a group of radio buttons ( ie. select exactly one alternative only ) could switch between file format 'modes'.
As usual, any problems/suggestions let me know .... :-)
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
I've just run RR_V4 with a new set of data and it looks very good. In particular, the summary page of inputs and outputs is just what I need. I can print the new window to a single page and so preserve the calculations for later comparison with runs on other datasets. Since I have many similar boxes to compare, I'd also like to input a few (optional) details about the particular host that could be output as part of the header on the summary page, things like
* hostname / hostid
* hardware string - eg Dual PIII 1400/1024/36.4
* OS details
* BOINC version
* Science App version
I thought that it might be easy for the Inputs and Outputs Summary button to lead to an intermediate data input screen with a "continue" button at the bottom. The input values could be wholly or partially left blank and so users who didn't need the info wouldn't be inconvenienced. Any data that was supplied would simply appear immediately below the header on the summary page. Also, for consistency in terminology with the latest versions of BOINC, (ie "results" are now "tasks") perhaps the header should say "S5R3 TASK CRUNCH TIME ..." but this isn't important and I'm NOT nitpicking :).
Also, I've noticed that I'm mainly interested in step 5 so what I do is enter a 9th data point at step 4 on the way through. This is usually the next sequence number after the 8 I'm planning on using in step 5. That way, once step 5 has been completed and I've transferred the estimates to step 6, the estimated runtime for the next sequence number will be immediately showing as an output in step 6. Quite often (in fact always if I gather more new data) I would like to add a 10th, 11th, ... data point so rather than scrolling back to step 4 to change the sequence number would it be possible to put an input box for sequence number immediately after the A & B inputs in step 6, thanks?
Quote:
Possibilities for V5:
- addition of step 5 text area type input box for dumping pasted data ....
YES, YES, PLEASE, PLEASE!!
Quote:
CSV format as : FREQ,SEQUENCE#,RUNTIME say ...
This is probably the best order since data grabbers (eg Bikeman's awk script) will give that order naturally when parsing each line of collected data.
Here is another thought for consideration. As mentioned recently by Peter, it looks like (for a given platform anyway) crunch times may be relatively frequency independent. It might be interesting therefore to augment frequency and sequence with a phase value. People doing plots could then simply plot crunch time against phase and use data from the same host even though that host might be (like now) receiving a variety of different frequency work. They could use different coloured points for different frequencies and when all were superimposed, the vertical scatter (or lack of it) would give an indication of the effect of frequency.
So, the thought is really that your step 5 screen could contain three inputs per line, frequency, sequence and runtime and the summary page could show 4 columns of output, frequency, sequence, phase and runtime.
As always, these are thoughts and not commands :).
Now for some more data for you. I tested RR_V4 with a dual PIII 1Gig server running Linux with the 4.27 app. I transitioned this server quite early and so have plenty of 4.27 data at constant frequency, 764.50. In the list below I've included all the available dataset (including 4.14 and "mixed" data) and marked those that were used for the 8 data points with a relevant DP# (ie Data Point Number). If a data point wasn't used I've left this last column blank. I've also included the next tasks in the cache that haven't completed yet. The hostid of this machine is 946535.
RE: ... and the contents
)
For people saving data over a period of time the RID field is rather useless, seeing as the tasks disappear from the database quite quickly and that number then points to nothing. I've been piping things through "cut -c10" to leave just the last three fields, which works fine for me.
Is the RID field actually useful for anything do you reckon?
Cheers,
Gary.
RE: Bikeman and Gary
)
Feel free to use it for whatever you want.
Nice shell script, Gary! However if your workunit cache is so big that it covers the whole (or most) of the first result page, this won't work, I guess you also have to consider following pages (using the offset URL parameter) to be on the safe side.
Well, I'm kind of "relational" so I like to keep the IDs for everything :-). Might be useful to check the individual Result later on
CU
Bikeman
RE: ... if your workunit
)
In my case, very few of my machines go onto a second page as my cache is quite small and the machines are relatively slow. The average is around 10 - 16 results only with an occasional machine having about 20 or so. Because I've been collecting data manually, I'm in the process of storing what I already have in "result.nnnnn" files and using the script to update them on a daily basis. For that job, the new stuff I really need will always be on the first results page.
However, you are quite correct that the script shouldn't be so limited. The plan is to grab the host_detail page (which actually records the total number of tasks) and parse it to get that value. Let's say there were 267 tasks. That would mean there would be 14 results pages to get so it should be easy to loop 14 times, incrementing the offset, to grab each page of 20 and append the extracted info to a growing results.tmp file. I've just got to work out how to best extract the number of tasks from the host detail page. Perhaps again awk would be best for the job.
Cheers,
Gary.
Thanks for
)
Thanks for permission.
And that is why it will be a simple app.If no runtime errors pop up it could be done under few days.
If no changes to variation I can start as early as 12.2.That day I should have last test. :-)
(Mat. analysis and Linear algebra are awaiting me next two weeks...)
RE: However, you are quite
)
Or just keep fetching the next page until a grep fails to find the "Next" hyperlink in the fetched result page, so you don't have to care about two different HTML pages.
CU
H-B
RE: RE: However, you are
)
It would be more elegant to be able to fetch results based on the userID and parse out the hostIDs on the fly. To do that though, I believe you'd need to read in the data from the cookie. We will be doing something similar soon in our Java Servlet / JSP class... I still haven't been motivated enough to worry about writing something in Java for doing something like this. I probably won't until sometime in May/June...
Well, I want to do this
)
Well, I want to do this mainly for my own machines for which I know the ids, so I guess I won't be doing anything fancier in the near future either.
Here's an update of the awk script, with missing/wrong escapes fixed and a filter that will only extract the runtime of finished AND successful results.
if(ind == 6) {
if(!match($0,"Success") ) {
start=0;
}
}
if(ind == 8) {
if(match($0,"[0-9,\\\\.]+") ) {
tm = substr($0,RSTART,RLENGTH);
gsub("\\\\,","",tm);;
line = line tm;
printf(line "\\n");
}
}
}
}
END {}
RE: However, you are quite
)
Another strategy would be to look for the next-20 link on the results page and if present, follow it.
I just started learning Python a couple of weeks ago, and for my first ‘real’ project I’m writing a script for the same purpose. So if anyone’s interested I’ll post it when it reaches a functional state. Assuming I don’t get stuck somewhere, that is—I seem to have the page-parsing/data-collection part of it working already, but I have yet to start on the storage & analysis parts.
Well folks, here's
)
Well folks, here's RR_V4!
- rounding of time estimates to nearest second ( prior significance was excessive ).
- summary button in Step 5 will create a new browser window and populate with data as given, plus estimates generated from them. You can print or whatever from there.
- layout & prissy polishing stuff.
Possibilities for V5:
- addition of step 5 text area type input box for dumping pasted data - say in CSV format - which can then be sucked into the analysis pipeline and emit same metrics ( peak, average, trough ..... )
- addition of 'analysis' button to activate the processing of text area data as above.
- CSV format as : FREQ,SEQUENCE#,RUNTIME say. Suggestions anyone? I can do any old way, as long as it's fixed. Although .... a group of radio buttons ( ie. select exactly one alternative only ) could switch between file format 'modes'.
As usual, any problems/suggestions let me know .... :-)
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
RE: Well folks, here's
)
I've just run RR_V4 with a new set of data and it looks very good. In particular, the summary page of inputs and outputs is just what I need. I can print the new window to a single page and so preserve the calculations for later comparison with runs on other datasets. Since I have many similar boxes to compare, I'd also like to input a few (optional) details about the particular host that could be output as part of the header on the summary page, things like
* hardware string - eg Dual PIII 1400/1024/36.4
* OS details
* BOINC version
* Science App version
I thought that it might be easy for the Inputs and Outputs Summary button to lead to an intermediate data input screen with a "continue" button at the bottom. The input values could be wholly or partially left blank and so users who didn't need the info wouldn't be inconvenienced. Any data that was supplied would simply appear immediately below the header on the summary page. Also, for consistency in terminology with the latest versions of BOINC, (ie "results" are now "tasks") perhaps the header should say "S5R3 TASK CRUNCH TIME ..." but this isn't important and I'm NOT nitpicking :).
Also, I've noticed that I'm mainly interested in step 5 so what I do is enter a 9th data point at step 4 on the way through. This is usually the next sequence number after the 8 I'm planning on using in step 5. That way, once step 5 has been completed and I've transferred the estimates to step 6, the estimated runtime for the next sequence number will be immediately showing as an output in step 6. Quite often (in fact always if I gather more new data) I would like to add a 10th, 11th, ... data point so rather than scrolling back to step 4 to change the sequence number would it be possible to put an input box for sequence number immediately after the A & B inputs in step 6, thanks?
YES, YES, PLEASE, PLEASE!!
This is probably the best order since data grabbers (eg Bikeman's awk script) will give that order naturally when parsing each line of collected data.
Here is another thought for consideration. As mentioned recently by Peter, it looks like (for a given platform anyway) crunch times may be relatively frequency independent. It might be interesting therefore to augment frequency and sequence with a phase value. People doing plots could then simply plot crunch time against phase and use data from the same host even though that host might be (like now) receiving a variety of different frequency work. They could use different coloured points for different frequencies and when all were superimposed, the vertical scatter (or lack of it) would give an indication of the effect of frequency.
So, the thought is really that your step 5 screen could contain three inputs per line, frequency, sequence and runtime and the summary page could show 4 columns of output, frequency, sequence, phase and runtime.
As always, these are thoughts and not commands :).
Now for some more data for you. I tested RR_V4 with a dual PIII 1Gig server running Linux with the 4.27 app. I transitioned this server quite early and so have plenty of 4.27 data at constant frequency, 764.50. In the list below I've included all the available dataset (including 4.14 and "mixed" data) and marked those that were used for the 8 data points with a relevant DP# (ie Data Point Number). If a data point wasn't used I've left this last column blank. I've also included the next tasks in the cache that haven't completed yet. The hostid of this machine is 946535.
0764.50 198
0764.50 186
Peak__ Runtime = 142594
Averge Runtime = 112232
Trough Runtime = 094901
Runtime Varnce = 0.3344
Estimated Error = 3.33%
Hope this is of some use to you. I'll send the last two values for the list above once they have crunched.
Cheers,
Gary.