OK! I've gone rather ballistic on RR_V6A ( 120K ) having scored a freely distributable Javascript graphics library. Basically the same functionality as V5A but I've re-done/savaged the interface to suit - a picture is indeed worth a thousand words!
Before you ask - I am working on a method to print the plots. :-)
As usual, please tell me about the least little thing ..... :-)
Enjoy!
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
OK! I've gone rather ballistic on RR_V6A ( 120K ) having scored a freely distributable Javascript graphics library. Basically the same functionality as V5A but I've re-done/savaged the interface to suit - a picture is indeed worth a thousand words!
Wow!! Nice work!!
This is a current data file that I used to test V6A with. It has come from a dual PIII coppermine 1Gig HP Netserver running Linux. It's the same machine that I gave you data for back on Feb 2 with a promise of a couple more points when they finished. Well, better late than never and at least you get a few more than just two extra points. The HostID is 946535
I pasted exactly this into the LH 'Input' pane and got output in the RH pane and a very nice full width plot below. I particularly like the 'next' and 'prev' ability. The above data has only one workable frequency so all I could cycle through were the sequence numbers. In doing that, points that are perhaps a little suspect seem to show up very clearly. As an example, seq# 170 seems to have taken a bit longer than it should have compared to the immediate neighbours.
On clicking the Inputs and Outputs Summary, the information produced is pretty much as in the previous version. Here is a small snippet
Frequency : 764.5
Period of task cycle = 120.4
Task sequence number = 49
runtime = 96360
phase = 0.407
principal value = 0.958
Task sequence number = 54
runtime = 93559
phase = 0.449
principal value = 0.987
Task sequence number = 109
runtime = 125899
phase = 0.905
principal value = 0.293
.....
.....
Task sequence number = 288
runtime = 97247
phase = 0.392
principal value = 0.943
Number of point pairs used = 199
Minimum runtime in data = 93351
Maximum runtime in data = 140024
Estimated peak runtime = 142087
Estimated average runtime = 111924
Estimated trough runtime = 94708
Estimated runtime variance = 0.333
Estimated error = 3.2 %
I don't really know how to preserve the indentation but I wanted to ask a question or two about the output. First of all, you said previously
Quote:
The algorithm performs determinations of A and B over all possible combinations of pairs of points ( two equations with two unknowns each time ). For instance, 8 points yields 28 pairwise estimates [ generally N*(N-1)/2 ].
and
Quote:
Actually the two-point solution can only be sensibly obtained if the points are within the same sine excursion. That's because of the absolute value in your equation, which then ruins analytic continuity ( ~ differentiability ) across any peak point. So the sequence numbers of all the given points are mapped into the first cycle [ zero to one period ] prior to pairwise analysis. So it's that image of the points ( 'principle value' ) which I'm really discussing. This is legitimate as we expect no difference in execution times between points with sequence numbers exactly one period apart.
I had not remembered these details until I went back searching for information about 'point pairs' and 'principal value'. I was confused to see in the final block of output that 199 point pairs were used when I had entered only 23 data points. I initially thought that a 'point pair' was a 'runtime, seq#' combination and that you were showing how many of these data pairs were left after any 'irregular' ones were discarded. Of course, going back and finding the first quote quickly sorted that out.
As for 'principal value', I fully understand the need to map higher seq#s to the first cycle. So, from the comments about the term given in the second quote, I imagined that it meant that 'higher order' seq#s would be mapped to seq#s between 0 - 120.4 for the 764.5 frequency. So the 'principal value' would simply be the equivalent 'base' seq#, ie a seq# of 288 would have a principal value of 47.2 ie 47. Obviously 'principal value' isn't the equivalent 'base' seq# but rather some function of it. What exactly is a 'principal value'?
I now intend to play with the tool using different and more extensive datasets so hopefully there will be more feedback to come.
If there are any people out there using either Bikeman's awk script or my shell script which relies on the awk script, there is a small change you will need to make to the awk script as a result of the new >800 frequency tasks. This line from the script
match($0,"h1_[0-9]+\\\\.[0-9]+_S5R2__[0-9]+_S5R3[a-z]_[0-9]+");
needs to change to
match($0,"h1_[0-9]+\\\\.[0-9]+_S5R[23]__[0-9]+_S5R3[a-z]_[0-9]+");
This is needed because old data was referred to as S5R2 data and the new >800 data is referred to as S5R3 data.
Having got that out of the way, I'd like to report on some enhancements that I've made to my data gathering shell script. I wrote it this way because I don't really have any experience with other scripting languages and the unix shell seems to be able to handle what I wanted to do without too much of a re-learning curve. Obviously you need a unix (linux) machine to run this script.
My goals were to automate the process of data gathering for an unlimited number of machines and to create result files that could be dropped straight into Mike Hewson's RR. The data gathering process should be smart enough to handle things like frequency changes, app version changes, recognition of crunching of a single task by multiple app versions, etc, without requiring user intervention.
I've been testing my latest version of the script for a few days now and it seems to be working OK. It's gathering data for around 100 boxes and a random check on a few data files seems to show that everything is in order. For each successful task that a host completes, four CSV data values are recorded per line in the results file - Frequency, Sequence#, Runtime, App_Version.
Here are some of the features of the script
* Separate results files for each host which are appended to at a user settable interval with no limit on the number of hosts.
* Automatic recording of both HostID and Hostname in a file called hostids every time you give a new host to the script. Future runs can simply reuse some or all stored hosts, or can add new ones.
* Manual operation on just a couple of selected hosts if needed.
* Each time the script runs, any data that has previously been gathered will not be duplicated in the ongoing results file.
* The correct app version used for crunching will always be recorded even if the task is "branded" differently in your Boinc Manager list of tasks. If more than one app version was used for a single task the output will show 'mixed'.
* All website task data for each host will be examined irrespective of how many pages this might involve (ie one or many).
* The script will attempt to minimise the number of website pages consulted in updating the results file for each host.
* A rudimentary progress indicator - very useful when auto collecting for 100 hosts :).
Here is the current version (unfortunately indenting is lost) of the script. It is extensively commented if anyone is actually trying to make sense of it, so that partly accounts for the size. Undoubtedly, better ways to do things may emerge so there's a good chance it might shrink in future. I'll find a place to host this so you will get the full indentation if you wait a bit. I'll also post a separate message with proper instructions for use. The only thing you will need is access to a unix/linux machine that understands bash, awk, sed, grep, find, sort, paste, and a few other standard unix utilities.
#!/bin/sh
#
# grabdata.sh - Version 3.0
#
# Script to retrieve successful task data from the Einstein@Home online database for
# one or many hostIDs (either entered singly or read from a file) and to parse it to
# produce a tabulated set of information (frequency, seq#, cpu_time, app_version) as
# a CSV list for each host of interest.
#
# Hosts entered manually may be added automatically to the HostIDs file for future
# reuse. Extracted results data are added to individual host results files to provide
# an ongoing, updating log of host results statistics.
#
# This script relies on an awk script written by Bikeman.
#
echo
echo "Program to grab stats for successful EAH tasks and write them to a results file."
echo "Data is collected from the EAH website for hosts of interest, using the HostID."
echo "The IDs can be entered manually or read from an existing file named hostids."
echo "HostIDs manually entered will be added to the hostids file if not already there."
echo "The lines in this file contain the HostID and a HostName separated by a space."
hidfile=hostids
mindays=+2
echo
echo "Integer days of enforced wait from previous update before a new one is allowed."
echo -n "(Signed ints like +0 +1 +2 - see -mtime flag on 'find' manpage) (default=+2) : "
read ans
echo
if [ "X$ans" != "X" ]
then
mindays=$ans
fi
#
# Find all results files older than $mindays & store in $hids - ignore newer files
#
hids=`find results.* -mtime $mindays | sed 's/^results\\.//'`
#
# If there is no file of HostIDs (ie 1st time run) create a new one
#
if [ ! -f $hidfile ]
then
touch $hidfile
fi
#
# See if we are doing an auto update of all out-of-date host or perhaps a manual run
#
echo
echo -n "Using hosts from hostids file? (y or n - use if entering manually) : "
read ans
echo
if [ "$ans" == "y" ]
then
if [ "`wc -l $hidfile`" == "0" ]
#
# Can't autorun if there are no hosts filed. Suggest manual host entry
#
then
echo "HostIDs file $hidfile does not contain hosts - use manual entry ..."
echo
exit
else
echo "Any outdated results for hosts in $hidfile file will be retrieved."
echo
fi
else
#
# Manual run. Collect HostIDs. Check if HostID is on the list of out-of-date results
# Only accept it if it is. If it's not, check if we have a results file for it.
# Offer to update HostIDs file if we don't already have this host filed.
#
mhids=""
allhids=`cat $hidfile | sed 's/\\ \\ .*//'`
while true
do
echo -n "Next hostID to use for results ( only when finished ) : "
read hid
if [ "X$hid" != "X" ]
then
inhids=n
for i in $hids
do
if [ "$hid" == "$i" ]
then
inhids=y
mhids="$mhids$hid "
break
fi
done
if [ "$inhids" == "n" ]
then
if [ -f results.$hid ]
then
inhids=r
echo "HostID $hid was recently updated -- ignoring this HostID ..."
else
mhids="$mhids$hid "
fi
inallhids=n
for j in $allhids
do
if [ "$j" == "$hid" ]
then
inallhids=y
break
fi
done
if [ "$inallhids" == "n" ]
then
echo -n "Hostname for updating HostIDs file $hidfile ( = Dont update) : "
read host
if [ "X$host" != "X" ]
then
echo "$hid $host" >> $hidfile
fi
else
echo "HostID $hid already in $hidfile - no need to update that file ..."
fi
else
echo "HostID $hid already in $hidfile - no need to update that file ..."
fi
else
break
fi
done
echo "Manually entered HostIDs requiring an update are :-"
echo "$mhids"
echo
echo -n "If these are OK hit to continue or q to quit : "
read tmp
if [ "X$tmp" != "X" ]
then
exit
else
hids=$mhids
fi
fi
#
# Variable $hids now contains just those hosts (either auto-determined or manually
# entered) that require updating - if there are any.
#
if [ "X$hids" == "X" ]
then
echo "There are no out-of-date results files that need updating -- exiting ... "
echo
exit
fi
#
# Grab the data from the website for each valid host
#
for hid in $hids
do
#
# Get the first page of (up to 20) results Test if there is a "Next" page
#
curl -s "http://einstein.phys.uwm.edu/results.php?hostid=$hid" > results.curl
while true
do
tmp=`grep Next results.curl | tail -1 | sed -e 's/^.*offset.//' -e 's/.Next.*//'`
if [ "X$tmp" != "X" ]
then
#
# There is a next page - keep grabbing until no more
#
while true
do
curl -s "http://einstein.phys.uwm.edu/results.php?hostid=$hid&offset=$tmp" > results.ext
tmp1=`grep Next results.ext | tail -1 | sed -e 's/^.*offset.//' -e 's/.Next.*//'`
cat results.ext >> results.curl
tmp=$tmp1
if [ "X$tmp1" == "X" ]
then
break 2
fi
done
else
break
fi
done
#
# All grabbed pages for a host have been concatenated. Pass through Bikeman's awk script
# Massage to create a CSV list of Freq,Seq#,Runtime and a list of task IDs. Use the TIDs
# to grab the page with stderr.out so that the app version can be obtained.
#
awk -f parser.awk results.curl > results.raw
cut -c-9 results.raw > results.tid
tids=`cat results.tid`
cut -c10- results.raw > results.cut
sed -e s/\\ /,/g -e /,[0-9][0-9],/s/,/,0/ -e /,[0-9],/s/,/,00/ -e s/...$// results.cut > results.tmp
paste -d, results.tid results.tmp | sed s/\\ // > results.csv
#
# Progress indicator. Each dot means another set of raw host results has been parsed
#
echo -n .
for tid in $tids
do
#
# For each TID obtained from the website, check to see if those details are already recorded
# in the results file, Regard a match of seq# and runtime as sufficient to prove a match.
# Do not grab the page with stderr.out if we already have the data recorded
#
runtime=`grep $tid results.csv | sed 's/^.*,//'`
seqno=`grep $tid results.csv | cut -c18- | sed s/,.*//`
tmp=""
if [ -f results.$hid ]
then
tmp=`grep $runtime results.$hid | grep $seqno`
fi
if [ "X$tmp" == "X" ]
then
#
# No match for seq# and runtime for this TID so we need to grab the page.
#
freq=`grep $tid results.csv | sed -e 's/,[0-9][0-9][0-9],.*//' -e 's/^.*,//'`
ver=`curl -s "http://einstein.phys.uwm.edu/result.php?resultid=$tid" | grep einstein_S5R3 | sed -e 's/^.*S5R3_//' -e 's/_[iwp][6io].*//'`
numv=`echo $ver | wc -w`
#
# Check if more than one app version was used to crunch the data - if so record as 'mixed'
#
if [ $numv != 1 ]
then
flag=1
ver1=`echo $ver | sed 's/\\ .*//'`
for i in $ver
do
if [ $i != $ver1 ]
then
flag=2
break
fi
done
if [ $flag == 1 ]
then
ver=$ver1
else
ver=mixed
fi
fi
#
# Assemble a line of new data and store it in a temporary results file
#
echo $freq,$seqno,$runtime,$ver >> results.new
fi
done
#
# Progress indicator. Each plus means another host's new results have been assembled.
#
echo -n +
if [ ! -f results.new ]
then
#
# There are actually no new results at this time so create an empty file as a placeholder
#
touch results.new
fi
if [ -f results.$hid ]
then
#
# If there are existing results for this host add in the new ones. For no existing results
# any new ones will become the future existing ones. Clean up any temp files
#
mv results.$hid results.sav
cat results.sav results.new | sort > results.$hid
rm -f results.sav results.new
else
mv results.new results.$hid
fi
rm -f results.raw results.c* results.t* results.ext
done
echo
This script has been tested a few times but I wouldn't call it "extensive" by any means. During testing, I would often think of new features that would be nice so some of the more recent additions haven't been tested much at all. I'll be very surprised if there aren't at least a few logic bugs and probably more logic deficiencies. The most recent test involving about 100 hosts took less than 10 minutes to update the results files of all hosts. Most hosts had an average of about 10-15 results listed on the website, with just a few going past 20.
I'll be interested to see if anyone is kind (?foolish?) enough to give it a trial :). There is one known deficiency which I've just remembered and will fix before I host it somewhere. The script only understands results crunched on Windows or Linux. I've got to look at what other platforms say about the app version and then make a few minor adjustments to suit.
EDIT:
A very small change has been made to the above script to allow it to handle results for MacOS PPC. Windows, Linux and MacOS X Intel were already correctly handled.
To actually trial the script, all you need to do is create a directory and in it place this script and Bikeman's awk script with the latest mod mentioned at the start of this post. Make the script executable and from a console window (I use an xterm) 'cd' to the directory and just run the script. It will ask you for whatever it needs.
I don't anticipate having much time to add new features but I'm certainly interested in bug reports and particularly in suggestions of smarter ways to do things. I'm quite limited in my knowledge of tools outside of what was available in unix 15-25 years ago :).
Thanks! Actually I meant to openly acknowledge this guy who wrote the library! :-)
( although it's in the page source code too, as per GNU Lesser GPL )
Quote:
In doing that, points that are perhaps a little suspect seem to show up very clearly.
Indeed they do stand out nicely, as the algorithm is 'half-dumb' with respect to such outliers. :-)
Quote:
As for 'principal value' .... What exactly is a 'principal value'?
It's the value that a 'standard' sine function takes on - height along the vertical axis - for a given phase, the argument along the horizontal axis. I normalise it to the range [0,1].
[ this is a math terminology as : while each angle has a given sine(angle), a given sine corresponds to infinitely many angles. 'Principal value' implies that you have chosen a limited domain ( x-axis interval ). Various purposes ..... ]
A single point is defined by sequence number & runtime - a pair of numbers - in the usual x & y co-ordinate sense. Point pairs are two points grouped together ( 2 x 2 = 4 numbers now ) for examination in order to attempt deduction of what sine curve they may belong too.
The algorithm extracts all possible pairs of points for a given frequency, works out the parameters of the particular sine curve for each ( with some pairs exempted from later analysis - complex ). Those curve parameters - pairs of specific peak and variance values each representing a candidate sinusoid for a given point pair - are a collection which is then subject to measurements of center ( average ) and spread ( standard deviation ). I form a 'Mr Average Sine' from that and the rest follows. I've been kludging/guessing a bit with the choice of some of the algorithm's parameters - but hey, this is applied maths! :-)
The algorithm's weak aspects are evident with :
(1) too few points
(2) points too close in principal value ( denominator stuff )
(3) the given point set is whacko and doesn't actually reflect sinusoidal behaviour
(4) outliers
I cull and/or refuse to estimate those scenarios ....
So for a given particular sequence value you map it back to the equivalent sequence number for the first cycle ( if it's not already there ), work out how far along the cycle it is as a fraction - phase equals sequence divided by period ( PI multiplies in as well - 'because' ). Finally take the sine of that value which is labelled as the principal value. So the peaks will have principal value of zero, the troughs have principal value of one, and the average has principal value ~ 0.63. The problem lends itself to this approach, and your earlier comment about phase gave me the epiphany of normalising! :-)
Quote:
I now intend to play with the tool using different and more extensive datasets so hopefully there will be more feedback to come.
Looking forward to it! We can visualise/compare the detail now just by posting/passing CSV blocks and plugging them into RR ..... :-)
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Here is the current version (unfortunately indenting is lost) of the script.
For those who want to see your indenting as you wrote it, this dodge still works:
Click the "reply" button for the message containing the code of interest.
Scroll through the quoted material in the message formatting box to find the part you want, then just select and copy that.
Then don't click the "post reply" button.
I stumbled on this many months ago when some of us were trying to help each other with ap_info file issues. Those are really harder to read with the indenting gone.
Here is the current version (unfortunately indenting is lost) of the script.
For those who want to see your indenting as you wrote it, this dodge still works:
Thanks very much for the tip. I've just tested it out and it works fine. The tabs are all back where they should be. I had the tab interval set to 4 instead of the default 8 since in some places there were about 6 or 7 levels of nesting :).
Now that you mention the trick again, I think I did see you post it previously somewhere a while ago ...
As for 'principal value' .... What exactly is a 'principal value'?
It's the value that a 'standard' sine function takes on ...
Thanks for all the details as I'm sure others will now appreciate more fully, just how your tool works. For my purposes "Principal Value = |sin(phase)|" would have done nicely :). It's obvious now but I couldn't see it at the time I asked.
Also, whilst it's nice to see all the cycles in the data (which gives a wide graph), it might also be useful to see just one period. Would it be possible to have a tick box or similar that would "compress" the multiple periods of a normal plot into just one period? Using the "Next"/"Prev" buttons for seq# would then move to the appropriate data point and show the true seq# in the numeric display but the plotted point would be at its converted position in the single period. This would effectively allow the relationship of the actual data points to the model line to be seen more accurately, I think.
Quote:
Quote:
I now intend to play with the tool using different and more extensive datasets so hopefully there will be more feedback to come.
Looking forward to it! We can visualise/compare the detail now just by posting/passing CSV blocks and plugging them into RR ..... :-)
Yes, indeed. I have around 100 data files accumulating so as I start looking at them and find anything interesting, I'll just pass you the CSV data. Too easy!!
Also, whilst it's nice to see all the cycles in the data (which gives a wide graph), it might also be useful to see just one period. Would it be possible to have a tick box or similar that would "compress" the multiple periods of a normal plot into just one period? Using the "Next"/"Prev" buttons for seq# would then move to the appropriate data point and show the true seq# in the numeric display but the plotted point would be at its converted position in the single period. This would effectively allow the relationship of the actual data points to the model line to be seen more accurately, I think.
Quite right! :-)
I'll bung in a button ( now talking RR_V7A already! ) to swap between the two view types. It's a quick-ish redraw using the first-cycle mapping ....
Other possibles:
- I'm also pondering/reviewing the estimates aspect - I'd mentioned earlier somewhere of going to a median measure to reduce outlier sensitivity. Needs thought ...
- prissy changes to colors etc, more 'web safe' for want of a better phrase.
- put in the brief/verbose reporting selection that I promised.
- 'simple' editing of headings for the plots.
- print the plots by themselves. Alas this is actually quite non-trivial if you want to remain cross-platform!
- a look at inter-sequence variability to examine that cliff/ledges/step/0.45 business noted earlier elsewhere.
Cheers, Mike.
( edit ) Here's an example ( thanks to archae86 ) of where the analysis wobbles off, by the points straying from ~ sinusoidal pattern :
I'll bung in a button ( now talking RR_V7A already! ) to swap between the two view types. It's a quick-ish redraw using the first-cycle mapping ....
Thanks very much!
Quote:
( edit ) Here's an example ( thanks to archae86 ) of where the analysis wobbles off, by the points straying from ~ sinusoidal pattern :
....
What happened there? You may well ask .... :-)
I haven't plotted the points but I can see exactly what you are referring to :).
Something happened probably during or around seq# 40 which gave a step improvement of close to 10% in crunching performance. Could it have been an app change to a faster app or could it have been hardware improvements like upping the overclock a bit :). Maybe Peter might have some thoughts on this.
Create a new working directory of your choice (mine is /home/gary/EAH_Results) and place the two files there. They should be owned by you. I gave mine 755 permissions, but 644 would be OK if you feed the shell script to sh. Change to the working directory and execute the shell script. Here are the questions it will ask:-
* How many days of enforced wait to set? The idea here is to be kind to the servers and only allow results files older than a certain number of days to be updated. The default is +2 which means that an existing results file needs to be greater than 2 days old to be allowed to be updated. This question allows you to override the default. On your first run there are no existing results files so just accept the default.
* Do you want to use stored HostIDs? On your first run you will not have any stored HostIDs so selecting 'y' will draw a complaint. The question tells you to just hit for manual hosts entry. This is what you need for the first run.
* You will then be prompted for a HostID (write them down from the website before you start) and then you will be asked for a HostName. Both will be stored in a file called 'hostids' and the idea of the HostName is to make the file human readable if you are monitoring many hosts. This is fine if they are your own hosts but you will need to 'invent' a suitable name if you were monitoring an otherwise unidentified host. If you don't give a HostName the script will assume that this is a 'throw away' or 'once-off' data collection and it won't pollute your hostids file which is really intended for your own hosts. As you enter HostIDs, the script will check to see if any already have recently collected data, more recent than your enforced wait and will drop any that it finds. On a first run none will be dropped. If any do get dropped, you would notice this at the next stage.
* After you finish HostID entry with a null entry, the script will present you with a list of those for which data will be collected and will ask for permission to start that process. You can bail out at this point if you change your mind.
* Data collection can take a little while and a rudimentary progress meter will be constructed using decimal points and plus signs. A pair of these represents a completely finished collection for a single host. When all host data has been collected the program will simply exit.
After a first run you will find a complete hostids file ready for subsequent use and a series of results files - results.nnnnnn - where nnnnnn is the HostId you entered. The results files are in CSV (comma separated variables) format and there should be four items per line - Frequency, Sequence No, Runtime, App_version. If more than one app version was used to crunch a task the value will be recorded as 'mixed'. If you rerun the program every so often, new tasks that your host has crunched in the meantime will be added to your saved results. The program should be smart enough to avoid any duplication.
The results files are designed so that they can be simply pasted into the data entry area of Mike Hewson's RR_V6A which was announced in this thread quite recently.
At this point there are few (if any) sanity checks in the program. Please be careful when you enter a HostID and please think about the servers when deciding which hosts you need to monitor. Select your hosts carefully and don't force unnecessary repetitive updates just to see if any new task may have completed.
If you have any questions or comments, please fire away.
OK! I've gone rather
)
OK! I've gone rather ballistic on RR_V6A ( 120K ) having scored a freely distributable Javascript graphics library. Basically the same functionality as V5A but I've re-done/savaged the interface to suit - a picture is indeed worth a thousand words!
Before you ask - I am working on a method to print the plots. :-)
As usual, please tell me about the least little thing ..... :-)
Enjoy!
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
RE: OK! I've gone rather
)
Wow!! Nice work!!
This is a current data file that I used to test V6A with. It has come from a dual PIII coppermine 1Gig HP Netserver running Linux. It's the same machine that I gave you data for back on Feb 2 with a promise of a couple more points when they finished. Well, better late than never and at least you get a few more than just two extra points. The HostID is 946535
I pasted exactly this into the LH 'Input' pane and got output in the RH pane and a very nice full width plot below. I particularly like the 'next' and 'prev' ability. The above data has only one workable frequency so all I could cycle through were the sequence numbers. In doing that, points that are perhaps a little suspect seem to show up very clearly. As an example, seq# 170 seems to have taken a bit longer than it should have compared to the immediate neighbours.
On clicking the Inputs and Outputs Summary, the information produced is pretty much as in the previous version. Here is a small snippet
Number of point pairs used = 199
Minimum runtime in data = 93351
Maximum runtime in data = 140024
Estimated peak runtime = 142087
Estimated average runtime = 111924
Estimated trough runtime = 94708
Estimated runtime variance = 0.333
Estimated error = 3.2 %
I don't really know how to preserve the indentation but I wanted to ask a question or two about the output. First of all, you said previously
and
I had not remembered these details until I went back searching for information about 'point pairs' and 'principal value'. I was confused to see in the final block of output that 199 point pairs were used when I had entered only 23 data points. I initially thought that a 'point pair' was a 'runtime, seq#' combination and that you were showing how many of these data pairs were left after any 'irregular' ones were discarded. Of course, going back and finding the first quote quickly sorted that out.
As for 'principal value', I fully understand the need to map higher seq#s to the first cycle. So, from the comments about the term given in the second quote, I imagined that it meant that 'higher order' seq#s would be mapped to seq#s between 0 - 120.4 for the 764.5 frequency. So the 'principal value' would simply be the equivalent 'base' seq#, ie a seq# of 288 would have a principal value of 47.2 ie 47. Obviously 'principal value' isn't the equivalent 'base' seq# but rather some function of it. What exactly is a 'principal value'?
I now intend to play with the tool using different and more extensive datasets so hopefully there will be more feedback to come.
Cheers,
Gary.
If there are any people out
)
If there are any people out there using either Bikeman's awk script or my shell script which relies on the awk script, there is a small change you will need to make to the awk script as a result of the new >800 frequency tasks. This line from the script
match($0,"h1_[0-9]+\\\\.[0-9]+_S5R2__[0-9]+_S5R3[a-z]_[0-9]+");
needs to change tomatch($0,"h1_[0-9]+\\\\.[0-9]+_S5R[23]__[0-9]+_S5R3[a-z]_[0-9]+");
This is needed because old data was referred to as S5R2 data and the new >800 data is referred to as S5R3 data.Having got that out of the way, I'd like to report on some enhancements that I've made to my data gathering shell script. I wrote it this way because I don't really have any experience with other scripting languages and the unix shell seems to be able to handle what I wanted to do without too much of a re-learning curve. Obviously you need a unix (linux) machine to run this script.
My goals were to automate the process of data gathering for an unlimited number of machines and to create result files that could be dropped straight into Mike Hewson's RR. The data gathering process should be smart enough to handle things like frequency changes, app version changes, recognition of crunching of a single task by multiple app versions, etc, without requiring user intervention.
I've been testing my latest version of the script for a few days now and it seems to be working OK. It's gathering data for around 100 boxes and a random check on a few data files seems to show that everything is in order. For each successful task that a host completes, four CSV data values are recorded per line in the results file - Frequency, Sequence#, Runtime, App_Version.
Here are some of the features of the script
* Automatic recording of both HostID and Hostname in a file called hostids every time you give a new host to the script. Future runs can simply reuse some or all stored hosts, or can add new ones.
* Manual operation on just a couple of selected hosts if needed.
* Each time the script runs, any data that has previously been gathered will not be duplicated in the ongoing results file.
* The correct app version used for crunching will always be recorded even if the task is "branded" differently in your Boinc Manager list of tasks. If more than one app version was used for a single task the output will show 'mixed'.
* All website task data for each host will be examined irrespective of how many pages this might involve (ie one or many).
* The script will attempt to minimise the number of website pages consulted in updating the results file for each host.
* A rudimentary progress indicator - very useful when auto collecting for 100 hosts :).
Here is the current version (unfortunately indenting is lost) of the script. It is extensively commented if anyone is actually trying to make sense of it, so that partly accounts for the size. Undoubtedly, better ways to do things may emerge so there's a good chance it might shrink in future. I'll find a place to host this so you will get the full indentation if you wait a bit. I'll also post a separate message with proper instructions for use. The only thing you will need is access to a unix/linux machine that understands bash, awk, sed, grep, find, sort, paste, and a few other standard unix utilities.
This script has been tested a few times but I wouldn't call it "extensive" by any means. During testing, I would often think of new features that would be nice so some of the more recent additions haven't been tested much at all. I'll be very surprised if there aren't at least a few logic bugs and probably more logic deficiencies. The most recent test involving about 100 hosts took less than 10 minutes to update the results files of all hosts. Most hosts had an average of about 10-15 results listed on the website, with just a few going past 20.
I'll be interested to see if anyone is kind (?foolish?) enough to give it a trial :). There is one known deficiency which I've just remembered and will fix before I host it somewhere. The script only understands results crunched on Windows or Linux. I've got to look at what other platforms say about the app version and then make a few minor adjustments to suit.
EDIT:
A very small change has been made to the above script to allow it to handle results for MacOS PPC. Windows, Linux and MacOS X Intel were already correctly handled.
To actually trial the script, all you need to do is create a directory and in it place this script and Bikeman's awk script with the latest mod mentioned at the start of this post. Make the script executable and from a console window (I use an xterm) 'cd' to the directory and just run the script. It will ask you for whatever it needs.
I don't anticipate having much time to add new features but I'm certainly interested in bug reports and particularly in suggestions of smarter ways to do things. I'm quite limited in my knowledge of tools outside of what was available in unix 15-25 years ago :).
Cheers,
Gary.
RE: Wow!! Nice
)
Thanks! Actually I meant to openly acknowledge this guy who wrote the library! :-)
( although it's in the page source code too, as per GNU Lesser GPL )
Indeed they do stand out nicely, as the algorithm is 'half-dumb' with respect to such outliers. :-)
It's the value that a 'standard' sine function takes on - height along the vertical axis - for a given phase, the argument along the horizontal axis. I normalise it to the range [0,1].
[ this is a math terminology as : while each angle has a given sine(angle), a given sine corresponds to infinitely many angles. 'Principal value' implies that you have chosen a limited domain ( x-axis interval ). Various purposes ..... ]
A single point is defined by sequence number & runtime - a pair of numbers - in the usual x & y co-ordinate sense. Point pairs are two points grouped together ( 2 x 2 = 4 numbers now ) for examination in order to attempt deduction of what sine curve they may belong too.
The algorithm extracts all possible pairs of points for a given frequency, works out the parameters of the particular sine curve for each ( with some pairs exempted from later analysis - complex ). Those curve parameters - pairs of specific peak and variance values each representing a candidate sinusoid for a given point pair - are a collection which is then subject to measurements of center ( average ) and spread ( standard deviation ). I form a 'Mr Average Sine' from that and the rest follows. I've been kludging/guessing a bit with the choice of some of the algorithm's parameters - but hey, this is applied maths! :-)
The algorithm's weak aspects are evident with :
(1) too few points
(2) points too close in principal value ( denominator stuff )
(3) the given point set is whacko and doesn't actually reflect sinusoidal behaviour
(4) outliers
I cull and/or refuse to estimate those scenarios ....
So for a given particular sequence value you map it back to the equivalent sequence number for the first cycle ( if it's not already there ), work out how far along the cycle it is as a fraction - phase equals sequence divided by period ( PI multiplies in as well - 'because' ). Finally take the sine of that value which is labelled as the principal value. So the peaks will have principal value of zero, the troughs have principal value of one, and the average has principal value ~ 0.63. The problem lends itself to this approach, and your earlier comment about phase gave me the epiphany of normalising! :-)
Looking forward to it! We can visualise/compare the detail now just by posting/passing CSV blocks and plugging them into RR ..... :-)
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
RE: Here is the current
)
For those who want to see your indenting as you wrote it, this dodge still works:
Click the "reply" button for the message containing the code of interest.
Scroll through the quoted material in the message formatting box to find the part you want, then just select and copy that.
Then don't click the "post reply" button.
I stumbled on this many months ago when some of us were trying to help each other with ap_info file issues. Those are really harder to read with the indenting gone.
RE: RE: Here is the
)
Thanks very much for the tip. I've just tested it out and it works fine. The tabs are all back where they should be. I had the tab interval set to 4 instead of the default 8 since in some places there were about 6 or 7 levels of nesting :).
Now that you mention the trick again, I think I did see you post it previously somewhere a while ago ...
Thanks again.
Cheers,
Gary.
RE: RE: As for 'principal
)
Thanks for all the details as I'm sure others will now appreciate more fully, just how your tool works. For my purposes "Principal Value = |sin(phase)|" would have done nicely :). It's obvious now but I couldn't see it at the time I asked.
Also, whilst it's nice to see all the cycles in the data (which gives a wide graph), it might also be useful to see just one period. Would it be possible to have a tick box or similar that would "compress" the multiple periods of a normal plot into just one period? Using the "Next"/"Prev" buttons for seq# would then move to the appropriate data point and show the true seq# in the numeric display but the plotted point would be at its converted position in the single period. This would effectively allow the relationship of the actual data points to the model line to be seen more accurately, I think.
Yes, indeed. I have around 100 data files accumulating so as I start looking at them and find anything interesting, I'll just pass you the CSV data. Too easy!!
Cheers,
Gary.
RE: Also, whilst it's nice
)
Quite right! :-)
I'll bung in a button ( now talking RR_V7A already! ) to swap between the two view types. It's a quick-ish redraw using the first-cycle mapping ....
Other possibles:
- I'm also pondering/reviewing the estimates aspect - I'd mentioned earlier somewhere of going to a median measure to reduce outlier sensitivity. Needs thought ...
- prissy changes to colors etc, more 'web safe' for want of a better phrase.
- put in the brief/verbose reporting selection that I promised.
- 'simple' editing of headings for the plots.
- print the plots by themselves. Alas this is actually quite non-trivial if you want to remain cross-platform!
- a look at inter-sequence variability to examine that cliff/ledges/step/0.45 business noted earlier elsewhere.
Cheers, Mike.
( edit ) Here's an example ( thanks to archae86 ) of where the analysis wobbles off, by the points straying from ~ sinusoidal pattern :
368.35,75,31944
368.35,74,31804
368.35,73,31783
368.35,72,31816
368.35,71,31658
368.35,70,31532
368.35,69,31304
368.35,68,32480
368.35,67,31637
368.35,66,32473
368.35,65,33310
368.35,64,34886
368.35,63,33528
368.35,62,34145
368.35,61,35229
368.35,60,34687
368.35,59,34699
368.35,58,35342
368.35,57,35567
368.35,56,37068
368.35,55,36548
368.35,54,36520
368.35,53,35622
368.35,52,34807
368.35,51,34080
368.35,50,33324
368.35,49,33094
368.35,48,32907
368.35,47,32655
368.35,46,32210
368.35,45,31440
368.35,44,31344
368.35,43,31493
368.35,42,31302
368.35,41,31090
368.35,40,30518
368.35,39,29759
368.35,38,29805
368.35,37,29231
What happened there? You may well ask .... :-)
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
RE: I'll bung in a button (
)
Thanks very much!
I haven't plotted the points but I can see exactly what you are referring to :).
Something happened probably during or around seq# 40 which gave a step improvement of close to 10% in crunching performance. Could it have been an app change to a faster app or could it have been hardware improvements like upping the overclock a bit :). Maybe Peter might have some thoughts on this.
Cheers,
Gary.
As promised, here are the two
)
As promised, here are the two files required to get automatic data collection happening on Linux for any hosts you may have an interest in.
Firstly the shell script and secondly the Bikeman awk script that the shell script uses.
INSTRUCTIONS
============
Create a new working directory of your choice (mine is /home/gary/EAH_Results) and place the two files there. They should be owned by you. I gave mine 755 permissions, but 644 would be OK if you feed the shell script to sh. Change to the working directory and execute the shell script. Here are the questions it will ask:-
* Do you want to use stored HostIDs? On your first run you will not have any stored HostIDs so selecting 'y' will draw a complaint. The question tells you to just hit for manual hosts entry. This is what you need for the first run.
* You will then be prompted for a HostID (write them down from the website before you start) and then you will be asked for a HostName. Both will be stored in a file called 'hostids' and the idea of the HostName is to make the file human readable if you are monitoring many hosts. This is fine if they are your own hosts but you will need to 'invent' a suitable name if you were monitoring an otherwise unidentified host. If you don't give a HostName the script will assume that this is a 'throw away' or 'once-off' data collection and it won't pollute your hostids file which is really intended for your own hosts. As you enter HostIDs, the script will check to see if any already have recently collected data, more recent than your enforced wait and will drop any that it finds. On a first run none will be dropped. If any do get dropped, you would notice this at the next stage.
* After you finish HostID entry with a null entry, the script will present you with a list of those for which data will be collected and will ask for permission to start that process. You can bail out at this point if you change your mind.
* Data collection can take a little while and a rudimentary progress meter will be constructed using decimal points and plus signs. A pair of these represents a completely finished collection for a single host. When all host data has been collected the program will simply exit.
After a first run you will find a complete hostids file ready for subsequent use and a series of results files - results.nnnnnn - where nnnnnn is the HostId you entered. The results files are in CSV (comma separated variables) format and there should be four items per line - Frequency, Sequence No, Runtime, App_version. If more than one app version was used to crunch a task the value will be recorded as 'mixed'. If you rerun the program every so often, new tasks that your host has crunched in the meantime will be added to your saved results. The program should be smart enough to avoid any duplication.
The results files are designed so that they can be simply pasted into the data entry area of Mike Hewson's RR_V6A which was announced in this thread quite recently.
At this point there are few (if any) sanity checks in the program. Please be careful when you enter a HostID and please think about the servers when deciding which hosts you need to monitor. Select your hosts carefully and don't force unnecessary repetitive updates just to see if any new task may have completed.
If you have any questions or comments, please fire away.
Cheers,
Gary.