# [S5R3/R4] How to check Performance when Testing new Apps

peanut
Joined: 4 May 07
Posts: 162
Credit: 9,644,812
RAC: 0

### The RR is a neat little page.

The RR is a neat little page. Good work Mike.

I do like the big fonts on the RR. I'm old enough to need big fonts.

Klimax
Joined: 27 Apr 07
Posts: 87
Credit: 1,370,205
RAC: 0

### OK,app will be ready soon.It

OK,app will be ready soon.It will for now take one URL and writes a single file named WUURL.txt (for now) with every single wu in the URL. So far only WIn is supported (but port should not be too difficult. and will be done,once I get Linux).I suppose tommorow I will release it w/ source code.

For know only gathering WUs,no analysis,because it is bit more difficult than anticipated. :-(

Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,578
Credit: 304,193,462
RAC: 223,251

### RE: He's how I came up with

Message 77816 in response to message 77813

Quote:

He's how I came up with this:

The major working-hypothesis behind all this is that the declination ("sykcoordinates latitude") of a skypoint is the major factor for the runtime needed to process it. Not quite true but close enough, hopefully.

So I went along and did some standalone, offline experiments, running the E@H client with single-point skygrids of my own, plotted runtime against declination and the result looked very close to

runtime(delta) = T0+ c*(sin(delta))^2

When you then go further and calculate the declination of a skypoint for a workunit at phase ph , then some equations later the final formula for runtime is

runtime(ph) = T0 + B*(2*ph-1)^2

So here the quadratic function is not meant as an approximation to the sine function, it's supposed to be the "real thing", suggested by the experimental evidence from single-skypoint timing tests.

Now, a few simplifications and assumptions were made here, so the sine-based runtime model might still be a superior fit, even if theory suggests a quadratic function.

OK ... that's very interesting. My apologies, I have misunderstood your approach there ....

Actually about two months ago I was playing with a cycloid model for this but discarded it because I couldn't establish any reasoning to support. But, as always, the lamp of measurement guides the way ..... :-)

Least squares ( & spinoffs ) doesn't find the best function to fit any data, it gives the best fit parameters of some function type which has already been assumed to apply approximately to the data. For instance, it is entirely possible to have distinct functional forms, measured by some 'match quality' metric, equally satisfy the same data set. [ See Frank Anscombe's analogous rebuttal of bland correlation ]

Sooo .... what about I mark that down for RR8? Bung in another button ( or make it a drop-down list, as this logic may expand! ) this time to swap between different assumed models ( QUADRATIC, SINUSOID, CYCLOID .... ).

Quote:
ASIDE : The RR algorithm/script logic is factorised to cope with that - most of the computational load in least squares goes into producing various sums ( over all points ) of several different term types .... x, y, sin(x), y*sin(x), x^2 or whatever depending on the chosen function ... so one can accumulate these while the points are being taken in. The final step is an easy matrix inversion. The good news for our/we/my current understanding of the problem is that the assumed errors/variations/residuals are in the runtime, and not also in the sequence number. If you have data which is assumed to vary in both axes, horizontal & vertical, then that generally is rather more work.

I now see that is what you were getting at some time earlier, sorry 'bout that .....

This is like a group crossword solving session, with much the same dynamics too ... :-)

Peanut: same here! The big font idea came from those cheap calculators you can buy at the supermarket - you could land a helicopter on the 'equals' button. :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

archae86
Joined: 6 Dec 05
Posts: 3,153
Credit: 7,163,794,931
RAC: 620,766

### RE: Slight change with

Message 77817 in response to message 77812

Quote:

Slight change with RR_V7C one can enter/vary a sky grid density value.

Cheers, Mike.

I've got lots of good experience with this version (I had not kept up since version 3).

Possibly because I've fallen behind, I don't see how one would use this version for prediction (i.e., given parameters calculated for set of observations, what is the model-predicted CPU time for a new frequency and sequence).

This type of prediction was a major use for the early versions, as it let folks checking up on a new application or host quickly see if even a single point showed evidence of improvement. Of course one can use V7c to get parameter estimates for a set of observations, then v3 to make a prediction or two.

When I used this version to process dozens of sets of observations, a few of which reached over 200 entries, I stumbled on an aspect of its behavior I don't recall seeing mentioned.

As I said in the Haselgrove hostid 1001562 thread:

one comment--RR v7c appears not to have some form of garbage collection for internal state accumulated when one handles successive batches of inputs. It can process a first set of many dozens, even a couple of hundred, inputs in reasonable time, but slows way down if one continues by using the clear input button and just pasting in new sets. I found it faster to kill the tab it was running in and reload the html file once every few sets.

To amplify the comment--it is not just large new sets which take a long time once internal state has piled up, even six-observation sets can take seconds to compute on my Q6600, and just exiting the tab takes multiple seconds as well.

So, to users, consider quitting and starting again if things start to slow. To Mike, maybe there is an improvement opportunity here--once you are past your exam.

Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,578
Credit: 304,193,462
RAC: 223,251

### RE: Possibly because I've

Message 77818 in response to message 77817

Quote:

Possibly because I've fallen behind, I don't see how one would use this version for prediction (i.e., given parameters calculated for set of observations, what is the model-predicted CPU time for a new frequency and sequence).

This type of prediction was a major use for the early versions, as it let folks checking up on a new application or host quickly see if even a single point showed evidence of improvement. Of course one can use V7c to get parameter estimates for a set of observations, then v3 to make a prediction or two.

Ok ... I'll re-include that for RR8! :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Klimax
Joined: 27 Apr 07
Posts: 87
Credit: 1,370,205
RAC: 0

### While Javascript page is

While Javascript page is running smoothly,I ran into terminal problem.At least two of STL functions for string operations are in Mingw-GCC 4.1.2 broken and crashes almost on anything.

Task page will download correctly,but substr failes with any input.And most probably find is failing as well.

So much for STL strings...

If anyone wants to see sourcecode,tell me and I post link to my webpage.

Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,870
Credit: 115,156,248,726
RAC: 32,336,359

### Anyone using the shell script

Anyone using the shell script that I published earlier in this thread might like to know that Bikeman has identified a small problem when he used it for a host that was putting an extra debug line in the normal stderr.out output that the script looks at on the website. This was confusing the bit of code that was working out which app versions had been used for crunching.

I've made a small fix but I believe that most hosts wouldn't have been affected by the problem Bikeman found. The symptoms were that results that were crunched by a single app only, were erroneously being reported as "mixed". If you wish to fix the problem without downloading again, you can make the following change to line 214 of the V3.0 script. Simply change

`'s/^.*S5R3_//'` to `'s/^.*ein_S5R3_//'` and you will have the more robust search string.

Here is the fixed version which incorporates this more robust search string. It should work fine irrespective of the presence or absence of extra debug lines in stderr.out.

Cheers,
Gary.

Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,870
Credit: 115,156,248,726
RAC: 32,336,359

### I've uploaded a new version

I've uploaded a new version of my shell script which addresses a problem mentioned by Richard Haselgrove and explained here.

You can download the new version (V3.2) which fixes the problem here.

Cheers,
Gary.

Bernd Machenschalk
Moderator
Joined: 15 Oct 04
Posts: 4,301
Credit: 248,201,745
RAC: 32,352

### I just found this in another

I just found this in another thread and think it might be suitable to be answered here:

Quote:
Quote:
Alternatively, we could just go ask Bernd instead of expending all that energy. I'm rather curious as to why he hasn't put us out of our misery earlier :). Of course that would completely take away the thrill of the chase :).

Quote:
I bet Bernd is reading these threads, and just chortling quietly at us as we bumble around re-inventing his algorithm......

There's probably an E@H Head Office Sweepstake riding on this, and it might have attracted serious money! Keep up the top work guys .... lookin' good. :-)

Cheers, Mike.

No, there isn't.

Wow, what excitement and geniality I missed!

To add my two cents here's some very, very technical background (which you probably already largely figured out), and a proposal at the end:

The "Hierarchical Search" we are using since S5R2 consists of two steps that are performed alternating for each "template": A "F-Statistic" (or short "Fstat") step that is similar (but not equal to) an DFT, and a "Hough-transform based coincidence analysis". Previous runs consisted only of the Fstat step, and a coincidence analysis was part of the post-processing. In the Hierarchical Search we analyze data from two detectors with a Fstat method, then basically look for similarities in the results, as a gravitational wave should show up in the signals of both instruments, while local noise and disturbances shouldn't. This "coincidence analysis" is currently done in the Hough part.

When you point a profiler such as Intel's VTune or Apple's Shark to current Apps, you should see the pattern of alternating calls to the functions ComputeFStatFreqBand and ComputeFstatHoughMap (and subfunctions; in recent Apps replaced by optimized functions with the prefix "Local") that implement these two steps. Currently about 2/3 of a task's run-time is spent in the Fstat analysis, and 1/3 in the Hough part. The whole machinery around these two functions doesn't matter much for the computation time.

The time needed for the Fstat part is constant for every template, i.e. independent of the actual sky position and frequency. This isn't true for the Hough step. The size of a fundamental data structure in the Hough part depends on the declination of the current sky position (actually the (abs() or sqr() - I don't remember from the top of my head) sine of the angle between the sky position and the ecliptic plane), and so does the time of every computation that involves it (This is very much simplified, but if you neglect everything that averages out over a task even in S5R3, it boils down to this).

This effect averages out in an "all-sky" task, which we had in all previous analysis runs up to and including S5R2. This is the reason why we didn't see this run-time variation in S5R2 although we already used the Hierarchical Search there, and also why the credits are right on average in S5R3.

In order to get reasonable and equal run-times for each task (though the amount of data to be analyzed is constantly growing from run to run) we found it necessary to split up not only the frequency range, but also the sky positions between the tasks in S5R3. This splitting is actually done in the application:

Having had quite some trouble in early runs with calculating the grid of sky positions to look at in the Application (from numerical differences between architectures and compilers) we are distributing files with lists of pre-calculated sky positions, the "sky-grid" files. The granularity of this grid depends on the frequency; there is a new sky-grid file for every 10Hz. The sky positions in such a file start at one pole, followed by all sky positions next to it in order of right ascension, then followed by the next-nearest circle of sky positions and so forth, until the other pole is reached.

If you look at the command-line of the App (e.g. in client_state.xml) you'll see that it gets passed options named "numSkyPartitions" and "partitionIndex", originally set by the workunit generator for this particular workunit. The App splits the skygrid file in "numSkyPartitions" portions and calculates the sky positions of the portion number "partitionIndex". (In hindsight it looks more clever to me to take every "numSkyPartitions"th skypoint of the file starting with the "partitionIndex"th, which effectively would still cover the whole sky with every task, looking more interesting on the graphics and averaging run times.)

So the current run-time is the sum of a part that is determined only by the number of templates in each task (and can be derived from the currently assigned credit), and a part that varies with the (average) declination of the sky positions in the sky partition assigned to the task.

The trouble is that the ratio between these two is a little bit different of course for every App (version), but also for every machine, and finally also depends on the frequency (the higher the frequency, the larger the Fstat part). The Fstat part is FLOPs bound, it doesn't read or write much data, and depends on the speed of the FPU (or whatever SIMD unit the current App is using). The Hough part is largely limited by memory bandwidth, which is somewhat orthogonal to FP speed. The first App optimizations to the "Kernel loop" and "sin/cos approximation" addressed the Fstat part, while the recent "prefetching" affects the memory access of the Hough part.

Honestly I find it very hard to "tune" the credits of a workunit (i.e. predict the run-time for it) in a way that would do justice to everyone and every machine, especially if that should also serve as a reference for comparing speed of Apps.

For the latter I think it would be best to construct a "reference workunit" that covers the whole sky and thus averages out these variation, e.g. by taking a current workunit and picking some points from the skygrid file in the way described above. This would, however, mean that people interested in speed comparison would need to run it on their machine with every new App they get. Remember that you'd do this without getting BOINC credit for it, but just for comparison. How much (run-) time would you be willing to sacrifice for this?

BM

BM

archae86
Joined: 6 Dec 05
Posts: 3,153
Credit: 7,163,794,931
RAC: 620,766

### RE: I just found this in

Message 77823 in response to message 77822

Quote:
I just found this in another thread and think it might be suitable to be answered here:

Thanks for the information and details.

Quote:
For the latter I think it would be best to construct a "reference workunit" that covers the whole sky and thus averages out these variation, e.g. by taking a current workunit and picking some points from the skygrid file in the way described above. This would, however, mean that people interested in speed comparison would need to run it on their machine with every new App they get. Remember that you'd do this without getting BOINC credit for it, but just for comparison. How much (run-) time would you be willing to sacrifice for this?

I do see some practical points:

1. For this to be most useful, I assume you'd want results from a range of hosts, with variation in processor architecture, relative speed of memory to CPU, and mix of tasks sharing the resources with Einstein. That, in turn, probably means someone (you, or one of the handy helpers here) would need to prescribe an easy to follow installation, run, and report procedure, preferably one which would not lead to mass discarding of queued work.

2. For hosts with more than one processor, whether virtual (hyperthreading), multi-core, or multi-socket, memory interactions mean that the most useful performance measure needs the test application running on the same number of processors, in the same environment of non-Einstein activity, as that host's normal production standard.

3. For the multiple-processor case, it may not be best to start multiple instances of the identical test result at the same moment to this end, as the memory conflict issues for that case are not likely to mimic the average case well.

While this has been great good fun, and possibly even has been slightly useful in helping guide those few users who view this forum to better application choices, a procedure which you would find more useful to your purposes would certainly win my participation.

However, given the relatively low number of people reporting, and the modest fraction of them who have adopted our cycle learning or the Ready Reckoner tool in their reporting, I'm a bit skeptical that the current user base will generate much more feedback to you using the reference workunit approach than presently.