[S5R3/R4] How to check Performance when Testing new Apps

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5877
Credit: 118588527569
RAC: 17853525

In the opening post of this

In the opening post of this thread, I wrote

Quote:

The key formula is that the cycle period (P) for the variation in crunch time for data of frequency (F) is:-

P = 0.000206 * F^2

The frequency to 2 decimal places is contained within the task filename, eg the frequency for the task h1_0737.50_S5R2__70_S5R3a_1 is 737.50Hz. If you just happened to be crunching this task then you would expect to see cyclic crunch times whose calculated period would be:-

P = 0.000206 * 737.5^2 = 112.04 Each task has a sequence number, eg 70 in the above example. Tasks which have a sequence number that is an integer multiple of the period (including zero) have the maximum crunch time, eg 0, 112, 224, 336, ... in the above example. Let's call these the peak tasks. Tasks whose sequence number falls half way between any two adjacent peaks will have the minimum possible crunch time, ie the fastest speed of crunching. In the above example such tasks would be numbered 56, 168, 280, 392, ... and we could call them trough tasks.

So if we wanted to reliably compare "new app" tasks with "old app" tasks, we just need to ensure two things:-

  • * The tasks need to have a close enough frequency so that the calculated period does not vary significantly.
    * The tasks chosen for comparison should have as close as possible to the same offset from any peak or trough in the cycle.

Because of the relatively steep slope of the cycle near to the peaks, it is best if possible to compare tasks which are close to a trough where the slope is much smaller.

I want to use the above principles to illustrate a real world example. The box in question is an AMD Athlon XP 2000+ running Linux and I switched it from version 4.14 to 4.27 about 4 days ago. I've just collected all the details from its current tasks list and I'll reproduce them here as they will be disappearing from the online database shortly. Here is the relevant data:-

Taskname__ Seq# AppV Crunchtime
h1_0703.85 _175 4.14 71,815
h1_0703.85 _159 4.14 66,907
h1_0703.85 _150 4.14 66,659
h1_0703.85 _147 4.14 67,617
h1_0703.85 _143 4.14 68,264
h1_0703.85 _132 4.14 72,250
h1_0703.85 _123 4.14 79,886
h1_0703.85 _119 4.14 80,509
h1_0703.85 _110 4.14 86,098
h1_0703.85 _102 ---- 90,421 ==> Transition task - 60% with 4.14 and 40% with 4.27
h1_0703.85 _094 4.27 82,115
h1_0703.85 _030 4.27 65,388
h1_0704.00 _071 4.27 65,542
h1_0704.05 _008 4.27 80,578

As you can see, there are 4 tasks crunched with the new app and the crunch times are still highly variable and (at first glance) seemingly no better on average than the old app tasks. In fact there were more old tasks in the ~65k range so if anything the new stuff seems a bit disappointing. Here are the steps to do a proper analysis:-

1. Calculate the cycle period P = 703.85^2 * 0.000206 = 102
2. Identify the peaks and troughs - peaks are 0, 102, 204, ... and troughs are 51, 153, 255, ...
3. Match off equivalent sequence numbers for comparison. To make this easy I constructed the following table of sequence numbers:-

 0    102    102    204    204
05     97    107    199    209
10     92    112    194    214
15     87    117    189    219
20     82    122    184    224
25     77    127    179    229
30     72    132    174    234
35     67    137    169    239
40     62    142    164    244
45     57    147    159    249
50     52    152    154    254
51     51    153    153    255

The columns in this table represent sequence numbers in steps of 5, going from the peak of the cycle (slowest) to the trough (fastest). The odd columns should be read downwards from peak to trough and the even colums should be read upwards from the trough back to the next peak. Any horizontal row therefore represents those sequence numbers which should be at equivalent points on the cycle and which should give very similar performance for the same version science app. Because the numbers are in steps of 5 it is pretty easy to interpolate the equivalent sequence numbers. For example sequence numbers 45, 57, 147, 159, 249, ... are all at an equivalent point on the cycle. So would be 31, 71, 133, 173, 235, ... - just a different (higher - ie slower) equivalent point.

So a quick scan of the available crunch data shows that 30/132 are a match and so are 8/94/110 (a triple match) and so are 147/159. Also 71/175 are pretty close since a precise match for 71 would be 173.

Another point to notice is that 159/150/147/143 are all in what we might call a trough area and therefore would be expected to give quite similar performance - which indeed they do.

4. Armed with the above potential matchups, a number of conclusions can be drawn:-

  • * The 30/132 (new_app/old_app) matchup shows a speed increase of ~10%
    * The 94/110 (new_app/old_app) matchup shows a speed increase of ~5%
    * The 8/110 (new_app/old_app) matchup shows a speed increase of ~7%
    * The 71/175 (new_app/old_app) close match shows a speed increase of ~9%
    * Of the above 4 estimates, the 9% and 10% ones are more likely to be correct since the other two matchups were done close to a peak in the cycle and the steep slope there makes the outcome much less reliable.
    * Matchups done near a trough in the cycle are much preferred and do not need sequence numbers to be precisely at the same point - see 143/147/150/159
    * There is always going to be "scatter" in crunch times apart from the cyclic variation - eg 123 seems to have a larger crunch time than its neighbours would otherwise suggest.
    * This overall technique allows a reasonable ball park figure to be quickly deduced from what is really quite a minimal data set. I would guess the true figure for performance increase to be probably around 8-9%
    * All of this was possible ONLY after archae86 posted his

cyclic model in which he presented his data derived formula for the cycle period. Those (like me) who want a quick ball park estimate of performance change for the various betas that Bernd is dishing out now have no excuse for getting it wrong :).

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3161
Credit: 7302635023
RAC: 2261882

RE: * The 30/132

Message 77735 in response to message 77734

Quote:
* The 30/132 (new_app/old_app) matchup shows a speed increase of ~10%
* The 94/110 (new_app/old_app) matchup shows a speed increase of ~5%
* The 8/110 (new_app/old_app) matchup shows a speed increase of ~7%
* The 71/175 (new_app/old_app) close match shows a speed increase of ~9%
* Of the above 4 estimates, the 9% and 10% ones are more likely to be correct since the other two matchups were done close to a peak in the cycle and the steep slope there makes the outcome much less reliable.
* Matchups done near a trough in the cycle are much preferred and do not need sequence numbers to be precisely at the same point - see 143/147/150/159


It can happen that two aps being compared may not differ in speed by a uniform percentage across the cycle. I am pretty sure that actually did happen on the Windows Einstein ap going from 4.15 to 4.26. The parameter b in my current model
t = a(1 - b*|sin(PI*seq/period)|)
which this the proportion of full-scale by which the valley is below the peak currently appears to fit about 0.2 or 0.21 on my two hosts running 4.26, while best fit for 4.15 seemed to be about .16 to .17 on the same hosts.

A consequence is that overall performance estimates taken at the valley will slightly mis-state the cumulative change across the full cycle range. Specifically for Windows 4.26 running on my Conroe hosts this method slightly over-states the improvement over 4.15.

I don't think that is likely often to be very serious, and don't disagree with your advocacy of this method. While I've not reviewed your data carefully enough to be sure, I suspect an increase in peak height relative to valley may also be present for the aps you are comparing, and may account for more of the reduced advantage seen on the samples estimating 5 and 7 percent than errors in peak position estimation.

Quote:
There is always going to be "scatter" in crunch times apart from the cyclic variation - eg 123 seems to have a larger crunch time than its neighbours would otherwise suggest.


The scatter problem is appreciable in generating improvement estimates from single samples, even if one finds valley points. I've noticed that the amount of scatter can vary quite a bit. I operate two Conroe 3.006 GHz hosts, one a quad and one a Duo. The Quad is a bit slower, as memory conflict might suggest, but it also has appreciably more scatter--perhaps again a memory resource issue, or perhaps interaction with more an different non-BOINC aps than on the other.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 776868654
RAC: 1233799

Interesting analysis! From

Interesting analysis!

From what I got in profiling and disassembling the Windows app, I agree that a uniform speed increase across the cycle is not likely to happen. Optimizations in the code like the linear sin/cos stuff should affects only the non cyclic part of the runtime function, and most of the speed-up of the superior compiler optimization by VS 2003 should also mostly go to that part of the code.

CU
Bikeman

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6591
Credit: 325474396
RAC: 84397

This is an updated version of

This is an updated version of the Ready Reckoner for all to enjoy! :-)

Again any/all feedback welcome .....

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

archae86
archae86
Joined: 6 Dec 05
Posts: 3161
Credit: 7302635023
RAC: 2261882

RE: This is an updated

Message 77738 in response to message 77737

Quote:

This is an updated version of the Ready Reckoner for all to enjoy! :-)

Again any/all feedback welcome .....


I have some 4.26 returns for my Q6600 host for which I've spent effort combining both graphical comparison of model prediction and actual data, and also summary mean and stdev of the estimate errors.

By good fortune, my set included one point pretty near cycle peak, and several points in a fairly low portion, which helps estimation.

My estimates for a and b were:

30500 0.21

I entered just two points into your ready reckoner2, the near-high peak and one of the pretty low ones. At that point ready reckoner2 using just two points was very close to my personal estimates using about fifty points.
30810 0.21

Then I entered more points. At first nothing happened. The problem was that I had directly copied and pasted the seconds of run time from the "tasks for computer" web page, and it included a comma separating thousands from units. Apparently the reckoner thinks this not a number, and ignores the entry.

Deleting the comma meant reckoner did something with the additional points, but it does not seem to have been good. Possibly as a warning the error box, which of course showed zero when just the two required points were in, rose as more points were added, reaching a peak of 44% when seven points were in, dropping back to 28.1% with all eight points were in.

More disturbingly, the A and B estimates, so close to mine on the first two points, diverged a great deal, reaching:

37210 0.37

As seven of the points are near the valley, it stands to reason that implied estimate of the minimum is pretty close. I fear the numeric technique used is placing rather too much credence in the small differences of the values near the minimum in estimating curvature, for which the single point near the peak in fact supplies far more information, after the effects of noise are considered.

Anyway, as a comment to other users of this admirable tool, you'll find you get much better results if you supply the tool with sample points spread across a cycle a bit. It is wonderful if you are so lucky as to have a point pretty near a peak (right on it may not be quite so good, as any error in the real cycle period estimate will have greatly increased harm for the first couple of percent from peak). It is good to have a point fairly near the valley. To help, it is probably good if additional points are scattered across the cycle, more than if they are adjacent. (guessing a bit here).

archae86
archae86
Joined: 6 Dec 05
Posts: 3161
Credit: 7302635023
RAC: 2261882

RE: This is an updated

Message 77739 in response to message 77737

Quote:

This is an updated version of the Ready Reckoner for all to enjoy! :-)

Again any/all feedback welcome .....


Mike, second feedback post.

As one of the biggest uses here is estimating performance improvement, a useful additional output you might provide for step 4 would be, given the current estimates for a and b, an estimate of the probable average CPU time for a full cycle of results (and thus the expected value for a randomly chosen scatter of results).

Otherwise people may over-focus on a, which is a peak value, and moves up and down quite drastically as the sample suggests a greater or lesser peak to valley variation.

I don't think an over-complex calculation of the estimate which takes into account the granularity of actual sequence increments, or the displacement of the nearly actual sample to the equation-predicted peak is necessary, or even desirable. The real function is discontinuous anyway. So I think this estimate would test be calculated independent of frequency.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6591
Credit: 325474396
RAC: 84397

Thank you for your kind

Message 77740 in response to message 77738

Thank you for your kind feedback. I will indeed improve the interface and modelling. All your points are absolutely correct, and I think I will highlight the instructions better on the page ..... :-)

The error box is indeed a warning that the estimates are diverging. The algorithm performs determinations of A and B over all possible combinations of pairs of points ( two equations with two unknowns each time ). For instance, 8 points yields 28 pairwise estimates [ generally N*(N-1)/2 ].

Alas it is quite unlikely that any three particular points would lie on a sinusoidal curve of a given period - with only amplitude and offset to work with for fit - hence the pairwise approach. I would dearly love inbuilt stats abilities from everyday Javascript. I haven't as yet nutted out a least squares implementation from it's pretty basal Math set, which is my preference. But I want the page to be one-shot downloadable to any reasonable browser, self contained, no setup.

Algorithmically the difficulty is that the denominator in the solution for A has the difference of two sines. Both are less than unity, and if the sequence numbers are close then that sine difference will be rather smaller again. In a divisor this is not good. So a rising error indicator reflects that the solutions are starting to spray. That suggests an improvement by tossing estimates ( not the points ) from sequence number pairs that are, say, within five or so of each other.

Actually the two-point solution can only be sensibly obtained if the points are within the same sine excursion. That's because of the absolute value in your equation, which then ruins analytic continuity ( ~ differentiability ) across any peak point. So the sequence numbers of all the given points are mapped into the first cycle [ zero to one period ] prior to pairwise analysis. So it's that image of the points ( 'principle value' ) which I'm really discussing. This is legitimate as we expect no difference in execution times between points with sequence numbers exactly one period apart.

The output boxes give means, but I might go for medians which are less sensitive to outliers. The code currently has a logic error in the median part, but I don't use it ( yet ). The quoted error refers to the time estimate, and is the ratio of the standard deviation to the mean of all time estimates, expressed in percent.

I do hope you have at least as much fun fiddling with it as I did writing it! :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6591
Credit: 325474396
RAC: 84397

RE: Mike, second feedback

Message 77741 in response to message 77739

Quote:
Mike, second feedback post.


and my second feedback, feedback :-)

Quote:
As one of the biggest uses here is estimating performance improvement, a useful additional output you might provide for step 4 would be, given the current estimates for a and b, an estimate of the probable average CPU time for a full cycle of results (and thus the expected value for a randomly chosen scatter of results).

Nice one! -> In the pipe for V3! A pretty straight forward integration there. Actually the RMS value of a sine is ~ 0.7071 ( reciprocal of square root of two ), so that'll be easy.

Quote:
Otherwise people may over-focus on a, which is a peak value, and moves up and down quite drastically as the sample suggests a greater or lesser peak to valley variation.

Precisely ...

Quote:
I don't think an over-complex calculation of the estimate which takes into account the granularity of actual sequence increments, or the displacement of the nearly actual sample to the equation-predicted peak is necessary, or even desirable. The real function is discontinuous anyway. So I think this estimate would test be calculated independent of frequency.

Good points all .... :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6591
Credit: 325474396
RAC: 84397

Here's V3 of the Ready

Here's V3 of the Ready Reckoner! :-)

- instructions in one place
- can enter numbers with commas
- better tossing of variant estimate cases ( if difference in sines < 0.1 )
- no estimates given if data is poorly
- prediction of average runtime over ALL units at a given frequency [ using average of a sine wave = 2/PI ~ 0.6366. Not RMS! ]
- some prissy polishing :-)

Again any/all feedback welcome .....

Cheers, Mike.

( edit ) I've noticed that if you get a 'good' A and B for a given computer at some frequency, you can leave those in at Step 6 then go back to step 3 & 4 entering the different frequency ( if below 800 ) and sequence number for another known unit. The actual and predicted times match to ~ 1%. Perhaps hard to believe but that was the original question in my mind to be answered - are A and B truly machine dependent only ( all else held constant - including the app - bar the stepping of work unit parameters ). Like Ewan McGregor & Charley Boorman, the long way round!

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Svenie25
Svenie25
Joined: 21 Mar 05
Posts: 139
Credit: 2436862
RAC: 0

Looks quiet good. But I have

Looks quiet good. But I have a problem at step 5. If I intput my sequence number and runtimes, the fields "Average runtime", "Error ~" and "A" and "B" will stay at "0". Don´t know what happens there.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.