trouble getting O1AS20-100T work

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7221104931

RAC: 964315

RE: The truncation issue

19 Feb 2016 15:54:09 UTC

Message 137220 in response to message 137218

(moderation:

)

Quote:

The truncation issue should be fixed now. I now see longer feature strings in the logfile where hosts are missing AVX. You should get work now if your CPU does support AVX.

There is evidence in the O1AS20 task list list for one of AgentB's Linux machines that the truncation fixed worked to enable AVX WU download by that machine.

The last non-AVX task on view there shows a 19 Feb 2016, 8:24:51 UTC send date, while more than one task sent starting 19 Feb 2016, 12:01:23 UTC is AVX.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117554956699

RAC: 35307816

Here is an updated version of

19 Feb 2016 22:05:49 UTC

Message 137221 in response to message 137210

(moderation:

)

Here is an updated version of the table showing GW tasks distributed per ~24 hour period.

Date and Time UTC       Status page item   O1AS20-100T   Change  Comment
16 Feb 2016, 20:48:31        Tasks total       746,736
16 Feb 2016, 20:48:31      Tasks to send       741,997        0
17 Feb 2016, 21:05:01      Tasks to send       738,475    -3552  Initial cache fillup on hosts
18 Feb 2016, 21:35:02      Tasks to send       736,549    -1926  Hmmm.. these tasks chew sloooow
19 Feb 2016, 21:45:01      Tasks to send       733,902    -2647  Picked up a bit...  New V1.04 app
20 Feb 2016, 21:00:01      Tasks to send       731,502    -2400  Looks 'steady state' now
21 Feb 2016, 21:50:02      Tasks to send       728,666    -2836  Up a bit more...

A new version (1.04) of the app appeared just a few hours ago - around closing time on a Friday!! I hope they weren't doing that remotely from a bar somewhere :-).

Edit: Added 20 Feb entry to table.

Edit2: Added 21 Feb entry to table.

Cheers,
Gary.

AgentB

Joined: 17 Mar 12

Posts: 915

Credit: 513211304

RAC: 0

RE: RE: The truncation

20 Feb 2016 12:39:41 UTC

Message 137222 in response to message 137220

(moderation:

)

Quote:

Quote:
The truncation issue should be fixed now. I now see longer feature strings in the logfile where hosts are missing AVX. You should get work now if your CPU does support AVX.

There is evidence in the O1AS20 task list list for one of AgentB's Linux machines that the truncation fixed worked to enable AVX WU download by that machine.

I aborted a number of the v1.03 X64 (SSE tasks) - to get the AVX tasks started.

A panic requiring a restart, during these tasks, or close to their completion AVX1 and AVX2 and the details logged appear truncated. This could be simple as tasks completed, panic occurred - stderr_txt log files truncated, and these then reported later after restart. I expect i may see some invalids on those two or related tasks.

It will be a few hours before more AVX will finish.

A couple of early observations as these AVX complete on this host. The AVX times for me are ~10% longer, and CPU temps ~5C higher.

Comparing 32 bit(v1.02) and 64 bit(v1.03 an v1.04) - 64bit is much (at least 25%) faster.

Aside: I notice most tasks have unsent "wingmen", and there is a growing number of tasks to validate - which have no quorum tasks sent as yet. I guess this is to be expected in the beta stage where tasks are many and wingmen, few.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117554956699

RAC: 35307816

RE: Aside: I notice most

20 Feb 2016 22:12:24 UTC

Message 137223 in response to message 137222

(moderation:

)

Quote:

Aside: I notice most tasks have unsent "wingmen", and there is a growing number of tasks to validate - which have no quorum tasks sent as yet. I guess this is to be expected in the beta stage where tasks are many and wingmen, few.

This is what happens with GW type runs. It's due to the way locality scheduling (LS) works. The large data files are arranged in steps of 0.05Hz. To crunch a single task you need a number of these data files covering a frequency range. Double this again for the two detectors (h1 and l1). If you got tasks 'at random' your data download would be enormous, potentially a full set of data for each task. LS is vital for people who have low monthly bandwidth caps.

Once you have a set of data, you can get lots (up to thousands) of tasks that will reuse the same data. When that particular frequency set is finished, you will get just the extra 2 files to extend the range by a further 0.05Hz so that a whole slew of new tasks becomes available that will still be using some files you already have. This all works quite well when there are not too many frequency bins and lots of computers to provide quorum partners.

The current situation is that there aren't all that many computers that have 'test apps allowed' and all of the available frequency bins are in play (all tasks in the tuning run are in the database). The scheduler has all frequencies to choose from so hosts will be 'scattered thinly' :-). This should change fairly soon when the app (hopefully V1.04) is deemed ready for prime time.

As a side note, the transition from V1.00 -> 1.01 -> 1.02 -> 1.03 -> 1.04 has been quite rapid. The last transition was for a problem in the science code whereas the earlier ones were to fix operational problems. If anyone has earlier version tasks that have not yet started, the best action would be to abort them so that they could be reissued as the current version. I don't think it's useful to waste the time on a version with an already known problem.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117554956699

RAC: 35307816

RE: A couple of early

21 Feb 2016 1:17:49 UTC

Message 137224 in response to message 137222

(moderation:

)

Quote:

A couple of early observations as these AVX complete on this host. The AVX times for me are ~10% longer, and CPU temps ~5C higher.

I've just found an example of a direct comparison of the X64 and AVX Linux apps on a single host of mine. The tasks of interest are at the bottom of the page. The link should work, as is, for the next 24 hours at least, since the host won't get new work for the moment. Of the (currently) 4 completed tasks there are two examples for each app type.

The CPU is an i3-3240. As you say, AVX is taking 10% longer. That's pretty disappointing. Hopefully something can be done about that.

EDIT: I managed to get two machines mixed up. The one above is an i3-4130 (Haswell and not Ivy Bridge) and it's continuing to draw new tasks so the 4 comparison tasks are partly on the second page now. There are also further AVX results continuing to show the longer times :-(.

The i3-3240 that I thought I was looking at started with the AVX app from the beginning and the crunch times are just slightly slower than those of the Haswell. I thought there would be more of a difference between Ivy Bridge and Haswell. It's got some FGRPB1 to finish before it gets back to GW tasks.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117554956699

RAC: 35307816

RE: Comparing 32 bit(v1.02)

21 Feb 2016 2:36:48 UTC

Message 137225 in response to message 137222

(moderation:

)

Quote:

Comparing 32 bit(v1.02) and 64 bit(v1.03 an v1.04) - 64bit is much (at least 25%) faster.

I have this host which got SSE2 (32bit) tasks for a while until the 64bit detection worked. The speedup is not 25% on this old clunker but nevertheless it's still quite impressive (102K -> 88K).

The CPU is a single core AMD Sempron and, through a BIOS trick, it's been turned into a dual core. I had retired it quite a while ago but I thought it would be good to run it again for testing purposes on the new GW run.

Cheers,
Gary.

floyd

Joined: 12 Sep 11

Posts: 133

Credit: 186610495

RAC: 0

RE: The CPU is an i3-3240.

21 Feb 2016 5:50:18 UTC

Message 137226 in response to message 137224

(moderation:

)

Quote:

The CPU is an i3-3240. As you say, AVX is taking 10% longer. That's pretty disappointing. Hopefully something can be done about that.

It's even worse on my FX-8320E. The 1.03 X64 tasks took between 18 and 19 hours which is already much longer than expected. The 1.04 AVX tasks running right now seem to be heading for the 23 hours mark.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117554956699

RAC: 35307816

Looks like the O1AS20-100T

22 Feb 2016 21:40:55 UTC

Message 137227

(moderation:

)

Looks like the O1AS20-100T apps may now be out of test status? Either that or something really, really big has been added to the beta test crunchers. Note the sudden dramatic drop in the remaining 'tasks to send' that has occurred in the last 24 hours.

Date and Time UTC       Status page item   O1AS20-100T   Change  Comment
16 Feb 2016, 20:48:31        Tasks total       746,736
16 Feb 2016, 20:48:31      Tasks to send       741,997        0
17 Feb 2016, 21:05:01      Tasks to send       738,475    -3552  Initial cache fillup on hosts
18 Feb 2016, 21:35:02      Tasks to send       736,549    -1926  Hmmm.. these tasks chew sloooow
19 Feb 2016, 21:45:01      Tasks to send       733,902    -2647  Picked up a bit...  New V1.04 app
20 Feb 2016, 21:00:01      Tasks to send       731,502    -2400  Looks 'steady state' now
21 Feb 2016, 21:50:02      Tasks to send       728,666    -2836  Up a bit more...
22 Feb 2016, 21:05:02      Tasks to send       707,949   -20717  Open the floodgates...?
22 Feb 2016, 21:30:01      Tasks to send       707,300     -649  In 25 mins ie. ~38,000/day

The 'drop' seems to be accelerating so get in quick before they're all gone :-).

Cheers,
Gary.

Jasper

Joined: 14 Feb 12

Posts: 63

Credit: 4032891

RAC: 0

RE: Looks like the

22 Feb 2016 22:00:02 UTC

Message 137228 in response to message 137227

(moderation:

)

Quote:

Looks like the O1AS20-100T apps may now be out of test status?

That sounds correct to me: IÂ´ve not enabled testing but received my first O1 WU today.

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7221104931

RAC: 964315

RE: Looks like the

22 Feb 2016 22:03:14 UTC

Message 137229 in response to message 137227

(moderation:

)

Quote:

Looks like the O1AS20-100T apps may now be out of test status?

You cite pretty good evidence at the macro level. At a micro level, I just looked at my resumption on my Westmere, and realized my quorum partner (the same for all six tasks, naturally) never got a single O1AS20-100T task until 22 Feb 2016, 10:00:07 UTC, and has subsequently gotten 134.
I predict an epidemic of deadline misses starting 5 days from now.

The first pulse will be from 5-day deadline O1AS20 tasks issued today, the next from 7-day deadline work, then a more diffuse cloud as longer-deadline work on Gamma-Ray Pulsar Binary 1 or from other projects pre-empted by priority given the O1AS20 work then fails to meet deadline.

In other words, I think the work content for these is still underestimated, so over-fetch will happen on hosts set for big caches which have stabilized on the previous work. That will interact with the short deadlines (yes, longer than they were at the start) to give a bit more of this sort of commotion than usual. The good news is that if the host goes into deadline protection priority mode quickly, the DCF will boom up after the first completion, which will snub the excess fetching.

trouble getting O1AS20-100T work

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports