Pascal again available, Turing may be coming soon

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117619389620

RAC: 35214771

archae86 wrote:Anyone

6 Oct 2018 1:53:06 UTC

Message 167147 in response to message 167144

(moderation:

)

archae86 wrote:

Anyone remember the remarkable speedups obtained by code revisions by the person posting as akosf?

Sure do :-). Akos Fekete was the star of the moment with those spectacular improvements :-).

archae86 wrote:

I still have 22 of the "high-pay" units in cache, all received on September 30. So for the moment I can keep them in stock by a somewhat tedious procedure of keeping them suspended save for brief periods when I download a gulp of fresh work. I don't think their shelf life will expire for about another week. But I'd like to free myself from my procedure, and suspect I may need a much longer shelf life. So I need a refrigeration method.

Unfortunately, I doubt there is anything you can do to extend the expiry date of what you already have. It would be possible to 'shelve' the current BOINC tree and replace it with a new tree (new host ID) which would get all new tasks. If you were lucky enough to get resends of 'high-pay' tasks they would have a full 2-week deadline but not so for those you currently have.

In the past, I have used the 'resend lost tasks' feature to delete tasks that failed for unrelated reasons so that the scheduler would notice them as 'lost' and send fresh copies. When that happened the resends were actually given a new full 2 week deadline. However that bug was corrected some time ago so that if a lost task is created now, it will come back as a resend of that lost task with the previous deadline and not a new one. If you'd had this problem a year or two ago, you could have got a nice extension to their shelf life :-).

Bearing in mind that I know nothing about Windows and am describing what I'd probably do if it were my problem using Linux, I would rename BOINC to BOINC_save and create a new BOINC tree. You would need a new host ID so that the scheduler wouldn't notice lost tasks. When the new ID first contacts the scheduler, it would get everything new, including a bunch of tasks. With luck, you might get some of the correct 'high-pay' resends. The liklihood of getting some is probably at its highest about 2 weeks after the data file in question was first released since there is always a proportion of hosts that receive tasks but never return them. There will be some between 2 to 4 weeks with perhaps a few more at around the 4 week mark.

According to my records (I cache LAT data files and deploy them rather than letting individual hosts get them through downloading) here are the dates when the relevant files first appeared on any of my hosts.

-rw-r--r-- 5 gary gary 2270502 Sep 6 05:48 LATeah0104M.dat
-rw-r--r-- 5 gary gary 2270502 Sep 8 19:26 LATeah0104N.dat
-rw-r--r-- 5 gary gary 2270502 Sep 11 15:58 LATeah0104O.dat
-rw-r--r-- 5 gary gary 2270502 Sep 14 05:45 LATeah0104P.dat
-rw-r--r-- 5 gary gary 2270502 Sep 16 13:08 LATeah0104Q.dat

As you can see, the 'Q' tasks would have had a peak at around 30th Sep, so I'm guessing your current ones are largely of the Q flavour. We are at the point where you might get some of the 'M' and 'N' but it's probably a bit of a forlorn hope.

If you did try and did get a few, you could save this new BOINC tree as something like BOINC_new, and revert to the previous host ID. You could abort the tasks in the old tree that would fail anyway (or 'manage' them as you have been until closer to the due date) but eventually the old tree would have to go back to 'auto' mode of running. If a new driver did come along too late for your 'saved' tasks, you might still be able to test with any you got in the new tree.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117619389620

RAC: 35214771

Keith's message arrived

6 Oct 2018 4:39:01 UTC

Message 167148

(moderation:

)

Keith's message arrived almost simultaneously with mine. I don't know of any 'benchmarking tools'. No volunteer here has developed any optimised versions of the current apps that I know of. The last time something like that happened, it was for the BRP5 apps when staff member Bikeman (Heinz-Bernd Eggenstein) made optimisations that reduced the PCIe bandwidth comsumption problem - or something along those lines.

However Keith's comments reminded me that there is the ability to run the science app in 'standalone mode'. I know nothing about it but I believe you can set up the equivalent of a slot directory and place copies of all the required components (app and data, etc.) and launch the app manually with all the command line arguments as would be found in the state file. As the app is running standalone (not managed by BOINC) there would be no communication with the project so there would be no concept of a deadline. You could setup a special directory somewhere and create some subdir 'slots' and populate each one with a copy of what was needed to run the app for a particular 'high-pay' task. If you wanted to test a new driver version, you could do it anytime with an otherwise long overdue task that you have kept for just this purpose.

As I said, I've never looked into this so I don't know the details of how to do it. It might need to be different from what is done at Seti. When this project tests apps before release, perhaps they use standalone mode so maybe one of the staff might be able to point you in the right direction.

Cheers,
Gary.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2957762972

RAC: 715433

The SETI benchmark tool does

6 Oct 2018 10:24:03 UTC

Message 167149 in response to message 167148

(moderation:

)

The SETI benchmark tool does exactly that - runs the app in standalone mode against a datafile supplied by the user.

It's a fiendishly complicated Windows batch file, because it also does a lot more besides: it contains a standalone version of the SETI validator to measure the quality of the generated output, and a lot of time logging/reporting. It allows multiple tests to be conducted in a single run, specifically allowing the output quality test to be automated to compare 'reference' and 'test' applications.

Anybody with (advanced) batch/scripting experience could adapt it for use here: only the validator app would require true programming effort.

I wrote the fanout generator tool that Keith mentions: it's a simple Excel/Open Office spreadsheet which allows copy'n'paste operation of the procedure described in https://setiathome.berkeley.edu/forum_thread.php?id=56536&postid=953939#953939. The only tricky part is identifying the fanout folder from the file name by MD5: that's a standard BOINC procedure, so it could be adapted to work here, if there's a demand. But I doubt it's needed: because of locality scheduling, most of you will have sticky copies of most of the data files already.

Validity testing (accuracy of results) would require capturing and presenting the differing parameter command-line strings which define each new task. But Peter's problems relate to immediate crashes only: simple offline/standalone testing, without the advanced validation baggage, would be sufficient.

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7223004931

RAC: 975542

My "walled garden" thought

6 Oct 2018 17:16:08 UTC

Message 167153 in response to message 167149

(moderation:

)

My "walled garden" thought has evolved, thanks to insights gleaned from Richard Haselgrove and Gary Roberts and a little of my own poking around.

New insight 1: At the moment the Einstein executable is processing a WU, it appears to rely solely on the (recently prepared just for this WU) slot directory contents, without external file dependencies.

New insight 2: Quite a bit of control configuration is passed to the executable in the a long string of switches. At least one switch value is specific to the individual WU, so getting them all right is essential for even the sort of "is it alive" test I have in mind.

Not new insight to me, which Richard caught: I'm not seeking to get credit, or even to test whether a trial run gets "the right answer". I just want to see it run to completion or fail quickly.

So my current general concept, which I post before trial in case discussion here further improves my insight.

1. While BOINC is running in normal order on my machine:

2. Unsuspend a high-pay WU so it runs next.

3. Immediately after the processing begins, capture the full command line (including switches) using Process Explorer.

4. Copy the slot directory contents to a holding location.

5. Suspend the WU and resume normal processing of other work

6. make a myrun.bat file containing the captured command line and store it in the holding location slot copy.

Then, to actually use this saved copy to test a new driver:

1. Terminate BOINCMgr, and, if necessary, any executing boinc-related tasks which somehow survive.

2. copy the holding location slot to yet another "active trial" location

3. open a command line window, and change context to the active trial location

4. run myrun.bat

5. observe execution impact in the form of GPU activity (e.g. GPU-Z). If it runs more than a couple of minutes it is a pass, and if about 25 seconds a fail.

My current thought is to make holding location copies of one each low-pay and high-pay WUs, and to attempt the proposed test in my current environment.

Just for your curiousity, here is the full command line as captured from Process Explorer for a low-pay WU running today on the machine in question. For readability I have added a carriage return once every few switches.


hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe --inputfile LATeah1026L.dat
--alpha 1.41058464281 --delta -0.444366280137 --skyRadius 5.090540e-07 --ldiBins 30
--f0start 148.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13
--f1dotBand 1e-13 --df1dot 2.512676418e-15 --ephemdir JPLEPH.405 --Tcoh 2097152.0
--toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1
--cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1
--demodbinary 1 --BinaryPointFile templates_LATeah1026L_0156_8324624.dat
--debug 1 --debugCommandLineMangling --device 0

Juha

Joined: 27 Nov 14

Posts: 49

Credit: 4964434

RAC: 0

You can easily make a test

6 Oct 2018 19:10:37 UTC

Message 167155

(moderation:

)

You can easily make a test set up yourself. First open client_state.xml and find the <app version> you are interested in. You'll need to match <app_name>, <version_num, <platform> and <plan_class>. I'm using CPU app here because I don't run the Einstein GPU app.

<app_version>
  <app_name>hsgamma_FGRP5</app_name>
  <version_num>108</version_num>
  <platform>windows_intelx86</platform>
  <avg_ncpus>1.000000</avg_ncpus>
  <flops>3397004991.680533</flops>
  <plan_class>FGRPSSE</plan_class>
  <api_version>7.3.0</api_version>
  <file_ref>
    <file_name>hsgamma_FGRP5_1.08_windows_intelx86__FGRPSSE.exe</file_name>
    <main_program/>
  </file_ref>
  <file_ref>
    <file_name>einstein_S5R6_3.01_graphics_windows_intelx86.exe</file_name>
    <open_name>graphics_app</open_name>
  </file_ref>
  <file_ref>
    <file_name>fftwf-wisdom_FGRP5_1.08_windows_intelx86.exe</file_name>
    <open_name>fftwf-wisdom_FGRP5_1.08_windows_intelx86.exe</open_name>
  </file_ref>
</app_version>

The-parts tells what files the application is made of. Copy the files from Einstein project directory to some other directory. The graphics app and the fftw app here are probably not needed but it doesn't hurt to copy them too. If any file has <open_name> that's different from <file_name> you'll need to rename the file to match.

Next pick a task you want to use for testing, cut the last _0, _1, etc from the name and find <workunit> with that name.

<workunit>
  <name>LATeah0046F_72.0_920_-2e-12</name>
  <app_name>hsgamma_FGRP5</app_name>
  <version_num>108</version_num>
  <rsc_fpops_est>105000000000000.000000</rsc_fpops_est>
  <rsc_fpops_bound>2100000000000000.000000</rsc_fpops_bound>
  <rsc_memory_bound>600000000.000000</rsc_memory_bound>
  <rsc_disk_bound>20000000.000000</rsc_disk_bound>
  <command_line>
--inputfile LATeah0046F.dat --alpha 4.8185021298 --delta -0.9503033260 --skyRadius 0.001031447752 --ldiBins 15 --f0start 56 --f0Band 16 --firstSkyPoint 920 --numSkyPoints 8 --f1dot -3e-12 --f1dotBand 1e-12 --df1dot 1.855338438e-15 --ephemdir JPLEPH.405 --Tcoh 4194304.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.15 --reftime 56265.0 --f0orbit 0.005 --freeRadiusFactor 2 --mismatch 0.15 --debug 1 --debugCommandLineMangling
  </command_line>
  <file_ref>
    <file_name>LATeah0046F.dat</file_name>
    <open_name>LATeah0046F.dat</open_name>
  </file_ref>
  <file_ref>
    <file_name>JPLEPH.405</file_name>
    <open_name>JPLEPH.405</open_name>
  </file_ref>
</workunit>

That gives two more files, copy them too. Also notice the long command_line>, you'll need it later.

Next, pick a running task that's using the same app version and copy init_data.xml from the task's slot directory to the test directory. Open the init_data.xml file in the test directory and change <wu_name> and <result_name> to match the name of the task you copied and remove <project_dir>, <boinc_dir>, <comm_obj_name>, <slot>, <client_pid> and <computation_deadline>.

Finally, you need to figure out how to tell the app which GPU to run on. Check the command line a running task uses with Windows Task Manager or Process Explorer or some similar program. If a running task has "--device 0" on the command line then the test task needs it too. The 0 is the number of the GPU that is used. If you need to test some other GPU just replace 0 find the number of the GPU. For apps that use newer BOINC API the GPU number and type is specified in <gpu_type>, <gpu_device_num> and <gpu_opencl_dev_index> elements in init_data.xml.

The test directory should now be ready. Make a copy of it any time you want to run the test task so that you don't need to clean up the directory afterwards.

To run the test task, open command prompt, change directory to the copy of the test directory, copy-paste the name of the science app executable (the file that is accompanied by <main_program> tag in <app_version>, add a space and append the <command_line> from <workunit> and, if the app needs it, add a space and append "--device 0". Putting all that together, for this example task:

hsgamma_FGRP5_1.08_windows_intelx86__FGRPSSE.exe --inputfile LATeah0046F.dat --alpha 4.8185021298 --delta -0.9503033260 --skyRadius 0.001031447752 --ldiBins 15 --f0start 56 --f0Band 16 --firstSkyPoint 920 --numSkyPoints 8 --f1dot -3e-12 --f1dotBand 1e-12 --df1dot 1.855338438e-15 --ephemdir JPLEPH.405 --Tcoh 4194304.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.15 --reftime 56265.0 --f0orbit 0.005 --freeRadiusFactor 2 --mismatch 0.15 --debug 1 --debugCommandLineMangling --device 0

(This particular task is a CPU task so it doesn't use "--device 0". It's there just to give an example.)

Juha

Joined: 27 Nov 14

Posts: 49

Credit: 4964434

RAC: 0

I was composing the

6 Oct 2018 18:09:38 UTC

Message 167156 in response to message 167153

(moderation:

)

I was composing the instructions during your poking around. Your concept won't work as is because the files in the slot directory are BOINC soft links.

You certainly should grab both kinds of workunits. If for nothing else then just to test the test set up.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117619389620

RAC: 35214771

Many thanks to Richard and

6 Oct 2018 23:03:35 UTC

Message 167158

(moderation:

)

Many thanks to Richard and Juha for adding all the details. As I've never tried to do this before, I would not have felt comfortable trying to come up with the detailed instructions without a whole bunch of preliminary experiments to make sure I had things 'all correct and accounted for' :-).

What I can do is supply the equivalent examples for GPU tasks to augment the ones Juha used for CPU tasks. Some things will still be different (I run Linux - not Windows) so you will still need to look at examples from your own state file.

Also, just to emphasise the point already made by Juha, there is no need to 'capture' the command line string of options and their values since they are already recorded in the state file (as part of the workunit specification) at the time a particular task is downloaded. When you save stuff for later reuse, don't copy a soft link - make sure you have a hard copy of the real file :-).

<app_version>
    <app_name>hsgamma_FGRPB1G</app_name>
    <version_num>118</version_num>
    <platform>x86_64-pc-linux-gnu</platform>
    <avg_ncpus>0.300000</avg_ncpus>
    <max_ncpus>1.000000</max_ncpus>
    <flops>137158360128.617355</flops>
    <plan_class>FGRPopencl1K-ati</plan_class>
    <api_version>7.3.0</api_version>
    <file_ref>
        <file_name>hsgamma_FGRPB1G_1.18_x86_64-pc-linux-gnu__FGRPopencl1K-ati</file_name>
        <main_program/>
    </file_ref>
    <coproc>
        <type>ATI</type>
        <count>0.500000</count>
    </coproc>
    <gpu_ram>629145600.000000</gpu_ram>
    <dont_throttle/>
</app_version>

<workunit>
    <name>LATeah1026L_196.0_0_0.0_35870583</name>
    <app_name>hsgamma_FGRPB1G</app_name>
    <version_num>118</version_num>
    <rsc_fpops_est>525000000000000.000000</rsc_fpops_est>
    <rsc_fpops_bound>10500000000000000.000000</rsc_fpops_bound>
    <rsc_memory_bound>450000000.000000</rsc_memory_bound>
    <rsc_disk_bound>20000000.000000</rsc_disk_bound>
    <command_line>
--inputfile LATeah1026L.dat --alpha 1.41058464281 --delta -0.444366280137 --skyRadius 5.090540e-07 --ldiBins 30 --f0start 188.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 2.512676418e-15 --ephemdir JPLEPH.405 --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile templates_LATeah1026L_0196_35870583.dat --debug 1 --debugCommandLineMangling
    </command_line>
    <file_ref>
        <file_name>LATeah1026L.dat</file_name>
        <open_name>LATeah1026L.dat</open_name>
    </file_ref>
    <file_ref>
        <file_name>JPLEPH.405</file_name>
        <open_name>JPLEPH.405</open_name>
    </file_ref>
    <file_ref>
        <file_name>templates_LATeah1026L_0196_35870583.dat</file_name>
        <open_name>templates_LATeah1026L_0196_35870583.dat</open_name>
    </file_ref>
</workunit>

In particular, I thought it important to include a <workunit> example because of the need to include the correct template file (see above). As CPU tasks don't use templates and the template is specific to the task, you'll need to be careful with that.

The above example comes from a machine with a single RX 460 GPU so no device number. As I believe you are testing a single GPU, you wont need that option either.

Cheers,
Gary.

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7223004931

RAC: 975542

Juha wrote:Your concept won't

7 Oct 2018 2:19:42 UTC

Message 167159 in response to message 167156

(moderation:

)

Juha wrote:

Your concept won't work as is because the files in the slot directory are BOINC soft links.

Gary Roberts wrote:

When you save stuff for later reuse, don't copy a soft link - make sure you have a hard copy of the real file

UGH. Thank you very much. I've long ago seen things specifically carrying an lnk extension that were pointers to the real thing, but did not know that Windows File Explorer would mis-represent things this particular way.

My method, as I stated it, indeed could not work. It nevertheless may allow me a helpful cross-check to still look at a slot directory as it looked when active.

Off to the state file for me. I'm not used to delving around in there, and hoped to avoid it. Sigh.

And once again, thanks again to you both for stopping me in my tracks before I wasted more time going down my mis-conceived rat hole.

It took me so long to respond because I was off singing. Well, I only sang for about ten minutes, but the event was a fund-raiser for the group, so for a couple of hours I stood outside the door trying to guess who was a fellow chorus member who had forgotten to wear their badge, and who might be an "honored guest" (fund-raising target person) to whom I should give a friendly request for their name so I could try to find whether there was a prepared badge for them and supply it.

I'll get back to working on this not later than Monday. My units of interest will not expire by then. They are all Q units first issued on September 17, then re-issued October 1 when a first recipient failed to meet the reply deadline.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117619389620

RAC: 35214771

archae86 wrote:I've long ago

7 Oct 2018 9:04:08 UTC

Message 167163 in response to message 167159

(moderation:

)

archae86 wrote:

I've long ago seen things specifically carrying an lnk extension that were pointers to the real thing, but did not know that Windows File Explorer would mis-represent things this particular way.

It's not a matter of mis-representation - you could use soft links for what you need to do. It's just that there is no gain in doing so. If you were intending to process hundreds of tasks depending on certain static files, soft links would definitely save a lot of disk space. For your situation with a very small number of test tasks, you can afford to populate each work area with the real files. The problem you want to avoid is that if you create a new soft link pointing at a particular real file and BOINC comes along at a later stage and deletes the real file because BOINC has finished with it, you could end up with a link pointing at nothing. So, for safety, populate your test area with hard copies of all the real files needed - just in case.

archae86 wrote:

Off to the state file for me. I'm not used to delving around in there, and hoped to avoid it. Sigh.

As the stuff you are after isn't going to change at all while the tasks remain unprocessed, just take a full copy as it currently is and put that copy in a safe place. You can then use a plain text editor to copy and paste the bits you need from that copy once you have decided which particular tasks for both high-pay and low-pay are going to be your test tasks.

You can easily take your copy of the state file while BOINC is running. Once you have that copy, if no solution comes along before deadline, you can abort (at a time of your choosing) the tasks that would fail if allowed to run. Just make sure you have hard copies of all ancillary files well before that.

Good luck with everything and I sincerely hope a future driver update suddenly fixes the problem for you.

Cheers,
Gary.

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 578226874

RAC: 199674

Just a short comment: some

7 Oct 2018 21:03:58 UTC

Message 167172

(moderation:

)

Just a short comment: some time ago I did some tests on Intel GPUs for Einstein. It was rather simple at the end and basically what the others said: copying the files from a slot and having a cmd command to execute in stand-alone mode. A result file was generated if successful. I renamed this and could immediately run again and binary-compare the new result file. I'm not going to look for that folder, as it was BRP4 work, but just wanted to encourage you - once you get the command line, you're almost done :)

MrS

Scanning for our furry friends since Jan 2002

Pascal again available, Turing may be coming soon

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner