Gravitational Wave Engineering run on LIGO O1 Open Data

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5846

Credit: 109979507502

RAC: 29061074

Sorry about the extra ]. I

22 Apr 2019 4:40:01 UTC

Message 170865 in response to message 170862

(moderation:

)

Sorry about the extra ]. I copied the example from the docs and edited out the optional bits that weren't needed. I missed one closing bracket. I've corrected it (in both messages) in case someone tries to use it as a template.

Thanks for the image. The 2nd line in the event log snip shows that the expected conditions from cc_config.xml seem to be detected properly. There are 4 things that occur to me that might be interfering with how this is supposed to work. These are just random thoughts, not necessarily 'smoking guns'.

There is a mention of ignoring an AMD GPU. Since you hadn't previously documented which computer and what type of GPU, I looked at your full list of 5 hosts. All those with crunching GPUs were listed as having a single nvidia GPU. That is why I said you shouldn't need a <type> entry. Does one of your machines have a 2nd AMD GPU that is not being listed on the computers list? If not, that comment seems a bit strange.
At the bottom, there is mention of an app_config.xml file. Do you have entries in that file that might be conflicting in some way with what you are now trying through cc_config.xml? Previously I warned about stuff remaining in the state file from a previous app_config.xml file. It would be worth browsing the state file and identifying what has been left from that file to see if there was anything that could possibly be conflicting with cc_config.xml. It should be easy to spot and pretty easy to edit out, if found.
Is it possible that having the same short name for both a CPU app and a GPU app is causing some sort of issue? I was just wondering if the code that implements this option might not like the same name being used for both. The project has chosen to do this so it's not something you can avoid. Someone familiar with the code would need to comment.
The scrollbar for the event log seems to show (I'm not familiar with Windows appearance, so I could be wrong) that this is a very small section at the bottom of a much larger log. In other words, this isn't the 'startup' messages but just what is produced after clicking 'reread config files' with a longer existing log already in place, perhaps? Is that what happened?In the background, the 'missing GPU' status for all the FGRPB1G tasks seems to imply that BOINC didn't detect the GPU during startup. Is it true to say that the GPU only became 'missing' as a result of rereading the config files? The docs mention that some options need a full restart of BOINC. If that wasn't done, maybe that might explain what is causing the strange behaviour.

If there isn't anything obvious from a user perspective that's preventing the cc_config option from working, you'll probably need to report what's happening as some sort of bug, over on the BOINC website. Maybe I'm misunderstanding the use of that option. If so, they should be able to clarify that over there.

Cheers,
Gary.

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3562358667

RAC: 162

1) I'm testing on the

22 Apr 2019 5:04:10 UTC

Message 170866

(moderation:

)

1) I'm testing on the computer with a GTX 1080. None of my systems currently have an AMD gpu, and the system in particular I don't think has ever had one.

2) App_config has entries that are preventing the GW GPU work from running and from my initial attempts to keep it from loading. Since I have nothing in cc_config other than the exclude gpu entry I don't think there should be anything that could conflict.

<app>
<name>einstein_O1OD1E</name>
<max_concurrent>5</max_concurrent>
<gpu_versions>
<gpu_usage>99</gpu_usage>
<cpu_usage>99</cpu_usage>
</gpu_versions>
</app>
<app_version>
<app_name>einstein_O1OD1E</app_name>
<plan_class>GW-opencl-nvidia-V1</plan_class>
<max_concurrent>0</max_concurrent>
<avg_ncpus>99</avg_ncpus>
<ngpus>99</ngpus>
</app_version>

4) Yes, that screenshot was taken at the tail end of a session which involved multiple attempts to load the config file. The GPU was detected at startup, permanently lost during my cc_config experimenting (not sure if due to your stray ], or to something related to experimenting with Zalster's version), and not seen again until after I restarted boinc. The failure to request any new Fermi GPU tasks and thinking that it needed to grab a task from a backup GPU project all occurred after the client restart. (I did all my cc_config testing with network disabled so that I could recover if I screwed something up badly enough that the client aborted my work queue).

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5846

Credit: 109979507502

RAC: 29061074

DanNeely wrote:2) App_config

22 Apr 2019 8:34:58 UTC

Message 170870 in response to message 170866

(moderation:

)

DanNeely wrote:

2) App_config has entries that are preventing the GW GPU work from running and from my initial attempts to keep it from loading. Since I have nothing in cc_config other than the exclude gpu entry I don't think there should be anything that could conflict.

I'm sorry, but I think you may well be seeing conflict. When I said that I thought Zalster's suggestion should work, I meant in place of rather than in cooperation with an app_config.xml solution. I guess I didn't really make that very clear. There was no precedent to say that app_config should work - you had to resort to kludgy values to get something that seemed close to working.

Since both are messing with how a single GPU interacts with different GPU applications, I think that some sort of conflict is inevitable. I made a point of mentioning that stuff from app_config.xml lingers on in the state file and that you needed to get rid of that when transitioning to the cc_config solution. My intention was always to suggest trying just the cc_config solution on its own since the documentation allowed you to focus on a specific app name and exclude just that app from using the GPU. The more I think about what you are trying to do (I have not previously tried to exclude particular apps this way so have no experience to guide me) the more I think that the <exclude_gpu>, with its ability to focus on a single app name, is the way to go. If it doesn't work exactly as documented, we need to complain long and loud to the BOINC Devs.

Cheers,
Gary.

Mad_Max

Joined: 2 Jan 10

Posts: 153

Credit: 2140086148

RAC: 218721

Your attempt to limit the

23 Apr 2019 1:46:09 UTC

Message 170881 in response to message 170825

(moderation:

)

Your attempt to limit the exclusion to GW GPU tasks didn't work, it also showed GPU missing on my Fermi tasks, failed over to a backup project, and at some point in there process began aborting the fermi GPU tasks (I managed to stop boinc and revert the change before it took out more than 50 or 60 of them).

Strange. <exclude_gpu>Option works fine for me. Only noticeable difference with Zalster variant - i do not use plain class. Just that simple:

<cc_config>
   <options>
    <exclude_gpu>
        <url>einstein.phys.uwm.edu</url>
        <type>ATI</type>
        <app>einstein_O1OD1E</app>
    </exclude_gpu>
   </options>
</cc_config>

Type tag can be omitted too if there is only one type of GPU in computer.

Note: BOINC documentations states client restart is needed to apply these setting from cc_config. So may be "read config files" from menu is not enough.

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

I would agree with Max on

23 Apr 2019 1:50:50 UTC

Message 170882 in response to message 170881

(moderation:

)

I would agree with Max on this. I was talking with someone else on a different project and they stated that the restart of the BOINC manager is required. Just telling BOINC manager to re read the config will not work.

Mad_Max

Joined: 2 Jan 10

Posts: 153

Credit: 2140086148

RAC: 218721

archae86 wrote:Continuing

25 Apr 2019 6:55:53 UTC

Message 170930 in response to message 170511

(moderation:

)

archae86 wrote:

Continuing observations on my Radeon VII running 0.11.

....................

Naturally running this task smashed my DCF up by over a factor of ten. As I've unsuspended the GRP work in my queue, that work is getting burned off in panic mode. It may be some time before I can try 2X or 3X on the GW work.

tolafoph wrote:

archae86 wrote:
Meanwhile, my primary host is indicated as having 29 days of work on board, as I unintentionally allowed more of the new GW work to download when a spate of running GRP had driven the completion estimates way back down. I'm afraid a mass abort is in my future but I currently plan to run pure GW GPU for another half day.

Yeah, The extremly different runtimes of 15 min vs 2h is messing with the work I got. I almost ran out of tasks. I changed it from 0.25 to 0.5d of work buffer. But if I run only the 15 min tasks it might download way to many of the 2h ones. But so far I havent gotten any new GW units.

Well, now I am also run into huge swings in DFC and VERY incorrect run time projection. DFC on one of machines was pushed to x14, other to about 6-10. CPU WUs is projected to run few days each (while actual run time about 6-8 hours). It happens due to very different speed of current FGRP и GW apps.

There is a way to fix it, but it vexing because its not permanent.
- shut down BOINC
- open client_state.xml
- find and reset DFC to default (1) - <duration_correction_factor>1</duration_correction_factor> in E@H project section
find used apps sections by name and version (for example <app_name>einstein_O1OD1E</app_name> and <version_num>13</version_num>) or by executable name (<file_name>einstein_O1OD1E_0.13_windows_x86_64__GW-opencl-ati-V1.exe</file_name>)
Correct <flops>xxxxx</flops> value to real numbers for your system and setup (including X factor - how many tasks is run on GPU).

How to calculate flops.
Take "size" of WU - 144,000 GFLOPs for current GW tasks (CPU and GPU - they are same) and 525,000 GFLOPs for FGRPB WUs and divide by your actual average runtimes you see on practice for this type of work in sec.
For example my runtimes FGRP completes in ~20 min = 525000/1200 = 437.5 GFLOPS
<flops>437000000000</flops>

GW task runs for ~3 hours on GPU (when run 4 tasks per 1 GPU) = 144000/10800 = 13 GFLOPS
<flops>13000000000</flops>

GW task runs for ~7 hours on CPU (1 task per 1 CPU core) = 144000/25200 = 5.7 GFLOPS
<flops>5700000000</flops>

After restart you will get DFC=1 and PERFECT run time projection for all types of tasks simultaneously and correct cache size valuation.

Unfortunately it is not a permanent fix - <flops> setting can reset after some time (not sure how often it happens) because it intended to be set by project staff and eventually BOINC will catch incorrect values from server and swings of DFC begins again - this is how BOINC client tries to adapt to inadequate flops counts of apps.

E@H admins can alleviate this problem by adjusting a "speedup" factors of apps (how fast is particular app is compared to base CPU app). Current GPU GW app needs speedup reduction by factor of 2-3. While CPU GW app (v.06 v07) increase by factor ~1.5
But not completely - as this server setting can not consider differences in very different user systems and how many tasks are runs in parallel. It can set only very rough estimates averaged over all users.

Real solution for this very old BOINC problem would be modifying of BOINC client to calculate and use DFC on per app basis not per whole project. But for some reasons BOINC programmers d'nt willing to implement such useful feature...

Gravitational Wave Engineering run on LIGO O1 Open Data

Forums › Cruncher's Corner

Sorry about the extra ]. I

1) I'm testing on the

DanNeely wrote:2) App_config

Your attempt to limit the

I would agree with Max on

archae86 wrote:Continuing

Comment viewing options

Forums › Cruncher's Corner