Help with Configuring multi core & multi GPU

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

Zalster wrote:Kind of weird

Zalster wrote:
Kind of weird if it turns out they are using a large o instead of a 0(zero) to designate a work unit.

Not if O1 stands for "Observation 1" or "Observation run 1", I seem to remember something along those lines from when they started the gravity wave work again some month ago.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109410447845
RAC: 35046630

Holmis wrote:Zalster

Holmis wrote:
Zalster wrote:
Kind of weird if it turns out they are using a large o instead of a 0(zero) to designate a work unit.

Not if O1 stands for "Observation 1" or "Observation run 1"....

That's exactly what it does stand for -- the first observation run with the upgraded LIGO detectors.  The reason for O1 was explained in the first bullet point of the official announcement of the previous search.  Unfortunately as the current run is a test of what is coming, there hasn't been an official announcement of it yet.  Eventually, O1 will transition to O2 when the 2nd observation run finishes and the data gets prepared for processing.

If we go back to the OP's original question about how to configure Einstein for peak performance, I don't believe that using an app_config.xml file is the simplest answer or even necessary for that matter.  It's just an added complexity for him to worry about when things change in the future - as they certainly will.  The best advice for him is still what I first suggested

Quote:
Your best bet (for starters) would be to completely delete your app_config.xml.

If he were then to (temporarily) set his preferences to NOT do CPU work at all, he could spend a day gathering data on GPU performance with (hopefully) just one GPU task running on each of the two GPUs that BOINC sees.  When he had enough results to get a good idea of the average, he could simply change the GPU utilization factor from 1 to 0.5 to see if he could run two tasks concurrently on each GPU and if that would improve the output.  Further decisions could then be made, depending on the results of that.

There is one further (potential) complication resulting from the previous scatter-gun approach to using various parameters in multiple previous app_config.xml files.  The documentation says:-

Quote:
If you remove app_config.xml, or one of its entries, you must reset the project in order to restore the proper values.

I've never needed to do this even though I'm using this file extensively on multiple hosts.  I use it because of lack of locations (venues) in BOINC.  If I want to stop using it for any reason, I just put the parameters back to default values first and get those recognised before deleting the file.  I suspect (but have never tested) that removing a parameter without replacing it might leave the old value in the state file rather than removing it there.  If you replace a former value with the default value, at least that ensures the state file gets corrected and a full reset is unnecessary.

Cheers,
Gary.

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

I understand your reluctance

I understand your reluctance to having people use app_config but it really is an easier way for him to get what he wants.

Remember he has a 4 core CPU.  He wants to run on the Dual GPU but doesn't want his CPU maxed out. He also want to run CPU work units.

Sticking with your restriction of only using web based changes, that's never going to happen. He is either going to get 2 GPU work unit running (with 50% CPU usage) or 4 GPU work units (with 100% CPU usage, which he doesn't want)

Or he gets 2 GPU work units and 2 CPU work units (also 100% CPU usage, which again he doesn't want)

So really, the app_config is the only way to get 2 GPU work units and 1 CPU work unit. Thereby using only 75% of his CPU.  

Since it's either 100% or 0% with CPU units currently, that doesn't leave any options for those willing to try and help out the GW runs.  Personally, I don't care either for running my CPU at 100% all the time (there was an example of what happens when you do that on Seti)

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109410447845
RAC: 35046630

Zalster wrote:Sticking with

Zalster wrote:
Sticking with your restriction of only using web based changes, that's never going to happen. He is either going to get 2 GPU work unit running (with 50% CPU usage) or 4 GPU work units (with 100% CPU usage, which he doesn't want)

I said "(temporarily)" in an attempt to suggest that there were more courses of action to explore after he had worked out for himself whether he wanted to run 2 GPU tasks or 4 GPU tasks.  It's good to see (GPU-wise) what output can be achieved and how this might affect normal usage before making a final decision.  Once he has that information, if the decision is to stay with 2 GPU tasks and have 1 CPU task with 1 core kept 'free' for other purposes, he can easily do that without needing an app_config.xml.

All he needs to do is set the GPU utilization factor back to default (1), the BOINC preferences to use 25% of CPU cores, and then restore the setting to allow BOINC to do CPU work.  There are multiple benefits of doing it that way rather than using app_config.xml.  An often overlooked benefit is that BOINC will not overfetch CPU work seeing as it knows it is being restricted to just one CPU core for CPU tasks.

Zalster wrote:
So really, the app_config is the only way to get 2 GPU work units and 1 CPU work unit. Thereby using only 75% of his CPU.

Unfortunately, the "only way" statement is simply not correct.

 

Cheers,
Gary.

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

Gary Roberts wrote:Zalster

Gary Roberts wrote:
Zalster wrote:
Sticking with your restriction of only using web based changes, that's never going to happen. He is either going to get 2 GPU work unit running (with 50% CPU usage) or 4 GPU work units (with 100% CPU usage, which he doesn't want)

 the BOINC preferences to use 25% of CPU cores

Zalster wrote:
So really, the app_config is the only way to get 2 GPU work units and 1 CPU work unit. Thereby using only 75% of his CPU.

Unfortunately, the "only way" statement is simply not correct.

 

And I said we are never going to agree to your statement about computer based preferences on percentage usage since you think it means one thing and I say it means another.

So in that regard my statement is true.

Dougga
Dougga
Joined: 27 Nov 06
Posts: 27
Credit: 24844941
RAC: 0

I believe I tried without an

I believe I tried without an app_config file and I believe I was frequently only using 1 GPU.  That should be a long way from optimal given my over 3000 cores between the 2 GPUs.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109410447845
RAC: 35046630

Zalster wrote:And I said we

Zalster wrote:

And I said we are never going to agree to your statement about computer based preferences on percentage usage since you think it means one thing and I say it means another.

So in that regard my statement is true.

I have quite a number of quad core machines (Q6600 Q8400) that run 2 GPU tasks plus 1 CPU task.  They use the %cores setting in order to have that one CPU task instead of more.  However I have made a mistake in specifying 25% for the OP's situation.  It should have been 75%.  If the OP uses a GPU utilization factor of 1.0 and sets the BOINC CPU pref setting to 75% he will get 2 GPU tasks and 1 CPU task running concurrently - no app_config.xml needed.  However, as I've now realised, there is a 'gotcha'.  What I've said will work correctly, provided that an app_config.xml hasn't been used previously, or if it has, the project has been reset.

I decided to try to resolve through experiment why we see different behaviours.  I set up a test machine to explore all the permutations and hopefully throw some light on what was happening.  It was a machine that had previously been crunching (last crunching was 5th Dec 2016) and had been using app_config.xml.  I started it up again yesterday but the fairly lengthy outage of GPU tasks later that day meant that I couldn't do all the experiments until now.

You are quite correct to say that you can't get GPU utilization factor (GUF) to work properly on a machine that has been using app_config.xml.  I was wrong to suggest that restoring the parameters to default values first before deleting app_config.xml might fix the problem.  When I tried the experiments on the test machine, I could control the number of cores crunching CPU tasks with the %cores preference but I couldn't get 2 GPU tasks crunching by setting the GUF to 0.5, even after several new work downloads (just before the tasks to send went to zero, fortunately).  It would stubbornly run just 1 GPU task accompanied by 1 less CPU task than the %cores setting was set for.  If I set 50% I would get 1 CPU task, etc.  When I gave up last night, I had exhausted the supply of GPU tasks I had started the day with and there were no more to be had.

Today, there are plenty of tasks.  So I've used the 'reset the project' option which pretty much (except for the basic <project> block) cleans out everything to do with Einstein in the state file.  Interestingly, it didn't clean out 'old' data files and executables in the project directory - just the current stuff.  By hand, I removed all the old executables, going back to S5R6 days and all the old data (eg O1AS data) seeing as there were absolutely no <file> or <file_info> blocks hanging around to cause any of this to be downloaded again.  I then restored all the current stuff so BOINC would find it all and not have to go through a fresh lot of downloads.

The machine is now happily running again with a GUF of 0.5 and 75% cores and is crunching 2 GPU tasks and 1 CPU task as it should be.  If I reduce the setting to 50% cores, the CPU task stops crunching.  If I reduce it to 25%, both GPU tasks continue to crunch.  I had wondered if this BOINC setting might cause one of the GPU tasks to stop crunching but it didn't.

So you can indeed do what the OP was wanting to achieve without using app_config.xml.  It's also important to realise that if you have been using app_config.xml and you want do do things differently, you will need to reset the project and not just simply delete the app_config.xml file.  This may well become an issue in the future when different searches and new apps come into play.  Having done the reset, the GUF seems to work as expected without further issue.

 

Cheers,
Gary.

Dougga
Dougga
Joined: 27 Nov 06
Posts: 27
Credit: 24844941
RAC: 0

I've been fiddling with the

I've been fiddling with the app_config file a bit more and discovered something interesting.

 

Recall I have an Nvidia GPX 690 which is actually two GPX680 on one card.

One of you pointed out that if I had 4GB of memory on the card, I could in theory run 4 GPU work units.

I tried this while running Ubuntu 17.04 and it corrupted work units, usually in a matter of 10-15 seconds of work.  This would churn through my work units and start downloading more.  Quite the mess.

I booted to Windows 7/x64 and found that it would in fact work.  I suspect my average work units in this configuration will go sky high.  So next I returned to linux.  I dropped the GUI (LightDM) and found that it would work without the graphical front end so it seems there's somerthing in LightDM that is interfering with the operation of boinc for linux.  Returning to the GUI with this configuration, I find that boinc and boincmgr don't seem to work as advertised.   The CPU's show that they are crunching on the GPU work units but the fan on my GPU suggests that it's only the CPUs that are hard at work.  Rebooting the system returns things to the situation where all work units are corrupted within 15 seconds. 

This appears to be a bug, but it's not clear to me if this is an Einstein or Boinc bug.  Thoughts?

 

Cheers,

Doug

 

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109410447845
RAC: 35046630

doug_20 wrote:One of you

doug_20 wrote:
One of you pointed out that if I had 4GB of memory on the card, I could in theory run 4 GPU work units.

I certainly mentioned that option.  The downside is that all 4 CPU cores would be needed for CPU support so you wouldn't be running any CPU tasks.  Also you would need to assess whether that impacted on other things you might want to run on that machine.

doug_20 wrote:
I booted to Windows 7/x64 and found that it would in fact work.  I suspect my average work units in this configuration will go sky high.

I would be quite surprised if "sky high" is the correct description :-).  I would imagine a more modest improvement in GPU throughput (maybe 10-30% if lucky) with a somewhat higher power consumption.  I don't have any NVIDIA GPUs capable of running tasks 2x so I cannot comment.  If you run a number of tasks under Windows, you should get a pretty good idea of what's possible.

doug_20 wrote:
So next I returned to linux.  I dropped the GUI (LightDM) and found that it would work without the graphical front end so it seems there's somerthing in LightDM that is interfering with the operation of boinc for linux.  Returning to the GUI with this configuration, I find that boinc and boincmgr don't seem to work as advertised.   The CPU's show that they are crunching on the GPU work units but the fan on my GPU suggests that it's only the CPUs that are hard at work.  So do one of you know how to file a bug with boinc?

It's not a BOINC bug - it's the way you have your system configured.  X needs to be running and you need the NVIDIA drivers and OpenCL libs correctly installed.  If you think it's the DE (desktop environment) then you perhaps need to try alternate DEs.  I know nothing about Ubuntu or its various DEs so I can't offer any advice.  Maybe someone using Ubuntu can help with that.

If 4 tasks run correctly under Windows, you should be able to do the same under Linux.  The first thing you should do is run a decent test under Windows to make sure there is a worthwhile improvement and that all the tasks are correctly validating.  While running the test you should use your machine as you would normally to make sure you are happy with its continuing usability.  You should also check for any instability/crashes which might point to potential power/heat issues.

 

Cheers,
Gary.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

There is not a lot of log

There is not a lot of log diagnostic information in the thread so i'm guessing.

 Could  @dougga confirm this is the host we are taking about?

 https://einsteinathome.org/host/12511586 and https://einsteinathome.org/host/12505828

 X does not need to be running for GPU crunching with nVidia - that's an AMD fgrlx dependency although I recall there is some workaround (hack) to that.  That said there is no good reason for us mortals not to run a GUI these days!

Disable SLI (see @dvdl above)- this will be part of the problem causing error tasks.  You'll need to do some research on this for the 690 - i can't recall exactly why this causes a problem with GPU crunching but it has had a history of application errors and card identification (boinc sees two cards probably incorrectly - google opencl nvidia sli)  in any regard SLI if working would be slower than one task per GPU if memory had to be accessed across an SLI bridge. Try Chapter 28 near here http://uk.download.nvidia.com/XFree86/Linux-x86_64/375.66/README/index.html

 After disabling SLI you should have two working cards - but then FGRPB1G GPU applications are large needing ~0.9GB per app - if you are running a large monitors with X that will consume a lot of VRAM you will only be able to run one app. 

 I would also remove the app_config and see how it behaves no SLI state.

Please be concise about such things as "dropping the GUI" - how exactly did you do this?

Each of card runs 2GB so 4 tasks might run but expect video issues tearing and freezing running 4.

Please post

clinfo output (for each of N tasks running) , and

event logging output with coproc_debug flag (maybe the scheduling flags might show some info as well) so to get better insight.

good luck.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.