Multi-Directed Gravitational Wave Search

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 581914817
RAC: 138962

Zalster wrote:Since I run 20

Zalster wrote:
Since I run 20 and 16 core machines, Einstein decides it wants to run as many as possible on those machines and I end up with almost 100% of all cores running. By use of a app_config.xml I can turn down how many are running and the computer become usable again.

Not Einstein is deciding how many WUs to run, that's your BOINC setting. If you don't want to use all cores, tell BOINC so by setting "use at most xx % of CPUs" to something less than 100.

Zalster wrote:

... but with the 50% cap in place my 16 core was running only 11 of 16 with 2 waiting to run. ( I allowed for all cores to be use). I overcame this restriction by increasing the percentage of usable RAM to 90%.

But it does explain why I had numerous Parkes cuda 55 error out once the GW started to crunch. They didn't have any free RAM to use for their calcualtions.

I had 1 computer reboot due to an unspecified error.  Given that it was using a large amount of RAM plus another 7GB for the system, I think it ran out of available RAM and forced a reboot.

Your observation of "11 of 16 cores occupied, 2 WUs waiting to run" tells you how BOINC handles this: as long as the Einstein Devs tag the WUs with the correct memory requirement, BOINC is smart enough not to overcommit its allowed share of main memory. Instead, WUs are being halted / waiting to run.

If the system runs out of physical memory it beings to swap to disk. It's painfully slow, but it doesn't cause errors. Same if the GPU tasks need more memory. Granted there could be some special cases (i.e. slow disk swapping causing the video driver to reset the GPU, as it thinks it became unresponsive), but I suspect the most probable cause for errors upon a lack of RAM is if people followed the common advice to disable the page file completely if they have an SSD, "because the SSD is fast enough.. and Window's memory management is stupid anyway". Personally I think is stupid advice, as it would lead to crashes in exactly the situations you're describing, instead of simple swapping.

MarkJ wrote:
All my 4 core/8 thread i7's only have 8Gb. Sounds like time to upgrade their memory. In the mean time I suppose I can limit the number of concurrent tasks.

I don't think you'll need to limit the number of concurrent tasks (as long as the memory estimate of the tasks is correct, see text above). When you've got other BOINC work in the queue I'd expect BOINC to start these instead of waiting for the memory. This hope would need validation, though.

MrS

Scanning for our furry friends since Jan 2002

rebirthman
rebirthman
Joined: 6 Jul 16
Posts: 2
Credit: 19589928
RAC: 0

Hello, just to add an other

Hello,

just to add an other example, which took quite long: 20 hours

https://einsteinathome.org/workunit/258366374

My wing man timed out providing no response on time.

Also only three days to crunch such a WU.

br Michael

 

 

 

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

ExtraTerrestrial Apes

ExtraTerrestrial Apes wrote:
Zalster wrote:
Since I run 20 and 16 core machines, Einstein decides it wants to run as many as possible on those machines and I end up with almost 100% of all cores running. By use of a app_config.xml I can turn down how many are running and the computer become usable again.

Not Einstein is deciding how many WUs to run, that's your BOINC setting. If you don't want to use all cores, tell BOINC so by setting "use at most xx % of CPUs" to something less than 100.

Zalster wrote:

... but with the 50% cap in place my 16 core was running only 11 of 16 with 2 waiting to run. ( I allowed for all cores to be use). I overcame this restriction by increasing the percentage of usable RAM to 90%.

But it does explain why I had numerous Parkes cuda 55 error out once the GW started to crunch. They didn't have any free RAM to use for their calcualtions.

I had 1 computer reboot due to an unspecified error.  Given that it was using a large amount of RAM plus another 7GB for the system, I think it ran out of available RAM and forced a reboot.

Your observation of "11 of 16 cores occupied, 2 WUs waiting to run" tells you how BOINC handles this: as long as the Einstein Devs tag the WUs with the correct memory requirement, BOINC is smart enough not to overcommit its allowed share of main memory. Instead, WUs are being halted / waiting to run.

If the system runs out of physical memory it beings to swap to disk. It's painfully slow, but it doesn't cause errors. Same if the GPU tasks need more memory. Granted there could be some special cases (i.e. slow disk swapping causing the video driver to reset the GPU, as it thinks it became unresponsive), but I suspect the most probable cause for errors upon a lack of RAM is if people followed the common advice to disable the page file completely if they have an SSD, "because the SSD is fast enough.. and Window's memory management is stupid anyway". Personally I think is stupid advice, as it would lead to crashes in exactly the situations you're describing, instead of simple swapping.

MarkJ wrote:
All my 4 core/8 thread i7's only have 8Gb. Sounds like time to upgrade their memory. In the mean time I suppose I can limit the number of concurrent tasks.

I don't think you'll need to limit the number of concurrent tasks (as long as the memory estimate of the tasks is correct, see text above). When you've got other BOINC work in the queue I'd expect BOINC to start these instead of waiting for the memory. This hope would need validation, though.

MrS

Hello MrS

  • I think you were misinterpreting my statement.  16 Core with 4 GPUs running 3 work units each (12 Parkes) The Gravity waves don't start with 1.5 Gb but work their way up to it. Once the system runs out of RAM, it starves the GPU and the work units fell into Error.  After I placed a limit on how many work units, on the CPU I could run at once, I was able to free up more resources for the GPUs. Yes I'm sure the system halted the GW on the CPU but that didn't help the GPUs.

Since trying to explain without graphics is difficult let me show you what I mean. Here is another computer with 16 cores but only 2 GPUs on it.  I've limited the number of CPU cores to 10 of 16, Ram is now 85% of 32 but what you can see is

Using only 10 cores and support 2 GPUs with 3 work units apiece it is still using 23.28GB of Ram with only 3.12 GB Left over. The 6 GW waiting to run were suspended once I place the new restriction on number of GW at one time into place.

In my other computers with 4 GPUs I have to decrease the number of CPU work units down to about 8 in order to make sure there is enough RAM left over for anything else. 

As you are aware the GW have varying amount of requirements for RAM so it is prudent to configure for the max alloted.

The use of the "use at most" % of CPU is an old argument that I've had too many times to count.

To tell the computer to use 75% of 16 cores would end up having 12 GW and 12 Parkes all sharing 12 cores.  I would prefer to have 10 GW and 12 Parkes sharing 16 cores. That way there is no starving of the GPUs.

Somewhere, it was thought that telling the computer to only use 75% of all cores resulted in people thinking that the GPUs would somehow utilize the remaining 25% of non use cores.  I never understood that.  In fact, the CPU and the GPU both utilized only the 75% of all cores and the other 25% is not touched.  This approach leads to GPUs waiting for free cycles on the CPU which are busy with Gravity Wave units running at near 100% of the core  (ie GPU starvation) Which is why I prefer the app_config.xml with a <max_concurrent> in place to allow for all cores to be utilized but limit how many instances of work are actually running.

As far as that 1 computer that rebooted due to an unspecified reason. I was able to trace it down to the Windows OS trying to force an update. I've since corrected that problem. The other 3 computers experienced no reboot but had Parkes that errored shortly after the GW began their runs and before I make the change to the Max % of Ram. 

Either way, I think it presents a good discussion for people to review and decide how they wish to proceed.

Happy Crunching...

 

Zalster

 

 

Gaurav Khanna
Gaurav Khanna
Joined: 8 Nov 04
Posts: 42
Credit: 30729942165
RAC: 12035700

Christian Beer wrote:We just

Christian Beer wrote:
We just started to distribute the first tasks for the next search for continuous gravitational waves on Einstein@Home. It works similar to the recent all-sky search but also has some new features. Here is an overview:

Christian -- 

Is this a totally new app or same as the previous einstein_O1AS20 apps? I'm wondering if I would need to compile it .. or use the previous O1 versions (for my unsupported platforms). 

Thanks!

archae86
archae86
Joined: 6 Dec 05
Posts: 3161
Credit: 7262741817
RAC: 1569117

Zalster wrote:The use of the

Zalster wrote:

The use of the "use at most" % of CPU is an old argument that I've had too many times to count.

To tell the computer to use 75% of 16 cores would end up having 12 GW and 12 Parkes all sharing 12 cores.  I would prefer to have 10 GW and 12 Parkes sharing 16 cores. That way there is no starving of the GPUs.

Somewhere, it was thought that telling the computer to only use 75% of all cores resulted in people thinking that the GPUs would somehow utilize the remaining 25% of non use cores.  I never understood that.  In fact, the CPU and the GPU both utilized only the 75% of all cores and the other 25% is not touched.  

Your assertions do not match the behavior I have observed, nor my understanding of how these things work.  My observations are limited to Windows system, most recently Windows 7 and Windows 10.

The Boinc "use at most" restriction affects how many tasks it starts.  It has no effect on core affinity, which in normal circumstances (absent use of some intervention by the user with Process Lasso, for example) means that the Windows scheduler is free to route all tasks to all cores as it wishes.  The windows 10 scheduler is noticeably "stickier" than the Windows 7 one, less inclined to move an application from one core to another many times per second for no reason, but neither does it generally leave lots of cores idle for extended periods when there are more tasks than cores about (a common situation for GPU users running multiple GPUs and multiple tasks per GPU).

I'm a happy user of the "Use at most" nn% "of the processors" setting.  I hope others will not be discouraged from using it where appropriate, which is my reason for adding this note.

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

Yes, I understand many

Yes, I understand many believe as you do.  And many more will post here that their observations are similar to yours.

However, that is not my observations nor many of my teammates after testing both methods.

It is as I have stated, which is why we choose to use the app_config.xml to maximize our output.

Every person is going to have to make a decision on which method they believe is best for themselves.

 

mmonnin
mmonnin
Joined: 29 May 16
Posts: 292
Credit: 3444726540
RAC: 799183

So many here want to take the

So many here want to take the easy way out use the website and not setup.

You could possibly increase the CPU usage for the GPUs so that BOINC limits CPU WUs to less than 16. Like 0.33 if you have 12 CPU WUs and want to reduce CPU WUs by 4.

I assumed the limit # of CPUs was only for CPU tasks but I've only used it to limit 1 core for FAH. Having multiple GPUs in one system running multiple WUs would make it easier to test though.

archae86
archae86
Joined: 6 Dec 05
Posts: 3161
Credit: 7262741817
RAC: 1569117

mmonnin wrote:I assumed the

mmonnin wrote:
I assumed the limit # of CPUs was only for CPU tasks but I've only used it to limit 1 core for FAH. Having multiple GPUs in one system running multiple WUs would make it easier to test though.

Nope, assuming you are speaking of the "userat most" nn% "of the processors" setting--it actually limits the combined launch of CPU and GPU tasks, so that the estimated CPU consumption stays within your limit.  It assumes the party line regarding the CPU used to support a GPU task, plus one full CPU to support each CPU task.  It limits nothing save the number of tasks launched.   For extra fiddling credit you can change the party line regarding CPU use to support a GPU task.

It appears to me that the prioritization is to support GPU tasks--in other words, if you have six GPU tasks running, and the total CPU use would be estimated to exceed your limit, the "solution" is to hold back a GPU task (or two, or three) before holding back GPU tasks.

But all the tasks running, both pure BOINC CPU tasks, and the CPU task required to support each BOINC GPU task, are unrestricted as to which core they run on.

At least that is the way it works on Windows.  I don't speak Linux.

 

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 192700679
RAC: 304130

Gaurav Khanna wrote:Christian

Gaurav Khanna wrote:
Christian Beer wrote:
We just started to distribute the first tasks for the next search for continuous gravitational waves on Einstein@Home. It works similar to the recent all-sky search but also has some new features. Here is an overview:

Christian -- 

Is this a totally new app or same as the previous einstein_O1AS20 apps? I'm wondering if I would need to compile it .. or use the previous O1 versions (for my unsupported platforms). 

Thanks!

Yes it is a new app that has some important changes compared to the O1AS20-100 app. You need to recompile. The SHA1 we used is 57193443c6c43373c6e0b72fff81e764ca3763dc.

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 581914817
RAC: 138962

Zalster wrote:Yes, I

Zalster wrote:

Yes, I understand many believe as you do.  And many more will post here that their observations are similar to yours.

However, that is not my observations nor many of my teammates after testing both methods.

It is as I have stated, which is why we choose to use the app_config.xml to maximize our output.

 

I don't understand how you reach that conclusion. Do you use an alternative BOINC client? Do you change core affinities? Do you have differing numbers of concurrent tasks in your comparison?

If both answers are "no" I fail to see how the two methods in question differ. Both provide different ways to launch the same number of CPU & GPU tasks, if configured properly. Then the Windows / OS scheduler takes over and distributes those tasks over the cores according to their priorities.

How did you measure that productivity difference? When you speak about CPU tasks taking full cores, starving the GPUs.. remember that the CPU portion of the BRP 1.57 GPU app hardly uses causes CPU load, even if cores are available. Note that I'm not disputing that your method of limiting the maximum number of concurrent tasks works well, I just think you're fiddling around with things more than necessary (and seemed to be rather frustrated by that).

Regarding your other answers: OK!

MrS

Scanning for our furry friends since Jan 2002

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.