Discussion Thread for the Continuous GW Search known as O2MD1 (now O2MDF - GPUs only)

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117574819903
RAC: 35234652

Hi Sam, Welcome to the

Hi Sam,
Welcome to the Einstein forums!

sam6861 wrote:
Ever since the memory estimation fix a week ago, my computer with multiple GPUs have stopped receiving bigger then 2GB O2MDF tasks for AMD Vega 11 (12GB ?) and AMD RX 5500 XT 8GB.

Since it was announced in the Technical News thread that the run has finished, there are no further primary tasks to be had.  There will continue to be 'resends' - copies of tasks that have failed on other machines but the availability of those will probably decline rather quickly.  I suspect they may be sending the dregs to the Atlas supercomputer to get a quick windup.

It's wrong to think of there being 2Gb or 4GB tasks.  There is a whole fairly continuous range of memory requirements and not just two particular values.  For a given series of tasks (identified by the series frequency at the start of a task name) there are at least 400 to 500 tasks belonging to that series with an "issue number" which is the 2nd last field in the task name.  The very last field is the copy number - _0 and _1 are the two original copies and anything higher denotes a 'resend' - a copy of an original task that has failed on another computer.

The thing to understand is that the scheduler always starts issuing tasks for any given series at the very highest issue number, which also happens to be at the highest memory requirements.  If tasks fail, the scheduler tries to resend them to hosts that already have the required data files.  There's no particular sequence for issuing resends.  If a host comes along with the correct data, it will get the resend.

As the whole run was finishing for the time frame you mention, it's hardly surprising that all the high issue numbers had already been issued and the scheduler was working its way down to zero.  If you look at your list of tasks, you can see this happening with the last tasks that you did receive.

So, nothing at all suspicious with getting low memory, low issue number tasks near the end of the run.

sam6861 wrote:

.... it appears the project server refused to send any 4GB O2MDF tasks to my AMD GPU. All because of NVidia GT 1030 2GB in the computer, but I have Nvidia disabled in project settings.

Maybe fix it to only look at AMD GPU RAM for AMD tasks, and look at NVidia RAM for NVidia tasks ...

As well as memory not being fixed at two specific values, there are no such things as AMD tasks and nvidia tasks.  All tasks can be crunched on either brand of GPU.  The scheduler didn't refuse to send you AMD tasks because of your GT 1030.  It can't send what it doesn't have.  For the data files you already had, it would send tasks for that data.  If it only has low issue numbers then that's what you'll get - until they run out too.

if you want to stop BOINC from 'seeing' your GT 1030, all you need to do is put an instruction in the client configuration file, cc_config.xml.  As an example the following line will ignore the first nvidia GPU in the system.

<ignore_nvidia_dev>0</ignore_nvidia_dev>

There is full documentation about this here.

Cheers,
Gary.

Tom M
Tom M
Joined: 2 Feb 06
Posts: 6441
Credit: 9571507114
RAC: 8344726

I think this guy's Amd 3900x

I think this guy's Amd 3900x is running very badly any ideas? 

https://einsteinathome.org/host/12836182

I expect he is running the Ubuntu distro version of the Boinc Manager.

Tom M

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)  I want some more patience. RIGHT NOW!

GWGeorge007
GWGeorge007
Joined: 8 Jan 18
Posts: 3062
Credit: 4966307686
RAC: 1412829

Personally I think his/her

Personally I think his/her system is way under utilized with the graphics card in use combined with an AMD 3900X.

NVIDIA GeForce GTX 1050 (1997MB) driver: 440.64

Regardless of the distro version they're running, this is an older card with only 2 GB of memory, so it can't actually do much in the way of GPU calculations in the time that we are accustomed to.  Plus, if they have their system setup to run only a few cores, and/or only at certain times of the day (or night), or set to download only a few at a time, it won't show much credit.

George

Proud member of the Old Farts Association

hoarfrost
hoarfrost
Joined: 9 Feb 05
Posts: 207
Credit: 103053349
RAC: 0

Hi folks! :) I use a CPU

Hi folks! :) I use a CPU only. Comparison with other Ryzens are problematic because hosts in TOP50 involved in GPUs rush and I cannot see results of wingman of my tasks.

Tom M
Tom M
Joined: 2 Feb 06
Posts: 6441
Credit: 9571507114
RAC: 8344726

It looks like your tasks are

@Hoarfrost, it looks like your tasks are not being sent out again.  So you are not getting any comparisons.

I am running an Amd 3900x under Linux.  And its cpu tasks for Gravity Wave are here: https://einsteinathome.org/host/12845190/tasks/0/53

I will admit I am not running those cpu tasks right now.  But the results I have are significantly faster than your results.  So I am pretty confident "something is not right".

Notes

I use Boinc Manager but you can also use the website for these settings:

1) 87.5% cpu threads of available 24 threads.

Experience shows higher production if you don't use all the available cpu threads leaving a few available for non-BOINC work.  This is true for most systems most of the time.  There have been a few systems where it has not been true.

2) You can use this on a terminal session if you are running (I think) a standard Ubuntu install to see how fast your cpu is running:  watch -n1 "cat /proc/cpuinfo | grep \"^[c]pu MHz\"" 

3) If your cpu is running "too hot" it can be temperature throttling.  Depending on your motherboard this might help get at the temperature.  From a terminal session:

On X570 Mb:
sudo modprobe nct6775
sensors
psensor

--------------------------------------------------------

These are all the diagnosis tools I can think of right now.

Hope this helps.

Tom M

 

 

 

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)  I want some more patience. RIGHT NOW!

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4964
Credit: 18720722538
RAC: 6410928

First thing is to look at

First thing is to look at your temps, since it looks like some serious thermal throttling is taking place.  Your runtimes for the cpu GW tasks is way out of whack.

Or the memory access or speed is way out of whack like running at JEDEC 2133 speed or something.

The GW CPU tasks do a lot of memory transfers so that is affected by memory speed.

While running your BOINC loads, look at k10temp sensor output with the sensors command.  Are you high and at the thermal 95° C. limit?  Use the core clock readout command mentioned to see what clocks your cores are running. Are they wildly differentiating or are they abnormally low?

 

hoarfrost
hoarfrost
Joined: 9 Feb 05
Posts: 207
Credit: 103053349
RAC: 0

Currently, I run WGC and

Currently, I run WGC and Einstein@Home at all (24) threads, as usual. "sensors" report that

Tdie:         +65.8°C  (high = +70.0°C)
Tctl:         +65.8°C  

lscpu, cpuinfo and cpufreq-info show current freqency as 4.00 GHz:

CPU MHz:             4000.196

...

cpu MHz        : 4000.207

...
cpufreq stats: 4.00 GHz:100,00%, 2.80 GHz:0,00%, 2.20 GHz:0,00%

 

But I clearly remember, that several weeks ago Einstein@Home GW workunits my CPU complete in ~10 - 12 hours, and WGC Microbiome Immunity Project tasks - in ~1.5 hours. At now, for completion of tasks of both projects need ~ 3 times of time more! I set GOVERNOR settings of cpufreq utils to "performance", but this did not affect the crunch time.

Began to investigate of UEFI power management settings...

Thank you!

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4964
Credit: 18720722538
RAC: 6410928

It is generally considered

It is generally considered bad form to utilize 100% of a processor for external computing.  You need to leave at least one or two threads free for the OS to manage itself and for desktop environment housekeeping.

You are overcommitted and that would explain why you have such a large differential between run_times and cpu_times.

Not enough timeslices to support the science app before they get pulled away to do other things.

 

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250470891
RAC: 35329

There is now a new Beta tes

There is now a new Beta tes app version 2.09 which should correctly free GPU resources when it is suspended and not kept in memory (it was reported somewhere that the current one doesn't, though atm I can't find the thread).

BM

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Bernd Machenschalk

Bernd Machenschalk wrote:
There is now a new Beta tes app version 2.09

I'm receiving that only for AMD cards (for about 10 hours now). My Nvidia cards still get only 2.07. 

At https://einsteinathome.org/apps.php it looks like here is 2.09 also for Nvidia. Both my AMD and Nvidia hosts share the same venue and settings. Am I missing something ?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.