Discussion Thread for the Continuous GW Search known as O2MD1 (now O2MDF - GPUs only)

Betreger
Betreger
Joined: 25 Feb 05
Posts: 987
Credit: 1433667954
RAC: 585389

I don't think that's the way

I don't think that's the way it works. 

My hypothesis is you down load a data set then you down load series of templates which apply a sequential set of algorithms, a series of low frequency, mid frequency and high frequency ones. Each series has increasing memory demands.  It has been documented that it is the high frequency tasks are the problem children. When completed repeat the process on the nest bit of data. 

Me thinks you wold never get a mix to task sizes. The template determines the task size not the data per se.

Only 1 template runs at a time, 1X, 2X etc. are the data pints running at  a time.

Being an hypothesis I could be totally wrong but it is consistent with my observations. 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109973219655
RAC: 29806268

Peter Hucker wrote:The

Peter Hucker wrote:
The chances of having two large WUs at once are low ...

Unfortunately, that's a particularly bad recommendation which is at odds with what really happens.

By using locality scheduling, the scheduler tries very hard to send tasks belonging to the same 'frequency bin' in order to economise on the massive data file downloads that are otherwise required.  The 2nd last field in a task name is a 'sequence number' for otherwise similar tasks belonging to a particular frequency bin and therefore using already downloaded data.

The scheduler sends tasks with consecutively decreasing sequence numbers.  Without considering resends, the most likely event is that a subsequent task in the current series will have a lower (but quite close) sequence number.  This does depend on how many hosts are drawing from the same frequency bin.  At the moment that seems to be rather few hosts per bin, so groups of close sequence numbers are quite common.

Since there are many adjacent sequence numbers having the same memory requirements, the highest chance is that your next task will have the same memory requirement as the previous one.  So, if the current task failed, you're likely to get more of the same.  By the same token, if the current task succeeded, so will all the subsequent tasks for the same frequency bin that have a lower sequence number - all the way down to zero, which will be the last in that decreasing sequence.

When considering resends (there are lots at the moment) the most likely scenario is that they will represent failed tasks from other hosts where a likely failure mode will have been insufficient memory.  So, good luck with those.

I'm currently preparing a guide to help understand the behaviour of the current VelaJr tasks.  I hope to have it ready to post shortly.

Cheers,
Gary.

Betreger
Betreger
Joined: 25 Feb 05
Posts: 987
Credit: 1433667954
RAC: 585389

Gary thanx for the input. 

Gary thanx for the input.  Now I better understand what is going on and I shall continue to run 2X until something changes, which it most certainly will. 

Mr P Hucker
Mr P Hucker
Joined: 12 Aug 06
Posts: 819
Credit: 481409887
RAC: 1677

Gary Roberts wrote: Peter

Gary Roberts wrote:

Peter Hucker wrote:
The chances of having two large WUs at once are low ...

Unfortunately, that's a particularly bad recommendation which is at odds with what really happens.

By using locality scheduling, the scheduler tries very hard to send tasks belonging to the same 'frequency bin' in order to economise on the massive data file downloads that are otherwise required.  The 2nd last field in a task name is a 'sequence number' for otherwise similar tasks belonging to a particular frequency bin and therefore using already downloaded data.

The scheduler sends tasks with consecutively decreasing sequence numbers.  Without considering resends, the most likely event is that a subsequent task in the current series will have a lower (but quite close) sequence number.  This does depend on how many hosts are drawing from the same frequency bin.  At the moment that seems to be rather few hosts per bin, so groups of close sequence numbers are quite common.

Since there are many adjacent sequence numbers having the same memory requirements, the highest chance is that your next task will have the same memory requirement as the previous one.  So, if the current task failed, you're likely to get more of the same.  By the same token, if the current task succeeded, so will all the subsequent tasks for the same frequency bin that have a lower sequence number - all the way down to zero, which will be the last in that decreasing sequence.

When considering resends (there are lots at the moment) the most likely scenario is that they will represent failed tasks from other hosts where a likely failure mode will have been insufficient memory.  So, good luck with those.

I'm currently preparing a guide to help understand the behaviour of the current VelaJr tasks.  I hope to have it ready to post shortly.

This doesn't explain how Betreger is getting nothing but successes running two at once on his 6GB Nvidia card.  I think it must be because newer Nvidias (and all AMDs) are quite happy to use system RAM if needed.  I don't know if that's a failing of earlier Nvidia cards or bad programming at Einstein's end.

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

cecht
cecht
Joined: 7 Mar 18
Posts: 1432
Credit: 2468175260
RAC: 752545

Peter Hucker wrote:...This

Peter Hucker wrote:
...This doesn't explain how Betreger is getting nothing but successes running two at once on his 6GB Nvidia card.  I think it must be because newer Nvidias (and all AMDs) are quite happy to use system RAM if needed...

This would be at odds with a recent post by T-800 on the Navi 10 forum (https://einsteinathome.org/goto/comment/178293) saying that it is only when the GW app needs to reach into system memory that it bogs down.

Here's an (probably not original) idea: when a card's available VRAM is exceeded, long-run GW VelaJr tasks are not efficiently crunched.  Here are some numbers...

I can run my 4 GB RX 570 card at 1x with none of the extra-large task time increases that are seen for some VelaJr frequency bands, but at 2x those long-run tasks take much longer to complete. My 6 GB RX 5600xt, on the other hand, can run at 2x with no large task time increases, but at 3x the long-run tasks become problematic.

These differences between the cards can make sense when viewed by how much of the card's VRAM is used and assuming that "long-run" task time will take much longer when the card's VRAM capacity is exceeded. For example, the RX 5600 XT uses ~80%, or ~4.8 GB, of available VRAM when running 2x (https://einsteinathome.org/goto/comment/178290). Let's call this a requirement of 2.4 GB VRAM per task. But at 3x, it pegs out at ~100% VRAM; that is, it needs ~7.2 GB, but only has 6, so the app dips into system memory and doesn't do a very good job of it.

Now consider the RX 570 with 4 GB; assuming 2.4 GB VRAM is needed per task, it crunches fine at 1x, but at 2x it needs 4.8 GB VRAM and so must dip into system memory and, again, doesn't do a very good job of it.  This all assumes that GPU or memory load isn't limiting at any of those multiplicities.

By this reasoning, the 16 GB Radeon VII should be able to run all GW tasks equally well at 6x (14.4 GB VRAM needed), that is, without having the long-run tasks take an extraordinarily longer time. That is not to say it will crunch efficiently at 6x because GPU or memory loads will likely be exceeded and average individual task times will increase past a certain multiplicity (4x? 5x?). I'd predict, however, that the Radeon VII at 6x should see only about a 2-fold run time difference between short-run and long-run tasks, the same time differential seen with the RX570@1x and the RX5600xt@2x. Similarly, a 2 GB card would always have VRAM be limiting for long-run tasks.

As pointed out by T-800, and no doubt by others, this is not an AMD thing, and these VRAM crunching requirements would be expected to hold for NVIDIA cards.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Mr P Hucker
Mr P Hucker
Joined: 12 Aug 06
Posts: 819
Credit: 481409887
RAC: 1677

cecht wrote:This would be at

cecht wrote:
This would be at odds with a recent post by T-800 on the Navi 10 forum (https://einsteinathome.org/goto/comment/178293) saying that it is only when the GW app needs to reach into system memory that it bogs down.

[Very useful and interesting information snipped]

I wasn't suggesting it was a good idea to use system RAM.  I always try to run WUs that will fit into VRAM.  But there are two different problems here: slowing down, and crashing out.  For some reason older Nvidias will cause computation errors if they run out of VRAM.  Newer Nvidias and all AMDs will just slow down.  However, neither is desirable.  It's best I think for every individual to experiment with their own combination of GPU, VRAM, CPU, to find out what produces the most valid WUs in a given time period, or even give up and go to Gamma on older cards.  Also, Einstein should be trying to send out WUs based on available VRAM.  If I connect with a 3GB card, I shouldn't receive a 3.5GB WU.  At the moment, I have a lot of 3GB cards which I've had to relegate to Gamma work, since with Gravity, they get more than they can handle efficiently.  Yet they could be doing the smaller Gravity WUs.

 

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

Tom M
Tom M
Joined: 2 Feb 06
Posts: 5644
Credit: 7728730054
RAC: 2448617

I am using this system to

I am using this system to exercise riser hardware (Pcie to usb3 board, cable, external PCIe adaptor for the card)

As a by-product, I am getting the following results under Windows 10 with an Nvidia Gtx 1660 Super.

~ 1,075 seconds to 1,383 seconds (GW gpu).

I don't know if it is useful or not but I thought I would offer the data.

I am testing the riser hardware, that I bought used before I re-sell it.  So I can vouch for it.

Tom M

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)

Betreger
Betreger
Joined: 25 Feb 05
Posts: 987
Credit: 1433667954
RAC: 585389

That's running 1X?

That's running 1X?

T-800
T-800
Joined: 21 Jun 16
Posts: 2
Credit: 4691520
RAC: 0

I **think** I have an

I **think** I have an explanation for both the observed slow down when GPU's have to access main system RAM (for AMD and newer nVidia cards) and crashing completely (older nVidia cards).

Taking the Rx 5600XT as an example, for the reference AMD card at least, the maximum memory bandwidth is rated at 336 GB/s.
(From the specifications here: https://www.amd.com/en/products/graphics/amd-radeon-rx-5600-xt)

However the bandwidth of typical DDR4 main system memory is much lower , for example 38.4 GB/s for DDR4 2400 RAM running in dual channel.
(Source showing how this is calculated: https://www.pcsteps.com/7932-real-ram-speed-mhz-cas-latency/)

I would assume that the sort of workload that einstein is running involves a constant flow of data into and out of memory.
When this is VRAM on the graphics card, this can occur roughly 10x faster than when the card has to use main system memory.

Assuming I've understood this correctly, this would explain the tasks bogging down when the tasks require more memory than is available in VRAM.

Interestingly, this would mean using higher frequency main system memory (e.g. DDR4 3600 rather than the DDR4 2400 in my example above) would reduce this bottleneck, although maybe not by enough to make it worthwhile.
I wonder if memory speed is also a limiting factor for the rate at which CPU tasks are crunched?

As for older nVidia cards crashing completely when their VRAM is exceeded, I wonder if this has something to do with OpenCL / CUDA versions?
I know einstein's applications are written in the OpenCl programming language, which AMD cards support natively, whereas (as I understand it) nVidia cards only support nVidia's own CUDA language natively.
When an OpenCl application is run on an nVidia card it does some kind of 'emulation' that converts the OpenCl code into CUDA code before it runs.

I know different generations of nVidia cards use different CUDA versions, so maybe older versions could not properly handle emulating an OpenCl application that was trying to access memory outside of the cards VRAM, and that this has subsequently been added to later versions of CUDA found on newer nVidia cards, which is why they no longer crash?

Not hugely useful, but interesting (at least to me!) to explain the observed behaviour.

As Peter said above, experimenting with your system is probably the best way to find out what works best for you:

It's best I think for every individual to experiment with their own combination of GPU, VRAM, CPU, to find out what produces the most valid WUs in a given time period

 

Cherokee150
Cherokee150
Joined: 13 May 11
Posts: 24
Credit: 810984138
RAC: 377701

About 50% of the work I was

About 50% of the work I was receiving was O2, but I haven't received any O2 work for almost three weeks.  I was tweaking some of my preferences around that time.  Are we still getting o2 work, or might I have made an error in my settings?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.