einstein@home memory leak

Cat22
Cat22
Joined: 13 May 21
Posts: 28
Credit: 917683798
RAC: 1451344
Topic 225419

I am seeing e@h consistently using up all of swap and eventually hitting the oom under linux.This is  and also on 15.2on 3 different systems running opensuse Tumbleweed and also on opensuse 15.2.

Here is a sample of its swap usage: (values in kb)

kb                    pid      state

2,081,400       16516   R       einstein_O2MD1_
1,508,328       14303   R       einstein_O2MD1_
1,179,072       17305   R       einstein_O2MD1_
  356,688       17376   R       einstein_O2MD1_
  298,232       16985   R       einstein_O2MD1_
  295,176       16863   R       einstein_O2MD1_
  290,680       16911   R       einstein_O2MD1_
  286,112       17129   R       einstein_O2MD1_
  285,152       4318    R       einstein_O2MD1_
  282,648       1074    R       einstein_O2MD1_
  238,568       989     R       einstein_O2MD1_
  236,344       1023    R       einstein_O2MD1_
  230,936       1057    R       einstein_O2MD1_

and in dmesg i see this:

[3755909.825425] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=einstein_O2MD1_,pid=28321,uid=1000
[3755909.825436] Out of memory: Killed process 28321 (einstein_O2MD1_) total-vm:2405336kB, anon-rss:572372kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:4748kB oom_score_adj:0
[3755910.055260] oom_reaper: reaped process 28321 (einstein_O2MD1_), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117529863527
RAC: 35389812

Cat22 wrote:I am seeing e@h

Cat22 wrote:
I am seeing e@h consistently using up all of swap and eventually hitting the oom under linux.

Are you sure the problem is caused by the E@H app?

Your computers are hidden and you don't provide a link to any of your hosts so there's no way for any volunteers to examine host details and tasks lists, etc.  I use Linux but I have no knowledge of openSUSE.  What actual kernel are you running?

The O2MD1 app (a CPU app) has been around for a long time.  If there were a serious memory leak issue, I would imagine there would have been lots of other complaints about it by now.  I don't recall seeing any other reports like this.

I only run GPU tasks and I have seen something similar with very recent kernels.  I've been testing 5.10.x and 5.11.x kernels and all the ones in these series that I've tested do end up consuming all of mem & swap if allowed to continue running for around a day or two.  My preferred kernel is currently the LTS 5.4 series and if I use any in this series (I'm up to 5.4.115) there is no consumption of memory so it's definitely the kernel series and not the E@H app that is causing the problem in my case.

Can you try an LTS kernel to see if the problem goes away?

Cheers,
Gary.

Cat22
Cat22
Joined: 13 May 21
Posts: 28
Credit: 917683798
RAC: 1451344

I downloaded 5.4.119 from

I downloaded 5.4.119 from kernel.org, compiled it, and i am testing it now, this will be a couple of days before i can be sure of the results if the problem doesn't show up, less if it shows up.

Do you know specifically what was wrong in the kernel that could have caused this issue? Can you point me to a kernel file/fn?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117529863527
RAC: 35389812

Cat22 wrote:Do you know

Cat22 wrote:
Do you know specifically what was wrong in the kernel that could have caused this issue?

No I don't.  I don't compile my own kernels.  My distro keeps up to date with both current and LTS kernels.  At the moment, there's 5.4.119 LTS as well as up-to-date 5.12, 5.11 and 5.10 series.  I haven't started looking at the 5.12 series yet.  I think they're up to about 5.12.3 or 5.12.4.

I haven't tried to investigate the myriad of configuration options.  I don't have anything fancy or cutting edge so the standard options that my distro uses work for everything I own and suit me fine.  One less skill to have to learn :-).

There's been no discussion on my distro's forums about what I've observed regarding memory consumption.  I haven't posted a report yet.  I always install an LTS kernel along with a current series one (or two) and tend to test that they all work before finally booting to the LTS.  My mainly older hardware is better suited to that.

By accident, one machine ended up doing BOINC under a 5.10.23 kernel when it was supposed to be running 5.4.105.  A day or so later when it ran out of mem, I started investigating why.  Over a period whilst experimenting with more than 20 different machines, I confirmed to my own satisfaction that the problem didn't occur with any 5.4 or 5.9 series kernel but always occurred with any 5.10 or 5.11 kernel.  That involved more than 2 weeks of testing which I've only completed quite recently.

The main reason why I responded to your message was because this was all so fresh in my mind and seemed like a plausible explanation.  Were you running a 5.10 or above kernel??  It would be nice to know if you're seeing the same issue.  I'll try a 5.12 kernel at some point but I'm expecting it might have the same issue.  I'm happy to stay with LTS kernels.  Sooner or later it'll get fixed in current series.

As an example of what I saw, I'll mention a host with a FX-6300 CPU, a HD 7850 GPU and 16GB of RAM.  I'm using a KDE Plasma desktop (same for all machines) and the kernel was 5.10.23.  With BOINC up and running GRP GPU tasks x2, the free mem was just under 15GB.  I measured that value around every 4 hours or so and it steadily kept declining.  At the end of the second day it was down to 0.5GB.  I rebooted it to 5.4.115 and it immediately had nearly 15GB free again.  Today, with an uptime that is now 11 days, the free bytes show as 14,781,984.

I saw the exact same behaviour on many different hardware combinations.  As a further example, an old (2009) machine with 3GB, a Q8400 CPU and an RX 570 4GB GPU has now been up with the 5.4.115 kernel for 14 days.  It started with about 1.7GB free.  It was running out of mem with 5.10.23 in under 1.5 days.  Right now with an uptime of 14 days it shows 1,448,116 bytes free.  It was much the same value within an hour or so of launch and it has stayed around that value ever since.  None of my machines currently show any mem leak symptoms.

Cheers,
Gary.

Cat22
Cat22
Joined: 13 May 21
Posts: 28
Credit: 917683798
RAC: 1451344

That's a great description!

That's a great description! You should file a kernel bug. If you do let me know and I'll chime in there on my experience too. I just this morning got the 5.4.119 kernel compiled and working ( had issues with 5.4.119 and my wireless that drove me crazy but finally got that resolved ) on my laptop and started running e@h again. I'll also get it up and running on another desktop - so time will tell if the memory leak goes away.

Yeh, i was running 5.12.2 and also KDE x11 plasma.

Cat22
Cat22
Joined: 13 May 21
Posts: 28
Credit: 917683798
RAC: 1451344

Ok, so running 5.4.119 with

Ok, so running 5.4.119 with basically nothing else running, e@h will repeatedly use more and more memory until the oom killer kills it and the whole process starts over.  I also tried it with swap disabled and got the same result. You can just watch dmesg and see it happen.

I don't think this is a kernel issue, I'm pretty sure its a sw issue with e@h, specifically with einstein_O2MD1_2.08_x86_64-pc-linux-gnu__GWnew   (is the source available?)

I never had memory leak issues until i started running e@h and then they very quickly showed up on 3 desktops and two laptops al running linux.

Too bad we couldn't run it under valgrind, that would detect the memory leak very quickly

Cat22
Cat22
Joined: 13 May 21
Posts: 28
Credit: 917683798
RAC: 1451344

Here is 40 minutes worth of

Here is 40 minutes worth of oom's from dmesg (extracts - not full dmesg) Is it possible to attach a file here? I can post the full dmesg.

[ 7402.404125] avahi-daemon invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0[ 7402.404324] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=einstein_O2MD1_,pid=4290,uid=1000
[ 8611.764427] mysqld invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[ 8611.764760] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=einstein_O2MD1_,pid=23663,uid=1000
[ 8733.711363] einstein_O2MD1_ invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[ 8733.711598] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=einstein_O2MD1_,pid=18245,uid=1000
[ 8876.109456] X invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[ 8876.109778] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=einstein_O2MD1_,pid=3369,uid=1000
[ 8993.093393] krunner invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[ 8993.093684] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=einstein_O2MD1_,pid=5533,uid=1000
[ 9078.606048] gkrellm invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[ 9078.606273] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=einstein_O2MD1_,pid=4267,uid=1000
[ 9793.991427] DOM Worker invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[ 9793.991783] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=einstein_O2MD1_,pid=7789,uid=1000

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117529863527
RAC: 35389812

Cat22 wrote:I don't think

Cat22 wrote:
I don't think this is a kernel issue, I'm pretty sure its a sw issue with e@h, specifically with einstein_O2MD1_2.08_x86_64-pc-linux-gnu__GWnew   (is the source available?)

The source isn't available and you now probably need to talk directly with one of the Devs here.  If your assessment that the app is the problem is correct, I find it strange that there aren't lots of others seeing exactly the same thing.

Anything I know about this stuff is self taught and the next steps are beyond my level of expertise.  I have no formal background in programming or debugging techniques.  Computers weren't around when I went through the education system.  I did an engineering degree but my computing device at the time was a slide rule.  They were pretty easy to debug - just get better glasses so you could read the scales more correctly  :-).

I'll send a PM to Bernd Machenschalk and ask him to look at this thread and advise you where to go from here.

Cheers,
Gary.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250425232
RAC: 35131

Hm. I think Gary is right

Hm. I think Gary is right saying

The O2MD1 app (a CPU app) has been around for a long time.  If there were a serious memory leak issue, I would imagine there would have been lots of other complaints about it by now. 

The source code is available in principle (see the license page linked at the bottom, some late additions may be missing in the LSC repo), but it's really huge. It's a "LALSuite" application built with quite a few libraries, and LALSuite itself has some ten thousand lines of code (including, btw. some methods to detect and eliminate possible memory leaks during development and testing).

I have to admit that I don't know what's going on on your machine specifically. it might be that running 13 einstein_O* processes at once is just too much. The actual memory footprint of an app process still varies quite a bit between actual machines, depending on things like OS version, kernel settings etc. We do our best to adjust the memory bounds communicated to the client for each workunit, but it might be that our estimate is too far off for your machine, and the client might schedule too many tasks at once.

If you are familiar with manually configuring the client, you could tell it not to run more than n instances of an application at once. Or, if nothing else works, restrict your machine to running FGRP (in the project preferences); this app should be way less demanding.

hth

BM

Cthulhu
Cthulhu
Joined: 28 Jul 20
Posts: 7
Credit: 167384303
RAC: 84243

I can substantiate this as my

I can substantiate this as my computer was nearly dead until I removed E@H this morning. Hard drive was full speed processing and ram disking into oblivion! No more E@H until this is fixed! Using Linux Mint 20.1.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.