I am seeing e@h consistently using up all of swap and eventually hitting the oom under linux.This is and also on 15.2on 3 different systems running opensuse Tumbleweed and also on opensuse 15.2.
Here is a sample of its swap usage: (values in kb)
kb pid state
2,081,400 16516 R einstein_O2MD1_
1,508,328 14303 R einstein_O2MD1_
1,179,072 17305 R einstein_O2MD1_
356,688 17376 R einstein_O2MD1_
298,232 16985 R einstein_O2MD1_
295,176 16863 R einstein_O2MD1_
290,680 16911 R einstein_O2MD1_
286,112 17129 R einstein_O2MD1_
285,152 4318 R einstein_O2MD1_
282,648 1074 R einstein_O2MD1_
238,568 989 R einstein_O2MD1_
236,344 1023 R einstein_O2MD1_
230,936 1057 R einstein_O2MD1_
and in dmesg i see this:
[3755909.825425] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=einstein_O2MD1_,pid=28321,uid=1000
[3755909.825436] Out of memory: Killed process 28321 (einstein_O2MD1_) total-vm:2405336kB, anon-rss:572372kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:4748kB oom_score_adj:0
[3755910.055260] oom_reaper: reaped process 28321 (einstein_O2MD1_), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Copyright © 2024 Einstein@Home. All rights reserved.
Cat22 wrote:I am seeing e@h
)
Are you sure the problem is caused by the E@H app?
Your computers are hidden and you don't provide a link to any of your hosts so there's no way for any volunteers to examine host details and tasks lists, etc. I use Linux but I have no knowledge of openSUSE. What actual kernel are you running?
The O2MD1 app (a CPU app) has been around for a long time. If there were a serious memory leak issue, I would imagine there would have been lots of other complaints about it by now. I don't recall seeing any other reports like this.
I only run GPU tasks and I have seen something similar with very recent kernels. I've been testing 5.10.x and 5.11.x kernels and all the ones in these series that I've tested do end up consuming all of mem & swap if allowed to continue running for around a day or two. My preferred kernel is currently the LTS 5.4 series and if I use any in this series (I'm up to 5.4.115) there is no consumption of memory so it's definitely the kernel series and not the E@H app that is causing the problem in my case.
Can you try an LTS kernel to see if the problem goes away?
Cheers,
Gary.
I downloaded 5.4.119 from
)
I downloaded 5.4.119 from kernel.org, compiled it, and i am testing it now, this will be a couple of days before i can be sure of the results if the problem doesn't show up, less if it shows up.
Do you know specifically what was wrong in the kernel that could have caused this issue? Can you point me to a kernel file/fn?
Cat22 wrote:Do you know
)
No I don't. I don't compile my own kernels. My distro keeps up to date with both current and LTS kernels. At the moment, there's 5.4.119 LTS as well as up-to-date 5.12, 5.11 and 5.10 series. I haven't started looking at the 5.12 series yet. I think they're up to about 5.12.3 or 5.12.4.
I haven't tried to investigate the myriad of configuration options. I don't have anything fancy or cutting edge so the standard options that my distro uses work for everything I own and suit me fine. One less skill to have to learn :-).
There's been no discussion on my distro's forums about what I've observed regarding memory consumption. I haven't posted a report yet. I always install an LTS kernel along with a current series one (or two) and tend to test that they all work before finally booting to the LTS. My mainly older hardware is better suited to that.
By accident, one machine ended up doing BOINC under a 5.10.23 kernel when it was supposed to be running 5.4.105. A day or so later when it ran out of mem, I started investigating why. Over a period whilst experimenting with more than 20 different machines, I confirmed to my own satisfaction that the problem didn't occur with any 5.4 or 5.9 series kernel but always occurred with any 5.10 or 5.11 kernel. That involved more than 2 weeks of testing which I've only completed quite recently.
The main reason why I responded to your message was because this was all so fresh in my mind and seemed like a plausible explanation. Were you running a 5.10 or above kernel?? It would be nice to know if you're seeing the same issue. I'll try a 5.12 kernel at some point but I'm expecting it might have the same issue. I'm happy to stay with LTS kernels. Sooner or later it'll get fixed in current series.
As an example of what I saw, I'll mention a host with a FX-6300 CPU, a HD 7850 GPU and 16GB of RAM. I'm using a KDE Plasma desktop (same for all machines) and the kernel was 5.10.23. With BOINC up and running GRP GPU tasks x2, the free mem was just under 15GB. I measured that value around every 4 hours or so and it steadily kept declining. At the end of the second day it was down to 0.5GB. I rebooted it to 5.4.115 and it immediately had nearly 15GB free again. Today, with an uptime that is now 11 days, the free bytes show as 14,781,984.
I saw the exact same behaviour on many different hardware combinations. As a further example, an old (2009) machine with 3GB, a Q8400 CPU and an RX 570 4GB GPU has now been up with the 5.4.115 kernel for 14 days. It started with about 1.7GB free. It was running out of mem with 5.10.23 in under 1.5 days. Right now with an uptime of 14 days it shows 1,448,116 bytes free. It was much the same value within an hour or so of launch and it has stayed around that value ever since. None of my machines currently show any mem leak symptoms.
Cheers,
Gary.
That's a great description!
)
That's a great description! You should file a kernel bug. If you do let me know and I'll chime in there on my experience too. I just this morning got the 5.4.119 kernel compiled and working ( had issues with 5.4.119 and my wireless that drove me crazy but finally got that resolved ) on my laptop and started running e@h again. I'll also get it up and running on another desktop - so time will tell if the memory leak goes away.
Yeh, i was running 5.12.2 and also KDE x11 plasma.
Ok, so running 5.4.119 with
)
Ok, so running 5.4.119 with basically nothing else running, e@h will repeatedly use more and more memory until the oom killer kills it and the whole process starts over. I also tried it with swap disabled and got the same result. You can just watch dmesg and see it happen.
I don't think this is a kernel issue, I'm pretty sure its a sw issue with e@h, specifically with einstein_O2MD1_2.08_x86_64-pc-linux-gnu__GWnew (is the source available?)
I never had memory leak issues until i started running e@h and then they very quickly showed up on 3 desktops and two laptops al running linux.
Too bad we couldn't run it under valgrind, that would detect the memory leak very quickly
Here is 40 minutes worth of
)
Cat22 wrote:I don't think
)
The source isn't available and you now probably need to talk directly with one of the Devs here. If your assessment that the app is the problem is correct, I find it strange that there aren't lots of others seeing exactly the same thing.
Anything I know about this stuff is self taught and the next steps are beyond my level of expertise. I have no formal background in programming or debugging techniques. Computers weren't around when I went through the education system. I did an engineering degree but my computing device at the time was a slide rule. They were pretty easy to debug - just get better glasses so you could read the scales more correctly :-).
I'll send a PM to Bernd Machenschalk and ask him to look at this thread and advise you where to go from here.
Cheers,
Gary.
Hm. I think Gary is right
)
Hm. I think Gary is right saying
The source code is available in principle (see the license page linked at the bottom, some late additions may be missing in the LSC repo), but it's really huge. It's a "LALSuite" application built with quite a few libraries, and LALSuite itself has some ten thousand lines of code (including, btw. some methods to detect and eliminate possible memory leaks during development and testing).
I have to admit that I don't know what's going on on your machine specifically. it might be that running 13 einstein_O* processes at once is just too much. The actual memory footprint of an app process still varies quite a bit between actual machines, depending on things like OS version, kernel settings etc. We do our best to adjust the memory bounds communicated to the client for each workunit, but it might be that our estimate is too far off for your machine, and the client might schedule too many tasks at once.
If you are familiar with manually configuring the client, you could tell it not to run more than n instances of an application at once. Or, if nothing else works, restrict your machine to running FGRP (in the project preferences); this app should be way less demanding.
hth
BM
I can substantiate this as my
)
I can substantiate this as my computer was nearly dead until I removed E@H this morning. Hard drive was full speed processing and ram disking into oblivion! No more E@H until this is fixed! Using Linux Mint 20.1.