My most recently built host is a Windows 10 (born that way, not upgraded from Windows 7) PC.
Very recently I've noticed it to be running short on RAM after well under a week of uptime since reboot. My debugging of the problem is just starting, but one possible suspect seems to be the Einstein Gamma-ray pulsar binary search #1 v1.00 windows_intelx86 application. I only switched this PC to this application when it could no longer get a steady supply of GW work, which at least roughly corresponds to when I started noticing the problem.
The PC has 16 GB of RAM, and after rebooting this morning as my start on assessing this problem both Process Explorer and Process Lasso showed about 22% of RAM in use after all my standard background applications (including 6 Einstein GPU and 3 Einstein CPU tasks) were up and running. After browsing with both Firefox and Chrome active for a half hour it is now just up to 34%. Yet just before I shut it down, after just two days of up time, memory in use with the same background tasks running exceeded 90%.
It seems likely there is somehow a memory leak on the system.
The suspicion of GRPBS involvement is circumstantial, and I have no experience in investigating memory leak trouble, nor awareness of tools which might help identify such problems.
I'm posting here to ask if anyone has noticed such an issue possibly associated with Einstein applications, or has investigation technique suggestions.
My short-term intent is to monitor the RAM usage buildup over the coming hours, and as a trial to suspend GRPBS tasks when the usage hits 50% to see if the rise seems to change. Another investigation possibility would be to enable GRPBS work on my other two serious hosts, which are both recently Windows 10 conversions but do not currently run BOINC CPU tasks.
Copyright © 2024 Einstein@Home. All rights reserved.
Memory depletion--graphics driver related
)
You should be able to monitor RAM usage in the Ressource Monitor for each process. The easiest way to access this is via the Task Manager. At some point in this you need to activate "Show all processes" or something. At least this was the case on Windows 7 when I last used this OS.
Could be also anything else.
)
Could be also anything else. I had memory leaks on Windows 10 caused by the built-in Windows Defender. Getting rid of it fixed a lot of problems ;-)
Win10 has a much improved Task Manager, you can easily track the memory hogging application using it - just make the list sorted by memory usage.
-----
If my problem was that an
)
If my problem was that an application consumed ever-increasing memory while running, then I'd expect already to have seen it when I reviewed all listed applications in Process Explorer.
But neither that nor Task Manager shows anything amiss. (well, I've just started looking at Task Manager, but already usage has grown enough that I should be able to see something if memory reported to a specific task displayed the problem.)
I think one form of system memory leak has a task failing to release allocated memory on termination. Given the specifics at hand, I suspect that is what I face. Hence the diagnosis difficulty. The only diagnostic method that currently comes to mind is to terminate suspect applications, and observe whether the growth stops.
I once also investigated a
)
I once also investigated a potential memory leak where memory was not released upon termination of the app. The problem was that the app was restarted every second but the Windows kernel didn't free the memory. The user with this specific problem tracked it down to a driver problem. He had installed a driver for his ASUS motherboard that is only needed when you have spinning discs attached. When you have only solid state disks installed you must not install the driver.
You should see a lot of "grayed out" processes in Resource Monitor (Task Manager does not show them) that are not running anymore but still consume memory. I think it's the windows equivalent of Linux Zombie processes.
If you open the Task Manager
)
If you open the Task Manager and look at Performance and Memory, is it the "Paged pool" that keeps growing uncontrollably (until it's like 10 GB's)?
Finally found my way to
)
Finally found my way to Resource Monitor.
No greyed-out processes there, yet.
I've logged the summary numbers currently shown by RM, so should be able to see which grows after a while.
At this moment Task Manager|Performance|Memory shows the paged pool at 1.3 GB, non-paged pool at 245 MB
Changes over slightly more
)
Changes over slightly more than one hour when I was away from the machine:
[pre]
Process Explorer reported physical usage rose from 32.66 to 34.67%
Resource Monitor reported:
in use rose from 5093 to 5408
Modified rose from 228 to 259
Standby unchanged at 7647
cached rose from 7879 to 7906
Task Manager|Performance|Memory reported:
In Use rose from 5.0 to 5.3 GB
Paged Pool rose from 1.3 to 1.6 GB
Committed rose from 7.2 to 7.5 GB
[/pre]
Reviewing the Processes shown in Resource monitor in descending order of Working set size, the first one that looks at all suspicious is one of the several copies of svchost.exe which now has a working set of 139,272 KB. An eyeball sum of all the copies of svchost comes to something like 420 MB. But I don't know that to be a change, though I intend to watch it.
I'm thinking about something.
)
I'm thinking about something. I see that host has Nvidia cards and they might be running Binary Radio Pulsar Search (Parkes PMPS XT) v1.57 (BRP6-Beta-cuda55) tasks.
If Paged pool keeps rising, try suspending those BRP6-tasks. Does Paged pool stop rising?
My suspicion of svchost.exe
)
My suspicion of svchost.exe does not fit the continuing observations.
With some more time on the clock, the rising items are:
physical Usage (39.28% and steadily climbing)
In Use
Paged Pool
Committed
Sadly Resource Monitor does not seem to support copying the Process list so that I could easily do a sum in Excel, but my subjective impression is that the usage increases I am seeing listed above are NOT corresponding to increases visible in that list.
I'll wait some more hours, perhaps to about 50% usage, then try shutting down "optional" processes (for monitoring and control I run Process Lasso, Task Manager, GPU-Z, TThrottle and MSI Afterburner), as a group. If no joy I'll suspend the CPU jobs (GRPBS), and only then the BRP6/CUDA55 work. The last seems unlikely to me as prime suspect, as the only recent change plausibly affecting it is that I added a 750Ti to a machine which had been happily running a 1070.
Compliant with [url=https://msdn.microsoft.com/en-us/library/windows/hardware/ff541886(v=vs.85).aspx]advice given by Microsoft[/url] to establish whether a memory leak exists, I've started a copy of Performance monitor and am accumulating a multi-hour trace of:
. pool non-paged bytes
. pool paged bytes
. paging file % usage
This seems likely to show a rather steady growth in pool paged bytes, based on a quick look at a higher update rate.
The several hours graph of
)
The several hours graph of pool paged bytes shows an extremely uniform uptrend.
Several sources advised that I should download and install the Windows Driver Kit, and see what Poolmon could tell me.
Poolmon shows that the leading tag for Paged type is "Vi12". This line updates every few seconds (that seems to be a poolmon rate, and each update appears to report about 1500 allocations, for a rate of about 300 per second) and with each update the Bytes count heads never-endingly up. The "Allocs" column monumentally outweigh the "Frees" column (something like 11 million to 28 thousand at the moment. The "per alloc" column has stayed constant at 239 in the couple of minutes I've watched it. Taken at face value, that comes to a loss rate of about a quarter of a Gigabyte per hour, which is in the ballpark of what I'm seeing at the aggregate level.
One source suggested that I search the driver files (*.sys in the Windows drivers directory) for the string shown in Poolmon as the suspect tag. Not a guaranteed to work method, but often it hits in the driver of interest and seldom in others.
On my system this got multiple hits within the file dxgmms.sys, and none in any other file in c:\Windows\System32\Drivers. Sources say this is the DirectX Graphics MMS. Most of the Internet complaints I find regarding this driver are game-players complaining of BSOD and such after Windows 10 upgrades. This system was born Windows 10, and does not BSOD. Some claim to have fixed that problem by user Guru3D's DDU driver uninstall and cleanup utility before doing a fresh Nvidia driver install. I should note I have never used that utility, though some stalwarts advise using it every time.
As the tag of concern has such frequent updates, I decided I could one-by-one terminate my "optional" programs and see if the Vi12 increase suddenly stopped.
BoincTasks
Tthrottle
TextPad
Process Lasso
GPU-Z
iRotate
With no joy from killing all of these, I moved on to suspending types of BOINC work. I first suspended all non-started GRPBS tasks, then suspended the running ones. No joy.
I suspended all non-running BRP6/CUDA55 tasks, then the three that were running on my 750Ti. No joy.
So I suspended the three BRP6 tasks running on my GTX 1070.
Immediately updates on Poolmon stopped showing activity with the Vi12 tag.
I resumed the three 1070 tasks first, which from experience I knew would restart them on the 1070, then restarted the formerly 750Ti tasks, then re-suspended the 1070 tasks, so I now had just three GPU tasks all running on the older, slower card. The pattern of activity was visibly different than during previous GPU running, and after some transients (including some slight REDUCTIONS in the total) the Vi12 line stopped getting updates.
So whatever is going on seems to be enabled in my current configuration by running Einstein BRP6/CUDA32 work on my installed GTX 1070, but not the GTX 750Ti (which of course is using the same nvidia driver--which is not the driver to which this reporting chain appears to refer).
When, early in this thread, I described changing from GW CPU work to GRPBS CPU work as very roughly time-correlated to the onset of this problem, I failed to mention than something else plausibly time-correlated was my change from running GTX 1070 commissioning work with the 1070 alone, vs. adding the 750Ti to the case, and, later, swapping which card was in which PCIe slot.
I've never been a DDU true believer, and in fact have not used it at all, although I have dutifully re-run the most recent Nvidia driver installer after each change in installed cards or positions.
On the off chance it might help, I'm inclined to switch my card positions back to where they were (the thermal advantage I sought by the current position was mixed at best), and use DDU on the way to installing the most current Nvidia driver.
I'm not deeply interested in thorough diagnosis at this point, just interested in getting my system back to a condition in which it can run for well over a day at a time. However I'm interested in any comments or insight, and will consider reasonable test requests.