GNU/Linux S5R3 "power users" App 4.35 available

Keck_Komputers
Keck_Komputers
Joined: 18 Jan 05
Posts: 376
Credit: 5744955
RAC: 0

RE: Hi! So why is this

Message 79547 in response to message 79546

Quote:

Hi!

So why is this good news? Synchronous DNS lookup causes lots of "no heartbeat for 30 seconds, exiting" incidents (at least it does so under Linus) where all apps running under BOINC exit whenever there's a network problem. Looks more like a choice between two evils.

CU
Bikeman


It is currently the lesser of two evils. The async bug takes a host out of service until the participant manually intervenes. The sync bug restarts automatically in a minute or less. Both have a rare system freeze possibility.

BOINC WIKI

BOINCing since 2002/12/8

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

I see. Thanks for explaining,

I see. Thanks for explaining, looks like there really isn't much I can do about it except hope to get my internet fixed asap.

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 128
Credit: 17544558702
RAC: 6886887

I have noticed that sometimes

I have noticed that sometimes when switching tasks from another project to Einstein 4.35, the work-unit gets 'stuck'. That is, the CPU remains idle and BOINC remains stuck trying to get Einstein to run. Manually switching to another project and then switching back does not change anything. Fortunately, stopping and restarting the BOINC service/daemon returns the Einstein WU to working status, and the WU has finished successfully and validated.

So far, I have only noticed this on one host, but it has happened more than once and on more than one work-unit, and I am quite sure this started when I started using Einstein 4.35. It's not a disastrous fault in the sense that there isn't any work that has already been done being lost, however, having a stuck WU leaving a host idle overnight isn't very useful. I am just wondering if anyone else has noticed this issue, and if there is any way to resolve it. Given that Einstein 4.38-1 is the same as this application, this issue could be relevant to the 4.38 beta application as well.

Oh yes, and thanks to everyone for making such a big improvement to the overall performance.

Soli Deo Gloria

rroonnaalldd
rroonnaalldd
Joined: 12 Dec 05
Posts: 116
Credit: 537221
RAC: 0

RE: I have noticed that

Message 79550 in response to message 79549

Quote:

I have noticed that sometimes when switching tasks from another project to Einstein 4.35, the work-unit gets 'stuck'. That is, the CPU remains idle and BOINC remains stuck trying to get Einstein to run. Manually switching to another project and then switching back does not change anything. Fortunately, stopping and restarting the BOINC service/daemon returns the Einstein WU to working status, and the WU has finished successfully and validated.

So far, I have only noticed this on one host, but it has happened more than once and on more than one work-unit, and I am quite sure this started when I started using Einstein 4.35. It's not a disastrous fault in the sense that there isn't any work that has already been done being lost, however, having a stuck WU leaving a host idle overnight isn't very useful. I am just wondering if anyone else has noticed this issue, and if there is any way to resolve it. Given that Einstein 4.38-1 is the same as this application, this issue could be relevant to the 4.38 beta application as well.

Oh yes, and thanks to everyone for making such a big improvement to the overall performance.

So far i had not seen anything like this on all my 4 tested hosts. But it can be that this fault will only be seen if you run boinc as daemon or service.
What i have noticed is that whenever an einstein-app starts there will be also a longer access to disk and a movement in amount of memory.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 753242758
RAC: 1193033

Now I might be wrong but

Now I might be wrong but isn't there a common theme that BOINC may have problems suspending/resuming tasks in general sometimes. This seems to be most common when you use the CPU throttling feature (high frequencya of suspend requests) but also affects resuming apps after a benchmark and other situations. And it's not limited to E@H either. Just a feeling, tho.

CU
Bikeman

archae86
archae86
Joined: 6 Dec 05
Posts: 3160
Credit: 7256108949
RAC: 1465747

RE: What i have noticed is

Message 79552 in response to message 79550

Quote:
What i have noticed is that whenever an einstein-app starts there will be also a longer access to disk and a movement in amount of memory.


Yes indeed. I never logged the numbers before Windows 4.36 Einstein power user ap, but on that one simultaneous 4-result Einstein starts on my Q6600 spend over two minutes in an intense I/O Read phase, with Process Explorer showing the total I/O Read bytes to be a very reproducible number near 892 million. (though I just looked, and for three of my four running now it is instead 785 million).

If four results are starting together, they all generally finish this phase within a second or so of the same time. If only one is starting, it gets through this faster (roughly a minute compared to over two on my Q6600), suggesting competition for some resource is partly at hand. Also suggesting contention is that if four Einstein results are in this phase together, significant System Idle time is logged, along with some System time, with the four tasks each getting only something like 16 to 18% each credited directly to the task, rather than the 24.nn% they generally get for the remaining execution time. CPU temperature during this phase is also clearly depressed, along with power consumption.

Just a curiosity, but I'd be interested in an explanation.

rroonnaalldd
rroonnaalldd
Joined: 12 Dec 05
Posts: 116
Credit: 537221
RAC: 0

RE: Yes indeed. I never

Message 79553 in response to message 79552

Quote:


Yes indeed. I never logged the numbers before Windows 4.36 Einstein power user ap, but on that one simultaneous 4-result Einstein starts on my Q6600 spend over two minutes in an intense I/O Read phase, with Process Explorer showing the total I/O Read bytes to be a very reproducible number near 892 million. (though I just looked, and for three of my four running now it is instead 785 million).

If four results are starting together, they all generally finish this phase within a second or so of the same time. If only one is starting, it gets through this faster (roughly a minute compared to over two on my Q6600), suggesting competition for some resource is partly at hand. Also suggesting contention is that if four Einstein results are in this phase together, significant System Idle time is logged, along with some System time, with the four tasks each getting only something like 16 to 18% each credited directly to the task, rather than the 24.nn% they generally get for the remaining execution time. CPU temperature during this phase is also clearly depressed, along with power consumption.

Just a curiosity, but I'd be interested in an explanation.

892MB for 4 tasks and 785MB for 3? Hmmm, 110MB for starting is something more than 40-45MB for each running einstein-task. 892MB is nearly 1Gigabyte of RAM. If each einstein-app needs to begin of his work ~110MB it could result in a big amount of swapping. I think our pc-system and/or OS of today are not the best for hardest multicore-usage.
Are you a dedicated cruncher?
1GB of RAM sounds for me that other applications are also in memory. My IE7 with 30 opened tabs is also between 15MB in tasklist and 60MB for refresh in fullscreen, virtual used RAM is ever around 215MB. If i starting or shutting down my Suse10.2_64bit_dualcore-VM with 640MB memory-access there will be also a big amount of swapping in both directions...

add:
892MB is more as twice the amount of memory that "normally" is in use of starting newer einstein-apps.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 753242758
RAC: 1193033

RE: Just a curiosity, but

Message 79554 in response to message 79552

Quote:

Just a curiosity, but I'd be interested in an explanation.

Here's some part of the debugging output of one of your tasks which demonstrates this:

Quote:

2008-03-04 00:24:07.1718 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S5R3_4.36_windows_intelx86.exe'.
2008-03-04 00:24:07.3125 [debug]: Set up communication with graphics process.
2008-03-04 00:24:07.6093 [debug]: Reading SFTs and setting up stacks ... done
2008-03-04 00:25:16.2031 [normal]: INFO: Couldn't open checkpoint h1_0905.80_S5R3__339_S5R3b_1_0.cpt
2008-03-04 00:25:16.2031 [debug]: Total skypoints = 1202. Progress: 0,

When E@H starts up, it will have to read in the different raw data input files (I think 6 or 8, each 3 to 4 MB in size). If 4 tasks start simultaneously, sometimes they will all use the same data files (probably good because I'd expect them to enter the disk cache after the first tasks reads one), but another possibility is that 4 tasks are working on different frequency ranges and try to read separate data files from disk. Heavy disk I/O from separate threads on different files which may reside on different "corners" of the disk because of disk fragmentation will result in poor throughput because most of the time the processes are waiting for the HD to position the heads over the right data. Maybe this can be reduced a bit by increasing the read-ahead buffer?

Still the total amount of disk I/O seems a bit high. For sure E@H is not consuming this much memory, so I don't see why the disk I/O should be so high (where do all the bytes go after they are read in ???)

CU
Bikeman

archae86
archae86
Joined: 6 Dec 05
Posts: 3160
Credit: 7256108949
RAC: 1465747

RE: Still the total amount

Message 79555 in response to message 79554

Quote:


Still the total amount of disk I/O seems a bit high. For sure E@H is not consuming this much memory, so I don't see why the disk I/O should be so high (where do all the bytes go after they are read in ???)

CU
Bikeman


Indeed, I know not where they go. As Process Explorer shows "peak working set" sizes for these of about 72 Megabytes per process, and stabilized working sets during execution around 40M to 45M, the answer is not that it is occupying an obscene amount of RAM.

Since the total I/O Reads in the first rush is over twenty times higher than the maximum suggested by your comment:

Quote:
have to read in the different raw data input files (I think 6 or 8, each 3 to 4 MB in size)


It may be that the style in which they are read in involves a great deal of re-reading for some reason. Possibly reading in a different order or an otherwise different style would be more efficient.

Again, while this interests me, let my emphasize that I'm not suggesting this is a major improvement opportunity worthy of significant programming investment. Even on 4.36, these results take about 4 CPU hours to compute on my Q6600, and this "thrash phase" is lasting less than 3 wall clock minutes.

I should also mention, since I've posted these comments in a Linux ap thread, that these observations are on a Windows XP system. I have no Linux systems on which to make observations.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 753242758
RAC: 1193033

RE: It may be that the

Message 79556 in response to message 79555

Quote:

It may be that the style in which they are read in involves a great deal of re-reading for some reason. Possibly reading in a different order or an otherwise different style would be more efficient.

Just a guess but as the data files have to be downloaded, they sure are compressed. Not sure if decompression happens in memory or via a temporary diskfile, which would explain some of the overhead. But as you said, it's only a small part of the overall runtime and really not that significant, I guess.

Bikeman

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.