GNU/Linux S5R3 "power users" App 4.35 available

Keck_Komputers

Joined: 18 Jan 05

Posts: 376

Credit: 5744955

RAC: 0

RE: Hi! So why is this

3 Mar 2008 0:05:08 UTC

Message 79547 in response to message 79546

(moderation:

)

Quote:

Hi!

So why is this good news? Synchronous DNS lookup causes lots of "no heartbeat for 30 seconds, exiting" incidents (at least it does so under Linus) where all apps running under BOINC exit whenever there's a network problem. Looks more like a choice between two evils.

CU
Bikeman

It is currently the lesser of two evils. The async bug takes a host out of service until the participant manually intervenes. The sync bug restarts automatically in a minute or less. Both have a rare system freeze possibility.

BOINC WIKI

BOINCing since 2002/12/8

Annika

Joined: 8 Aug 06

Posts: 720

Credit: 494410

RAC: 0

I see. Thanks for explaining,

3 Mar 2008 1:35:57 UTC

Message 79548

(moderation:

)

I see. Thanks for explaining, looks like there really isn't much I can do about it except hope to get my internet fixed asap.

Wedge009

Joined: 5 Mar 05

Posts: 128

Credit: 17543854821

RAC: 6885495

I have noticed that sometimes

4 Mar 2008 13:59:08 UTC

Message 79549

(moderation:

)

I have noticed that sometimes when switching tasks from another project to Einstein 4.35, the work-unit gets 'stuck'. That is, the CPU remains idle and BOINC remains stuck trying to get Einstein to run. Manually switching to another project and then switching back does not change anything. Fortunately, stopping and restarting the BOINC service/daemon returns the Einstein WU to working status, and the WU has finished successfully and validated.

So far, I have only noticed this on one host, but it has happened more than once and on more than one work-unit, and I am quite sure this started when I started using Einstein 4.35. It's not a disastrous fault in the sense that there isn't any work that has already been done being lost, however, having a stuck WU leaving a host idle overnight isn't very useful. I am just wondering if anyone else has noticed this issue, and if there is any way to resolve it. Given that Einstein 4.38-1 is the same as this application, this issue could be relevant to the 4.38 beta application as well.

Oh yes, and thanks to everyone for making such a big improvement to the overall performance.

Soli Deo Gloria

rroonnaalldd

Joined: 12 Dec 05

Posts: 116

Credit: 537221

RAC: 0

RE: I have noticed that

4 Mar 2008 17:22:25 UTC

Message 79550 in response to message 79549

(moderation:

)

Quote:

I have noticed that sometimes when switching tasks from another project to Einstein 4.35, the work-unit gets 'stuck'. That is, the CPU remains idle and BOINC remains stuck trying to get Einstein to run. Manually switching to another project and then switching back does not change anything. Fortunately, stopping and restarting the BOINC service/daemon returns the Einstein WU to working status, and the WU has finished successfully and validated.

So far, I have only noticed this on one host, but it has happened more than once and on more than one work-unit, and I am quite sure this started when I started using Einstein 4.35. It's not a disastrous fault in the sense that there isn't any work that has already been done being lost, however, having a stuck WU leaving a host idle overnight isn't very useful. I am just wondering if anyone else has noticed this issue, and if there is any way to resolve it. Given that Einstein 4.38-1 is the same as this application, this issue could be relevant to the 4.38 beta application as well.

Oh yes, and thanks to everyone for making such a big improvement to the overall performance.

So far i had not seen anything like this on all my 4 tested hosts. But it can be that this fault will only be seen if you run boinc as daemon or service.
What i have noticed is that whenever an einstein-app starts there will be also a longer access to disk and a movement in amount of memory.

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 753109730

RAC: 1192571

Now I might be wrong but

4 Mar 2008 17:30:58 UTC

Message 79551

(moderation:

)

Now I might be wrong but isn't there a common theme that BOINC may have problems suspending/resuming tasks in general sometimes. This seems to be most common when you use the CPU throttling feature (high frequencya of suspend requests) but also affects resuming apps after a benchmark and other situations. And it's not limited to E@H either. Just a feeling, tho.

CU
Bikeman

archae86

Joined: 6 Dec 05

Posts: 3160

Credit: 7255928963

RAC: 1463280

RE: What i have noticed is

4 Mar 2008 17:50:17 UTC

Message 79552 in response to message 79550

(moderation:

)

Quote:

What i have noticed is that whenever an einstein-app starts there will be also a longer access to disk and a movement in amount of memory.

Yes indeed. I never logged the numbers before Windows 4.36 Einstein power user ap, but on that one simultaneous 4-result Einstein starts on my Q6600 spend over two minutes in an intense I/O Read phase, with Process Explorer showing the total I/O Read bytes to be a very reproducible number near 892 million. (though I just looked, and for three of my four running now it is instead 785 million).

If four results are starting together, they all generally finish this phase within a second or so of the same time. If only one is starting, it gets through this faster (roughly a minute compared to over two on my Q6600), suggesting competition for some resource is partly at hand. Also suggesting contention is that if four Einstein results are in this phase together, significant System Idle time is logged, along with some System time, with the four tasks each getting only something like 16 to 18% each credited directly to the task, rather than the 24.nn% they generally get for the remaining execution time. CPU temperature during this phase is also clearly depressed, along with power consumption.

Just a curiosity, but I'd be interested in an explanation.

rroonnaalldd

Joined: 12 Dec 05

Posts: 116

Credit: 537221

RAC: 0

RE: Yes indeed. I never

4 Mar 2008 19:26:51 UTC

Message 79553 in response to message 79552

(moderation:

)

Quote:

Yes indeed. I never logged the numbers before Windows 4.36 Einstein power user ap, but on that one simultaneous 4-result Einstein starts on my Q6600 spend over two minutes in an intense I/O Read phase, with Process Explorer showing the total I/O Read bytes to be a very reproducible number near 892 million. (though I just looked, and for three of my four running now it is instead 785 million).

If four results are starting together, they all generally finish this phase within a second or so of the same time. If only one is starting, it gets through this faster (roughly a minute compared to over two on my Q6600), suggesting competition for some resource is partly at hand. Also suggesting contention is that if four Einstein results are in this phase together, significant System Idle time is logged, along with some System time, with the four tasks each getting only something like 16 to 18% each credited directly to the task, rather than the 24.nn% they generally get for the remaining execution time. CPU temperature during this phase is also clearly depressed, along with power consumption.

Just a curiosity, but I'd be interested in an explanation.

892MB for 4 tasks and 785MB for 3? Hmmm, 110MB for starting is something more than 40-45MB for each running einstein-task. 892MB is nearly 1Gigabyte of RAM. If each einstein-app needs to begin of his work ~110MB it could result in a big amount of swapping. I think our pc-system and/or OS of today are not the best for hardest multicore-usage.
Are you a dedicated cruncher?
1GB of RAM sounds for me that other applications are also in memory. My IE7 with 30 opened tabs is also between 15MB in tasklist and 60MB for refresh in fullscreen, virtual used RAM is ever around 215MB. If i starting or shutting down my Suse10.2_64bit_dualcore-VM with 640MB memory-access there will be also a big amount of swapping in both directions...

add:
892MB is more as twice the amount of memory that "normally" is in use of starting newer einstein-apps.

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 753109730

RAC: 1192571

RE: Just a curiosity, but

4 Mar 2008 19:59:47 UTC

Message 79554 in response to message 79552

(moderation:

)

Quote:

Just a curiosity, but I'd be interested in an explanation.

Here's some part of the debugging output of one of your tasks which demonstrates this:

Quote:

2008-03-04 00:24:07.1718 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S5R3_4.36_windows_intelx86.exe'.
2008-03-04 00:24:07.3125 [debug]: Set up communication with graphics process.
2008-03-04 00:24:07.6093 [debug]: Reading SFTs and setting up stacks ... done
2008-03-04 00:25:16.2031 [normal]: INFO: Couldn't open checkpoint h1_0905.80_S5R3__339_S5R3b_1_0.cpt
2008-03-04 00:25:16.2031 [debug]: Total skypoints = 1202. Progress: 0,

When E@H starts up, it will have to read in the different raw data input files (I think 6 or 8, each 3 to 4 MB in size). If 4 tasks start simultaneously, sometimes they will all use the same data files (probably good because I'd expect them to enter the disk cache after the first tasks reads one), but another possibility is that 4 tasks are working on different frequency ranges and try to read separate data files from disk. Heavy disk I/O from separate threads on different files which may reside on different "corners" of the disk because of disk fragmentation will result in poor throughput because most of the time the processes are waiting for the HD to position the heads over the right data. Maybe this can be reduced a bit by increasing the read-ahead buffer?

Still the total amount of disk I/O seems a bit high. For sure E@H is not consuming this much memory, so I don't see why the disk I/O should be so high (where do all the bytes go after they are read in ???)

CU
Bikeman

archae86

Joined: 6 Dec 05

Posts: 3160

Credit: 7255928963

RAC: 1463280

RE: Still the total amount

4 Mar 2008 21:29:47 UTC

Message 79555 in response to message 79554

(moderation:

)

Quote:

Still the total amount of disk I/O seems a bit high. For sure E@H is not consuming this much memory, so I don't see why the disk I/O should be so high (where do all the bytes go after they are read in ???)

CU
Bikeman

Indeed, I know not where they go. As Process Explorer shows "peak working set" sizes for these of about 72 Megabytes per process, and stabilized working sets during execution around 40M to 45M, the answer is not that it is occupying an obscene amount of RAM.

Since the total I/O Reads in the first rush is over twenty times higher than the maximum suggested by your comment:

Quote:

have to read in the different raw data input files (I think 6 or 8, each 3 to 4 MB in size)

It may be that the style in which they are read in involves a great deal of re-reading for some reason. Possibly reading in a different order or an otherwise different style would be more efficient.

Again, while this interests me, let my emphasize that I'm not suggesting this is a major improvement opportunity worthy of significant programming investment. Even on 4.36, these results take about 4 CPU hours to compute on my Q6600, and this "thrash phase" is lasting less than 3 wall clock minutes.

I should also mention, since I've posted these comments in a Linux ap thread, that these observations are on a Windows XP system. I have no Linux systems on which to make observations.

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 753109730

RAC: 1192571

RE: It may be that the

4 Mar 2008 21:51:00 UTC

Message 79556 in response to message 79555

(moderation:

)

Quote:

It may be that the style in which they are read in involves a great deal of re-reading for some reason. Possibly reading in a different order or an otherwise different style would be more efficient.

Just a guess but as the data files have to be downloaded, they sure are compressed. Not sure if decompression happens in memory or via a temporary diskfile, which would explain some of the overhead. But as you said, it's only a small part of the overall runtime and really not that significant, I guess.

Bikeman

GNU/Linux S5R3 "power users" App 4.35 available

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner