Multi-directed GW search

Alan Barnes

Joined: 3 Oct 13

Posts: 10

Credit: 7384740

RAC: 0

15 Oct 2016 17:41:08 UTC

Topic 201975

(moderation:

)

Multi-Directed Continuous Gravitational Wave search Tuning run G v1.01 (SSE2)
i686-pc-linux-gnu

seems consistently (7 times out of 7) to terminate with an error after approximately 6 seconds run-time on my

GenuineIntel Intel(R) Core(TM)2 Duo CPU E4500 @ 2.20GHz [Family 6 Model 15 Stepping 13]

running Linux 4.4.0-42-generic (32 bit Ubuntu 14.04).

No problems to date on 64 bit Ubuntu 16.04 (GenuineIntel Pentium(R) Dual-Core CPU T4500 @ 2.30GHz [Family 6 Model 23 Stepping 10])

also running Linux 4.4.0-42-generic, although the two WUs have yet to complete so far (~2 hours run-time).

Alan Barnes

mountkidd

Joined: 14 Jun 12

Posts: 176

Credit: 12555682555

RAC: 8019272

I'm having the same problem

15 Oct 2016 20:13:26 UTC

Message 150751

(moderation:

)

I'm having the same problem on an i5-3570k host, Kubuntu 14.04 64bit. Tasks run for 7 sec then error out - 26 out of 26 so far... Here's a link to the host: https://einsteinathome.org/host/5501745/tasks/error.

Marcin Pietrzak

Joined: 10 Jun 09

Posts: 1

Credit: 428324

RAC: 0

Same here on Linux Mint 17.3

15 Oct 2016 22:35:59 UTC

Message 150755

(moderation:

)

Same here on Linux Mint 17.3 64 bit, 3.19.0-32-generic, i7-4702MQ.
46 of 46 terminated after few seconds.
https://einsteinathome.org/host/11467994/tasks/error

Christian Beer

Joined: 9 Feb 05

Posts: 595

Credit: 188192100

RAC: 329501

I noticed the problem too and

15 Oct 2016 22:45:06 UTC

Message 150757

(moderation:

)

I noticed the problem too and I'm still monitoring it. So far I couldn't find a pattern why this is happening on some hosts. It seems to be happening on Linux only but is not related to specific CPU or OS version. It also does not look like an application problem as there are some successful results using the same app on the same CPU type.

AgentB

Joined: 17 Mar 12

Posts: 915

Credit: 513211304

RAC: 0

I could not see any Linux

16 Oct 2016 9:55:14 UTC

Message 150779 in response to message 150757

(moderation:

)

I could not see any Linux hosts with AVX enabled CPUs generating results, i probably didn't search hard enough - my sample size is drawn from those reporting success and failures in these threads.

Edit: after searching some more - i did find one returning good results, so apologies, just a herring coloured red.

See my post https://einsteinathome.org/goto/comment/150777

(There are other non AVX enabled hosts generating errors with the other application.)

I have one host generating a detailed stack trace -

Stack trace of LAL functions in worker thread:
SetUpSFTs at /home/jenkins/workspace/workspace/EAH-GW-Master/SLAVE/LINUX64/TARGET/linux-x86_64/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT.c:2065
LALExtrapolatePulsarSpinRange at /home/jenkins/workspace/workspace/EAH-GW-Master/SLAVE/LINUX64/TARGET/linux-x86_64/EinsteinAtHome/source/lalsuite/lalpulsar/src/ExtrapolatePulsarSpins.c:293
At lowest level status code = 0: NO LAL ERROR REGISTERED
20:43:41 (12351): called boinc_finish

AgentB

Joined: 17 Mar 12

Posts: 915

Credit: 513211304

RAC: 0

Just to add to the

16 Oct 2016 16:12:30 UTC

Message 150781

(moderation:

)

Just to add to the strangeness.

I have a host https://einsteinathome.org/host/11905468 which is generating the errors.

Using Virtualbox a i created a VM - same OS and installed boinc and attached to E@H this is the (virtual) host https://einsteinathome.org/host/12268233

It does not error out tasks, and has been crunching away for several hours without error.

Not sure if that helps but at least i have a method around the error.

Edit: Real host

14:30:38 [---] Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt dtherm ida arat pln pts

Edit: Virtual host (Virtualbox 5.0.18)

16-Oct-2016 11:54:39 [---] Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclm ulqdq ssse3 cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx rdrand lahf_lm abm

Edit2: I should have mentioned the real host has produced some good results

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7219624931

RAC: 975722

Some folks may see some

16 Oct 2016 15:04:25 UTC

Message 150782

(moderation:

)

Some folks may see some DCF-related craziness in running this work. My most capable host has been running just BRP4G-cuda55 work on a GTX1070 plus a GTX 1060 and no CPU tasks. The DCF at this moment is reported as .40 which gives elapsed time estimates for the BRP4G work in queue of 0:36:53, near the midpoint of the actual 29 minutes for (3x) 1070 tasks and 42 minutes for 1060 tasks.

The rub is that the estimated elapsed time for the 1.01 Multi-Directed CV tasks is showing as 1:57:00, while the single task in progress has reached 94.6% completion at 13:53:00 elapsed time. So presumably on completion of that task the DCF will bump straight up to something quite near 3.0, raising the estimated amount of GPU work estimated in queue by over a factor of 7.

I don't know whether my CV task is an unusually difficult one, or even whether there is something mis-configured on my machine that is greatly slowing it. But if this is typical I think the estimated work contained in the CV tasks may need a substantial revision upward to behave well in scheduling.

AgentB

Joined: 17 Mar 12

Posts: 915

Credit: 513211304

RAC: 0

archae86 wrote:I don't know

16 Oct 2016 16:37:05 UTC

Message 150787 in response to message 150782

(moderation:

)

archae86 wrote:

I don't know whether my CV task is an unusually difficult one, or even whether there is something mis-configured on my machine that is greatly slowing it. But if this is typical I think the estimated work contained in the CV tasks may need a substantial revision upward to behave well in scheduling.

By coincidence the host i mentioned below is running the same CPU, task completion times are quite varied ranging from 829s to 25,884s (along with many 8s errors). My other i7-860 host has been quite stable around 16000s. The virtual host has not completed any yet but is looking to be around 30,000s

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7219624931

RAC: 975722

AgentB wrote:By coincidence

16 Oct 2016 18:52:21 UTC

Message 150794 in response to message 150787

(moderation:

)

AgentB wrote:

By coincidence the host i mentioned below is running the same CPU, task completion times are quite varied ranging from 829s to 25,884s (along with many 8s errors).

Do the tasks arrive with varying work required estimates? Or does the server send them all marked the same (as indicated by the "Remaining (estimated)" column in the boincmgr tasks list, or the "Time Left" column in BoincTasks)?

AgentB

Joined: 17 Mar 12

Posts: 915

Credit: 513211304

RAC: 0

archae86 wrote:Do the tasks

16 Oct 2016 20:05:46 UTC

Message 150799 in response to message 150794

(moderation:

)

archae86 wrote:

Do the tasks arrive with varying work required estimates? Or does the server send them all marked the same (as indicated by the "Remaining (estimated)" column in the boincmgr tasks list, or the "Time Left" column in BoincTasks)?

I hadn't noticed that at the time, but yes looking over the job_log files for these tasks you see the large difference in the "flops" figure (smallest*25 = largest) which explains the difference.

1476397563 ue 536.781145 ct 975.752000 fe 5760000000000 nm h1_0034.75_O1C02Cl1In1C__O1MD1TCV_CasA_34.85Hz_1_0 et 987.293506 es 0 1476397572 ue 536.781145 ct 985.924000 fe 5760000000000 nm h1_0034.70_O1C02Cl1In1C__O1MD1TCV_CasA_34.80Hz_1_0 et 997.928398 es 0 1476397575 ue 536.781145 ct 982.964000 fe 5760000000000 nm h1_0034.80_O1C02Cl1In1C__O1MD1TCV_CasA_34.90Hz_1_0 et 994.826568 es 0 1476398393 ue 536.781145 ct 822.328000 fe 5760000000000 nm h1_0034.85_O1C02Cl1In1C__O1MD1TCV_CasA_34.95Hz_1_0 et 829.568351 es 0 1476429311 ue 939.367003 ct 1965.548000 fe 10080000000000 nm h1_0149.05_O1C02Cl1In1C__O1MD1TCV_VelaJr_149.15Hz_1_1 et 1990.800820 es 0 1476439039 ue 5904.592593 ct 11605.230000 fe 63360000000000 nm h1_0149.05_O1C02Cl1In1C__O1MD1TCV_CasA_149.20Hz_2_1 et 11718.838962 es 0 1476453205 ue 13419.528620 ct 25663.980000 fe 144000000000000 nm h1_0149.05_O1C02Cl1In1C__O1MD1TCV_VelaJr_149.15Hz_0_1 et 25884.797168 es 0

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7219624931

RAC: 975722

While CV work distribution

16 Oct 2016 21:34:06 UTC

Message 150801

(moderation:

)

While CV work distribution has been continuing at a rapid pace, G work dried up quite a few hours ago, and the O1MD1TG work generator line on the Einstein server status page has shown red "not running" during much of that time. The legend asserts that not running status means "Program failed or ran out of work (or the project is down)".

Multi-directed GW search

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports