Multi-directed GW search

Alan Barnes
Alan Barnes
Joined: 3 Oct 13
Posts: 10
Credit: 7384740
RAC: 0
Topic 201975

Multi-Directed Continuous Gravitational Wave search Tuning run G v1.01 (SSE2)
i686-pc-linux-gnu

seems consistently (7 times out of 7) to terminate with an error after approximately 6 seconds run-time on my

GenuineIntel Intel(R) Core(TM)2 Duo CPU E4500 @ 2.20GHz [Family 6 Model 15 Stepping 13]

running Linux 4.4.0-42-generic  (32 bit Ubuntu 14.04).

 

No problems to date on 64 bit Ubuntu 16.04  (GenuineIntel Pentium(R) Dual-Core CPU T4500 @ 2.30GHz [Family 6 Model 23 Stepping 10])

also running Linux 4.4.0-42-generic, although the two WUs have yet to complete so far (~2 hours run-time).

 

Alan Barnes

mountkidd
mountkidd
Joined: 14 Jun 12
Posts: 176
Credit: 12555682555
RAC: 8019272

I'm having the same problem

I'm having the same problem on an i5-3570k host, Kubuntu 14.04 64bit.  Tasks run for 7 sec then error out - 26 out of 26 so far...    Here's a link to the host: https://einsteinathome.org/host/5501745/tasks/error.

 

Marcin Pietrzak
Marcin Pietrzak
Joined: 10 Jun 09
Posts: 1
Credit: 428324
RAC: 0

Same here on Linux Mint 17.3

Same here on Linux Mint 17.3 64 bit, 3.19.0-32-generic, i7-4702MQ.
46 of 46 terminated after few seconds.
https://einsteinathome.org/host/11467994/tasks/error

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 188192100
RAC: 329501

I noticed the problem too and

I noticed the problem too and I'm still monitoring it. So far I couldn't find a pattern why this is happening on some hosts. It seems to be happening on Linux only but is not related to specific CPU or OS version. It also does not look like an application problem as there are some successful results using the same app on the same CPU type.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

I could not see any Linux

I could not see any Linux hosts with AVX enabled CPUs generating results, i  probably didn't search hard enough - my sample size is drawn from those reporting success and failures in these threads.

Edit:  after searching some more - i did find one returning good results, so apologies, just a herring coloured red.

See my post https://einsteinathome.org/goto/comment/150777

(There are other non AVX enabled hosts generating errors with the other application.)

I have one host generating a detailed stack trace -

Stack trace of LAL functions in worker thread:
SetUpSFTs at /home/jenkins/workspace/workspace/EAH-GW-Master/SLAVE/LINUX64/TARGET/linux-x86_64/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/HierarchSearchGCT.c:2065
LALExtrapolatePulsarSpinRange at /home/jenkins/workspace/workspace/EAH-GW-Master/SLAVE/LINUX64/TARGET/linux-x86_64/EinsteinAtHome/source/lalsuite/lalpulsar/src/ExtrapolatePulsarSpins.c:293
At lowest level status code = 0: NO LAL ERROR REGISTERED
20:43:41 (12351): called boinc_finish

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

Just to add to the

Just to add to the strangeness.

I have a host https://einsteinathome.org/host/11905468 which is generating the errors.

Using Virtualbox a i created a VM - same OS and installed boinc and attached to E@H this is the (virtual) host https://einsteinathome.org/host/12268233

It does not error out tasks, and has been crunching away for several hours without error.

Not sure if that helps but at least i have a method around the error.

Edit: Real host

14:30:38 [---] Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt dtherm ida arat pln pts

Edit: Virtual host  (Virtualbox 5.0.18)

16-Oct-2016 11:54:39 [---] Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclm ulqdq ssse3 cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx rdrand lahf_lm abm

 Edit2: I should have mentioned the real host has produced some good results

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7219624931
RAC: 975722

Some folks may see some

Some folks may see some DCF-related craziness in running this work.  My most capable host has been running just BRP4G-cuda55 work on a GTX1070 plus a GTX 1060 and no CPU tasks.  The DCF at this moment is reported as  .40 which gives elapsed time estimates for the BRP4G work in queue of 0:36:53, near the midpoint of the actual 29 minutes for (3x) 1070 tasks and 42 minutes for 1060 tasks.  

The rub is that the estimated elapsed time for the 1.01 Multi-Directed CV tasks is showing as 1:57:00, while the single task in progress has reached 94.6% completion at 13:53:00 elapsed time.  So presumably on completion of that task the DCF will bump straight up to something quite near 3.0, raising the estimated amount of GPU work estimated in queue by over a factor of 7.

I don't know whether my CV task is an unusually difficult one, or even whether there is something mis-configured on my machine that is greatly slowing it.  But if this is typical I think the estimated work contained in the CV tasks may need a substantial revision upward to behave well in scheduling.

 

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

archae86 wrote:I don't know

archae86 wrote:
I don't know whether my CV task is an unusually difficult one, or even whether there is something mis-configured on my machine that is greatly slowing it.  But if this is typical I think the estimated work contained in the CV tasks may need a substantial revision upward to behave well in scheduling.

By coincidence the host i mentioned below is running the same CPU, task completion times are quite varied ranging from 829s to 25,884s  (along with many 8s errors).   My other i7-860 host has been quite stable around 16000s.  The virtual host has not completed any yet but is looking to be around 30,000s

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7219624931
RAC: 975722

AgentB wrote:By coincidence

AgentB wrote:
By coincidence the host i mentioned below is running the same CPU, task completion times are quite varied ranging from 829s to 25,884s  (along with many 8s errors).  

Do the tasks arrive with varying work required estimates?  Or does the server send them all marked the same (as indicated by the "Remaining (estimated)" column in the boincmgr tasks list, or the "Time Left" column in BoincTasks)?

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

archae86 wrote:Do the tasks

archae86 wrote:
Do the tasks arrive with varying work required estimates?  Or does the server send them all marked the same (as indicated by the "Remaining (estimated)" column in the boincmgr tasks list, or the "Time Left" column in BoincTasks)?

I hadn't noticed that at the time, but yes looking over the job_log files for these tasks you see the large difference in the "flops" figure (smallest*25 = largest) which explains the difference.

1476397563 ue 536.781145 ct 975.752000 fe 5760000000000 nm h1_0034.75_O1C02Cl1In1C__O1MD1TCV_CasA_34.85Hz_1_0 et 987.293506 es 0 1476397572 ue 536.781145 ct 985.924000 fe 5760000000000 nm h1_0034.70_O1C02Cl1In1C__O1MD1TCV_CasA_34.80Hz_1_0 et 997.928398 es 0 1476397575 ue 536.781145 ct 982.964000 fe 5760000000000 nm h1_0034.80_O1C02Cl1In1C__O1MD1TCV_CasA_34.90Hz_1_0 et 994.826568 es 0 1476398393 ue 536.781145 ct 822.328000 fe 5760000000000 nm h1_0034.85_O1C02Cl1In1C__O1MD1TCV_CasA_34.95Hz_1_0 et 829.568351 es 0 1476429311 ue 939.367003 ct 1965.548000 fe 10080000000000 nm h1_0149.05_O1C02Cl1In1C__O1MD1TCV_VelaJr_149.15Hz_1_1 et 1990.800820 es 0 1476439039 ue 5904.592593 ct 11605.230000 fe 63360000000000 nm h1_0149.05_O1C02Cl1In1C__O1MD1TCV_CasA_149.20Hz_2_1 et 11718.838962 es 0 1476453205 ue 13419.528620 ct 25663.980000 fe 144000000000000 nm h1_0149.05_O1C02Cl1In1C__O1MD1TCV_VelaJr_149.15Hz_0_1 et 25884.797168 es 0

 

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7219624931
RAC: 975722

While CV work distribution

While CV work distribution has been continuing at a rapid pace, G work dried up quite a few hours ago, and the O1MD1TG work generator line on the Einstein server status page has shown red "not running" during much of that time.  The legend asserts that not running status means "Program failed or ran out of work (or the project is down)". 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.