Many and may errors while computing

Pushkin

Joined: 12 Mar 07

Posts: 15

Credit: 33187685

RAC: 0

17 Jun 2019 7:19:08 UTC

Topic 219048

(moderation:

)

Hi all,
in last weeks I get many errors while computing on my PC. It is a Ryzen 7 1800X with Debian Stretch, a list of my tasks is here. All of them are related to Continuous Gravitational Wave search O2 All-Sky v1.01 x86_64-pc-linux-gnu.

The PC is not overclocked nor shows any problems while running other applications...

Do you have any recommendation what to do with such problem?

Thank you,
pushkin

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5882

Credit: 118967648596

RAC: 24076850

I picked one of the failed

17 Jun 2019 9:15:00 UTC

Message 171741

(moderation:

)

I picked one of the failed tasks at random and looked at what was returned to the project through the stderr.txt output. The following information was given right near the end.
2019-06-12 11:13:21.51264 -- signal handler called: signal 11
I presume (but didn't check) that others are the same sort of error.

A signal 11 - a segmentation violation - means that the process tried to access an 'out of bounds' memory location - a location that wasn't part of its assigned address space. Sometimes this might be caused by a bug in the program and if this was a new program under test that might be the most likely reason. However this program has been around for a while and if a bug was causing all your tasks to fail, it should be the same for everybody else.

I'm not a programmer so I don't know the ins and outs of this sort of stuff but my basic understanding is that perhaps you have a faulty memory location which is part of your program's address space. This location contains bad data which is directing the executing program to try to access a different memory location that is not part of its own address space and the system is detecting this illegal access attempt. I'm sure someone will jump on me if that's not possible or not a proper explanation :-).

Until someone comes along with a better explanation, you should run a memory checking program like memtest, or perhaps try different RAM sticks if you can to see if the problem goes away.

EDIT: If your RAM checks out OK, you should scan your hard drive for bad sectors - just in case. Perhaps something read incorrectly from disk might be at fault.

Cheers,
Gary.

Pushkin

Joined: 12 Mar 07

Posts: 15

Credit: 33187685

RAC: 0

Gary,your're right, all of

17 Jun 2019 10:08:26 UTC

Message 171743 in response to message 171741

(moderation:

)

Gary,
your're right, all of them are segfaults. I'll try HW checks.

Thank you,
pushkin

EDIT: I checked my Rosetta account and there are also a few segfaults. Something really might be wrong.

mmonnin

Joined: 29 May 16

Posts: 292

Credit: 3444726540

RAC: 33590

Early Zen CPUs have segfault

17 Jun 2019 11:48:36 UTC

Message 171745

(moderation:

)

Early Zen CPUs have segfault issues. Is yours an early 1800x?

Pushkin

Joined: 12 Mar 07

Posts: 15

Credit: 33187685

RAC: 0

mmonnin wrote:Early Zen CPUs

17 Jun 2019 12:06:03 UTC

Message 171746 in response to message 171745

(moderation:

)

mmonnin wrote:

Early Zen CPUs have segfault issues. Is yours an early 1800x?

It is possible, I use it nearly one year. My cat /proc/cpuinfo says:

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 1
model name      : AMD Ryzen 7 1800X Eight-Core Processor
stepping        : 1
microcode       : 0x8001137
cpu MHz         : 3600.000
cache size      : 512 KB
physical id     : 0
siblings        : 16
core id         : 0
cpu cores       : 8
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16
 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 mwaitx hw_pstate vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap
clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf ibpb arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic overflow_recov succor smca
bugs            : fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
bogomips        : 7185.99
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate eff_freq_ro [13] [14]

JKnerdy

Joined: 3 Sep 19

Posts: 3

Credit: 8313

RAC: 0

I recently installed the

7 Sep 2019 9:58:18 UTC

Message 173168

(moderation:

)

I recently installed the BOINC app on my Galaxy S9 and I've been running a few days. I've tweaked most settings to allow max run time and space, as far as I understand them. However, the first 4 files showing nearly 60% complete have now reset after showing "computation error'. It took nearly 14 hours running to reach it's peak, so I'm pretty disappointed. Any help? Thx.

Keith Myers

Joined: 11 Feb 11

Posts: 5052

Credit: 19103969954

RAC: 6097891

You have to physically look

8 Sep 2019 0:44:51 UTC

Message 173189 in response to message 171746

(moderation:

)

You have to physically look at the top of the IHS heat spreader and read the week code printed on it. All Gen. 1 Ryzens manufactured before week 25 of 2017 had seg fault issue with heavy loading like when compiling. AMD offered to take the faulty cpu in return and ship newer cpus without the issue.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5882

Credit: 118967648596

RAC: 24076850

JKnerdy wrote:I recently

8 Sep 2019 4:12:23 UTC

Message 173191 in response to message 173168

(moderation:

)

JKnerdy wrote:

I recently installed the BOINC app on my Galaxy S9 and I've been running a few days. I've tweaked most settings to allow max run time and space, as far as I understand them. However, the first 4 files showing nearly 60% complete have now reset after showing "computation error'. It took nearly 14 hours running to reach it's peak, so I'm pretty disappointed. Any help? Thx.

Hi JKnerdy,
Welcome to the Einstein@Home project!

There are lots of volunteers prepared to help but we all need information in order to do so. Your computers are 'hidden' (which is the default setting) so unless you change that, or at least give a link to your host, or specify its host ID, it's just about impossible to speculate about what might be wrong.

By posting in an existing thread, people will tend to think your problem is in some way related to that of the thread starter. Even if you think it might be related, it's always best to start a new thread if you're not completely sure. Because of your totally different hardware and the type of task you may be running (you don't specify) it would seem unlikely it is in any way related to the segfault issue being discussed in this particular thread.

There are many possible reasons why tasks might fail part way through the calculations. If you could please start your own thread and give more information (allow your host details to be viewed by others or give a host ID which can be used to find your host and the needed information) I'm sure someone will be able to give you some advice as to what is causing the problem.

You can do some of this yourself. If you go to your account page and click the link to the computer in question, you will be able to see all the current tasks assigned to it, including the failed ones. If you click the Task ID link for one of the failed tasks, you will be able to see the details about what was logged during the calculations as they progressed. You should be able to see the error message that was created when the task failed. This should give a good indication of what caused the problem and if you can't interpret it, there's a good chance that one of the regulars will be able to.

Cheers,
Gary.

Many and may errors while computing

Forums › Problems and Bug Reports

I picked one of the failed

Gary,your're right, all of

Early Zen CPUs have segfault

mmonnin wrote:Early Zen CPUs

I recently installed the

You have to physically look

JKnerdy wrote:I recently

Comment viewing options

Forums › Problems and Bug Reports