Many and may errors while computing

Pushkin
Pushkin
Joined: 12 Mar 07
Posts: 15
Credit: 33187685
RAC: 0
Topic 219048

Hi all,
in last weeks I get many errors while computing on my PC. It is a Ryzen 7 1800X with Debian Stretch, a list of my tasks is here. All of them are related to Continuous Gravitational Wave search O2 All-Sky v1.01 x86_64-pc-linux-gnu.

The PC is not overclocked nor shows any problems while running other applications...

Do you have any recommendation what to do with such problem?

Thank you,
pushkin

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 117974248311
RAC: 22108598

I picked one of the failed

I picked one of the failed tasks at random and looked at what was returned to the project through the stderr.txt output.  The following information was given right near the end.
2019-06-12 11:13:21.51264 -- signal handler called: signal 11
I presume (but didn't check) that others are the same sort of error.

A signal 11 - a segmentation violation - means that the process tried to access an 'out of bounds' memory location - a location that wasn't part of its assigned address space.  Sometimes this might be caused by a bug in the program and if this was a new program under test that might be the most likely reason.  However this program has been around for a while and if a bug was causing all your tasks to fail, it should be the same for everybody else.

I'm not a programmer so I don't know the ins and outs of this sort of stuff but my basic understanding is that perhaps you have a faulty memory location which is part of your program's address space.  This location contains bad data which is directing the executing program to try to access a different memory location that is not part of its own address space and the system is detecting this illegal access attempt.  I'm sure someone will jump on me if that's not possible or not a proper explanation :-).

Until someone comes along with a better explanation, you should run a memory checking program like memtest, or perhaps try different RAM sticks if you can to see if the problem goes away.

EDIT:  If your RAM checks out OK, you should scan your hard drive for bad sectors - just in case.  Perhaps something read incorrectly from disk might be at fault.

Cheers,
Gary.

Pushkin
Pushkin
Joined: 12 Mar 07
Posts: 15
Credit: 33187685
RAC: 0

Gary,your're right, all of

Gary,
your're right, all of them are segfaults. I'll try HW checks.

Thank you,
pushkin

EDIT: I checked my Rosetta account and there are also a few segfaults. Something really might be wrong.

mmonnin
mmonnin
Joined: 29 May 16
Posts: 292
Credit: 3444636540
RAC: 2214915

Early Zen CPUs have segfault

Early Zen CPUs have segfault issues. Is yours an early 1800x?

Pushkin
Pushkin
Joined: 12 Mar 07
Posts: 15
Credit: 33187685
RAC: 0

mmonnin wrote:Early Zen CPUs

mmonnin wrote:
Early Zen CPUs have segfault issues. Is yours an early 1800x?

It is possible, I use it nearly one year. My cat /proc/cpuinfo says:

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 1
model name      : AMD Ryzen 7 1800X Eight-Core Processor
stepping        : 1
microcode       : 0x8001137
cpu MHz         : 3600.000
cache size      : 512 KB
physical id     : 0
siblings        : 16
core id         : 0
cpu cores       : 8
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16
 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 mwaitx hw_pstate vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap
clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf ibpb arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic overflow_recov succor smca
bugs            : fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
bogomips        : 7185.99
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate eff_freq_ro [13] [14]

JKnerdy
JKnerdy
Joined: 3 Sep 19
Posts: 3
Credit: 8313
RAC: 0

I recently installed the

I recently installed the BOINC app on my Galaxy S9 and I've been running a few days. I've tweaked most settings to allow max run time and space, as far as I understand them. However, the first 4 files showing nearly 60% complete have now reset after showing "computation error'. It took nearly 14 hours running to reach it's peak, so I'm pretty disappointed. Any help? Thx. 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4991
Credit: 18827072095
RAC: 5672387

You have to physically look

You have to physically look at the top of the IHS heat spreader and read the week code printed on it.  All Gen. 1 Ryzens manufactured before week 25 of 2017 had seg fault issue with heavy loading like when compiling.  AMD offered to take the faulty cpu in return and ship newer cpus without the issue.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 117974248311
RAC: 22108598

JKnerdy wrote:I recently

JKnerdy wrote:
I recently installed the BOINC app on my Galaxy S9 and I've been running a few days. I've tweaked most settings to allow max run time and space, as far as I understand them. However, the first 4 files showing nearly 60% complete have now reset after showing "computation error'. It took nearly 14 hours running to reach it's peak, so I'm pretty disappointed. Any help? Thx. 

Hi JKnerdy,
Welcome to the Einstein@Home project!

There are lots of volunteers prepared to help but we all need information in order to do so.  Your computers are 'hidden' (which is the default setting) so unless you change that, or at least give a link to your host, or specify its host ID, it's just about impossible to speculate about what might be wrong.

By posting in an existing thread, people will tend to think your problem is in some way related to that of the thread starter.  Even if you think it might be related, it's always best to start a new thread if you're not completely sure.  Because of your totally different hardware and the type of task you may be running (you don't specify) it would seem unlikely it is in any way related to the segfault issue being discussed in this particular thread.

There are many possible reasons why tasks might fail part way through the calculations.  If you could please start your own thread and give more information (allow your host details to be viewed by others or give a host ID which can be used to find your host and the needed information) I'm sure someone will be able to give you some advice as to what is causing the problem.

You can do some of this yourself.  If you go to your account page and click the link to the computer in question, you will be able to see all the current tasks assigned to it, including the failed ones.  If you click the Task ID link for one of the failed tasks, you will be able to see the details about what was logged during the calculations as they progressed.  You should be able to see the error message that was created when the task failed.  This should give a good indication of what caused the problem and if you can't interpret it, there's a good chance that one of the regulars will be able to.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.