Hi all,
in last weeks I get many errors while computing on my PC. It is a Ryzen 7 1800X with Debian Stretch, a list of my tasks is here. All of them are related to Continuous Gravitational Wave search O2 All-Sky v1.01 x86_64-pc-linux-gnu.
The PC is not overclocked nor shows any problems while running other applications...
Do you have any recommendation what to do with such problem?
Thank you,
pushkin
Copyright © 2024 Einstein@Home. All rights reserved.
I picked one of the failed
)
I picked one of the failed tasks at random and looked at what was returned to the project through the stderr.txt output. The following information was given right near the end.
2019-06-12 11:13:21.51264 -- signal handler called: signal 11
I presume (but didn't check) that others are the same sort of error.
A signal 11 - a segmentation violation - means that the process tried to access an 'out of bounds' memory location - a location that wasn't part of its assigned address space. Sometimes this might be caused by a bug in the program and if this was a new program under test that might be the most likely reason. However this program has been around for a while and if a bug was causing all your tasks to fail, it should be the same for everybody else.
I'm not a programmer so I don't know the ins and outs of this sort of stuff but my basic understanding is that perhaps you have a faulty memory location which is part of your program's address space. This location contains bad data which is directing the executing program to try to access a different memory location that is not part of its own address space and the system is detecting this illegal access attempt. I'm sure someone will jump on me if that's not possible or not a proper explanation :-).
Until someone comes along with a better explanation, you should run a memory checking program like memtest, or perhaps try different RAM sticks if you can to see if the problem goes away.
EDIT: If your RAM checks out OK, you should scan your hard drive for bad sectors - just in case. Perhaps something read incorrectly from disk might be at fault.
Cheers,
Gary.
Gary,your're right, all of
)
Gary,
your're right, all of them are segfaults. I'll try HW checks.
Thank you,
pushkin
EDIT: I checked my Rosetta account and there are also a few segfaults. Something really might be wrong.
Early Zen CPUs have segfault
)
Early Zen CPUs have segfault issues. Is yours an early 1800x?
mmonnin wrote:Early Zen CPUs
)
It is possible, I use it nearly one year. My cat /proc/cpuinfo says:
processor : 0 vendor_id : AuthenticAMD cpu family : 23 model : 1 model name : AMD Ryzen 7 1800X Eight-Core Processor stepping : 1 microcode : 0x8001137 cpu MHz : 3600.000 cache size : 512 KB physical id : 0 siblings : 16 core id : 0 cpu cores : 8 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 mwaitx hw_pstate vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf ibpb arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic overflow_recov succor smca bugs : fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2 bogomips : 7185.99 TLB size : 2560 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm hwpstate eff_freq_ro [13] [14]
I recently installed the
)
I recently installed the BOINC app on my Galaxy S9 and I've been running a few days. I've tweaked most settings to allow max run time and space, as far as I understand them. However, the first 4 files showing nearly 60% complete have now reset after showing "computation error'. It took nearly 14 hours running to reach it's peak, so I'm pretty disappointed. Any help? Thx.
You have to physically look
)
You have to physically look at the top of the IHS heat spreader and read the week code printed on it. All Gen. 1 Ryzens manufactured before week 25 of 2017 had seg fault issue with heavy loading like when compiling. AMD offered to take the faulty cpu in return and ship newer cpus without the issue.
JKnerdy wrote:I recently
)
Hi JKnerdy,
Welcome to the Einstein@Home project!
There are lots of volunteers prepared to help but we all need information in order to do so. Your computers are 'hidden' (which is the default setting) so unless you change that, or at least give a link to your host, or specify its host ID, it's just about impossible to speculate about what might be wrong.
By posting in an existing thread, people will tend to think your problem is in some way related to that of the thread starter. Even if you think it might be related, it's always best to start a new thread if you're not completely sure. Because of your totally different hardware and the type of task you may be running (you don't specify) it would seem unlikely it is in any way related to the segfault issue being discussed in this particular thread.
There are many possible reasons why tasks might fail part way through the calculations. If you could please start your own thread and give more information (allow your host details to be viewed by others or give a host ID which can be used to find your host and the needed information) I'm sure someone will be able to give you some advice as to what is causing the problem.
You can do some of this yourself. If you go to your account page and click the link to the computer in question, you will be able to see all the current tasks assigned to it, including the failed ones. If you click the Task ID link for one of the failed tasks, you will be able to see the details about what was logged during the calculations as they progressed. You should be able to see the error message that was created when the task failed. This should give a good indication of what caused the problem and if you can't interpret it, there's a good chance that one of the regulars will be able to.
Cheers,
Gary.