As I have a head for ideas and after a suggestion on l'AF forum, I decided to do a new test to measure the GPU temperature and the GPU electrical consumption when I run O3 task on the ATI GPU of my iMac.
And it's interesting : I switch from a primegrid task (I suspended the project) to one O3 GW-opencl-ati-2 task, and few minutes later I configured an app_config to run 3 in parallels (because each one uses max 1/3 of a CPU thread and I don't want to use more than one).
The GPU usage as before seem to drop down to nothing
But the temperature first falls but then rises again, to a lower level than the primegrid task, but "not that low"
and the same happen with the watts : a sudden drop and then a "somewhat lower level", but not "nothing"
I decided to let it work for the night and maybe tomorrow and I'll see what happens to these tasks on my project account page.
Correct me if I'm wrong but these 3 tasks (1, 2, 3) have been running for 15 hours, they end in success status are actually crashing at some point (like 1 hour after I started them)
Crashed executable name: einstein_O3AS_1.07_x86_64-apple-darwin__GW-opencl-ati-2
Machine type Intel x86-64h Haswell (64-bit executable)
System version: Macintosh OS 14.1.2 build 23B92
Mon Jan 29 22:36:30 2024
xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrun
atos cannot load symbols for the file einstein_O3AS_1.07_x86_64-apple-darwin__GW-opencl-ati-2 for architecture x86_64.
0 einstein_O3AS_1.07_x86_64-apple-dar 0x000000010e75532e
and continue to "crunch" all that time, a few hours after the previous "event" it's like restarting it in debug mode or something ?
Exiting...
putenv 'LAL_DEBUG_LEVEL=3'
2024-01-30 00:20:10.5628 (4701) [normal]: This program is published under the GNU General Public License, version 2
2024-01-30 00:20:10.5636 (4701) [normal]: For details see http://einstein.phys.uwm.edu/license.php
2024-01-30 00:20:10.5636 (4701) [normal]: This Einstein@home App was built at: Nov 9 2023 13:28:22
2024-01-30 00:20:10.5637 (4701) [normal]: Start of BOINC application 'einstein_O3AS_1.07_x86_64-apple-darwin__GW-opencl-ati-2'.
and then at 6am it finishes
2024-01-30 06:22:47.7088 (4701) [normal]: Finished main analysis.
2024-01-30 06:22:47.7089 (4701) [normal]: Recalculating statistics for the final toplist(s)...
2024-01-30 06:28:16.0005 (4701) [normal]: Finished recalculating toplist statistics.
2024-01-30 06:28:16.0016 (4701) [normal]: Finished in 0.00 s with peak RAM usage: 1564.0 MB on CPU 'Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz', peak VRAM usage: 1665.3 MB on GPU Device: 'AMD Radeon Pro 5700 XT Compute Engine ( Platform: Apple )' with backend: 'OpenCL'.
2024-01-30 06:28:16.0018 (4701) [debug]: Writing output ... Closing temp output file '../../projects/einstein.phys.uwm.edu/h1_1205.80_O3aC01Cl1In0__O3ASHF1b_1206.00Hz_22274_1_0.tmp' ... renaming temp output file '../../projects/einstein.phys.uwm.edu/h1_1205.80_O3aC01Cl1In0__O3ASHF1b_1206.00Hz_22274_1_0.tmp' to '../../projects/einstein.phys.uwm.edu/h1_1205.80_O3aC01Cl1In0__O3ASHF1b_1206.00Hz_22274_1_0' ... done.
2024-01-30 06:28:16.3887 (4701) [normal]: Restarted from checkpoint 455
but not really, because several hours later it finishes again "for real" the same "main analysis"
2024-01-30 13:50:53.0023 (4701) [normal]: Finished main analysis.
2024-01-30 13:50:53.0038 (4701) [normal]: Recalculating statistics for the final toplist(s)...
2024-01-30 13:57:02.0763 (4701) [normal]: Finished recalculating toplist statistics.
2024-01-30 13:57:02.0772 (4701) [normal]: Finished in 26925.55 s with peak RAM usage: 1564.0 MB on CPU 'Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz', peak VRAM usage: 1665.4 MB on GPU Device: 'AMD Radeon Pro 5700 XT Compute Engine ( Platform: Apple )' with backend: 'OpenCL'.
The 3 tasks have the exact same weird behaviour... are they really doing anything useful ?
I have 3 more tasks that have been running for 6 hours now, I'll let them finish too to see if it looks similar.
Correct me if I'm wrong but these 3 tasks (1, 2, 3) have been running for 15 hours, they end in success status are actually crashing at some point (like 1 hour after I started them)
and continue to "crunch" all that time, a few hours after the previous "event" it's like restarting it in debug mode or something ?
The 3 tasks have the exact same weird behaviour... are they really doing anything useful ?
I have 3 more tasks that have been running for 6 hours now, I'll let them finish too to see if it looks similar.
Quote: "...they end in success status..."
Your All-Sky tasks are not ending successfully, you are getting "0" credit for completion.
Of the 9 project's tasks you have selected, you are receiving at least 3 project's credits, none of which is Einstein. Your project preferences should be set up according to priority. Also, take into consideration how much GPU utilization is needed for each project's tasks. Though your AMD Radeon Pro 5700 XT has 16GB of VRAM memory, you could be stretching your GPUs limits past the point of also completing Einstein's All-Sky tasks to a successful end with credits.
Suggestion: Put a hold (i.e. suspend) all but Einstein in your Boinc Manager projects and then see what happens to your All-Sky tasks. If they do complete with successful credits, and in much less time, then you have some work to do.
If they still do not complete successfully, then you may have a problem running them on your GPU.
Your All-Sky tasks are not ending successfully, you are getting "0" credit for completion.
George, you should look more carefully. The status of each task says "waiting for validation" and not "invalid".
The tasks were successfully completed, albeit at an excruciatingly slow pace, but a second result is needed before validation can proceed.
I have much older, legacy AMD GPUs (RX 570 4GB) completing these tasks in 25 - 30 mins per task running at x2 (2 tasks finish in just over 50 mins on average). My best guess is that there is something wrong with his OpenCL installation. I know nothing about Apple hardware but haven't Apple ditched OpenCL in favor of something called Metal??
Exiting...
putenv 'LAL_DEBUG_LEVEL=3'
2024-01-30 00:20:10.5628 (4701) [normal]: This program is published ...
I don't know anything about what might have been going on to cause the crash but somehow things seem to have been reset because the "putenv" line is exactly what you see for a normal task at the very beginning.
_AF>Le_Pommier_ Jerome_C2005 wrote:
...
and then at 6am it finishes
2024-01-30 06:22:47.7088 (4701) [normal]: Finished main analysis.
Because the "upgrade" to the O3AS app split the full task into two separate 'half-tasks' you should always see two of these messages, one for each 'half-task' as it completes. Unfortunately, this is normal and doesn't give any clue as to why the processing is so abysmally slow.
Does your OpenCL installation come with the clinfo utility? Can you run that to see what it says?
You should really stop trying to run at x3 until you can see single tasks completing successfully in a reasonable time. Probably a very small part of your current problem stems from the fact that you stated earlier that there would be no problem running 3 tasks using the support of just one CPU thread. Since each task has substantial portions of 'CPU only' running (the end part of each half-task), things will slow down quite a lot if there is only 1 CPU available for the job each time those sections are encountered.
Thanks for your comments ! I'm actually just diving into PrimeGrid February Tour de Primes so I'm afraid new test will have to wait... I just let finish the last 3 O3 that are almost completed, after such an abysmally and excruciatingly long processing time it would be a shame to cancel them, right ? :D
But I will try "one single task with nothing else running" to see how it goes... in the future.
My personal conclusion is
)
My personal conclusion is that it cannot be doing anything useful from boinc/project/science perspective, so what is the point ?
Or could it actually still "do something useful" ?
As I have a head for ideas
)
As I have a head for ideas and after a suggestion on l'AF forum, I decided to do a new test to measure the GPU temperature and the GPU electrical consumption when I run O3 task on the ATI GPU of my iMac.
And it's interesting : I switch from a primegrid task (I suspended the project) to one O3 GW-opencl-ati-2 task, and few minutes later I configured an app_config to run 3 in parallels (because each one uses max 1/3 of a CPU thread and I don't want to use more than one).
The GPU usage as before seem to drop down to nothing
But the temperature first falls but then rises again, to a lower level than the primegrid task, but "not that low"
and the same happen with the watts : a sudden drop and then a "somewhat lower level", but not "nothing"
I decided to let it work for the night and maybe tomorrow and I'll see what happens to these tasks on my project account page.
[AF>Le_Pommier wrote:
)
..
?
)
?
(I see nothing in your answer)
[AF>Le_Pommier wrote:
)
It was in the wrong thread so I deleted it
Correct me if I'm wrong but
)
Correct me if I'm wrong but these 3 tasks (1, 2, 3) have been running for 15 hours, they end in success status are actually crashing at some point (like 1 hour after I started them)
Crashed executable name: einstein_O3AS_1.07_x86_64-apple-darwin__GW-opencl-ati-2
Machine type Intel x86-64h Haswell (64-bit executable)
System version: Macintosh OS 14.1.2 build 23B92
Mon Jan 29 22:36:30 2024
xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrun
atos cannot load symbols for the file einstein_O3AS_1.07_x86_64-apple-darwin__GW-opencl-ati-2 for architecture x86_64.
0 einstein_O3AS_1.07_x86_64-apple-dar 0x000000010e75532e
Thread 0 crashed with X86 Thread State (64-bit):
rax: 0x0100002f rbx: 0x7ff7b1a648b0 rcx: 0x7ff7b1a64828 rdx: 0x2800001513
rdi: 0x7ff7b1a648b0 rsi: 0x200000003 rbp: 0x7ff7b1a64890 rsp: 0x7ff7b1a64828
r8: 0xe1300000000 r9: 0x80300000000 r10: 0x80300000103 r11: 0x00000206
r12: 0x00000000 r13: 0x200000003 r14: 0x00001470 r15: 0x80300000000
rip: 0x7ff816b96a6e rfl: 0x00000206
and continue to "crunch" all that time, a few hours after the previous "event" it's like restarting it in debug mode or something ?
2024-01-30 00:20:10.5637 (4701) [normal]: Start of BOINC application 'einstein_O3AS_1.07_x86_64-apple-darwin__GW-opencl-ati-2'.
and then at 6am it finishes
but not really, because several hours later it finishes again "for real" the same "main analysis"
The 3 tasks have the exact same weird behaviour... are they really doing anything useful ?
I have 3 more tasks that have been running for 6 hours now, I'll let them finish too to see if it looks similar.
[AF>Le_Pommier wrote:
)
Quote: "...they end in success status..."
Your All-Sky tasks are not ending successfully, you are getting "0" credit for completion.
Of the 9 project's tasks you have selected, you are receiving at least 3 project's credits, none of which is Einstein. Your project preferences should be set up according to priority. Also, take into consideration how much GPU utilization is needed for each project's tasks. Though your AMD Radeon Pro 5700 XT has 16GB of VRAM memory, you could be stretching your GPUs limits past the point of also completing Einstein's All-Sky tasks to a successful end with credits.
Suggestion: Put a hold (i.e. suspend) all but Einstein in your Boinc Manager projects and then see what happens to your All-Sky tasks. If they do complete with successful credits, and in much less time, then you have some work to do.
If they still do not complete successfully, then you may have a problem running them on your GPU.
HTH
Proud member of the Old Farts Association
GWGeorge007 wrote:Your
)
George, you should look more carefully. The status of each task says "waiting for validation" and not "invalid".
The tasks were successfully completed, albeit at an excruciatingly slow pace, but a second result is needed before validation can proceed.
I have much older, legacy AMD GPUs (RX 570 4GB) completing these tasks in 25 - 30 mins per task running at x2 (2 tasks finish in just over 50 mins on average). My best guess is that there is something wrong with his OpenCL installation. I know nothing about Apple hardware but haven't Apple ditched OpenCL in favor of something called Metal??
Cheers,
Gary.
_AF>Le_Pommier_ Jerome_C2005
)
I don't know anything about what might have been going on to cause the crash but somehow things seem to have been reset because the "putenv" line is exactly what you see for a normal task at the very beginning.
Because the "upgrade" to the O3AS app split the full task into two separate 'half-tasks' you should always see two of these messages, one for each 'half-task' as it completes. Unfortunately, this is normal and doesn't give any clue as to why the processing is so abysmally slow. Does your OpenCL installation come with the clinfo utility? Can you run that to see what it says?
You should really stop trying to run at x3 until you can see single tasks completing successfully in a reasonable time. Probably a very small part of your current problem stems from the fact that you stated earlier that there would be no problem running 3 tasks using the support of just one CPU thread. Since each task has substantial portions of 'CPU only' running (the end part of each half-task), things will slow down quite a lot if there is only 1 CPU available for the job each time those sections are encountered.
Cheers,
Gary.
Thanks for your comments !
)
Thanks for your comments ! I'm actually just diving into PrimeGrid February Tour de Primes so I'm afraid new test will have to wait... I just let finish the last 3 O3 that are almost completed, after such an abysmally and excruciatingly long processing time it would be a shame to cancel them, right ? :D
But I will try "one single task with nothing else running" to see how it goes... in the future.