No jobs for newest AMD Epycs as the are diagnosed to lack sse and sse2?

Armin Burkhardt speaking for MPI/FKF
Armin Burkhardt...
Joined: 21 Feb 05
Posts: 17
Credit: 16306449082
RAC: 19761482
Topic 230140

Hey there!

I received this week a bunch of new compute nodes with these

AuthenticAMD AMD EPYC 9554 64-Core Processor [Family 25 Model 17 Stepping 1]

and they don't get work whereas last years batch with

AuthenticAMD AMD EPYC 7763 64-Core Processor [Family 25 Model 1 Stepping 1]

does.

Seems like they get diagnosed to lack sse and sse2 support (see below)

I modified my profiles quite a bit, but still ... no jobs. Do you have any ideas, maybe some BIOS settings?

Machines are LENOVO SR645v3 if that helps.

 

Thanks for your time and your help, feel free to look into my account and settings.

Armin

 

2023-09-29 11:59:39.9995 [PID=3352912]    [version] Checking plan class 'GW-SSE2'
2023-09-29 11:59:40.0012 [PID=3352912]    [version] reading plan classes from file '/BOINC/projects/EinsteinAtHome/plan_class_spec.xml'
2023-09-29 11:59:40.0012 [PID=3352912]    [version] CPU [' family 25 model 17 stepping 1 '] lacks feature ' sse2 '
2023-09-29 11:59:40.0013 [PID=3352912]    [version] no app version available: APP#59 (einstein_O3MD1) PLATFORM#7 (x86_64-pc-linux-gnu) min_version 0
2023-09-29 11:59:40.0015 [PID=3352912]    [mixed] sending non-locality work second
2023-09-29 11:59:40.0198 [PID=3352912]    [send] [HOST#13158202] will accept beta work.  Scanning for beta work.
2023-09-29 11:59:40.0238 [PID=3352912]    [version] Checking plan class 'opencl-intel_gpu'
2023-09-29 11:59:40.0238 [PID=3352912]    [version] parsed project prefs setting 'gpu_util_brp': 0.500000
2023-09-29 11:59:40.0238 [PID=3352912]    [version] No Intel GPU devices found
2023-09-29 11:59:40.0238 [PID=3352912]    [version] no app version available: APP#19 (einsteinbinary_BRP4) PLATFORM#7 (x86_64-pc-linux-gnu) min_version 0
2023-09-29 11:59:40.0238 [PID=3352912]    [version] Checking plan class 'FGRPSSE'
2023-09-29 11:59:40.0238 [PID=3352912]    [version] CPU [' family 25 model 17 stepping 1 '] lacks feature ' sse '

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 4045
Credit: 48093476663
RAC: 33544490

it's because the list of

it's because the list of extensions is too long and it gets truncated, making it look like it lacks some requires features. this has been reported to have been a problem with all of the zen4 systems.

you can get around this by copying the application binary file from one of your working systems and then forming an app_info.xml file to tell the system to use that file.

_________________________________________________________________________

Armin Burkhardt speaking for MPI/FKF
Armin Burkhardt...
Joined: 21 Feb 05
Posts: 17
Credit: 16306449082
RAC: 19761482

Thank you very much, I will

Thank you very much, I will fiddle around this bug.

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 5023
Credit: 18934793306
RAC: 6463466

You need to update your

You need to update your client.  This bug was fixed and merged back on March 4 and incorporated into the 7.22.1 tag release.

Merge fix for truncated cpuid info in Issue #5123

The cpuid info buffer was increased to 8192 characters so now plenty big enough to hold the cpuid info pulled from Zen 4 processors with AVX512 capability that was being truncated and cutting off sse cpu supported information.

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 4045
Credit: 48093476663
RAC: 33544490

Keith Myers wrote:You need

Keith Myers wrote:

You need to update your client.  This bug was fixed and merged back on March 4 and incorporated into the 7.22.1 tag release.

Merge fix for truncated cpuid info in Issue #5123

The cpuid info buffer was increased to 8192 characters so now plenty big enough to hold the cpuid info pulled from Zen 4 processors with AVX512 capability that was being truncated and cutting off sse cpu supported information.

that was actually a two layer fix, requiring both a client update and server update. it wouldnt make any impact for einstein unless they update the servers to the new code, which probably wont happen.

https://github.com/BOINC/boinc/issues/5122#issuecomment-1452884259

_________________________________________________________________________

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 5023
Credit: 18934793306
RAC: 6463466

Yes, you are correct.  I keep

Yes, you are correct.  I keep forgetting about the server side.  I just remember the issue was raised, analyzed, fixed and merged into the customer client.

So, only option is the anonymous platform that you mentioned.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118387718743
RAC: 25608464

Keith Myers wrote:So, only

Keith Myers wrote:
So, only option is the anonymous platform that you mentioned.

Hopefully not the only option :-).

Since there is a patch to fix this in the BOINC server code, shouldn't it be relatively easy for E@H Devs to apply a similar patch to their own code?  That would be better than having this same problem crop up for others in the future.

Cheers,
Gary.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 5023
Credit: 18934793306
RAC: 6463466

Maybe someone can chime in

Maybe someone can chime in with info on this question . . . .  when was the last time the Einstein server software was updated??

My guess is it hasn't been updated in years.  Sure apps are being updated but not the core client software.

I don't know where on the website that information is exposed.

So what version of the BOINC server software is the project running?

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 4045
Credit: 48093476663
RAC: 33544490

according to the event log

according to the event log messages during the schedule request, Einstein is running server version 611

_________________________________________________________________________

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 5023
Credit: 18934793306
RAC: 6463466

Ahh, I was overthinking it. 

Ahh, I was overthinking it.  I was looking in the scheduler reply on the website for each host.

That is what I thought I remembered, they used a Boinc version 6 branch to base their own server code on.

I knew they had to use the basic building blocks of BOINC to communicate with all the various consumer clients.

Still had to have the remnants of BOINC when it still had the DCF and locality scheduling mechanisms.

So being that far back down the BOINC development chain, I doubt very seriously whether they have applied any of the updated code from BOINC releases since version 6.11.

I don't know how that would be even possible since even client releases before say 7.02 aren't compatible anymore. 

 

Armin Burkhardt speaking for MPI/FKF
Armin Burkhardt...
Joined: 21 Feb 05
Posts: 17
Credit: 16306449082
RAC: 19761482

I upgraded the client to

I upgraded the client to 7.22. As predicted above this alone does not solve the problem. I informed my friend Bruce Allen directly, let's see what the project comes up with. I also copied the project binaries from a working Xen3 system and am in the process generating a working app_info.xml.  Hints on this are appreciated.

Thanks for the help and discussion so far

Armin

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.