Problem with info posted by Event Log

Allen
Allen
Joined: 23 Jan 06
Posts: 75
Credit: 633415417
RAC: 1242331

Gary,Thanks for the

Gary,

Thanks for the confidence and the upbeat attitude!  There for awhile I though perhaps I had ticked you off, but that was never my intention.

Even though you said you thought I should deal with one system at a time, I have been trying to update my other machines with lessons learned and so sometimes I confuse myself with when I did this or that.  Now everything is fairly stable and running, except for the 560's time problem. 

I do remember someone stating (maybe you) that it is possible that I need to reinstate receiving tasks before the 560's get back on the right time.  This doesn't seem to equate though, since the same changes on my other machines show up right away.

It is good though that the timing seems to be getting better.

Thanks again.

Allen

 

EDIT:

Just thought I would throw this in..... it might be of help somehow.....

 

                     
1 AMD FX(tm)-8350 Eight-Core Processor 8 Linux 36,977,936 783,090 5,249,475 18,912,663 666,748 -   Detailed stats   ↕  
2 AMD FX(tm)-8300 Eight-Core Processor 8 Windows 7 Professional x64 Edition, Serv
ice Pack 1, (06.01.7601.00)
22,926,409 353,430 1,635,480 8,645,868 274,564 -   Detailed stats   ↕  
3 AMD A8-7600 Radeon R7\\, 10 Compute Core
s 4C+6G
4 Windows 7 Professional x64 Edition, Serv
ice Pack 1, (06.01.7601.00)
18,418,264 311,850 2,079,000 8,899,506 295,335 245   Detailed stats   ↕  
4 Intel(R) Core(tm)2 Quad CPU Q6600 @ 2.40
GHz
1(4) Windows 7 Professional x64 Edition, Serv
ice Pack 1, (06.01.7601.00)
20,157,250 304,920 2,120,580 9,073,463 302,855 -   Detailed stats   ↕  
5 AMD Athlon(tm) X4 845 Quad Core Processo
r
1(4) Windows 7 Ultimate x64 Edition, Service
Pack 1, (06.01.7601.00)
21,319,693 291,060 2,092,860 8,899,506 297,149 179   Detailed stats   ↕  
6 AMD Ryzen 7 4700G with Radeon Graphics 16 Windows 10 Core x64 Edition, (10.00.1904
5.00)
19,419,679 277,200 2,298,681 10,363,122 330,681 60   Detailed stats   ↕  
7 Intel(R) Core(tm) i3-8130U CPU @ 2.20GHz 1(4) Windows 10 Core x64 Edition, (10.00.1904
5.00)
3,733,656 45,045 311,850 1,285,515 44,134 -   Detailed stats   ↕
Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5870
Credit: 116888785791
RAC: 36364896

Allen wrote:Thanks for the

Allen wrote:
Thanks for the confidence and the upbeat attitude!  There for awhile I though perhaps I had ticked you off, but that was never my intention.

No, I'm not ticked off - everything's good.

Allen wrote:
Even though you said you thought I should deal with one system at a time ...

I don't have time to do a deep dive on all your machines.  What I was trying to say was that we will concentrate on the one machine and you can can worry about the others as you see what works on this one.  I haven't looked at the others but if you had 10+10 on those as well, the problems (if any) will be of a similar nature and you can deal with those at your own pace.

Allen wrote:
I do remember someone stating (maybe you) that it is possible that I need to reinstate receiving tasks before the 560's get back on the right time.

I made the comment (go back and read it again) that if you had made the concurrent tasks changes through the GPU utilization factor on the website, then that change could ONLY be transmitted to your host by the receipt of new work.  That couldn't be the problem because you responded that you were using app_config.xml and not the website GPU utilization factor.

Allen wrote:
This doesn't seem to equate though, since the same changes on my other machines show up right away.

They always show up if you change app_config.xml AND THEN click the "read config files" in BOINC Manager.  I had asked specifically at one point if you were sure that you clicked that option but you didn't give a direct reply.  However once you showed the task properties printout, it became obvious that you must have done the clicking so I didn't pursue the matter.  You are obviously using app_config.xml on all machines which is why they don't need new work to see a change.

Allen wrote:
It is good though that the timing seems to be getting better.

I just had a look now and the latest tasks are finishing just under 50 mins.  There is no further evidence of 'faster' running.  It's still a mystery to be solved.

The 'in progress' number is reducing further (below 470) whilst the gap between the task finishing time and the deadline is continuing to grow - approx 2 days 7 hrs now.  The 470 'in progress' tasks represents about 4 days at the current rate so in 2 days time the host might start asking for new work.  There shouldn't be any need to have any tasks suspended very soon now, if not already.

I gotta go so I haven't had time to check the above properly.  I'll do so tomorrow ...

Cheers,
Gary.

Allen
Allen
Joined: 23 Jan 06
Posts: 75
Credit: 633415417
RAC: 1242331

Gary, just the latest on one

Gary, just the latest on one of the tasked finished recently.

Computer:    Alpha-8
Project    Einstein@Home
    
Name    LATeah3012L09_860.0_0_0.0_21731112_1
    
Application    Gamma-ray pulsar binary search #1 on GPUs 1.22 (FGRPopencl1K-ati)
Workunit name    LATeah3012L09_860.0_0_0.0_21731112
State    Ready to report
Received    8/30/2023 1:45:44 PM
Report deadline    9/13/2023 1:45:41 PM
Estimated app speed    101.51 GFLOPs/sec
Estimated task size    525,000 GFLOPs
Resources    0.25 CPUs + 0.5 AMD/ATI GPUs
CPU time at last checkpoint    00:00:00
CPU time    00:03:40
Elapsed time    00:50:27
Estimated time remaining    00:00:00
Fraction done    100%
Virtual memory size    0.00 MB
Working set size    0.00 MB
    

Thanks again!!

PS  Do you watch Opal Hunters?  I find it very interesting.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5870
Credit: 116888785791
RAC: 36364896

Allen wrote:just the latest

Allen wrote:
just the latest on one of the tasked finished recently.

There's no need to post the full properties of a task as shown in the Manager.  I can already see most of what I need from the tasks list on the website.

There was that one key bit that I didn't realise was also in the stderr output on the website and that was the CPU/GPU resources being assigned to each task.  I thought that was only in a properties listing.  I've now done a closer inspection of one of your validated tasks on the website and I've learned something new.  The resource allocation is shown in the stderr output as this example from one of your returned results shows. (The highlight colour is my mod to make it stand out more :-). )

I tend to be looking for error messages further down the output - usually around or below all the checkpoint records so I just skip over the header stuff.  Thanks for prompting me to pay more attention to the details.  You get to the stderr output by clicking on the TaskID link for a task of interest and then scrolling below the stderr heading.  It's the place to go if you're trying to find the cause of any errors that occur.

 

Stderr output



<core_client_version>7.22.2</core_client_version>
<![CDATA[
<stderr_txt>
03:27:29 (6516): [normal]: This Einstein@home App was built at: May  8 2019 13:29:27

03:27:29 (6516): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.22_windows_x86_64__FGRPopencl1K-ati.exe'.
03:27:29 (6516): [debug]: 1e+016 fp, 2.9e+009 fp/s, 3620412 s, 1005h40m12s46
03:27:29 (6516): [normal]: % CPU usage: 0.250000, GPU usage: 0.500000
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.22_windows_x86_64__FGRPopencl1K-ati.exe --inputfile ../../projects/einstein.phys.uwm.edu/LATeah3012L09.dat --alpha 2.59819959601 --delta -0.694603692878 --skyRadius 1.890770e-06 --ldiBins 15 --f0start 852.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 1.69860773e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah3012L09_0860_21102336.dat --debug 0 --device 1 -o LATeah3012L09_860.0_0_0.0_21102336_1_0.out
....
....

 

Allen wrote:
PS  Do you watch Opal Hunters?  I find it very interesting.

Australia has a lot of opal scattered around the outback but I wasn't aware there was a TV series about it  - I don't have time to watch much TV.  I had to google it to even be aware that such a series existed.

Good to see you're interested in a bit of 'DownUnda cultcha' :-).

I had a quick look at your tasks list.  There are still 412 in progress which represents around 3.6 days worth.  Because you only have one type of task (FGRPB1G) with a fairly uniform completion time (50mins) the time to finish them all is easy to calculate.  4 tasks (x2 on 2 GPUs) in 50mins is 12.5m per task on average.  412 @ 12.5 mins each works out to 3.576 days of continuous running.

If other machines are in trouble, you should be able to use what you've been doing with this one to help get them back running properly.  Good luck!  I now need to spend some time on my lot :-).

Cheers,
Gary.

Allen
Allen
Joined: 23 Jan 06
Posts: 75
Credit: 633415417
RAC: 1242331

Kevin,Were you suggesting

Kevin,

Were you suggesting there was something wrong with the CPU .25. GPU .5 setting?  I have played with the CPU part a bunch in the past and it seemed to not affect anything.  I can see where it might if you told it .001 CPU, but never really used it, unless I wanted to limit CPU tasks running, which of course, doesn't apply here.

Yes, I do enjoy watching the opal hunters.

 

Allen

mikey
mikey
Joined: 22 Jan 05
Posts: 12636
Credit: 1839019849
RAC: 5902

Allen wrote: Kevin, Were

Allen wrote:

Kevin,

Were you suggesting there was something wrong with the CPU .25. GPU .5 setting?  I have played with the CPU part a bunch in the past and it seemed to not affect anything.  I can see where it might if you told it .001 CPU, but never really used it, unless I wanted to limit CPU tasks running, which of course, doesn't apply here.

Yes, I do enjoy watching the opal hunters.

 

Allen 

The cpu part is hardcoded by the Developer of the tasks and is NOT changeable by us crunchers, you can change the gpu part obviously as you have done and continue to do. You can change the gpu part either on the website OR thru an app_config file but not both.

Einstein also reserves a full cpu core for gpu tasks if you run them, NOT one for each gpu task though just one cpu core if you run gpu tasks.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5870
Credit: 116888785791
RAC: 36364896

I haven't got a clue who

I haven't got a clue who Kevin is so I'll just assume it's addressed to me and answer the question.

Allen wrote:
Were you suggesting there was something wrong with the CPU .25. GPU .5 setting?

Not at all.  Can you please point to where you got that 'suggestion' from?

All I was interested in is seeing evidence of what those settings were and if that agreed with the number of tasks actually running and the times they were taking.  Those times are still slower than expected and there should be an explanation for that.  For the moment, things are rapidly improving so the reason for the slow times can wait.

Both these numbers are used for 'budgeting' purposes. The GPU number controls the number of concurrent GPU tasks.  The CPU number controls how many CPU threads will be prevented from running CPU tasks, if you happened to be running both types at once.  If you don't budget enough CPU support, a GPU task can slow down if CPU tasks are competing for those same resources.  The GPU task wont fail - it will just take longer to run since it has to fight for resources.

As you yourself have concluded, the CPU number is basically irrelevant when you aren't running CPU tasks.  However, it should be set to a suitable value, just in case you ever allowed CPU tasks to start running.  AMD GPUs running FGRPB1G don't use very many CPU cycles so 0.25 CPUs should be fine (unless things change in the future - who knows).  If you have no intention to run CPU tasks you could set it to any value you like.  With no competition from CPU tasks, the GPU just uses what it needs, whenever. The budget doesn't restrict the GPU from using more than the budgeted amount, if it needs to.

Allen wrote:
I have played with the CPU part a bunch in the past and it seemed to not affect anything.  I can see where it might if you told it .001 CPU, but never really used it, unless I wanted to limit CPU tasks running, which of course, doesn't apply here.

It's not intended that you use the CPU number for anything other than reserving enough cores for GPU support.  The proper place to restrict the number of cores allowed to run CPU tasks (so leaving free cores to support activities outside BOINC) is the setting for % of cores BOINC is allowed to use.  If you set that to 50% for example, half your total threads would be reserved for non-BOINC use.  The other half would be budgeted by BOINC to support both CPU and GPU tasks according to the rules in app_config.xml.  If those budgeting rules don't tie up a full thread for GPU support, BOINC would also allow a CPU task to share that partial thread as well, so you do need to think about the 'budget' if you are running both types of tasks.

All this sort of stuff is covered in the documentation and you should read the sections on both client configuration (cc_config.xml) and project level configuration (app_config.xml) to make sure you properly understand how things are supposed to work.

 

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5870
Credit: 116888785791
RAC: 36364896

mikey wrote:The cpu part is

mikey wrote:
The cpu part is hardcoded by the Developer of the tasks and is NOT changeable by us crunchers ...

Sorry, totally wrong!  app_config.xml (being used here) allows cpu_usage to be changed.

mikey wrote:
... NOT one for each gpu task though just one cpu core if you run gpu tasks.

Again, wrong!  Whatever the cpu_usage is, either the set value when using GPU Utilization factor, or a variable value in an app_config.xml file, the value is additive and the final number of threads to be reserved will depend on the number of concurrent GPU tasks running.

It's great that you want to help but incorrect statements like these aren't helpful.

Cheers,
Gary.

Allen
Allen
Joined: 23 Jan 06
Posts: 75
Credit: 633415417
RAC: 1242331

GARY WROTE: I haven't got

GARY WROTE:

I haven't got a clue who Kevin is so I'll just assume it's addressed to me and answer the question.

 

Best gut buster I've had in a long time.  I don't know what I was thinking at the time.  Yes, it was you.

I was writing you on my phone.  Weird.

Thanks again!!!

Allen

Allen
Allen
Joined: 23 Jan 06
Posts: 75
Credit: 633415417
RAC: 1242331

Gary, You're a prophet. 

Gary,

You're a prophet.  I've realized that all of my machines were in panic mode.  All numbers are increasing steadily.

Still wonder (like you) what is causing the oddity on the 560's, but that should pan out eventually, I hope!

Thanks,

Allen

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.