Problem with info posted by Event Log

Allen

Joined: 23 Jan 06

Posts: 75

Credit: 633285422

RAC: 1243024

Gary,Thanks for the

11 Sep 2023 3:39:10 UTC

Message 216843

(moderation:

)

Gary,

Thanks for the confidence and the upbeat attitude! There for awhile I though perhaps I had ticked you off, but that was never my intention.

Even though you said you thought I should deal with one system at a time, I have been trying to update my other machines with lessons learned and so sometimes I confuse myself with when I did this or that. Now everything is fairly stable and running, except for the 560's time problem.

I do remember someone stating (maybe you) that it is possible that I need to reinstate receiving tasks before the 560's get back on the right time. This doesn't seem to equate though, since the same changes on my other machines show up right away.

It is good though that the timing seems to be getting better.

Thanks again.

Allen

EDIT:

Just thought I would throw this in..... it might be of help somehow.....


1	AMD FX(tm)-8350 Eight-Core Processor	8	Linux	36,977,936	783,090	5,249,475	18,912,663	666,748	-
2	AMD FX(tm)-8300 Eight-Core Processor	8	Windows 7 Professional x64 Edition, Serv ice Pack 1, (06.01.7601.00)	22,926,409	353,430	1,635,480	8,645,868	274,564	-
3	AMD A8-7600 Radeon R7\\, 10 Compute Core s 4C+6G	4	Windows 7 Professional x64 Edition, Serv ice Pack 1, (06.01.7601.00)	18,418,264	311,850	2,079,000	8,899,506	295,335	245
4	Intel(R) Core(tm)2 Quad CPU Q6600 @ 2.40 GHz	1(4)	Windows 7 Professional x64 Edition, Serv ice Pack 1, (06.01.7601.00)	20,157,250	304,920	2,120,580	9,073,463	302,855	-
5	AMD Athlon(tm) X4 845 Quad Core Processo r	1(4)	Windows 7 Ultimate x64 Edition, Service Pack 1, (06.01.7601.00)	21,319,693	291,060	2,092,860	8,899,506	297,149	179
6	AMD Ryzen 7 4700G with Radeon Graphics	16	Windows 10 Core x64 Edition, (10.00.1904 5.00)	19,419,679	277,200	2,298,681	10,363,122	330,681	60
7	Intel(R) Core(tm) i3-8130U CPU @ 2.20GHz	1(4)	Windows 10 Core x64 Edition, (10.00.1904 5.00)	3,733,656	45,045	311,850	1,285,515	44,134	-

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5870

Credit: 116884615840

RAC: 36349582

Allen wrote:Thanks for the

11 Sep 2023 12:02:02 UTC

Message 216852 in response to message 216843

(moderation:

)

Allen wrote:

Thanks for the confidence and the upbeat attitude! There for awhile I though perhaps I had ticked you off, but that was never my intention.

No, I'm not ticked off - everything's good.

Allen wrote:

Even though you said you thought I should deal with one system at a time ...

I don't have time to do a deep dive on all your machines. What I was trying to say was that we will concentrate on the one machine and you can can worry about the others as you see what works on this one. I haven't looked at the others but if you had 10+10 on those as well, the problems (if any) will be of a similar nature and you can deal with those at your own pace.

Allen wrote:

I do remember someone stating (maybe you) that it is possible that I need to reinstate receiving tasks before the 560's get back on the right time.

I made the comment (go back and read it again) that if you had made the concurrent tasks changes through the GPU utilization factor on the website, then that change could ONLY be transmitted to your host by the receipt of new work. That couldn't be the problem because you responded that you were using app_config.xml and not the website GPU utilization factor.

Allen wrote:

This doesn't seem to equate though, since the same changes on my other machines show up right away.

They always show up if you change app_config.xml AND THEN click the "read config files" in BOINC Manager. I had asked specifically at one point if you were sure that you clicked that option but you didn't give a direct reply. However once you showed the task properties printout, it became obvious that you must have done the clicking so I didn't pursue the matter. You are obviously using app_config.xml on all machines which is why they don't need new work to see a change.

Allen wrote:

It is good though that the timing seems to be getting better.

I just had a look now and the latest tasks are finishing just under 50 mins. There is no further evidence of 'faster' running. It's still a mystery to be solved.

The 'in progress' number is reducing further (below 470) whilst the gap between the task finishing time and the deadline is continuing to grow - approx 2 days 7 hrs now. The 470 'in progress' tasks represents about 4 days at the current rate so in 2 days time the host might start asking for new work. There shouldn't be any need to have any tasks suspended very soon now, if not already.

I gotta go so I haven't had time to check the above properly. I'll do so tomorrow ...

Cheers,
Gary.

Allen

Joined: 23 Jan 06

Posts: 75

Credit: 633285422

RAC: 1243024

Gary, just the latest on one

11 Sep 2023 16:35:12 UTC

Message 216859

(moderation:

)

Gary, just the latest on one of the tasked finished recently.

Computer:   Alpha-8
Project   Einstein@Home

Name   LATeah3012L09_860.0_0_0.0_21731112_1

Application   Gamma-ray pulsar binary search #1 on GPUs 1.22 (FGRPopencl1K-ati)
Workunit name   LATeah3012L09_860.0_0_0.0_21731112
State   Ready to report
Received   8/30/2023 1:45:44 PM
Report deadline   9/13/2023 1:45:41 PM
Estimated app speed   101.51 GFLOPs/sec
Estimated task size   525,000 GFLOPs
Resources   0.25 CPUs + 0.5 AMD/ATI GPUs
CPU time at last checkpoint   00:00:00
CPU time   00:03:40
Elapsed time   00:50:27
Estimated time remaining   00:00:00
Fraction done   100%
Virtual memory size   0.00 MB
Working set size   0.00 MB

Thanks again!!

PS Do you watch Opal Hunters? I find it very interesting.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5870

Credit: 116884615840

RAC: 36349582

Allen wrote:just the latest

11 Sep 2023 23:20:55 UTC

Message 216868 in response to message 216859

(moderation:

)

Allen wrote:

just the latest on one of the tasked finished recently.

There's no need to post the full properties of a task as shown in the Manager. I can already see most of what I need from the tasks list on the website.

There was that one key bit that I didn't realise was also in the stderr output on the website and that was the CPU/GPU resources being assigned to each task. I thought that was only in a properties listing. I've now done a closer inspection of one of your validated tasks on the website and I've learned something new. The resource allocation is shown in the stderr output as this example from one of your returned results shows. (The highlight colour is my mod to make it stand out more :-). )

I tend to be looking for error messages further down the output - usually around or below all the checkpoint records so I just skip over the header stuff. Thanks for prompting me to pay more attention to the details. You get to the stderr output by clicking on the TaskID link for a task of interest and then scrolling below the stderr heading. It's the place to go if you're trying to find the cause of any errors that occur.

Stderr output

<core_client_version>7.22.2</core_client_version>
<![CDATA[
<stderr_txt>
03:27:29 (6516): [normal]: This Einstein@home App was built at: May  8 2019 13:29:27

03:27:29 (6516): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.22_windows_x86_64__FGRPopencl1K-ati.exe'.
03:27:29 (6516): [debug]: 1e+016 fp, 2.9e+009 fp/s, 3620412 s, 1005h40m12s46
03:27:29 (6516): [normal]: % CPU usage: 0.250000, GPU usage: 0.500000
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.22_windows_x86_64__FGRPopencl1K-ati.exe --inputfile ../../projects/einstein.phys.uwm.edu/LATeah3012L09.dat --alpha 2.59819959601 --delta -0.694603692878 --skyRadius 1.890770e-06 --ldiBins 15 --f0start 852.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 1.69860773e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah3012L09_0860_21102336.dat --debug 0 --device 1 -o LATeah3012L09_860.0_0_0.0_21102336_1_0.out
....
....

Allen wrote:

PS Do you watch Opal Hunters? I find it very interesting.

Australia has a lot of opal scattered around the outback but I wasn't aware there was a TV series about it - I don't have time to watch much TV. I had to google it to even be aware that such a series existed.

Good to see you're interested in a bit of 'DownUnda cultcha' :-).

I had a quick look at your tasks list. There are still 412 in progress which represents around 3.6 days worth. Because you only have one type of task (FGRPB1G) with a fairly uniform completion time (50mins) the time to finish them all is easy to calculate. 4 tasks (x2 on 2 GPUs) in 50mins is 12.5m per task on average. 412 @ 12.5 mins each works out to 3.576 days of continuous running.

If other machines are in trouble, you should be able to use what you've been doing with this one to help get them back running properly. Good luck! I now need to spend some time on my lot :-).

Cheers,
Gary.

Allen

Joined: 23 Jan 06

Posts: 75

Credit: 633285422

RAC: 1243024

Kevin,Were you suggesting

12 Sep 2023 0:16:46 UTC

Message 216871

(moderation:

)

Kevin,

Were you suggesting there was something wrong with the CPU .25. GPU .5 setting? I have played with the CPU part a bunch in the past and it seemed to not affect anything. I can see where it might if you told it .001 CPU, but never really used it, unless I wanted to limit CPU tasks running, which of course, doesn't apply here.

Yes, I do enjoy watching the opal hunters.

Allen

mikey

Joined: 22 Jan 05

Posts: 12636

Credit: 1839019411

RAC: 5929

Allen wrote: Kevin, Were

12 Sep 2023 1:04:57 UTC

Message 216874 in response to message 216871

(moderation:

)

Allen wrote:

Kevin,

Were you suggesting there was something wrong with the CPU .25. GPU .5 setting? I have played with the CPU part a bunch in the past and it seemed to not affect anything. I can see where it might if you told it .001 CPU, but never really used it, unless I wanted to limit CPU tasks running, which of course, doesn't apply here.

Yes, I do enjoy watching the opal hunters.

Allen

The cpu part is hardcoded by the Developer of the tasks and is NOT changeable by us crunchers, you can change the gpu part obviously as you have done and continue to do. You can change the gpu part either on the website OR thru an app_config file but not both.

Einstein also reserves a full cpu core for gpu tasks if you run them, NOT one for each gpu task though just one cpu core if you run gpu tasks.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5870

Credit: 116884615840

RAC: 36349582

I haven't got a clue who

12 Sep 2023 1:32:39 UTC

Message 216876 in response to message 216871

(moderation:

)

I haven't got a clue who Kevin is so I'll just assume it's addressed to me and answer the question.

Allen wrote:

Were you suggesting there was something wrong with the CPU .25. GPU .5 setting?

Not at all. Can you please point to where you got that 'suggestion' from?

All I was interested in is seeing evidence of what those settings were and if that agreed with the number of tasks actually running and the times they were taking. Those times are still slower than expected and there should be an explanation for that. For the moment, things are rapidly improving so the reason for the slow times can wait.

Both these numbers are used for 'budgeting' purposes. The GPU number controls the number of concurrent GPU tasks. The CPU number controls how many CPU threads will be prevented from running CPU tasks, if you happened to be running both types at once. If you don't budget enough CPU support, a GPU task can slow down if CPU tasks are competing for those same resources. The GPU task wont fail - it will just take longer to run since it has to fight for resources.

As you yourself have concluded, the CPU number is basically irrelevant when you aren't running CPU tasks. However, it should be set to a suitable value, just in case you ever allowed CPU tasks to start running. AMD GPUs running FGRPB1G don't use very many CPU cycles so 0.25 CPUs should be fine (unless things change in the future - who knows). If you have no intention to run CPU tasks you could set it to any value you like. With no competition from CPU tasks, the GPU just uses what it needs, whenever. The budget doesn't restrict the GPU from using more than the budgeted amount, if it needs to.

Allen wrote:

I have played with the CPU part a bunch in the past and it seemed to not affect anything. I can see where it might if you told it .001 CPU, but never really used it, unless I wanted to limit CPU tasks running, which of course, doesn't apply here.

It's not intended that you use the CPU number for anything other than reserving enough cores for GPU support. The proper place to restrict the number of cores allowed to run CPU tasks (so leaving free cores to support activities outside BOINC) is the setting for % of cores BOINC is allowed to use. If you set that to 50% for example, half your total threads would be reserved for non-BOINC use. The other half would be budgeted by BOINC to support both CPU and GPU tasks according to the rules in app_config.xml. If those budgeting rules don't tie up a full thread for GPU support, BOINC would also allow a CPU task to share that partial thread as well, so you do need to think about the 'budget' if you are running both types of tasks.

All this sort of stuff is covered in the documentation and you should read the sections on both client configuration (cc_config.xml) and project level configuration (app_config.xml) to make sure you properly understand how things are supposed to work.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5870

Credit: 116884615840

RAC: 36349582

mikey wrote:The cpu part is

12 Sep 2023 2:18:24 UTC

Message 216879 in response to message 216874

(moderation:

)

mikey wrote:

The cpu part is hardcoded by the Developer of the tasks and is NOT changeable by us crunchers ...

Sorry, totally wrong! app_config.xml (being used here) allows cpu_usage to be changed.

mikey wrote:

... NOT one for each gpu task though just one cpu core if you run gpu tasks.

Again, wrong! Whatever the cpu_usage is, either the set value when using GPU Utilization factor, or a variable value in an app_config.xml file, the value is additive and the final number of threads to be reserved will depend on the number of concurrent GPU tasks running.

It's great that you want to help but incorrect statements like these aren't helpful.

Cheers,
Gary.

Allen

Joined: 23 Jan 06

Posts: 75

Credit: 633285422

RAC: 1243024

GARY WROTE: I haven't got

12 Sep 2023 4:08:09 UTC

Message 216884

(moderation:

)

GARY WROTE:

I haven't got a clue who Kevin is so I'll just assume it's addressed to me and answer the question.

Best gut buster I've had in a long time. I don't know what I was thinking at the time. Yes, it was you.

I was writing you on my phone. Weird.

Thanks again!!!

Allen

Joined: 23 Jan 06

Posts: 75

Credit: 633285422

RAC: 1243024

Gary, You're a prophet.

12 Sep 2023 17:36:30 UTC

Message 216907

(moderation:

)

Gary,

You're a prophet. I've realized that all of my machines were in panic mode. All numbers are increasing steadily.

Still wonder (like you) what is causing the oddity on the 560's, but that should pan out eventually, I hope!

Thanks,

Allen

Problem with info posted by Event Log

Forums › Problems and Bug Reports

Stderr output

Comment viewing options

Forums › Problems and Bug Reports