With O3MDF now fixed (hopefully), is it possible to get O3MD1 CPU work units going again, there is none in the queue and hasn't been for awhile.
I would like to run some more of them please.
Conan
Thanks for getting more work out for O3MD1 and the CPUs.
I have some questions though concerning these new work units.
The previous ones that I ran took somewhere from 33,000 to 43,000 seconds (I think) and gave 1,000 credits. So a bit longer than my RYZEN 9 5900X takes to run a Binary type WU (around twice as long as a Gamma WU with Linux).
These new ones have so far ran 18 Hours and are just 33% done with 1 Day 12 Hours still to go.
They were sent out saying they would take 4 hours 54 minutes so I ended up with 55 of them.
The work units have now almost 21 hours done and 39.909% complete, still 1 Day 7 Hours to go.
How many concurrent CPU tasks are you running. Your host is listed as 12 cores / 24 threads. I hope you're not trying to run tasks on all 24 threads.
If you are, you may get better performance by experimenting with a smaller number. You could try halving the number of threads in use and checking the new rate of progress. You may find it more than doubles. There will probably be some 'sweet spot' for the optimum number to use.
You may finish them in time. Since they are brand new tasks that your host has never seen before, BOINC doesn't have a good estimate of how long they take to crunch.
With normal BOINC server software, the APR for a species of task takes 10 validated results before the estimate is correct.
But with Einstein using its custom server software that mechanism is not used and they use a variation of the older DCF method of runtime estimation. I don't understand fully how it works other than knowing that is it continually updated and often gets confused when the task species changes from what it was running before and produces wildly inaccurate estimates of remaining runtimes. Eventually after enough work is returned of the same type, the estimates are then accurate.
You might reduce your cache levels until your DCF stabilizes and you know how fast work is completed and you can increase your cache levels again.
I run a lot of projects, so when that computer finished up with World Community Grid, PRIVATE GFN SERVER, RamanujanMachine and ClimatePrediction it was left with the 55 tasks downloaded by BOINC.
So yes it started running 24 tasks at once to get through the tasks.
As I said the BOINC estimated run time of 4 hours 54 minutes was a mile off, even after running some of them before Christmas, BOINC did not learn of the run times.
After now passing 1 Day run times and not yet to 47%, I have suspended all bar 12 work units.
However this just also paused the 12 left running as BOINC downloaded LHC and LODA work units and started work on them instead, probably due to the hours spent on Einstein and now doing some catch up.
Will let it run and see how it goes, the work units don't run out till the 26th so I should be good, enabling more Einstein work units as the others complete.
Our CPU tasks are designed to run of order 10-12h (varying a lot with CPU type etc.). At lower frequency range (mostly at the beginning of a GW search), there are often tasks for which we don't have enough data for this runtime, so these task may run significantly shorter.
That's, however, our design. In reality, once the search is launched, things may turn out to deviate quite a bit from that. Every search is a little bit different not only in data, but also in parameters. For the GPU search we first found that the memory requirement was much higher than what we estimated. Then there was this bug that occurred only with a combination of certain parameters (frequency and coherence time) that we possibly never had before (and it occurred in a part of the code that wasn't used before).
The CPU (part of the) search seems to suffer from even more technical issues this time. We stopped it before the holiday season as our (attention) capacity for the project was very limited during that time, and the problem on the GPU side was already becoming large enough.
We started the CPU part again a few days ago, and are still seeing problems occurring (you may notice on the Server Status page that the number of tasks "failed" are about 50% of the "valid" tasks, which is alarming).
Please bare with us as we fix these issues one by one. Credit, I have to admit, is of lower priority to us right now. We'll fix that, too, but in due time.
Our CPU tasks are designed to run of order 10-12h (varying a lot with CPU type etc.). At lower frequency range (mostly at the beginning of a GW search), there are often tasks for which we don't have enough data for this runtime, so these task may run significantly shorter.
That's, however, our design. In reality, once the search is launched, things may turn out to deviate quite a bit from that. Every search is a little bit different not only in data, but also in parameters. For the GPU search we first found that the memory requirement was much higher than what we estimated. Then there was this bug that occurred only with a combination of certain parameters (frequency and coherence time) that we possibly never had before (and it occurred in a part of the code that wasn't used before).
The CPU (part of the) search seems to suffer from even more technical issues this time. We stopped it before the holiday season as our (attention) capacity for the project was very limited during that time, and the problem on the GPU side was already becoming large enough.
We started the CPU part again a few days ago, and are still seeing problems occurring (you may notice on the Server Status page that the number of tasks "failed" are about 50% of the "valid" tasks, which is alarming).
Please bare with us as we fix these issues one by one. Credit, I have to admit, is of lower priority to us right now. We'll fix that, too, but in due time.
Thanks Bernd, I am continuing to process, they haven't failed yet just running a long time. With less running at once they should run shorter times from now on.
I suggested that - "You could try halving the number of threads in use" - which means run 12 CPU threads, not 24. You do that by going into the manager and changing the setting for the number of 'cores' the client is allowed to use from 100% to 50%. You are trying to find the 'sweet spot' - the number of active threads which gives you the best 'rate of progress'. If 50% of the cores gives more than double the rate of progress you had previously, then you are using your resources more efficiently and getting more 'output' as a result. You then try other settings (say 75%, etc.) until you find what works best (or run out of patience while doing so) :-).
By suspending "all but 12" you created a shortfall in the work on hand. The client can't download replacements from Einstein (tasks are suspended) so of course it will fill the shortage from whatever other projects are available. Then the client can run 24 threads once again and resource share considerations will determine the project mix for the 24 running tasks. Obviously the client decided that other projects were 'owed' more share. If you want to test just a single project, temporarily suspend the others while testing that project.
You are no closer to working out if there is an efficiency benefit from running less than the full 24 threads. Please be aware that different projects may be more or less 'compute intensive', using CPU resources differently and consequently giving different 'sweet spots'.
It's quite likely there will be a gain in output by using somewhat less than 24 threads. It's worthwhile doing the experiments to find out.
Ian&Steve C. wrote:how many
)
Just one ;)
Unrelated to the topic then since it's not an app error, I thought I saw the same error code.
I didn't have any WU for this app in a few days, but I'll keep an eye on it
Conan wrote:With O3MDF now
)
Thanks for getting more work out for O3MD1 and the CPUs.
I have some questions though concerning these new work units.
The previous ones that I ran took somewhere from 33,000 to 43,000 seconds (I think) and gave 1,000 credits. So a bit longer than my RYZEN 9 5900X takes to run a Binary type WU (around twice as long as a Gamma WU with Linux).
These new ones have so far ran 18 Hours and are just 33% done with 1 Day 12 Hours still to go.
They were sent out saying they would take 4 hours 54 minutes so I ended up with 55 of them.
My questions are
"How long do these things run?"
Are they still only getting 1,000 points?
Thanks
Conan
Einstein does not allot
)
Einstein does not allot credit based on runtimes. Credit is alloted only in fixed amounts for each species of task.
So the 1000 credits will not change for OM3MD1/F.
Thanks Keith for the
)
Thanks Keith for the response.
I was sure the amounts were not going to change, but I could hope.
The work units have now almost 21 hours done and 39.909% complete, still 1 Day 7 Hours to go.
With 55 downloaded, I will contemplate whether I will complete all of them all, we will see how it goes.
I suppose I was the one asking for them so I will have to try and finish them.
Thanks
Conan
Conan wrote:The work units
)
How many concurrent CPU tasks are you running. Your host is listed as 12 cores / 24 threads. I hope you're not trying to run tasks on all 24 threads.
If you are, you may get better performance by experimenting with a smaller number. You could try halving the number of threads in use and checking the new rate of progress. You may find it more than doubles. There will probably be some 'sweet spot' for the optimum number to use.
Cheers,
Gary.
You may finish them in time.
)
You may finish them in time. Since they are brand new tasks that your host has never seen before, BOINC doesn't have a good estimate of how long they take to crunch.
With normal BOINC server software, the APR for a species of task takes 10 validated results before the estimate is correct.
But with Einstein using its custom server software that mechanism is not used and they use a variation of the older DCF method of runtime estimation. I don't understand fully how it works other than knowing that is it continually updated and often gets confused when the task species changes from what it was running before and produces wildly inaccurate estimates of remaining runtimes. Eventually after enough work is returned of the same type, the estimates are then accurate.
You might reduce your cache levels until your DCF stabilizes and you know how fast work is completed and you can increase your cache levels again.
Thanks Gary and
)
Thanks Gary and Keith,
I run a lot of projects, so when that computer finished up with World Community Grid, PRIVATE GFN SERVER, RamanujanMachine and ClimatePrediction it was left with the 55 tasks downloaded by BOINC.
So yes it started running 24 tasks at once to get through the tasks.
As I said the BOINC estimated run time of 4 hours 54 minutes was a mile off, even after running some of them before Christmas, BOINC did not learn of the run times.
After now passing 1 Day run times and not yet to 47%, I have suspended all bar 12 work units.
However this just also paused the 12 left running as BOINC downloaded LHC and LODA work units and started work on them instead, probably due to the hours spent on Einstein and now doing some catch up.
Will let it run and see how it goes, the work units don't run out till the 26th so I should be good, enabling more Einstein work units as the others complete.
Conan
Our CPU tasks are designed to
)
Our CPU tasks are designed to run of order 10-12h (varying a lot with CPU type etc.). At lower frequency range (mostly at the beginning of a GW search), there are often tasks for which we don't have enough data for this runtime, so these task may run significantly shorter.
That's, however, our design. In reality, once the search is launched, things may turn out to deviate quite a bit from that. Every search is a little bit different not only in data, but also in parameters. For the GPU search we first found that the memory requirement was much higher than what we estimated. Then there was this bug that occurred only with a combination of certain parameters (frequency and coherence time) that we possibly never had before (and it occurred in a part of the code that wasn't used before).
The CPU (part of the) search seems to suffer from even more technical issues this time. We stopped it before the holiday season as our (attention) capacity for the project was very limited during that time, and the problem on the GPU side was already becoming large enough.
We started the CPU part again a few days ago, and are still seeing problems occurring (you may notice on the Server Status page that the number of tasks "failed" are about 50% of the "valid" tasks, which is alarming).
Please bare with us as we fix these issues one by one. Credit, I have to admit, is of lower priority to us right now. We'll fix that, too, but in due time.
BM
Bernd Machenschalk
)
Thanks Bernd, I am continuing to process, they haven't failed yet just running a long time. With less running at once they should run shorter times from now on.
Conan
Conan wrote:... I have
)
That's not what I suggested you might try.
I suggested that - "You could try halving the number of threads in use" - which means run 12 CPU threads, not 24. You do that by going into the manager and changing the setting for the number of 'cores' the client is allowed to use from 100% to 50%. You are trying to find the 'sweet spot' - the number of active threads which gives you the best 'rate of progress'. If 50% of the cores gives more than double the rate of progress you had previously, then you are using your resources more efficiently and getting more 'output' as a result. You then try other settings (say 75%, etc.) until you find what works best (or run out of patience while doing so) :-).
By suspending "all but 12" you created a shortfall in the work on hand. The client can't download replacements from Einstein (tasks are suspended) so of course it will fill the shortage from whatever other projects are available. Then the client can run 24 threads once again and resource share considerations will determine the project mix for the 24 running tasks. Obviously the client decided that other projects were 'owed' more share. If you want to test just a single project, temporarily suspend the others while testing that project.
You are no closer to working out if there is an efficiency benefit from running less than the full 24 threads. Please be aware that different projects may be more or less 'compute intensive', using CPU resources differently and consequently giving different 'sweet spots'.
It's quite likely there will be a gain in output by using somewhat less than 24 threads. It's worthwhile doing the experiments to find out.
Cheers,
Gary.