The O2-All Sky Gravitational Wave Search on GPUs - discussion thread.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7220564931
RAC: 970958

Keith Myers

When I look at that task, I see the notation "checked but no consensus yet".  When you look on your own account's task list, does it show by any chance as "inconclusive".  If so it will very likely be declared invalid when a tie-breaking task gets returned by an additional quorum partner.

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

So been gone from this

So been gone from this project for quite a while. Was over at GPUGrid for a while with their Quantum Chemistry units then the WOW at Seti. Thought I'd check in and see what was going on. Took a bit to read through all the post in this thread to get a general idea about how the GW on GPU were doing.  Interesting to see Bernad's comments and some of the others ideas on the matter.  From what I've gathered there is a failure trying to run more than 1 work unit concurrent on a GPU, independent of the manufacturer.  Bernad's statement of running only 1 at a time makes a lot of sense. 

Starting last night I returned a single machine (Linux with i7 CPU and quad 1080Tis)back to Einstein.  Slight overclock to the CPU but GPUs are at stock without any OC. Since then I've processed 141 units, with only now getting 11 successful GPU and 5 CPU task. All the rest are pending but no invalids.

One of the things that has jumped out at me while reading the thread is the number of people attempting to restrict the CPU usage in processing these work units. I've seen Windows users using Process Lasso to restrict work units to a single physical core. Other users are using app_configs to restrict thread usage ever further down to 0.5 per work unit in an effort to run more concurrent work units at the same time on the same GPU.  

Having spent time with GPUGrid's QC I've gotten used to seeing near 400% usage of CPU for a single work unit.  What I've noticed here is 136% at the beginning of a work unit followed by a slow decrease down to about 124% at the end. CPU time is long than GPU time only because of how BOINC records CPU. At GPUGrid, if you wanted to know how long it took to crunch a work unit, you took CPU time and divided by the amount of CPU. Example 5472 seconds, utilizing 380% cpu.  5472/3.8=1440 seconds. Which is very close to the actual run time.    Here at Einstein I've been consistently seeing 1670 seconds while CPU time is 1900 seconds.  1900/1670 = 1.13 which I'm guessing is the average amount of CPU usage 1.13 x 100=113% CPU core.  

Keith has been kind enough to check this on his own machine and has verified that CPU usage starts at 99% on his machine and climbs upward throughout the processing of his work unit until it finishes with 110% usage.

Bernd would be the authority here to say if the behavior is correct for the work units.

My recommendation would be to those using the app_config, would be to increase your value to 1.5 to make sure that there are plenty of threads free. 

On my system, I have 16 threads with only a max concurrent in the app_config. I limit the machine to 6 work units, 3 GPUs and 3 GW on the CPU. I've set my preferences for no new CPU work units as I'm not sure how running CPU along with GPU is affecting the times of the GPU tasks but once they finish I will see if the times change dramatically with no CPU work units. My mistakes was allowing CPU tasks at the beginning as they are taking 7.5 hours to complete.  

I believe Gary is researching different architectures of CPUs early vs modern as a cause of issues? I seem remember early AMD chips shared a floating point as Keith reminded me. I also have to wonder how hard the MoBo BUS are getting hit as the work units are loaded up onto the GPU at the begging of a task. We know Einstein tends to be high PCIe lane related (the higher the PCIe speed, the faster the work gets done, ie 16>8>4) Also system RAM and CPU also influence the time to complete. 

My i7 Haswell at 4.0 Ghz and RAM at 3200 and no overlock GPU takes 28 minutes. (for now, I plan to OC once I finish the CPU task)

Keiths Threadripper at 4.1GH and RAM at 3600 with OC GPU are taking 24 minutes.

So in short, the GW on GPU require more than 1 CPU thread.  Bernd?

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4964
Credit: 18712565940
RAC: 6369513

I don't know how you can come

I don't know how you can come to that conclusion when I am first host to process the task.  Need to wait for a wingman's result.  Every first task returned gets "no consensus" status until somebody else checks in.

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7220564931
RAC: 970958

Keith Myers wrote:I don't

Keith Myers wrote:
I don't know how you can come to that conclusion when I am first host to process the task.

The only conclusion I gave was to quote the text shown on the web page.  How I came to it was simply by reading.  Then I asked you what status it is that you see on your task list.  I'd like to see your answer.  Then I might propose a conclusion.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Zalster wrote:My

Zalster wrote:
My recommendation would be to those using the app_config, would be to increase your value to 1.5 to make sure that there are plenty of threads free.

I changed that from 1 to 2 now. I'll see if it gives an observable boost...

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4964
Credit: 18712565940
RAC: 6369513

Yes, it says inconclusive. 

Yes, it says inconclusive.  So does inconclusive automatically mean it is invalid at Einstein.  I get plenty of Inconclusives at other projects and mostly get a validated task once the wingman checks in.

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7220564931
RAC: 970958

Keith Myers wrote:Yes, it

Keith Myers wrote:
Yes, it says inconclusive.  So does inconclusive automatically mean it is invalid at Einstein.  I get plenty of Inconclusives at other projects and mostly get a validated task once the wingman checks in.

I believe here the meaning is that an initial quorum has had both tasks returned, and usually that both passed sanity checks (though not necessarily) and that when the actual results were compared by the validator they were not close enough to each other to declare both valid.  In this case a new task is dispatched to a third system.  When that comes in, the three are cross-compared.  As we are in a beta test, under rules which preclude offering the third task for a quorum which had a beta-test GPU task as one of the first two, we are sure two of the three members of that next quorum will be CPU tasks.

While it is not any kind of bureaucratic rule, the result I think we have been seeing at this stage of the GW GPU beta-test is that two version 1.01 CPU tasks are quite a lot more likely to agree with each other well enough to validate than either is to a GPU task already found not to agree well enough with the first CPU task.

On the other hand, maybe the nature of your system's disagreement differs, or maybe my assertions as to the current project reporting behavior are inaccurate.  

But my forecast until I get new information is that Einstein GW GPU v1.07 tasks which you see showing status "inconclusive" are likely to resolve to "invalid" within a few hours to a few days, in most cases.

I've been watching this on my own systems especially closely recently, and one of them has passed from nearly 100% validation rate to zero validation rate within the past few days.  We have had some other puzzling reports of individual systems having very bad validation rates despite seeming to be built from similar components and running the same OS as other much more successful systems. 

Vigilance is called for.  I hope my dismal forecast proves wrong in your case.  Good luck.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4964
Credit: 18712565940
RAC: 6369513

OK, I can see where my

OK, I can see where my confusion starts.  I couldn't see how I could get an Inconclusive when I am the only host to run the task per the task assignments.  Reading about 4 pages back in the thread I think I understand that all the 1.07 GW gpu tasks we are crunching have already been issued and returned by cpus previously.  Is that correct in my understanding?  So I am getting an Inconclusive because my task does not agree with a cpu that ran it before?

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7220564931
RAC: 970958

Keith Myers wrote: I think I

Keith Myers wrote:
I think I understand that all the 1.07 GW gpu tasks we are crunching have already been issued and returned by cpus previously.  Is that correct in my understanding?

No.

By the time a reported in quorum is formed, your quorum partners will definitely be CPU jobs under the current operating rules (temporary for GW GPU beta testing).  But there is not a rule that the tasks sent to you have already been sent and returned by a CPU host.  Tasks for which no quorum of returned work yet exists will show in your task list as "Completed, waiting for validation" once you have returned them.  

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4964
Credit: 18712565940
RAC: 6369513

Well then, still completely

Well then, still completely confused.  I have no quorum partners for the task per the task assignments.  So how can I be compared to anybody yet?

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.