Almost all BRPS (Parkes PMPS XT) invalid?

Michael H.W. Weber
Michael H.W. Weber
Joined: 22 Jan 05
Posts: 10
Credit: 399,175,195
RAC: 0
Topic 198015

I just looked into the results section of my i5-2500K box and was kind of shocked to find that my AMD R9 290X started to produce almost exclusively invalid results since I work on the BRPS (Parkes PMPS XT) WUs. Before (Perseus and others) everything was just fine. I use to work on 4 GPU WUs in parallel which - so far - gave no issues. Now, since many days the whole work appears to just go to waste.

http://einsteinathome.org/account/tasks&offset=0&show_names=1&state=4&appid=29

or

http://einsteinathome.org/host/11761322/tasks&offset=0&show_names=1&state=4&appid=29

Does anybody have an idea what goes wrong here (I burn a lot of energy here with this GPU, so I really need to find a solution quickly)?

I just reduced the number of WUs to be processed in parallel from 4 to only 2 to see whether this is necessary for this WU type. And I also reduced the write-to-disk interval from 1 hr (at max) to every 10 minutes (at max) - do not know whether this has an influence...
In parallel, PrimeGrid occupies 3 of the 4 CPU cores since 12th of March, but errors occurred even before this date with this type of WUs.

Michael.

RNA World - A Distributed Supercomputer to Advance RNA Research

Logforme
Logforme
Joined: 13 Aug 10
Posts: 332
Credit: 1,714,373,961
RAC: 0

Almost all BRPS (Parkes PMPS XT) invalid?

What strikes me is that your 290X is reported as: "CAL Hawaii (4096MB)". I'd expect something more verbose like my 7970: "CAL AMD Radeon HD 7870/7950/7970/R9 280X series (Tahiti) (3072MB) driver: 1.4.1848"

What driver version are you using?

Mumak
Joined: 26 Feb 13
Posts: 325
Credit: 3,500,680,303
RAC: 2,011,783

What is your GPU Memory Clock

What is your GPU Memory Clock ? You might try to reduce it (there were cases where on such GPUs this was required).

-----

Michael H.W. Weber
Michael H.W. Weber
Joined: 22 Jan 05
Posts: 10
Credit: 399,175,195
RAC: 0

I use the latest AMD driver

I use the latest AMD driver set, no overclocking, no errors before or with other projects. The BOINC manager reports this:

16.03.2015 07:19:31 |  | OpenCL: AMD/ATI GPU 0: Hawaii (driver version 1642.5 (VM), device version OpenCL 2.0 AMD-APP (1642.5), 4096MB, 4096MB available, 3802 GFLOPS peak)

Michael

RNA World - A Distributed Supercomputer to Advance RNA Research

Logforme
Logforme
Joined: 13 Aug 10
Posts: 332
Credit: 1,714,373,961
RAC: 0

You could try to run the beta

You could try to run the beta versions of the GPU apps. They are faster and might get you back to producing valid results. I have no reason to believe this should work but it's something to try.
To enable beta check the "Run beta/test application versions?" box in your E@H preferences page.

Mumak
Joined: 26 Feb 13
Posts: 325
Credit: 3,500,680,303
RAC: 2,011,783

You might even attempt to

You might even attempt to downclock the GPU memory clock from stock setting, since there were some known issues with such GPUs running at high mem clock. I did this for my 280X, which was originally @ 1500 MHz.
Also as suggested, trying the Beta application might be worth as well, since this application should reduce memory bandwidth.

-----

archae86
archae86
Joined: 6 Dec 05
Posts: 3,157
Credit: 7,208,324,931
RAC: 936,483

On one of my five GPUs the

On one of my five GPUs the beta application actually seems to require a slightly lower clock rate than the stock application. This might not be the case for you, but I'll be a little surprised if it fixes your problem.

I do think you would be wise to try down clocking your GPU, possibly substantially, as a diagnostic technique.

Michael H.W. Weber
Michael H.W. Weber
Joined: 22 Jan 05
Posts: 10
Credit: 399,175,195
RAC: 0

Sorry, but so far none of

Sorry, but so far none of these "explanations" and suggestions is really helpful. The board runs rock-stable. There is no overclocking issue and why should I try a beta when the stable release doesn't do the job while it does the job for others?

I now have the suspicion that it could have something to do with running multiple GPU WUs in parallel.
I also figured that there were two non-validated WUs from the Perseus batch. Unfortunately, older WUs have been deleted from the system already, so I can't inquire further...
The point is that I got credits with the GPU for quite a while - even when running 4 WUs in parallel - but then suddenly all are invalid. It started some time around 5th of March.

Michael.

RNA World - A Distributed Supercomputer to Advance RNA Research

mikey
mikey
Joined: 22 Jan 05
Posts: 12,648
Credit: 1,839,039,974
RAC: 4,861

RE: Sorry, but so far none

Quote:

Sorry, but so far none of these "explanations" and suggestions is really helpful. The board runs rock-stable. There is no overclocking issue and why should I try a beta when the stable release doesn't do the job while it does the job for others?

I now have the suspicion that it could have something to do with running multiple GPU WUs in parallel.
I also figured that there were two non-validated WUs from the Perseus batch. Unfortunately, older WUs have been deleted from the system already, so I can't inquire further...
The point is that I got credits with the GPU for quite a while - even when running 4 WUs in parallel - but then suddenly all are invalid. It started some time around 5th of March.

Michael.

Have you tried running only 3 units at a time then? How about only 2 units at a time? Your RAC may actually go up if you do fewer units but more get validated, the project may like it better too.

Mumak
Joined: 26 Feb 13
Posts: 325
Credit: 3,500,680,303
RAC: 2,011,783

"The board runs rock-stable"

"The board runs rock-stable" - if you mean stable when gaming, that can be very different from considering it stable for GPU computing.

-----

Pollux_P3D
Pollux_P3D
Joined: 8 Feb 11
Posts: 30
Credit: 212,418,648
RAC: 0

In den 2 Systemen gibt es

In den 2 Systemen gibt es insgesamt eine validierte Perseus und bisher drei korrekte Parkes.Die Zeiten sind entsprechend niedrig. Die Fehlberechnungen und Validateerrors (seit dem 20 Februar, auch Perseus) beruhen wohl eher auf einem falsch konfigurierten Umfeld für die Grafik.
http://einsteinathome.org/host/11454520/tasks&offset=0&show_names=0&state=0&appid=0
http://einsteinathome.org/host/11761322/tasks&offset=0&show_names=0&state=3&appid=0

Vielleicht sollte erstmal die nötige (optimale) Coreunterstützung für die Gpu mit einer Parkes ermittelt werden (Windowstaskmanager) . Opencl braucht hierbei unbedingt einen freien Core.

Es müssen auch entsprechende Gegenmaßnahmen gegen den HighPriority Modus bei Cpuwork getroffen werden. Entweder kleine Arbeitspuffer oder 99 % Corenutzung auf Multiprozessorsystemen (Boincmanager), da Boinc ansonsten auch den für die Gpu reservierten Core mit zusätzlicher Cpuwork belegt und die Zeiten um das 10 fache steigen oder sogar durch ein Zeitlimit abgebrochen werden.
Die Summe bei mehreren Gpu-Wus sollte immer einen/mehrere volle Cores ergeben.

Quote:
Die XML Datei z. B. mit dem Notepad erstellen und in den jeweiligen Projektordner im BOINC data directory (bei Win7 ist dies gewöhnlich "C:\ProgramData\BOINC\projects\...") kopieren. Beachtet, dass die Dateiendung 'xml' lautet und nicht 'xml.txt' oder ähnliches!


Die Nutzung einer app_config.xml ist dabei besser zu Händeln als die Accounteinstellungen. Beim Account ist auch zu beachten, daß neue Setting gilt erst für neue Wus.
Bei Änderung der app_config ist ein Boincneustart nötig. Es geht zwar auch über neueinlesen und wird übernommen, die Anzeige in der Boinctaskliste (x cpu + x gpu) ist aber nicht korrekt.



einsteinbinary_BRP5

1
1


einsteinbinary_BRP6

1
1

gpu_usage 1 --> 1 Wu
gpu_usage 0.5 --> 2 Wus
gpu_usage 0.33 --> 3 Wus usw.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.