Hello,
I recently built a new machine for gaming/crunching and have run into a higher than normal invalids. Almost 24% of all my work is invalid.
The stats of the new machine are here https://einsteinathome.org/host/12215285
I have almost exclusively run seti/Einstein on multiple machines and this is the first I have had such a high invalid rate and don't understand what I could be doing wrong.
Before I start tinkering with settings,I figured I would ask here.
Thank you for your assistance in advance.
Jim
Copyright © 2024 Einstein@Home. All rights reserved.
Can you point out what I am doing wrong?
)
Gaming you say? Overclocked the GPU? If so, try to lower the clocks/voltages to stock for the GPU and see if that helps.
Try to set 2XWU per GPU and
)
Try to set 2XWU per GPU and temporary disable CPU WU crunching. Test these settings and inform us here.
Yup, gaming but I had no
)
Yup, gaming but I had no reason to OC as it runs all my games just fine. Everything is stock.
I will disable CPU and 2XWU the GPU and see if that helps.
Jim
Uninstall your display driver
)
Uninstall your display driver with DDU (Display Driver Uninstaller) and install another driver (older or newer).
http://www.guru3d.com/files-details/display-driver-uninstaller-download.html
Could this somehow be a
)
Could this somehow be a problem with windows 10 and Einstein?
I just noticed a work machine with an ATI GPU that is having the same amount of invalids. Both machines have Windows 10 running.
I am still doing other small changes like changing the number of cores and GPU's but have no smoking gun yet.
I would rather not just abandon Einstein but not really a choice when 50-100% of my WU's are invalid.
Thanks
Jim
RE: Could this somehow be a
)
Are you running more than one task at a time on the GPU?
When I upgraded my Win7 machine to Win10 the new graphics driver caused a lot of invalids. Had to change from running 3 tasks at a time to just running 1. That fixed the invalids.
RE: RE: Could this
)
I was running 3, switched to just 2 but that didn't seem to work out any better. I will try that next....
I have always run my machines at 100% of CPU's with 95% CPU time with no problems. I will keep it to running 2 WU for the GPU and reduce it to 90% CPU's and see if that works....of not I will modify it to just 1 WU per GPU.
Thanks
Jim
Can someone decipher this for
)
Can someone decipher this for me? These are for the BRP6 v1.52 that I have 100% invalid rate on, should there be so many "Checkpoint committed!"
Thanks!
core_client_version>7.6.22
Activated exception handling...
[08:25:08][135692][INFO ] Starting data processing...
[08:25:08][135692][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 7 MB (4011 MB free / 4018 MB total) -> Used by this application (assuming a single GPU task): 0 MB
[08:25:08][135692][INFO ] Using CUDA device #0 "GeForce GTX 980 Ti" (2816 CUDA cores / -1319.02 GFLOPS)
[08:25:08][135692][INFO ] Version of installed CUDA driver: 8000
[08:25:08][135692][INFO ] Version of CUDA driver API used: 3020
[08:25:09][135692][INFO ] Checkpoint file unavailable: PM0108_02051_308.cpt (No such file or directory).
------> Starting from scratch...
[08:25:09][135692][INFO ] Header contents:
------> Original WAPP file: ./PM0108_02051_DM1104.00
------> Sample time in microseconds: 1000
------> Observation time in seconds: 2097.152
------> Time stamp (MJD): 51558.567234862596
------> Number of samples/record: 0
------> Center freq in MHz: 1231.5
------> Channel band in MHz: 3
------> Number of channels/record: 96
------> Nifs: 1
------> RA (J2000): 112022.2642
------> DEC (J2000): -545348.657001
------> Galactic l: 0
------> Galactic b: 0
------> Name: G4700528
------> Lagformat: 0
------> Sum: 1
------> Level: 3
------> AZ at start: 0
------> ZA at start: 0
------> AST at start: 0
------> LST at start: 0
------> Project ID: --
------> Observers: --
------> File size (bytes): 0
------> Data size (bytes): 0
------> Number of samples: 2097152
------> Trial dispersion measure: 1104 cm^-3 pc
------> Scale factor: 1.36364
[08:25:09][135692][INFO ] Seed for random number generator is 0.
[08:25:10][135692][INFO ] Derived global search parameters:
------> f_A probability = 0.04
------> single bin prob(P_noise > P_thr) = 1.2977e-008
------> thr1 = 18.1601
------> thr2 = 21.263
------> thr4 = 26.2923
------> thr8 = 34.674
------> thr16 = 48.9881
[08:25:10][135692][INFO ] CUDA global memory status (GPU setup complete):
------> Used in total: 139 MB (3879 MB free / 4018 MB total) -> Used by this application (assuming a single GPU task): 132 MB
[08:26:08][135692][INFO ] Checkpoint committed!
[08:27:09][135692][INFO ] Checkpoint committed!
[08:28:09][135692][INFO ] Checkpoint committed!
[08:29:09][135692][INFO ] Checkpoint committed!
[08:30:09][135692][INFO ] Checkpoint committed!
[08:31:09][135692][INFO ] Checkpoint committed!
[08:32:10][135692][INFO ] Checkpoint committed!
[08:33:10][135692][INFO ] Checkpoint committed!
[08:34:10][135692][INFO ] Checkpoint committed!
[08:35:10][135692][INFO ] Checkpoint committed!
[08:36:10][135692][INFO ] Checkpoint committed!
[08:37:10][135692][INFO ] Checkpoint committed!
[08:38:11][135692][INFO ] Checkpoint committed!
[08:39:11][135692][INFO ] Checkpoint committed!
[08:40:11][135692][INFO ] Checkpoint committed!
[08:41:11][135692][INFO ] Checkpoint committed!
[08:42:11][135692][INFO ] Checkpoint committed!
[08:43:12][135692][INFO ] Checkpoint committed!
[08:44:12][135692][INFO ] Checkpoint committed!
[08:45:12][135692][INFO ] Checkpoint committed!
[08:46:12][135692][INFO ] Checkpoint committed!
[08:47:12][135692][INFO ] Checkpoint committed!
[08:48:12][135692][INFO ] Checkpoint committed!
[08:49:13][135692][INFO ] Checkpoint committed!
[08:50:13][135692][INFO ] Checkpoint committed!
[08:51:13][135692][INFO ] Checkpoint committed!
[08:52:13][135692][INFO ] Checkpoint committed!
[08:53:13][135692][INFO ] Checkpoint committed!
[08:54:14][135692][INFO ] Checkpoint committed!
[08:55:14][135692][INFO ] Checkpoint committed!
[08:56:14][135692][INFO ] Checkpoint committed!
[08:57:14][135692][INFO ] Checkpoint committed!
[08:58:14][135692][INFO ] Checkpoint committed!
[08:59:15][135692][INFO ] Checkpoint committed!
[09:00:15][135692][INFO ] Checkpoint committed!
[09:01:15][135692][INFO ] Checkpoint committed!
[09:02:15][135692][INFO ] Checkpoint committed!
[09:03:15][135692][INFO ] Checkpoint committed!
[09:04:15][135692][INFO ] Checkpoint committed!
[09:05:16][135692][INFO ] Checkpoint committed!
[09:06:16][135692][INFO ] Checkpoint committed!
[09:07:16][135692][INFO ] Checkpoint committed!
[09:08:16][135692][INFO ] Checkpoint committed!
[09:09:16][135692][INFO ] Checkpoint committed!
[09:10:17][135692][INFO ] Checkpoint committed!
[09:11:17][135692][INFO ] Checkpoint committed!
[09:12:17][135692][INFO ] Checkpoint committed!
[09:13:17][135692][INFO ] Checkpoint committed!
[09:14:17][135692][INFO ] Checkpoint committed!
[09:15:17][135692][INFO ] Checkpoint committed!
[09:16:18][135692][INFO ] Checkpoint committed!
[09:17:03][135692][INFO ] Statistics: count dirty SumSpec pages 7306 (not checkpointed), Page Size 1024, fundamental_idx_hi-window_2: 1886937
[09:17:03][135692][INFO ] Data processing finished successfully!
[09:17:03][135692][INFO ] Starting data processing...
[09:17:03][135692][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 7 MB (4011 MB free / 4018 MB total) -> Used by this application (assuming a single GPU task): 0 MB
[09:17:03][135692][INFO ] Using CUDA device #0 "GeForce GTX 980 Ti" (2816 CUDA cores / -1319.02 GFLOPS)
[09:17:03][135692][INFO ] Version of installed CUDA driver: 8000
[09:17:03][135692][INFO ] Version of CUDA driver API used: 3020
[09:17:03][135692][INFO ] Checkpoint file unavailable: PM0108_02051_308.cpt (No such file or directory).
------> Starting from scratch...
[09:17:03][135692][INFO ] Header contents:
------> Original WAPP file: ./PM0108_02051_DM1114.00
------> Sample time in microseconds: 1000
------> Observation time in seconds: 2097.152
------> Time stamp (MJD): 51558.567234653747
------> Number of samples/record: 0
------> Center freq in MHz: 1231.5
------> Channel band in MHz: 3
------> Number of channels/record: 96
------> Nifs: 1
------> RA (J2000): 112022.2642
------> DEC (J2000): -545348.657001
------> Galactic l: 0
------> Galactic b: 0
------> Name: G4700528
------> Lagformat: 0
------> Sum: 1
------> Level: 3
------> AZ at start: 0
------> ZA at start: 0
------> AST at start: 0
------> LST at start: 0
------> Project ID: --
------> Observers: --
------> File size (bytes): 0
------> Data size (bytes): 0
------> Number of samples: 2097152
------> Trial dispersion measure: 1114 cm^-3 pc
------> Scale factor: 1.36364
[09:17:03][135692][INFO ] Seed for random number generator is 1086045116.
[09:17:04][135692][INFO ] Derived global search parameters:
------> f_A probability = 0.04
------> single bin prob(P_noise > P_thr) = 1.2977e-008
------> thr1 = 18.1601
------> thr2 = 21.263
------> thr4 = 26.2923
------> thr8 = 34.674
------> thr16 = 48.9881
[09:17:04][135692][INFO ] CUDA global memory status (GPU setup complete):
------> Used in total: 139 MB (3879 MB free / 4018 MB total) -> Used by this application (assuming a single GPU task): 132 MB
[09:17:18][135692][INFO ] Checkpoint committed!
[09:18:18][135692][INFO ] Checkpoint committed!
[09:19:18][135692][INFO ] Checkpoint committed!
[09:20:18][135692][INFO ] Checkpoint committed!
[09:21:19][135692][INFO ] Checkpoint committed!
[09:22:19][135692][INFO ] Checkpoint committed!
[09:23:19][135692][INFO ] Checkpoint committed!
[09:24:19][135692][INFO ] Checkpoint committed!
[09:25:19][135692][INFO ] Checkpoint committed!
[09:26:20][135692][INFO ] Checkpoint committed!
[09:27:20][135692][INFO ] Checkpoint committed!
[09:28:20][135692][INFO ] Checkpoint committed!
[09:29:20][135692][INFO ] Checkpoint committed!
[09:30:20][135692][INFO ] Checkpoint committed!
[09:31:20][135692][INFO ] Checkpoint committed!
[09:32:21][135692][INFO ] Checkpoint committed!
[09:33:21][135692][INFO ] Checkpoint committed!
[09:34:21][135692][INFO ] Checkpoint committed!
[09:35:21][135692][INFO ] Checkpoint committed!
[09:36:21][135692][INFO ] Checkpoint committed!
[09:37:22][135692][INFO ] Checkpoint committed!
[09:38:22][135692][INFO ] Checkpoint committed!
[09:39:22][135692][INFO ] Checkpoint committed!
[09:40:22][135692][INFO ] Checkpoint committed!
[09:41:22][135692][INFO ] Checkpoint committed!
[09:42:22][135692][INFO ] Checkpoint committed!
[09:43:23][135692][INFO ] Checkpoint committed!
[09:44:23][135692][INFO ] Checkpoint committed!
[09:45:23][135692][INFO ] Checkpoint committed!
[09:46:23][135692][INFO ] Checkpoint committed!
[09:47:23][135692][INFO ] Checkpoint committed!
[09:48:24][135692][INFO ] Checkpoint committed!
[09:49:24][135692][INFO ] Checkpoint committed!
[09:50:24][135692][INFO ] Checkpoint committed!
[09:51:24][135692][INFO ] Checkpoint committed!
[09:52:24][135692][INFO ] Checkpoint committed!
[09:53:25][135692][INFO ] Checkpoint committed!
[09:54:25][135692][INFO ] Checkpoint committed!
[09:55:25][135692][INFO ] Checkpoint committed!
[09:56:25][135692][INFO ] Checkpoint committed!
[09:57:25][135692][INFO ] Checkpoint committed!
[09:58:25][135692][INFO ] Checkpoint committed!
[09:59:26][135692][INFO ] Checkpoint committed!
[10:00:26][135692][INFO ] Checkpoint committed!
[10:01:26][135692][INFO ] Checkpoint committed!
[10:02:26][135692][INFO ] Checkpoint committed!
[10:03:26][135692][INFO ] Checkpoint committed!
[10:03:39][135692][INFO ] Statistics: count dirty SumSpec pages 6873 (not checkpointed), Page Size 1024, fundamental_idx_hi-window_2: 1886937
[10:03:39][135692][INFO ] Data processing finished successfully!
10:03:39 (135692): called boinc_finish(0)
]]>
RE: Can someone decipher
)
Hi James,
There's not much to decipher from a volunteer's point of view. It looks pretty much like the standard stderr.txt output that is returned for these tasks. There are no lines flagged as [ERROR], just the normal [INFO] stuff.
So that crunching can be restarted at any time with minimal losses, checkpoints (the saved state of the crunching) are written approximately every minute. The stderr text just records the information about this. BRP6 tasks contain two bundled smaller tasks so it's normal to see two separate blocks of 'checkpoint saves' - more than two if crunching is stopped and restarted at any point.
When you want people to look at a website file like this, just provide a link (here it is) rather than pasting the entire file into a message.
In situations where the task is completing successfully but the actual results (not this stderr text) contain rubbish, there is a fairly short list of likely culprits.
* The app/driver/OS combination has some sort of incompatibility - possible, you need feedback from other Win10 users about this.
* Some other hardware issue on your machine - PSU, RAM, etc - try replacing components if possible.
* The GPU itself is somehow causing the bad results - don't really know what it might be. Ideally, you should try a different 980Ti. I don't suppose you have a spare one lying around :-).
If it were my problem, I would set up to crunch just BRP6 tasks by 'turning off' in your project preferences, all the other different science runs. You are doing both BRP6 and BRP4G at the moment and a significant number of BRP4G are failing with 'validate error' as well. If you can get BRP6 to work properly, I would think BRP4G would also be fixed. Make sure you set the GPU utilization factor to 1 (only a single GPU task at a time).
You are using a rather old app that was built with the CUDA 3.2 libraries. It would be useful to see if the CUDA 5.5 version works better (it's certainly faster). It still has a test status so you have to 'allow' test applications in your project preferences. If you do that you will automatically be sent the new app for any new tasks that you receive.
The other thing to do is look at what driver versions others are using. Go to the top hosts listing which can be found under the 'statistics' link on the home page of the website. If you browse through the pages you are bound to find examples of Win10 machines with 980Ti GPUs. There is one (#10) on the very first page. You can see the driver version and if you browse the tasks list for such hosts you can see the app being used - the beta-cuda-55 test app in a couple of examples I looked at. Your driver is listed as 364.51. The driver for #10 is 361.45. Maybe you could try that older driver to see if that makes any difference.
Please realise that I know nothing about recent versions of Windows and the drivers used. I use Linux exclusively. The last Windows I used was XP more than 8 years ago, well before GPU crunching came along :-).
Good luck with trying to find the cause of the problem.
Cheers,
Gary.
Thanks Gary, I wasn't sure
)
Thanks Gary, I wasn't sure how to post the link, I know it has something to do with the buttons up top but was too tired last night to play with it. I will attempt to learn it before posting that type of data again.
I checked a bunch of WU's against others who successfully processed them and even found a few that were very identical to what I am running. I think my next step is to change the driver of the video card.
Since my last post, I turned off S@H and all CPU tasks and only ran the GPU BRP4's and 6's. I have between 50-100% failure rate on both and I feel that I am not only wasting my time but anyone else who may need WU's.... very frustrating.
I don't have a spare 980, which is funny cause I was shortly going to be buying a second one but not until I figure this stuff out. I could possibly install one of my spare R9 280X cards in its place , rule out everything else (PSU, RAM, bad mojo).
After reading a bunch of posts on how some people fixed their issues with driver updates I am going to update to the latest, see if that works and if not then start going backwards until I find something that works.
Again, thanks for the pointers, I am still determined to figure this out and continue E@H.
Jim