Multiple GPU machine back up and running!

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109407221245
RAC: 35339386

RE: ... Second, in my

Quote:
... Second, in my attempts to re-engineer the chaos the rig created hundreds of error work units.


If you want to make a major change why didn't you set NNT (and low work cache settings) and allow all on-board work to be completed before even starting the change? If the answer is "impatience", why didn't you at least pull the network cable when testing out the restart? If all goes pear shaped, you can edit client_state.xml to remove all the trashed tasks so they can't be reported when you reattach the network cable. Obviously you need to plan ahead so you know how to edit client_state.xml in an emergency like this. Because Einstein will always 'resend lost tasks' the simple removal of those tasks from the state file (even though completely trashed) will allow the project to send them all back to you in pristine condition, in batches of 12 at a time, each time you make contact with the project after getting started again.

Another possible approach is to make a full backup copy of the entire BOINC tree beforehand. I wouldn't have a clue how difficult this might be in Windows but it's dead easy in Linux. If disaster strikes, you can delete the whole mess and go back to exactly where you were before the disaster. Of course, you should have also removed the network cable to prevent any of the disaster being reported before you could intervene.

Quote:
Third, this seems to have resulted in my machine being denied ...


Not true. If you delete your current host ID and allow your machine to acquire a new one, you can immediately start with a clean slate. Of course, if you keep repeating this cycle, the wrath of the Gods might descend upon you :-).

Quote:
... I petitioned Einstein@home for permission to receive work units with no success.


You're not using the 'right' petition :-).

Quote:
... we would have one simple thing. Specifically, dummy units to use to test our rigs.


You really have this already. Maintaining a special pool of dummy units would be much harder and Staff time expensive than a bit of forethought by the user. So you want to try a test with 'test' units? here is what you do

  • * Set NNT and allow all on board work to complete.
    * Set work cache size to bare minimum and leave NNT still set.
    * Shut down and make all hardware/software changes.
    * Restart machine and see if all hardware is detected and if BOINC can be restarted with all the correct startup messages.
    * If so, unset NNT and as soon as the contact that produces a download of tasks happens, set NNT again. This protects against further downloads if the initial set of tasks (one per crunching unit, max) fails.
    * These first tasks are your 'test' tasks. if they fail, then back to the drawing board. If they succeed, whooppeeeeee!!! The project would have no issue if they happened to fail, as long as you learn why and then fix the failure.

Quote:
It is apparently not even worth that effort, or the effort to simply to reject the eror units and permit a reset. I think the jig is up guys.


Maybe you'd like to rethink this a bit.

Cheers,
Gary.

David Rapalyea
David Rapalyea
Joined: 3 Jan 13
Posts: 79
Credit: 63886821
RAC: 0

I am just a duffer who built

I am just a duffer who built a good machine that went belly up the first time I replaced a gtx 660 with a gtx 750ti. And I acknowledge I did not expect total collapse and did not prepare for it.

However, the machine is now crunching Milkyway for a couple of weeks without much problem and am using its last configuration that was producing way many Einstein units on BETA. Three GPU = 1 x gtx 660 + 2 x gtx 650. Something like 140,000 stones at 300+ watts or some such for Einstein.

Whatever the power draw it was pushing bellow 35 Watts per 10k stones goal my goal at the time two or three years ago. But now we have Maxwell! And if I remember correctly, before BETA, I was running 4 gpu( 1 x gtx 660 + 3 x cordless 650). That was unstable with BETA so I plucked one cordless gtx 650. Was really screaming along. Then fugedaboudit.

As stated above, I am a duffer who managed a nice rig for a long time. Then NVIDIA sandbaged me with Maxwell. Then BOINC sand bagged me for excessive error units. I say sandbagged because error units should easily be recycled just as if they were dummy units. Perhaps BOINC gets a string of error units and only sends a dozen new ones till it sorts out. What a novel idea.

The serious problem is BOINC seems unaware my rig has been working Milkyway just fine and as of thirty minutes ago would not send Einstein units. And that is the case even though I contacted the moderator and explained the entire thing. Right now I have two idle GTX 750ti units I will probably donate to the local thrift shop.

If the project is just for I.T. types either prohibit multiple GPUs without express permission or do something to accomodate those of us who simply had fun crunching number. And if I am not mistaken my total crunching is still either equal to or in excess of 98% or 99% of all BOINC participants.

Arecibo 19 Oct 2012
Just Because The Space Alien Is Green
Does Not Mean You Should Go

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 118626898
RAC: 109398

RE: As stated above, I am a

Quote:
As stated above, I am a duffer who managed a nice rig for a long time. Then NVIDIA sandbaged me with Maxwell. Then BOINC sand bagged me for excessive error units. I say sandbagged because error units should easily be recycled just as if they were dummy units. Perhaps BOINC gets a string of error units and only sends a dozen new ones till it sorts out. What a novel idea.


If a task is reported with an error we must assume something is wrong with a host. Therefore the amount of work that the host can get is reduced with every error. Otherwise, one faulty host would crunch through all the available work in no time with no real benefit.

Quote:

The serious problem is BOINC seems unaware my rig has been working Milkyway just fine and as of thirty minutes ago would not send Einstein units. And that is the case even though I contacted the moderator and explained the entire thing. Right now I have two idle GTX 750ti units I will probably donate to the local thrift shop.

If the project is just for I.T. types either prohibit multiple GPUs without express permission or do something to accomodate those of us who simply had fun crunching number. And if I am not mistaken my total crunching is still either equal to or in excess of 98% or 99% of all BOINC participants.

Just because Milkyway is running fine doesn't mean that other projects are fine too. From what I see in the latest scheduler request from your 3 GPU host is, that it requests work for the GPUs but it seems you only have the Gamma-ray pulsar search application enabled. Can you please check in your project settings that work from all applications is permitted? And also check if you have an app_config.xml in the project directory?.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109407221245
RAC: 35339386

RE: ... plucked one

Quote:
... plucked one cordless gtx 650. Was really screaming along. Then fugedaboudit.


Removing a GPU would not cause all tasks to be trashed. Even if it was the last GPU, the worst that would happen is that all GPU tasks would still be there but would be labelled "GPU missing ...". It the GPU were re-inserted, the tasks would resume crunching. On a multi-GPU (same brand) setup, tasks should not be trashed. What happened to you is obviously unfortunate and distressing, but my gripe is that you immediately blame the project/BOINC for this when the much more likely scenario is that something else on your host/under your control got changed/deleted/whatever. I don't have the faintest clue what that might have been.

Quote:
... BOINC sand bagged me for excessive error units. I say sandbagged because error units should easily be recycled just as if they were dummy units.


There is a longstanding arrangement called the "daily limit", which is 32 per CPU core. Each trashed task reduces this by one. If you trash a lot of tasks, your limit will reduce to 1/core/day. Even this is not much of a restriction - at most you lose the ability to download more than 1 per core for 24 hours at most. The first successful task returned doubles this limit to 2. The next one doubles it again to 4, and so on. However this has nothing to do with your current problem.

Once the period in the penalty box has expired (you can see the time ticking down in BOINC Manager), if you're still not getting new tasks, the first thing to do is go through every single preference setting for the venue the host belongs to and make sure there is no setting that is blocking things. Check both local prefs and website prefs and make sure you know where the settings are coming from. If nothing is wrong with preference settings, go to your computer list on the website and click the last contact link in the far right hand column. Here is the current one for your computer.

2015-10-16 02:46:47.3124 [PID=31374]   Request: [USER#xxxxx] [HOST#11455796] [IP xxx.xxx.xxx.129] client 7.4.42
2015-10-16 02:46:47.3130 [PID=31374]    [send] effective_ncpus 4 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2015-10-16 02:46:47.3130 [PID=31374]    [send] effective_ngpus 3 max_jobs_on_host_gpu 999999
2015-10-16 02:46:47.3130 [PID=31374]    [send] Not using matchmaker scheduling; Not using EDF sim
2015-10-16 02:46:47.3130 [PID=31374]    [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
2015-10-16 02:46:47.3130 [PID=31374]    [send] CUDA: req 466232.79 sec, 1.50 instances; est delay 0.00
2015-10-16 02:46:47.3131 [PID=31374]    [send] work_req_seconds: 0.00 secs
2015-10-16 02:46:47.3131 [PID=31374]    [send] available disk 0.98 GB, work_buf_min 86400
2015-10-16 02:46:47.3131 [PID=31374]    [send] active_frac 0.999978 on_frac 0.925382 DCF 1.000000
2015-10-16 02:46:47.3137 [PID=31374]    [send] [HOST#11455796] not reliable; max_result_day 1
2015-10-16 02:46:47.3139 [PID=31374]    [send] set_trust: random choice for error rate 0.000010: yes
2015-10-16 02:46:47.3139 [PID=31374]    [mixed] sending non-locality work first (0.2306)
2015-10-16 02:46:47.3337 [PID=31374]    [version] Checking plan class 'FGRP4-SSE2'
2015-10-16 02:46:47.3369 [PID=31374]    [version] reading plan classes from file '/BOINC/projects/EinsteinAtHome/plan_class_spec.xml'
2015-10-16 02:46:47.3369 [PID=31374]    [version] plan class ok
2015-10-16 02:46:47.3369 [PID=31374]    [version] Don't need CPU jobs, skipping version 115 for hsgamma_FGRP4 (FGRP4-SSE2)
2015-10-16 02:46:47.3369 [PID=31374]    [version] no app version available: APP#27 (hsgamma_FGRP4) PLATFORM#9 (windows_x86_64) min_version 0
2015-10-16 02:46:47.3369 [PID=31374]    [version] no app version available: APP#27 (hsgamma_FGRP4) PLATFORM#2 (windows_intelx86) min_version 0
2015-10-16 02:46:47.3478 [PID=31374]    [mixed] sending locality work second
2015-10-16 02:46:47.3506 [PID=31374] [debug]   [HOST#11455796] MSG(high) No work sent
2015-10-16 02:46:47.3506 [PID=31374] [debug]   [HOST#11455796] MSG(high) see scheduler log messages on http://einstein5.aei.uni-hannover.de/EinsteinAtHome/host_sched_logs/11455/11455796
2015-10-16 02:46:47.3507 [PID=31374]    Sending reply to [HOST#11455796]: 0 results, delay req 60.00
2015-10-16 02:46:47.3517 [PID=31374]    Scheduler ran 0.043 seconds

This is nice and short. Here are some things to note:-

  • * You have 4 CPU cores and 3 GPUs (ncpus 4 and ngpus 3).
    * You're not asking for CPU work (CPU: req 0.00 sec) but you are asking for GPU work (CUDA: req 466232.79 sec).
    * You don't have much disk space allocated (0.98GB) but this isn't the reason.
    * The scheduler doesn't regard your host as reliable and shows your current limit of 1 per day.
    * The scheduler checks the CPU plan class FGRP4-SSE2 which is not relevant to you (CPU: req 0.00)
    * I have no idea why no GPU plan classes are being checked. Normally, you would expect to see a lot more lines about this.
    * There is no specific error message to say why no GPU work is being checked or allocated.

You have options on what to do about this. The best one would be to start a thread in "Problems ..." listing the above scheduler contact message and pointing out that you have triple-checked all preference settings. Ask the Devs nicely if they can investigate and give you a more meaningful reason as to why the scheduler seems to be ignoring you. Other volunteers like myself can try all sorts of guesses but we would really just be making guesses in the dark. The Devs are the only ones who can really work out why you can't get work if it's not operator error.

A second option would be to try 'resetting' the project in BOINC Manager. I never use this so I'm entirely unsuitable for predicting what might happen. A third option would be to 'remove' the project (BOINC Manager) and then add it back again. Once again, I don't do this so I don't know if that will end up getting you a new ID and hence a full daily work allowance and the ability to get work.

A last resort option (which is probably what I would do if it were me) would be to manually give your computer a 'different' ID. If you look through your computers on the website and select the 'All computers' link instead of just those active in the last 30 days, you will see all the previous incarnations of that machine. I particularly like the one with the HostID of 7205442 because it has a total credit of 20M+ and last active on 20 Feb 2014. That's more credit than your current ID 11455796 has :-).

For that old ID (7205442), you need to click on the 'details' link and note down the value of the 'Number of times the client has contacted server' field. You need to make a note of this number plus 1. If it was '12345' the value you note down would be '12346'. Now you stop BOINC and browse to the BOINC Data folder and find the state file (client_state.xml). You are going to edit this file with a plain text editor like notepad. You are going to change just two things and add one new line. If you have several projects attached, you need to make sure you are editing stuff for the Einstein project. The quickest way to get to the right place is to search for '11455796' which is in the correct Einstein project section. This should bring you to the line '11455796' and you need to change it to '7205442'. A couple of lines above this you will see 'nnnnn' where 'nnnnn' will be probably a quite large number. Whatever it is, just change it to read the '12346' value calculated (+1) from the value you looked up in the 'details' link for host 7205442. Double check that both and are exactly as they should be. One final thing (for safety) is browse down some more lines (maybe 10-20) until you find the line that says ''. Immediately below this line, insert a new one which says ''. This extra line will ensure that when BOINC is restarted, it won't ask Einstein for new work until you are ready to allow it to do so. When your edits are finished and checked, save the file. You did use a plain text editor, didn't you? :-).

When you restart BOINC, your machine will have its ID of 20 Feb 2014 and will have a full daily limit of 32/core and NNT will be set. The preferences it has will depend on whether it was using local prefs or website prefs when it was last active. You can check that these are all suitable before you click on 'allow new tasks' button on the projects tab. When you do 'allow' it will be interesting to see if the scheduler will send you GPU tasks.

Quote:
The serious problem is BOINC seems unaware my rig has been working Milkyway just fine and as of thirty minutes ago would not send Einstein units.


The 'serious problem' is nothing to do with BOINC or its 'awareness' of any other project. BOINC is asking for Einstein work as the above scheduler log shows. The scheduler doesn't seem to be interested in even checking that there is a suitable science run for which it could send tasks. I don't know why that is but I can think of two possibilities. Either there is some misconfiguration at your end or there is some sort of bug in how the scheduler is dealing with your machine. In either case, no volunteer, moderator or not, will be able to fix it for you. You haven't "explained the entire thing". You've fairly aggressively delivered a rant. It's understandable that you're upset but delivering a broadside against anybody/anything in range is not going to turn this around or make people happy to help.

I have told you the 'best' option above. I've also told you what I would try. In the end, you have to decide what's best for you. The Devs and quite a number of active volunteers try to help with issues like this, irrespective of the quantum of an individual's 'contribution'. It's actually quite insulting to suggest that a 'high' contribution somehow deserves a better level of service. Every volunteer's contribution is valued no matter how big or small it might be.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.