Former GPU mining rig now running BOINC

steffen_moeller
steffen_moeller
Joined: 9 Feb 05
Posts: 78
Credit: 1,773,655,132
RAC: 0

Gary Roberts

Gary Roberts wrote:

steffen_moeller wrote:
... I would just be as happy to perpetually run SETI on one GPU and have it all stable. Sounds weird, but it works. I issued a BOINC issue about it: https://github.com/BOINC/boinc/issues/3229 to support the specification of such a minimal number of tasks of a project to be run.

 Couldn't you install an app_config.xml in the Seti project dir with a <max_concurrent> of just 1 and a different file in the EAH project dir with a <max_concurrent> of whatever maximum number you wanted there?  You might have to juggle resource shares to make sure BOINC didn't stop getting EAH work because it thought Seti was behind in its share.  If Seti was out of work, the <max_concurrent> for Einstein should leave a GPU idle.  Or am I not properly understanding what you want to achieve?

 As I only run a single project, I have no real experience with <max_concurrent>.  However I believe it does what the label says :-).

How comes I was happy enough with what boinc-manager did for me for all these years. Thank you so much - again! I limited E@H to 5 instead of SETI to 1 - and just uploaded the first five E@H WUs.  As usual I'll grant it all a bit of an opportunity to fail now and if this continues to work as expected then .. maybe further investigations have to wait for when winter is coming and the heating season with it.

 

steffen_moeller
steffen_moeller
Joined: 9 Feb 05
Posts: 78
Credit: 1,773,655,132
RAC: 0

Gary Roberts

Gary Roberts wrote:
steffen_moeller wrote:
... part of the amdgpu-utils that Cecht suggested earlier in this thread ...

I'd figured that was the case and it's certainly on my list of things to investigate when I get some time :-).

That young cecht is a real go-getter in terms of mastering new stuff and finding the hidden gems!  We need to chain him down and make sure he never leaves :-).

I just saw we joined E@H on the same day. Maybe we should indeed have a special interest group for us "young on(c)es". I would then also update the image of that former kitten you see on the left.

Gary Roberts wrote:

At the moment I'm trying to complete a major re-write of a script I use to do post-installation configuration of the entire crunching fleet.  I've had certain parts of that automated previously and other parts that I do manually.  As I get older, it becomes increasingly hard to remember every bit of manual tweaking.  Not only that, I'm increasingly making mistakes in root commands so I'm a danger to myself :-).

So I have this grand plan to make this script that is smart enough to do all the major parts of the job on its own and to warn me if I'm trying to do something out of order - or otherwise insane :-).  So far it's going quite well.  A module to handle the automatic installation and recording of packages required by BOINC and to save that information on external media on a per host basis as it changes over time is being tested right now and seems to be working well.  This means that recovering from a disk failure, or deciding to transfer a BOINC installation from one host to another, or even just the mundane stuff after a new host is built, can all be done pretty much automatically by answering some simple questions when asked.  I reckon then I might be able to compensate for forgetfulness by just following instructions :-)

Linux? Windows? This may be a thread on itself. Wrt software dependencies the Debian/Ubuntu packages should be darn close to what you need. As a self-educational project I once automated the installation of BOINC and its attachment to E@H for the Amazon cloud (https://aws.amazon.com) with Chef (https://en.wikipedia.org/wiki/Chef_%28software%29) if I recall correcly - may have been Ansible or Puppet or whatever. All I remember is that it worked. I think I would look into all the various "infrastructure as code" tools first.

For me the bottleneck is something else. For instance I just gathered my 4 ROCK64/PI3s to donate them to N30dG since I cannot be bothered with all the monitoring but he has written something for it. So, to me, it is less the optimisation than the reduction of downtime where the work goes - and more optimisation like overclocking typically goes at the expense of more monitoring being required.

steffen_moeller
steffen_moeller
Joined: 9 Feb 05
Posts: 78
Credit: 1,773,655,132
RAC: 0

This thread produced a

This thread produced a Debian/Ubuntu package of Rick's AMDGPU Utilities that lives at https://salsa.debian.org/science-team/ricks-amdgpu-utils - the upload to the distribution shall happen when the next upstream version is released. I'll drop a note when it arrives in either distribution or a PPA.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,869
Credit: 115,068,482,856
RAC: 31,643,190

steffen_moeller wrote:... I

steffen_moeller wrote:
... I limited E@H to 5 instead of SETI to 1 - and just uploaded the first five E@H WUs.  As usual I'll grant it all a bit of an opportunity to fail ...

I'm very glad it seems to be working for you and I hope it continues that way :-).

One thing that might happen is that BOINC might decide to run more than 1 Seti if there is an imbalance in the resource shares.  It will be interesting to see if anything changes over the next few days or so.  Thank you so much for continuing to document what is happening.

I can appreciate your comment about waiting for winter :-).  It's midwinter here in Brisbane, Australia - daily range around 6C to 22C.  In summer, daily range is around 23C to 35C at the peak with just a couple of days beyond that usually.  I have 108 machines currently that run 24/7/365.  I don't use aircon just industrial strength forced ventilation using fresh outside air.  When I arrived this morning the outside was about 7C and my office was a very nice 24C :-).

I have a script that runs in summer to suspend crunching on all machines if the room temperature exceeds 37C. The ventilation system is so effective that for the entire last summer, crunching was paused for around 8 hours each day on just a couple of extreme days.  Even when the external temperature peaks around 38-40C, the computer room temperature seems to peak at about 36C only.

The computers are housed in open frames on pallet racking so the forced ventilation does a reasonable job of limiting the board temperature of each one - probably around 45-50C.  Of course, internal CPU and PSU temps are hotter than that.  I'm actually quite surprised at how well the hardware copes with the continuous high temperature running.

I have machines that I built in 2008/2009 that still run. The most common failure mode is bulging capacitors (both motherboards and PSUs) - usually after about 5-7 years of running.  I routine replace those as I find them, with pretty close to 100% success rate.  The older machines previously did CPU crunching only.  They've all been upgraded with a decent GPU.  I don't run CPU tasks on these now and that seems to have improved the reliability of the older hardware.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,869
Credit: 115,068,482,856
RAC: 31,643,190

steffen_moeller wrote:I just

steffen_moeller wrote:
I just saw we joined E@H on the same day.

Well how about that!  I hadn't even noticed.  I think there's someone else that I saw posting some time ago that also has that same join date.  There are a few regulars (eg Magic Quantum Mechanic, Mikey, Holmis, etc) that are even 'older' than that.  Some (eg Magic) usually pop up around anniversary time (December, January, Febuary) to remind us all about how many years we've all clocked up and how old we're all getting :-).

steffen_moeller wrote:
Maybe we should indeed have a special interest group for us "young on(c)es". I would then also update the image of that former kitten you see on the left.

That kitten is (was) quite cute!  No need to update that :-).

steffen_moeller wrote:
Linux? Windows? This may be a thread on itself. Wrt software dependencies the Debian/Ubuntu packages should be darn close to what you need. As a self-educational project I once automated the installation of BOINC and its attachment to E@H for the Amazon cloud (https://aws.amazon.com) with Chef (https://en.wikipedia.org/wiki/Chef_%28software%29) if I recall correcly - may have been Ansible or Puppet or whatever. All I remember is that it worked. I think I would look into all the various "infrastructure as code" tools first.

Linux, of course :-).  Actually PCLinuxOS (PCLOS).  And it's not a lack of software dependencies that I have an issue with.  I use a distro supplied minimalist .iso that's very fast to install but needs some standard things added immediately post-install.  For example, I routinely add stuff like rsync, sshd, lm_sensors, nfs-utils, wxwidgets, etc to the basic install.  It's more the tedium of having to open a package manager and then select maybe 15 or 20 packages to add to the basic install.  Adding the extras and configuring stuff is now all automated so I don't have to remember the steps and do it manually.

I prefer a rolling release distribution and I fully understand that breakages can occur when living at or near the bleeding edge.  I have a strategy that seems to cope with that, since I've never been forced to experience the 'joys' of doing software updates and finding a trashed system because of some incompatibility introduced through a 'newer version' package that hasn't been properly tested.  PCLOS seems very good in what it decides to package and for testing it before releasing to the masses.  Such problems don't seem to happen all that often.

I have an external USB drive on which the full PCLOS repo lives.  I started using that in around 2014.  It gets updated every 2 weeks or so from a PCLOS mirror in Aus.  I clone my repo copy about every 4 months and put a date on each clone.  I chose that date carefully when the repo seems to be in a quite stable condition.  I use the previous clone rather than the very latest for updating hosts.  The current clone in use is 19-04-22 which I've now used many times with no issues at all.  I will be looking to clone 'latest' in about a month or two and then test that new clone before using it to replace 19-04-22 as the 'stable' set of updates.

I'm not a programmer.  I've never formally studied/trained for anything IT related.  I was lucky enough to be exposed to Unix in a University environment back in the late '70s - early '80s.  I was intrigued by what a skilled person could do with sh/csh so I started experimenting with some simple sh scripts.  I was working in an Engineering department at the time running an industry training course.  I wrote some scripts to handle all the paperwork (letters, faxes, bookings, timetables, course descriptions, etc) which used troff/tbl/eqn/pic to do the document formatting and text processing.  The Dept eventually acquired one of the earliest Apple laserwriters and a colleague (who did know something about programming) wrote a Unix driver for it so I got to produce some pretty professional looking documents before the first versions of Microsoft Word ever came along.  I got to love the troff suite so much that I never did use any of the Microsoft stuff.

steffen_moeller wrote:
For me the bottleneck is something else. For instance I just gathered my 4 ROCK64/PI3s to donate them to N30dG since I cannot be bothered with all the monitoring but he has written something for it. So, to me, it is less the optimisation than the reduction of downtime where the work goes - and more optimisation like overclocking typically goes at the expense of more monitoring being required.

I retired in 2007.  At the time, I was running my own business (the same industry training course) and I had staff who could only cope with Windows.  So we had a couple of Windows machines that had started running Seti in 1999.  I decided to switch to Einstein in 2005 mainly because of some 'revelations' at the end of the Seti Classic period.  When I retired, I switched everything to Linux and started expanding the fleet.

The most important thing for me is to contribute to research based on physics/cosmology/astronomy and, in so doing, keep my brain stimulated and working for as long as possible.  I really hope there is some truth in the old adage of "Use it or lose it", so I enjoy the challenge of "using it" to plan the logic behind scripts to monitor/automate everything to do with keeping a large fleet of hosts working efficiently.  If something goes wrong I want to know about it as soon as possible and as accurately as possible - ie., don't miss anything and don't have false alarms - two things that tend to be diametrically opposed to each other :-).

As an example of something that really is very satisfying to me, here is the screen log for a script called 'gpu_chk' that I've been using since late last year.  I refine/enhance it from time to time.  The log below represents what has happened to the fleet over the last 4 days - this particular run was launched last Wednesday.


[gary@eros ~]$ sleep 780 ; gpu_chk -hc"8 9 10"
gpu_chk: New run started at Wed Jul 17 09:00:00 EST 2019.
Loop Item Time      Hostname     Octet   Uptime KDE    RPC  Status   TPI  Status
==== ==== ========  ========     =====   ======  ===    ===  ======   ===  ======================
 11.   1. 14:36:09  phenom-02    ( .62)   20.1d  v4   728s     OK     18   Info: Hi ticks - 1st=142 2nd=160 3rd=18 - Final OK.
 64.   2. 07:33:40  g3260-02     ( .96)   21.8d  v5  1905s     OK      3   Info: Low ticks but then OK 1st=1 2nd=3.
 66.   3. 09:33:41  g3260-02     ( .96)   21.9d  v5  4478s     OK      0   Err: Low ticks 1st=0 2nd=0 - reboot now! ...
 72.   4. 13:31:49  i5_2310-01   (.108)   74.3d  v5  3078s     OK     14   Info: Low ticks but then OK 1st=1 2nd=14.
 75.   5. 15:06:11  g645-01      ( .81)  Err: Attempt to get uptime -> ssh: connect to host 192.168.0.81 port 22: No route to host
 78.   6. 17:00:02  g4560-04     ( .5)    22.2d  v5  3220s     OK      7   Info: Low ticks but then OK 1st=0 2nd=7.
107.   7. 15:37:31  athlon-02    (.104)   63.8d  v5 11090s   HIGH      3   Ticks OK
111.   8. 18:36:42  i3_2120-01   ( .85)   22.3d  v4  1244s     OK      20   Info: Low ticks but then OK 1st=0 2nd=20.
115.   9. 22:30:16  leto         (  .8)  Err: Attempt to get uptime -> /usr/bin/xauth: error in locking authority file /home/gary/.Xauthority
116.  10. 23:30:15  leto         (  .8)  Err: Attempt to get uptime -> /usr/bin/xauth: error in locking authority file /home/gary/.Xauthority
117.  11. 00:30:15  leto         (  .8)  Err: Attempt to get uptime -> /usr/bin/xauth: error in locking authority file /home/gary/.Xauthority
118.  12. 01:30:16  leto         (  .8)  Err: Attempt to get uptime -> /usr/bin/xauth: error in locking authority file /home/gary/.Xauthority
119.  13. 02:30:16  leto         (  .8)  Err: Attempt to get uptime -> /usr/bin/xauth: error in locking authority file /home/gary/.Xauthority
120.  14. 03:30:16  leto         (  .8)  Err: Attempt to get uptime -> /usr/bin/xauth: error in locking authority file /home/gary/.Xauthority
121.  15. 04:30:16  leto         (  .8)  Err: Attempt to get uptime -> /usr/bin/xauth: error in locking authority file /home/gary/.Xauthority
122.  16. 05:30:18  leto         (  .8)  Err: Attempt to get uptime -> /usr/bin/xauth: error in locking authority file /home/gary/.Xauthority
123.  17. 06:30:16  leto         (  .8)  Err: Attempt to get uptime -> /usr/bin/xauth: error in locking authority file /home/gary/.Xauthority
06:37:29: Run finished after 123 loops. 17 items - 11 to check and 6 for info only.

[gary@eros ~]$ sleep 180 ; gpu_chk -hc"8 9 10" -x8
gpu_chk: New run started at Sun Jul 21 08:00:00 EST 2019.
Loop Item Time      Hostname      Octet  Uptime  KDE    RPC  Status   TPI  Status
==== ==== ========  ========      =====   ======  ===    ===  ======   ===  ======================
09:07:23: Run finished after 3 loops. 0 items - 0 to check and 0 for info only.

[gary@eros ~]$ sleep 1260 ; gpu_chk -hc"8 9 10"
gpu_chk: New run started at Sun Jul 21 09:30:00 EST 2019.
Loop Item Time      Hostname     Octet  Uptime  KDE    RPC  Status   TPI  Status
==== ==== ========  ========     =====   ======  ===    ===  ======    ===  ======================

As you can see, the script was started with a delay to sync with the hour mark of 9:00am last Wednesday.  The -hc option allows a selection of 'host codes' to be monitored.  There is also a -x option (see the restart at 8:00am today) which allows individual hosts (selected using the last octet) that belong to a particular code (group of hosts) to be excluded.  All hosts have a static IP address and the 'Octet' column refers to the last octet of that static address.  All hosts have a KDE desktop (for the odd occasions when I do hookup peripherals) and hosts with "SI" GPUs still use KDE4 with the old fglrx driver.  Hosts with Polaris GPUs use KDE5 and amdgpu. 

RPC is how long it's been since the host made an RPC connection to the EAH servers.  TPI is Ticks per Increment. Both of these have status columns attached to draw attention to anything of interest.  The TPI 'Increment' is 2secs and the 'clock' that's ticking is doing so at 100Hz.  The kernel's /proc virtual filesystem is interrogated for a 2s interval to get the clock ticks consumed by the CPU process that supports the GPU crunching. I have found that this is an extremely reliable indicator of when a GPU starts 'spinning its wheels' so to speak.  You can see an example of low clock ticks on loop 66 (item 3) with instructions for me to follow - reboot now! :-).  There is just one occasion in the last 8 months where the script has made a wrong call on this :-).

When high or low ticks are seen, the script double (or even triple) checks before making a call on the problem.  You can see an example of when a reboot was needed (and it really was) and you can see example where 'unusual' numbers of ticks were diagnosed not to be a problem.  There is a gap of 5s between retries when the script is deciding whether or not there is a problem.  Unusual ticks seems to be associated with either a task starting up or in the process of finishing.

The 108 hosts being monitored get scanned in about 7min 30sec so I usually use loop times of 15m, 30m, or 60m.  Overnight it's always every hour since I wont be around to do anything about it :-).  There were just 2 real problems over the 4 day period - Item 3 and Item 9, 10, 11 ... which started at 10:30pm last night.  The 3rd (Item 5) turned out to be quite transient - some sort of network glitch that corrected itself.  I saw both Item 3 and Item 5 as they happened.  Items are colour coded on the screen I'm working on and when a new one flashes up, it tends to grab my attention.  I use red for network type problems and magenta for a GPU not clocking up time.  With item 5, by the time I'd hooked up a monitor, etc, I found the event log showing that BOINC couldn't connect to the internet  to upload finished work and in a backoff of several further minutes.  I 'retried' the transfer and it went through so it really was some sort of transient event.

I saw the problem starting at item 9 when I arrived around 6:30 this morning.  This exact thing had happened a couple of times over the last week or two and a bit of research had pulled up things like no space left, etc.  My thought was that there was something wrong with the disk because there was plenty of space and /home was being seen as a read-only filesystem.  I tried a quick reboot without powering off and got "no boot device".  After power cycling it rebooted to the desktop but I'd already decided to change disks.  The CPU was an E3200 Celeron from 2009 and the GPU was an RX 570 that I wanted to get back to work

The disk was a 20GB IDE dated 2003 and I've got lots of spares of that exact model all tested and ready to go.  I hooked the replacement up to 'master' and put the bad one on 'slave' and booted from a live USB.  I used 'dd' with the conv=noerror option to make sure that dd didn't quit on any read errors.  There was actually one but the cloning proceeded to completion.  After that I ran fsck on both the newly copied root and home partitions and there were 'fixable' errors in the new /home.  After shutting down and removing the old disk, the replacement booted up with no complaints and crunching resumed from where it had left off.  It's been running all day now with no further issues.

In the above screen log you can see that I restarted gpu_chk with the -x8 option to exclude that machine while I was cloning the old disk.  You can see 3 half hour loops before I stopped and restarted it again without the -x8.  That machine was fully tested and back crunching again.  It's now 6:00pm and there are no further entries on the screen so I've had a nice peaceful day doing other things :-).

I put a lot of time and effort into the logic behind that script.  It's very satisfying to see something 'home-made' like this doing its job quite well.

Cheers,
Gary.

steffen_moeller
steffen_moeller
Joined: 9 Feb 05
Posts: 78
Credit: 1,773,655,132
RAC: 0

Thank you so much. My

Thank you so much. My retirement is soon enough to have had all the formal training long before there was anything like PCIe around. And that training was formal, indeed. So, with a daytime focus on data analyses, BOINC and Debian mean a lot to me for some vertical training and as an achor to the real word ... did I just say real... ? Practical sides? 'Tangible' maybe, like black holes? Hm. It is still quite a bubble we live in.

Concerning your very impressive setup, I am tempted to suggest to look into network booting. I found an overview here: https://wiki.debian.org/PXEBootInstall  .   You may decide to have different images for  different sets of machines (like separating NVidia and AMD machines, but mabe that is not even needed). This should reduce your configuration work quite a bit. Also, these 108 drives produce quite some heat (and noise and monitoring worries) by themselves that you can eliminate. Thereby you maybe even raise the maximal operating temperature. And it should be fun and save operating costs. Maybe start with a few machines and see how it goes.

The rig works fine. I am truly surprised about how robust it is despite those wobbly USB risers. So, while this is my AMD-only SETI+E@H machine, the other projects I am interested in are GPUGRID (NVidia only) and Rosetta+WCG (no GPUs). NVidia made quite some inroads with the deep learning community, and with a bit of luck we see these very powerful rigs do some work on the side and/or these appear on EBay in a not too distance future. The few evenings I had this now run have already gained 1/3rd of the credits of what my desktop at home as accumulated over many years. I am impressed.

Admittedly, I am also looking at what I might want to build myself. The crypto efforts have also motivated the development of non-ATX motherboards with 8 16x PCIe slots (that except for the first are also only having 1 PCIe lane, though) (http://www.biostar-usa.com/app/en-us/mb/introduction.php?S_ID=902 and others from Li Bai or the Octominer https://octominer.com/shop/octominer-b8plus-8-pcie-slot-mining-motherboard-intel-3855u-cpu-copy/). These are easier to all put into one box then with the extra level introduced with these risers, I presume. So, I'll wait a bit for what the next months bring that are more promising on the CPU side. I mean - this deep learning thingy not only supermicro has heard of, I am sure, so there might be something coming.

So, TL;DR: GPU mining rigs work with BOINC.

[AF>FAH-Addict.net]toTOW
[AF>FAH-Addict....
Joined: 9 Oct 10
Posts: 6
Credit: 10,596,151
RAC: 0

steffen_moeller wrote:So,

steffen_moeller wrote:
So, TL;DR: GPU mining rigs work with BOINC.

Why wouldn't they work on BOINC or other distributed computing projects ? GPGPU computing has been around way before mining fashion ...

cecht
cecht
Joined: 7 Mar 18
Posts: 1,492
Credit: 2,754,797,750
RAC: 2,057,486

Congrats Steffen_Moeller on

Congrats Steffen_Moeller on turning that mining rig into a productive BOINC crunching rig! That was quite an adventure. Between your information and Gary's, this is a Forum discussion that I'm bookmarking for posterity.

Gary Roberts wrote:
That young cecht is a real go-getter in terms of mastering new stuff and finding the hidden gems!  We need to chain him down and make sure he never leaves :-).

LOL Yes, no doubt I'm still in my E@H sophomore year (in the 'wise fool' sense); these discussion postings are guaranteed to keep me chained down while i sort through it all and think of ways to incorporate them into my E@H system.

 

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Aurum
Aurum
Joined: 12 Jul 17
Posts: 77
Credit: 3,408,757,040
RAC: 0

steffen_moeller wrote:Works!

steffen_moeller wrote:

Works! I happily see five running processes:

12480 boinc     30  10 1589916 155428  90992 R  17,9  0,5   0:10.41 hsgamma_FGRPB1G 12482 boinc     30  10 1589892 154684  90304 S  17,9  0,5   0:10.40 hsgamma_FGRPB1G 12506 boinc     30  10 1590600 221096  90556 S  17,5  0,7   0:10.48 hsgamma_FGRPB1G 12548 boinc     30  10 1589760 220088  90224 R  17,5  0,7   0:10.49 hsgamma_FGRPB1G 12455 boinc     30  10 1589684 154748  90596 R  17,2  0,5   0:10.45 hsgamma_FGRPB1G

Funnily enough I had app_config.xml on my radar only for self-compiled executables. Thank you so much!

I decided to leave it with 1 task per GPU for now since the card is already close to maximum power (this is what lm-sensors states, need to invest a bit more into finding the right monitoring tools for Ubuntu) and will allow the settings to evolve over time.

These are some I use with Linux Mint:

watch -n 1 sensors

sudo inxi -v 3

lspci -vv

steffen_moeller
steffen_moeller
Joined: 9 Feb 05
Posts: 78
Credit: 1,773,655,132
RAC: 0

[AF>FAH-Addict.net

[AF>FAH-Addict.net wrote:
toTOW]
steffen_moeller wrote:
So, TL;DR: GPU mining rigs work with BOINC.

 Why wouldn't they work on BOINC or other distributed computing projects ? GPGPU computing has been around way before mining fashion ...

Thank you for this question. "What could possibly go wrong?" is exactly what I had in mind.  But I am also much aware of the https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect (please read if not known, yet). So it was like there were two options:

a) it works and I get some tantalizing compute power for the costs of an entry level gaming PC
b) it fails and I have learned something, hopefully with some eventual migration to a).

Having a) a work, it was b) why I went for it. Also, I thought that when the word gets out a bit more that a mining rig can nicely retire with the sciences, then this may help others to decide to buy/sell their mining rig as a whole, instead of buy-/selling each graphics card separately. So, hurdles:

  • imaginary - did not happen
    • hardware suffering from wear and tear
    • failure flashing back the original BIOS
    • CPU too slow
    • Not enough memory - the 4GB the original system shipped with would have worked better than I would have thought. I had anyway upgraded to 32GB, which is far more than needed for E@H.
    • PCIe 1x fail to cope with what the apps demand
    • USB risers have too much latency
    • Cooling - I still don't like to run this unsupervised - need to straighten the cables a bit more
  • experienced - did happen
    • too much communication on PCIe bus - well, somewhat
      • with the 4th card onwards some 2-5% extra total compute, not CPU compute, were added. Tested only E@H
      • cannot run 6 E@H tasks in parallel
      • have not tested behaviour when allowing two tasks per GPU .. should do that
    • mining BIOS was completely unstable with both Windows and Linux
    • failure to install upstream AMDGPU drivers with the very latest Debian  - worked like a charm with Ubuntu LTS
    • was not aware of app_config.xml to extend what boinc-manager/boinccmd can do
      • specify CPU demand for GPU task
      • do not allow more than 5 E@H tasks to run at the same time
    • limitation of mainboard/chipset
      • only the single 16x slot was PCIe 3.0, the 4 1x slots were only 2.0
      • 16x slot was not functional when all other PCIe slots were also having a VGA card and the onboard graphics was activated
    • shipped only with USB stick - that was slow, but I don't really know why
    • Ubuntu and Knoppix get confused over how to start X too many graphics cards together with the console on the onboard graphics - worked better with onboard graphics off
    • system somewhat unpractical/wobbly to carry around / push out of the way. Still need to find the right place for it.
    • Einstein (ok, and SETI because they gave us BOINC in the first place) is (are) the only project(s) I want to run (and that I am aware of) that support(s) AMD cards.

At work, we just bought https://www.gigabyte.com/High-Performance-Computing-System/G291-Z20-rev-100#ov .  I could buy two or three of the 6GPU-rigs on eBay for what that empty thingy costs. Still, much of the issues I had would likely be complete non-issues in that system. So, I very much hope that this technology in some form gets soon closer to the consumer/gamer/mining market.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.