Hello,
I have access to 1000 (Yes, one thousand) Nvidia Tesla T4 GPUs. That is 8141 TFLOPS of FP32 computing power, which at the time of this writing equals to 66.5% of the total computing capacity of this project, currently 12236 TFLOPS and I would like to deploy it exclusively to Einstein@Home. However I'm facing a problem that prevents me from doing so. I hope there is somebody here, knowledgeable enough to solve it.
It's essential to know that the GPUs in question are located in a data center. Each GPU is provisioned in software as a virtual machine, and it is paired with a single physical CPU core (two threads), along with the requisite system RAM, disk space and networking. I'm running BOINC from the command line on Ubuntu 20.04.5 LTS.
Here's the problem. When I start multiple virtual machines they all get assigned the same tasks, which leads to conflict. So, my question is how can I tell BOINC to treat these virtual machines as separate computers.
In conclusion I would like to ask everyone who does NOT have anything to contribute in the form of a solution to my problem, to pretty please with a cherry on top, abstain from entering this discussion.
Copyright © 2024 Einstein@Home. All rights reserved.
Do you know the exact
)
Do you know the exact hardware configuration of each server? Like how many GPUs are in each box, etc. what are the limits of your access to them? do you have access to reconfigure the virtual machines to include more GPUs or CPU cores for each VM? This information could be useful in forming a solution.
I assume you’re using some kind of script or deployable VM image to setup BOINC on each VM? If you’re using the same VM image on all of them, that could be why some or many of them have conflicts, I bet you have a ton of duplicate host IDs. Each host will need different and unique host ids to be treated as separate systems.
_________________________________________________________________________
Ian&Steve,Since the
)
Hostid's that are unique and carry over for each instance is a fix.
You can cause that effect by having each instance have a unique operating system name.
Another possible work around is using an empty boinc cache and a "0" (zero) resource setting in the profile for all instances. As long as each instance does not have a copy of any of the data files when it starts up it should be possible to not confuse the server into letting it send out a more than one copy to the same hostid.
HTH,
Tom M
Ps. Gary the moderator or anyone running Pi clusters might have an idea.
A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor)
My previous experience is the
)
My previous experience is the boinc server will not accept duplicate result uploads.
So if you spin up two identical instances and let them run does the server complain about duplicate uploads?
It will complain.
Tom M
A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor)
The hardware configuration is
)
The hardware configuration is as follows:
2 x Intel Xeon E5-2650 v3 @ 2.30GHz
10 x PCI-E 4.0 x16 slots for GPGPU populated with 10 Nvidia Tesla T4
Dual 10GbE RJ45 LAN and Dedicated IPMI
3+1 1600W 80+ Platinum CRPS
System RAM 512GB DDR4 ECC RDIMM
8TB NVMe M.2 SSD
4TB NVMe M.2 SSD
4U Rackmount
The entire cluster of 1000 Nvidia Tesla T4 GPUs, consists of 100 identical servers, 10 GPUs in each.
I have full physical access to the entire data center where the servers are located, as well as full software management rights.
I can reconfigure the VMs however I want.
I am using a single VM image based on Ubuntu 20.04.5 LTS for BOINC deployment across the entire server cluster. I am limited to the command line only, no GUI. For that I use boinccmd.
I don't believe that using the same image is the reason why all VM instances are being seen as one by BOINC. I've tried a deployment without the VM image and I still get the same tasks conflict.
I need to tell BOINC to somehow generate a new host id upon VM initialization, and most importantly keep it after the VM disconnects.
According to this https://boinc.berkeley.edu/wiki/Cross-project_identification host identification is based on a domain name, IP address, free disk space, and a timestamp.
I've done some experimentation and here is what I've learned:
- when a single VM is connected to the project for the first time ever, a new host id is generated as it would have been on a physical computer.
- when a single VM that has been previously associated with a host id is deployed on a different physical server with the same hardware as described above, but different identifiers, such as ip address, mac address, different Nvidia GPU serial number, the host id is kept unchanged (the new VM instance assumes the host id of the previous one) and any tasks left in the queue are reclaimed and executed.
- when multiple VM instances are being run concurrently on the same physical server or each on a separate server (it makes no difference), all VM instances begin executing the same tasks left in the queue from the previous run of the last VM to make a contact with the project. Only one of those instances (the first one to complete a task and report it to the project) continues it's normal execution flow. All other VMs get stuck running the same tasks. Sometimes, one or more of these VMs would get a new cross-project id after reporting an already completed task by the first VM in which case there is no more conflict between tasks. But most often generating a new cross-project id would never happen and all "unlucky" VMs would compute the same tasks that the project's server would refuse.
I might suggest the solution
)
I might suggest the solution is to use the Lunatics AIO BOINC installation and install a single copy to each of the 100 VM hosts. That way you end up with Boinc folders numbered from 00-99 for example.
Each will have a distinct internal CPID but you can always edit the client_state.xml file to converge them all to the same external CPID which would be desirable if you want to get to the top of leaderboards.
you're going to have to
)
you're going to have to manually edit the host ID of each host to force them all to be different. once you change it in the client_state, it wont change back. you can probably write some kind of script to do that across all hosts.
_________________________________________________________________________
Ian&Steve C. wrote: you're
)
Can he just designate a unique IP address for each VM? They would then be like me adding another pc to the bunch all going thru the same router and getting it's IP address from it.
The problem is that he’s
)
The problem is that he’s using the same VM image for all of them, they aren’t unique OS or BOINC installs. So the projects think it’s all the same host. And they auto merge the host ID.
_________________________________________________________________________
Depends on how you set up the
)
Depends on how you set up the router for NAT or not. I have the same external IP address for all my hosts.
That doesn't prevent them from getting distinct hostID's.
You can generate a new hostID for any host by simply manipulating the <rpc_seqno> number in the client_state.xml file for each host. Change the number to a value less than the current value and the client-scheduler connection will force a new hostID for that host.
The <rpc_seqno> value mechanism is how you can make the host maintain the same hostID even if you change hardware or OS with an edit of the hostID in the client_state file.
For your kind of a data
)
For your kind of a data center you might try to find something like
azure
vmware
Google cloud