Seeing Linux GPU temperatures and getting alerts on problems

Joseph Stateson
Joseph Stateson
Joined: 7 May 07
Posts: 174
Credit: 3068344373
RAC: 420674
Topic 220252

I use Boinctasks for my remote systems and that tool does not report temperatures from Linux like it does for Windows.  For some time, I have been using BT and have configured several "rules" to send a text message to my phone  if it detects a stuck task or temp too high.  That was not possible on Linux until now.
 
I have a python script at https://github.com/JStateson/BoincTasks that runs as a service under systemd and reports temperatures to Boinctasks.  In addition, if the NVidia driver recommends a reboot to recover a "lost" GPU, then that script sends a text message alerting me and turns off GPU usage on boinc.  Anyone is welcome to use this tool and suggestions for improvement would be nice.  You may already be using an excellent temperature checking and reporting program.  This script allows temps to show up on the boinctask display which is convenient for me.
 
It had been tested only under Ubuntu 18.04 and requires lm-sensors and nvidia-smi if running NVidia boards.  It only works with Boinctasks.

Sid
Sid
Joined: 17 Oct 10
Posts: 164
Credit: 970313353
RAC: 421403

Thank you for a sharing the

Thank you for a sharing the tool.

I was trying to use this tool however boincTasks does not show a temperature.

+ sleep 10                                                                                                                                                                                                                                                                     
+ '[' 1 -eq 1 ']'                                                                                                                                                                                                                                                              
++ ./FetchTemps.py 0 0                                                                                                                                                                                                                                                         
./run_cycle.sh: line 31: warning: command substitution: ignored null byte in input                                                                                                                                                                                             
+ rtncod='Connection Established: ('\''192.168.0.100'\'', 56075)                                                                                                                                                                                                               
from boinctasks:  of length 5                                                                                                                                                                                                                                              
sent:                                                                                                                              
from boinctasks:  of length 0'                                                                                                                                                                                                                                                 
+ '[' 0 -eq 1 ']'                                                                                                                                                                                                                                                              

As I can see the BoinTasks has successfully connected to linux computer and tool has sent the temperature. 

However no any value was displayed.

Probably some more parameters on BoincTasks should be adjusted?

BoinTasks shows correct temperature for Windows computers.

Joseph Stateson
Joseph Stateson
Joined: 7 May 07
Posts: 174
Credit: 3068344373
RAC: 420674

Could not see what systems

Could not see what systems you are runring it on. I have three 18.04 each with python3

Make sure you have python 3.  If running Ubuntu, firewall is disabled by default. If you got it on then allow port 31417 access.  My version of lm-sensors is 3.40

Did you run the files from the Testing directory?  The following is one complete temperature reading:  8 Nvidia, no ATI, four CPU temperatures, the largest NVidia temperature (67.0) followed by individual NVidia temps.

I am using boinctasks 1.80 what version are you using?  I am spoofing TThrottle version 7.72 back to boinctasks 1.80 which should show up in the "computers" tab.

jstateson@h110btc:~/Projects/BoincTasks/Testing$ bash -x ./run_cycle.sh

+ echo Press CTRL-C to exit. may need to do it twice
Press CTRL-C to exit. may need to do it twice
+ nAnyNVidia=0
+ '[' '!' -e /bin/nvidia-smi ']'
++ nvidia-smi -L
++ grep -c GPU
+ nAnyNVidia=8
++ sensors
++ grep -c amd
+ nAnyATI=0
+ '[' 0 -gt 0 ']'
+ '[' : ']'
+ ./FetchTemps.py 8 0
Connection Established: ('192.168.1.241', 65130)
Expecting NV: 8
Expecting ATI: 0
max CPU temp  56.0
CPU temps  <CT0 56><CT1 55><CT2 55><CT3 51>
max NVidia temp  <TG 67.0>
NV temps  <GT0 59><GT1 53><GT2 53><GT3 53><GT4 36><GT5 67><GT6 52><GT7 49>
from boinctasks: <BT> of length 5
sent:  <TThrottle><HN:h110btc><PV 7.72><AC 0><TC 56.0><TG 67.0><NV 8><NA 0><DC 100><DG 100><CT0 56><CT1 55><CT2 55><CT3 51><GT0 59><GT1 53><GT2 53><GT3 53><GT4 36><GT5 67><GT6 52><GT7 49><RSFtOMXmk><AA0><SC77><SG80><XC100><MC2><TThrottle>
from boinctasks:  of length 0
+ sleep 10

 =======you are running the NonService=== should look like the following for 3 cycles

jstateson@h110btc:~/Projects/BoincTasks/NonService$ bash -x ./run_cycle.sh
+ echo Press CTRL-C to exit. may need to do it twice
Press CTRL-C to exit. may need to do it twice
+ nAnyNVidia=0
+ '[' '!' -e /bin/nvidia-smi ']'
++ grep -c GPU
++ nvidia-smi -L
+ nAnyNVidia=8
++ sensors
++ grep -c amd
+ nAnyATI=0
+ '[' 0 -gt 0 ']'
+ bRun=1
+ '[' 1 -eq 1 ']'
++ ./FetchTemps.py 8 0
+ rtncod='Connection Established: ('\''192.168.1.241'\'', 49408)'
+ '[' 0 -eq 1 ']'
+ sleep 10
+ '[' 1 -eq 1 ']'
++ ./FetchTemps.py 8 0
+ rtncod='Connection Established: ('\''192.168.1.241'\'', 49432)'
+ '[' 0 -eq 1 ']'
+ sleep 10
+ '[' 1 -eq 1 ']'
++ ./FetchTemps.py 8 0
+ rtncod='Connection Established: ('\''192.168.1.241'\'', 49438)'
+ '[' 0 -eq 1 ']'
+ sleep 10

 

Sid
Sid
Joined: 17 Oct 10
Posts: 164
Credit: 970313353
RAC: 421403

I'm using Linux

I'm using Linux Mint.

lm_sensors is 3.4.0-4

BoincTask is 1.80

python 3

I'm running run_cycle.sh and clearly see that connection is established and script sends the data:

+ rtncod='Connection Established: ('\''192.168.0.100'\'', 60673)
from boinctasks: <BT> of length 5
sent: <TThrottle><HN:sandy><PV 7.72><AC 0><TC 35.0><NV 0><NA 0><DC 100><DG 100><CT0 32><CT1 35><RSdwCAmjP><AA0><SC77><SG80><XC100><MC2><TThrottle>
from boinctasks: of length 0'

So this is not a firewall issue.

NVidia driver is not installed so probably this is the issue.

 

Joseph Stateson
Joseph Stateson
Joined: 7 May 07
Posts: 174
Credit: 3068344373
RAC: 420674

"send" just means I tried to

"send" just means I tried to "send" it.  No telling if it get there or not.  Googleing I see that Mint does not have its firewall enabled by default.  NVidia does not matter.  Try the following and look for results like in the picture

 

On your windows system, in a command shell run

telnet sandy 31417

It should give an error message: no connection

then on your Mint system run

./FetchTemps.py 0 0

nothing should happen.  then go back to your windows system and try telnet again.  When it makes a connection press the return key on keyboard.  You should see the temps "sent" and the python program should exit.  If you dont see that then check firewall.  If windows shows the temps then check boinctasks make sure that sandy is listed

I dont have mint.  Is that the same as Ubuntu 19?  Maybe I should upgrade.

telnet over port 31417

Sid
Sid
Joined: 17 Oct 10
Posts: 164
Credit: 970313353
RAC: 421403

Very interesting. The telnet

Very interesting.

The telnet window shows:
<TThrottle><HN:sandy><PV 7.72><AC 0><TC 28.0><NV 0><NA 0><DC 100><DG 100><CT0 26
><CT1 28><RSAdEmyVS><AA0><SC77><SG80><XC100><MC2><TThrottle>

The boincTask  log shows:

26 December 2019 - 21:24:20 TThrottle CreateClientSocket ---- Created client socket with handle = 5656
26 December 2019 - 21:24:21 TThrottle CreateClientSocket ---- CreateClientSocket: Connected successfully
26 December 2019 - 21:24:21 TThrottle DoSendOnce ---- Sent 5 bytes so far

26 December 2019 - 21:24:21 TThrottle DoSendUntilDone ---- Send completed
26 December 2019 - 21:24:21 TThrottle DoRecvOnce ---- Recd 140 bytes so far

26 December 2019 - 21:24:21 TThrottle DoRecvOnce ---- Recd 140 bytes so far

26 December 2019 - 21:24:21 TThrottle DoRecvUntilDone ---- Recv returned 0. Remote socket must have been gracefully closed.
26 December 2019 - 21:24:21 SendReceive ---- Received from Server: <TThrottle><HN:sandy><PV 7.72><AC 0><TC 30.0><NV 0><NA 0><DC 100><DG 100><CT0 28><CT1 30><RS9a3mcJj><AA0
26 December 2019 - 21:24:21 Disconnect ---- Closed socket 5656. Total Bytes Recd = 140, Total Bytes Sent = 5
26 December 2019 - 21:24:21 Receive TThrottle: 192.168.0.111, 31417 ---- <TThrottle><HN:sandy><PV 7.72><AC 0><TC 30.0><NV 0><NA 0><DC 100><DG 100><CT0 28><CT1 30><RS9a3mcJj><AA0><SC77><SG80><XC100><MC2><TThrottle>
26 December 2019 - 21:24:23 TThrottle CreateClientSocket ---- Processing Address 1 returned by getaddrinfo :
26 December 2019 - 21:24:23 TThrottle CreateClientSocket ---- Created client socket with handle = 5656
26 December 2019 - 21:24:24 TThrottle CreateClientSocket ---- CreateClientSocket: connect failed.
Error = No connection could be made because the target machine actively refused it
26 December 2019 - 21:24:24 Connect ---- Invalid Socket
26 December 2019 - 21:24:26 TThrottle CreateClientSocket ---- Processing Address 1 returned by getaddrinfo :
26 December 2019 - 21:24:26 TThrottle CreateClientSocket ---- Created client socket with handle = 5656
26 December 2019 - 21:24:27 TThrottle CreateClientSocket ---- CreateClientSocket: connect failed.
Error = No connection could be made because the target machine actively refused it

I guess the connection should be permanent. Python script tries  to establish the connection for one exchange only.

Joseph Stateson
Joseph Stateson
Joined: 7 May 07
Posts: 174
Credit: 3068344373
RAC: 420674

That is how it is supposed to

That is how it is supposed to work.

Boinctasks periodically tries to make a connection.  If it makes a connection it sends a 5 byte packet and looks at what it gets back and displays the temps.  The run_cycle script loops the python every 10 seconds.  The python listens on 31417 but sends back on whatever port that boinctasks uses to receive.   Should be working both the system and the non-system should work.  Running the python manually it only runs once and quits.  If boinctasks sees the project and tasks are running then within 20 seconds the temps will show up in the temp column (if using the run script or service)

 [edit]

Temperature from the intel video chip was being reported as an ATI temperature.  This was fixed by not reporting any intel pci temps.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.