用Python爬虫抓取雅黑PHP探针返回的数据,以监控服务器,实时获取远端服务器的负载、CPU、内存、网卡流量、实时网速等信息。关键词:PHP探针、Python、爬虫、服务器监控。
Table of Contents
0.实验环境
本实验在我的RaspberryPi 3b 上完成, Python3 。
我这个思路也算是骨骼清奇了,套了一圈用探针抓服务器信息。
不过对于一个本身运行着PHP的建站服务器,多挂一个探针并没有什么损耗。
某天我在床上躺尸的时候突然想到,大家天天剁手买VPS以针会友。如果我直接用树莓派抓取探针返回的服务器状态,显示在1602上,从此开始日日夜夜躺在沙发上看着数据跳动,岂不美哉?本文就是抓回数据的笔记。
1.雅黑PHP探针
1.1 关于PHP探针
关于PHP探针,给不知道的读者说道说道。
【雅黑PHP探针】
雅黑PHP探针最大的优点:每秒更新,不用刷网页。有一个负责的站长,会对探针进行长期支持和更新。
用于Linux系统(不推荐使用于Windows系统)。
可以实时查看服务器硬盘资源、内存占用、网卡流量、系统负载、服务器时间等信息,1秒钟刷新一次。
以及包括服务器IP地址,Web服务器环境监测,php等信息。
php探针对于经常购买VPS折腾的人肯定不陌生,简单地老说就是一个可以获取系统信息并在网页上显示的php程序。雅黑PHP探针的界面如下:
我一个Digitalocean服务器上挂的演示探针: http://sfo01.misaka.cc:888/tz.php
因此,经常有人买各种廉价小内存的VPS,只能挂个探针,却因此获得巨大快感,并从bbs上交流。叫做以针会友。
1.2 分析
打开探针网页可以看到循环刷新的服务器信息。思路很简单,用简单的Python爬虫去爬这个网页。
首先打开探针网页分析一下。
演示:
http://sfo01.misaka.cc:888/tz.php
可以看到服务器实时数据表格,是动态刷新的。因此,直接爬取该网页的html并不能持续获取服务器信息。既然有动态刷新,想必服务器和客户端之间必有数据包传输。在Chrome中,按F12开始审查网页,进入Networking标签栏。
可以立刻找到动态刷新请求的url。该url是
http://sfo01.misaka.cc:888/tz.php?act=rt&callback=jQuery1705809678890101435_1487402170358&_=1487402269387
直接访问该url,返回的是以下数据。
jQuery1705809678890101435_1487402170358({"useSpace":"3.985","freeSpace":"15.577","hdPercent":"20.37","barhdPercent":"20.37%","TotalMemory":"490.23 M","UsedMemory":"414.12 M","FreeMemory":"76.11 M","CachedMemory":"84.05 M","Buffers":"104.9 M","TotalSwap":"0 M","swapUsed":"0 M","swapFree":"0 M","loadAvg":"0.00 0.00 0.00 1\/117","uptime":"3\u59293\u5c0f\u65f626\u5206\u949f","freetime":"","bjtime":"","stime":"2017-02-18 15:17:56","memRealPercent":"45.93","memRealUsed":"225.17 M","memRealFree":"265.06 M","memPercent":"84.47%","memCachedPercent":"17.15","barmemCachedPercent":"17.15%","swapPercent":"0","barmemRealPercent":"45.93%","barswapPercent":"0%","NetOut2":"44 K 505 B ","NetOut3":"2 G 825 M 602 K 970 B ","NetOut4":"","NetOut5":"","NetOut6":"","NetOut7":"","NetOut8":"","NetOut9":"","NetOut10":"","NetInput2":"44 K 505 B ","NetInput3":"3 G 145 M 314 K 713 B ","NetInput4":"","NetInput5":"","NetInput6":"","NetInput7":"","NetInput8":"","NetInput9":"","NetInput10":"","NetOutSpeed2":"45561","NetOutSpeed3":"3013176266","NetOutSpeed4":"0","NetOutSpeed5":"","NetInputSpeed2":"45561","NetInputSpeed3":"3373591241","NetInputSpeed4":"0","NetInputSpeed5":""})
可以确认,返回的即为包含服务器实时信息的数据。
有没有感觉,在其后的数据有着一种很规范的标记方法?是的,在中括号之间,是一种json数据集。
JSON是一种取代XML的数据结构,和xml相比,它更小巧但描述能力却不差,由于它的小巧所以网络传输数据将减少更多流量从而加快速度。
可以将其理解成一组各类语言都可以接受的,有自己的标准的,用于互相交换信息的数据。
url中的参数 act=rt ,在雅黑PHP探针源码中ctrl+F一下其源代码。
立刻找到,在tz.php中第964行:
//ajax调用实时刷新 if ($_GET['act'] == "rt") { $arr=array('useSpace'=>"$du",'freeSpace'=>"$df",'hdPercent'=>"$hdPercent",'barhdPercent'=>"$hdPercent%",'TotalMemory'=>"$mt",'UsedMemory'=>"$mu",'FreeMemory'=>"$mf",'CachedMemory'=>"$mc",'Buffers'=>"$mb",'TotalSwap'=>"$st",'swapUsed'=>"$su",'swapFree'=>"$sf",'loadAvg'=>"$load",'uptime'=>"$uptime",'freetime'=>"$freetime",'bjtime'=>"$bjtime",'stime'=>"$stime",'memRealPercent'=>"$memRealPercent",'memRealUsed'=>"$memRealUsed",'memRealFree'=>"$memRealFree",'memPercent'=>"$memPercent%",'memCachedPercent'=>"$memCachedPercent",'barmemCachedPercent'=>"$memCachedPercent%",'swapPercent'=>"$swapPercent",'barmemRealPercent'=>"$memRealPercent%",'barswapPercent'=>"$swapPercent%",'NetOut2'=>"$NetOut[2]",'NetOut3'=>"$NetOut[3]",'NetOut4'=>"$NetOut[4]",'NetOut5'=>"$NetOut[5]",'NetOut6'=>"$NetOut[6]",'NetOut7'=>"$NetOut[7]",'NetOut8'=>"$NetOut[8]",'NetOut9'=>"$NetOut[9]",'NetOut10'=>"$NetOut[10]",'NetInput2'=>"$NetInput[2]",'NetInput3'=>"$NetInput[3]",'NetInput4'=>"$NetInput[4]",'NetInput5'=>"$NetInput[5]",'NetInput6'=>"$NetInput[6]",'NetInput7'=>"$NetInput[7]",'NetInput8'=>"$NetInput[8]",'NetInput9'=>"$NetInput[9]",'NetInput10'=>"$NetInput[10]",'NetOutSpeed2'=>"$NetOutSpeed[2]",'NetOutSpeed3'=>"$NetOutSpeed[3]",'NetOutSpeed4'=>"$NetOutSpeed[4]",'NetOutSpeed5'=>"$NetOutSpeed[5]",'NetInputSpeed2'=>"$NetInputSpeed[2]",'NetInputSpeed3'=>"$NetInputSpeed[3]",'NetInputSpeed4'=>"$NetInputSpeed[4]",'NetInputSpeed5'=>"$NetInputSpeed[5]"); $jarr=json_encode($arr); $_GET['callback'] = htmlspecialchars($_GET['callback']); echo $_GET['callback'],'(',$jarr,')'; exit; }
即使不懂PHP,也可以看出它的规则。在我们的url中,callback参数为“jQuery1705809678890101435_1487402170358&_=1487402269387”。尝试直接请求 http://sfo01.misaka.cc:888/tz.php?act=rt
得到如下结果:
({"useSpace":"3.986","freeSpace":"15.576","hdPercent":"20.38","barhdPercent":"20.38%","TotalMemory":"490.23 M","UsedMemory":"414.94 M","FreeMemory":"75.29 M","CachedMemory":"84.82 M","Buffers":"105.35 M","TotalSwap":"0 M","swapUsed":"0 M","swapFree":"0 M","loadAvg":"0.05 0.01 0.00 1\/117","uptime":"3\u59293\u5c0f\u65f644\u5206\u949f","freetime":"","bjtime":"","stime":"2017-02-18 15:35:36","memRealPercent":"45.85","memRealUsed":"224.77 M","memRealFree":"265.46 M","memPercent":"84.64%","memCachedPercent":"17.3","barmemCachedPercent":"17.3%","swapPercent":"0","barmemRealPercent":"45.85%","barswapPercent":"0%","NetOut2":"44 K 505 B ","NetOut3":"2 G 826 M 560 K 68 B ","NetOut4":"","NetOut5":"","NetOut6":"","NetOut7":"","NetOut8":"","NetOut9":"","NetOut10":"","NetInput2":"44 K 505 B ","NetInput3":"3 G 146 M 334 K 784 B ","NetInput4":"","NetInput5":"","NetInput6":"","NetInput7":"","NetInput8":"","NetInput9":"","NetInput10":"","NetOutSpeed2":"45561","NetOutSpeed3":"3014180932","NetOutSpeed4":"0","NetOutSpeed5":"","NetInputSpeed2":"45561","NetInputSpeed3":"3374660368","NetInputSpeed4":"0","NetInputSpeed5":""})
爬虫的思路也清晰了。
2.Python的简单爬虫
Python爬虫的简易教程我参考了:
文章简洁精悍。没多少字,简单带过后,了解了爬虫运用的一些思想。
想获得服务器信息的json数据,比较容易。现在shell中验证一下
pi@raspberrypi:~ $ python3 Python 3.4.2 (default, Oct 19 2014, 13:31:11) [GCC 4.9.1] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from urllib import request >>> f = request.urlopen("http://138.197.193.89:888/tz.php?act=rt") >>> print(f) <http.client.HTTPResponse object at 0x7656ef70> >>> data = f.read() >>> print(data.decode('utf-8'))
这里试着直接用我示例站的ip地址,避免等待dns解析。用utf-8格式解码后,得到如下结果:
({"useSpace":"3.983","freeSpace":"15.579","hdPercent":"20.36","barhdPercent":"20.36%","T 0.00 0.00 1\/119","uptime":"3\u59290\u5c0f\u65f616\u5206\u949f","freetime":"","bjtime":.49%","swapPercent":"0","barmemRealPercent":"51.14%","barswapPercent":"0%","NetOut2":"4437 M 409 K 258 B ","NetInput4":"","NetInput5":"","NetInput6":"","NetInput7":"","NetInputetInputSpeed4":"0","NetInputSpeed5":""})
打印出来的字符串并不是标准的json数据,字符串左右多了小括号。使用Python方便的字符串处理功能,将其去掉。但此时data并不是str属性,直接尝试去掉小括号会报错。
此时data的类型为“bytes”。用str()转换:
>>> type(data) <class 'bytes'> >>> data2 = str(data.decode('utf-8')).strip('(').strip(')') >>> print(data2) {"useSpace":"3.983","freeSpace":"15.579","hdPercent":"20.36","barhdPercent":"20.36%","TotalMemory":"490.23 M","UsedMemory":"427.5 M","FreeMemory":"62.73 M","CachedMemory":"76.5 M","Buffers":"98.95 M","TotalSwap":"0 M","swapUsed":"0 M","swapFree":"0 M","loadAvg":"0.00 0.00 0.00 1\/121","uptime":"3\u59290\u5c0f\u65f624\u5206\u949f","freetime":"","bjtime":"","stime":"2017-02-18 12:15:45","memRealPercent":"51.41","memRealUsed":"252.05 M","memRealFree":"238.18 M","memPercent":"87.2%","memCachedPercent":"15.6","barmemCachedPercent":"15.6%","swapPercent":"0","barmemRealPercent":"51.41%","barswapPercent":"0%","NetOut2":"44 K 505 B ","NetOut3":"2 G 817 M 1005 K 862 B ","NetOut4":"","NetOut5":"","NetOut6":"","NetOut7":"","NetOut8":"","NetOut9":"","NetOut10":"","NetInput2":"44 K 505 B ","NetInput3":"3 G 137 M 942 K 336 B ","NetInput4":"","NetInput5":"","NetInput6":"","NetInput7":"","NetInput8":"","NetInput9":"","NetInput10":"","NetOutSpeed2":"45561","NetOutSpeed3":"3005200222","NetOutSpeed4":"0","NetOutSpeed5":"","NetInputSpeed2":"45561","NetInputSpeed3":"3365845328","NetInputSpeed4":"0","NetInputSpeed5":""}
此时data2可直接通过json读取为字典
>>> import json >>> json.loads(data2) {'hdPercent': '20.36', 'NetInput9': '', 'swapFree': '0 M', 'NetOutSpeed5': '', 'UsedMemory': '427.5 M', 'NetOut10': '', 'NetInput4': '', 'NetInputSpeed5': '', 'FreeMemory': '62.73 M', 'NetInputSpeed4': '0', 'NetOut7': '', 'TotalSwap': '0 M', 'NetOut2': '44 K 505 B ', 'NetOut5': '', 'NetOut8': '', 'NetInput5': '', 'NetOut4': '', 'NetInputSpeed2': '45561', 'memCachedPercent': '15.6', 'NetInputSpeed3': '3365845328', 'loadAvg': '0.00 0.00 0.00 1/121', 'TotalMemory': '490.23 M', 'barmemRealPercent': '51.41%', 'NetOut6': '', 'NetInput7': '', 'barswapPercent': '0%', 'NetOutSpeed2': '45561', 'barhdPercent': '20.36%', 'stime': '2017-02-18 12:15:45', 'useSpace': '3.983', 'bjtime': '', 'barmemCachedPercent': '15.6%', 'memRealFree': '238.18 M', 'NetInput3': '3 G 137 M 942 K 336 B ', 'NetInput6': '', 'uptime': '3天0小时24分钟', 'NetOutSpeed4': '0', 'NetInput2': '44 K 505 B ', 'freetime': '', 'NetOut3': '2 G 817 M 1005 K 862 B ', 'NetInput10': '', 'memRealUsed': '252.05 M', 'Buffers': '98.95 M', 'freeSpace': '15.579', 'memPercent': '87.2%', 'NetOutSpeed3': '3005200222', 'swapUsed': '0 M', 'CachedMemory': '76.5 M', 'NetOut9': '', 'swapPercent': '0', 'memRealPercent': '51.41', 'NetInput8': ''} >>> data3=json.loads(data2) >>> type(data3) <class 'dict'> >>> type(data3['CachedMemory']) <class 'str'>
完成。接下来只需要按照面向对象的思想、增加代码的健壮性将其封装起来即可。
3.封装
# -*- coding:utf-8 -*- from urllib import request import json #探针爬虫类 class PHPTZ: #初始化方法,定义一些变量 def __init__(self): self.url = 'http://138.197.193.89:888/tz.php?act=rt' def getData(self): try: f = request.urlopen(self.url) data = f.read() data2 = str(data.decode('utf-8')).strip('(').strip(')') dataj = json.loads(data2) print(dataj) print(type(dataj)) except print('Error') return None myserver = PHPTZ() myserver.getData()
运行一下:
pi@raspberrypi:~ $ sudo python3 tz.py {'NetInput7': '', 'NetInput5': '', 'NetOut2': '44 K 505 B ', 'uptime': '3天4小时48分钟', 'loadAvg': '0.00 0.00 0.00 1/115', 'NetInput10': '', 'stime': '2017-02-18 16:39:49', 'NetInput4': '', 'NetOutSpeed2': '45561', 'NetInputSpeed3': '3379146879', 'freetime': '', 'NetOut9': '', 'UsedMemory': '418.66 M', 'hdPercent': '20.39', 'swapFree': '0 M', 'NetOut7': '', 'CachedMemory': '87.81 M', 'NetInput3': '3 G 150 M 620 K 127 B ', 'NetOut3': '2 G 830 M 296 K 887 B ', 'NetInputSpeed4': '0', 'NetOut6': '', 'NetInput2': '44 K 505 B ', 'memRealPercent': '45.61', 'FreeMemory': '71.57 M', 'NetInput8': '', 'NetOut8': '', 'memRealFree': '266.66 M', 'freeSpace': '15.573', 'swapPercent': '0', 'barmemRealPercent': '45.61%', 'memCachedPercent': '17.91', 'TotalMemory': '490.23 M', 'NetInputSpeed2': '45561', 'barmemCachedPercent': '17.91%', 'NetInputSpeed5': '', 'TotalSwap': '0 M', 'NetOut4': '', 'barhdPercent': '20.39%', 'Buffers': '107.28 M', 'useSpace': '3.989', 'memPercent': '85.4%', 'bjtime': '', 'NetOutSpeed4': '0', 'NetInput6': '', 'memRealUsed': '223.57 M', 'barswapPercent': '0%', 'swapUsed': '0 M', 'NetOut5': '', 'NetInput9': '', 'NetOutSpeed5': '', 'NetOutSpeed3': '3018105719', 'NetOut10': ''} <class 'dict'>
关于其错误处理的思想只是稍微领略了一下,不精,错误处理先试着这样写。
4.应用
有了数据在手,想怎么处理还不易如反掌?
尤其像是RaspberryPi这种东西,会有无尽的可能。我即将尝试制作新的东西
[…] 2017-02-18 利用PHP探针和Python爬虫监控服务器状态 […]
雅黑PHP探针好是好,可惜已经4年没更新了,最后一个版本0.4.7都不支持PHP7。
@Clarke :sad: 那倒是挺遗憾的。不过主站一般不挂这种敏感的东西。是用在我的公共代理用服务器上,上面的版本和架构相对落后一些
[…] [学习笔记]利用PHP探针和Python爬虫监控服务器状态–https://steinslab.xyz/archives/1144 […]
续跟着楼上:续
膜