Skip to content

Commit

Permalink
add load
Browse files Browse the repository at this point in the history
  • Loading branch information
leeyoshinari committed Jul 28, 2024
1 parent 3353e73 commit 9249a1c
Show file tree
Hide file tree
Showing 8 changed files with 848 additions and 114 deletions.
695 changes: 674 additions & 21 deletions LICENSE

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

## Introduction
#### Completed functions<br>
1. Monitoring the CPU usage, IO wait, Memory, Disk IO, Network, and TCP connections of the server.<br>
1. Monitoring the CPU usage, CPU Load, IO wait, Memory, Disk IO, Network, and TCP connections of the server.<br>
2. Monitoring the CPU usage, Context switching, Memory, Disk read and write, and TCP connections of the specified port.<br>
3. For Java applications, monitoring size of JVM and Garbage collection; when the frequency of Full GC is too high, an email alert will be sent.<br>
4. When the Server CPU usage is too high, or free memory is too low, an email alert will be sent; And can clear the cache automatically.<br>
Expand Down
2 changes: 1 addition & 1 deletion agent/config.ini
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ port = 12121
threadPool = 0
# Bandwidth supported by server ethernet, unit: Mb/s.
# Some servers (such as cloud servers) can not obtain network card, this default value will be used.
nicSpeed = 1000
nicSpeed = 10000

[server]
# The address of the server.
Expand Down
40 changes: 34 additions & 6 deletions agent/performance_monitor.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,9 @@ def __init__(self):
self.all_disk = [] # disk number
self.total_disk = 1 # total disk size, unit: M
self.total_disk_h = 0 # total disk size, unit:T or G
self.rece = 0.0
self.trans = 0.0
self.network = 0.0
self.network_speed = cfg.getAgent('nicSpeed') # bandwidth
self.Retrans_num = self.get_RetransSegs() # TCP retrans number

Expand Down Expand Up @@ -266,6 +269,9 @@ def write_cpu_mem(self, index):
.field('rKbs', pid_info['kB_rd'])
.field('wKbs', pid_info['kB_wr'])
.field('iodelay', pid_info['iodelay'])
.field('net', self.network)
.field('rec', self.rece)
.field('trans', self.trans)
.field('tcp', tcp_num.get('tcp', 0))
.field('close_wait', tcp_num.get('close_wait', 0))
.field('time_wait', tcp_num.get('time_wait', 0))
Expand Down Expand Up @@ -306,7 +312,10 @@ def write_system_cpu_mem(self, is_system):
'trans': 0.0,
'net': 0.0,
'tcp': 0,
'retrans': 0
'retrans': 0,
'load1': 0.0,
'load5': 0.0,
'load15': 0.0
}}
for disk in self.all_disk:
# The system disks exists in the format of 'sda-1'. Since influxdb cannot recognize the '-', need to replace it.
Expand Down Expand Up @@ -345,6 +354,9 @@ def write_system_cpu_mem(self, is_system):
line['fields']['net'] = res['network']
line['fields']['tcp'] = res['tcp']
line['fields']['retrans'] = res['retrans']
line['fields']['load1'] = res['load1']
line['fields']['load5'] = res['load5']
line['fields']['load15'] = res['load15']
pointer = Point(line['measurement'])
pointer = pointer.tag('type', 'system')
for key, value in line["fields"].items():
Expand Down Expand Up @@ -550,19 +562,20 @@ def get_system_cpu_io_speed(self):
if bps1 and bps2:
data1 = bps1[0].split()
data2 = bps2[0].split()
rece = (int(data2[1]) - int(data1[1])) / 1048576
trans = (int(data2[9]) - int(data1[9])) / 1048576
self.rece = (int(data2[1]) - int(data1[1])) / 1048576
self.trans = (int(data2[9]) - int(data1[9])) / 1048576
# 400 = 8 * 100 / 2
# Why multiply by 8, because 1MB/s = 8Mb/s.
# Why divided by 2, because the network card is in full duplex mode.
network = 400 * (rece + trans) / self.network_speed
self.network = 400 * (rece + trans) / self.network_speed
logger.debug(f'The bandwidth of ethernet is: Receive {rece}MB/s, Transmit {trans}MB/s, Ratio {network}%')

tcp, Retrans = self.get_tcp()
load1, load5, load15 = self.get_cpu_load()

return {'disk': disk, 'disk_r': disk_r, 'disk_w': disk_w, 'disk_d': disk_d, 'cpu': cpu, 'iowait': iowait,
'usr_cpu': usr_cpu, 'mem': mem, 'mem_available': mem_available, 'rece': rece, 'trans': trans,
'network': network, 'tcp': tcp, 'retrans': Retrans}
'usr_cpu': usr_cpu, 'mem': mem, 'mem_available': mem_available, 'rece': self.rece, 'trans': self.trans,
'network': self.network, 'tcp': tcp, 'retrans': Retrans, 'load1': load1, 'load5': load5, 'load15': load15}

@staticmethod
def get_free_memory():
Expand All @@ -585,6 +598,21 @@ def get_free_memory():

return mem, mem_available

@staticmethod
def get_cpu_load():
load1 = 0.0
load5 = 0.0
load15 = 0.0
try:
res = exec_cmd('uptime')[0]
loads = res.split(',')
load15 = float(loads[-1].strip())
load5 = float(loads[-2].strip())
load1 = float(loads[-3].strip())
except:
logger.error(traceback.format_exc())
return load1, load5, load15

'''def get_handle(pid):
"""
Get the number of handles occupied by the process
Expand Down
33 changes: 11 additions & 22 deletions server/draw_performance.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,9 @@ def draw_data_from_db(host, port=None, pid=None, startTime=None, endTime=None, s
'close_wait': [],
'time_wait': [],
'retrans': [],
'load1': [],
'load5': [],
'load15': [],
'disk': disk}

res = {'code': 1, 'flag': 1, 'message': 'Successful!'}
Expand All @@ -69,7 +72,7 @@ def draw_data_from_db(host, port=None, pid=None, startTime=None, endTime=None, s
|> filter(fn: (r) => r._measurement == "{host}" and r.type == "{port}")
|> pivot(rowKey:["_time"], columnKey: ["_field"], valueColumn: "_value")
|> map(fn: (r) => ({{ r with _time: uint(v: r._time) }}))
|> keep(columns: ["_time", "cpu", "wait_cpu", "mem", "tcp", "jvm", "iodelay", "rKbs", "wKbs", "close_wait", "time_wait"])
|> keep(columns: ["_time", "cpu", "wait_cpu", "mem", "tcp", "jvm", "iodelay", "rKbs", "wKbs", "net", "rec", "trans", "close_wait", "time_wait"])
'''
datas = query_api.query(org=cfg.getInflux('org'), query=sql)
if datas:
Expand All @@ -85,32 +88,15 @@ def draw_data_from_db(host, port=None, pid=None, startTime=None, endTime=None, s
post_data['io'].append(record.values['iodelay'])
post_data['disk_r'].append(record.values['rKbs'])
post_data['disk_w'].append(record.values['wKbs'])
post_data['nic'].append(record.values['net'])
post_data['rec'].append(record.values['rec'])
post_data['trans'].append(record.values['trans'])
post_data['close_wait'].append(record.values['close_wait'])
post_data['time_wait'].append(record.values['time_wait'])
else:
res['message'] = f'No monitoring data of the port {port} is found, ' \
f'please check the port or time setting.'
res['code'] = 0

if disk:
sql = f'''
from(bucket: "{cfg.getInflux('bucket')}")
|> range(start: {startTime}, stop: {endTime})
|> filter(fn: (r) => r._measurement == "{host}" and r["type"] == "system")
|> pivot(rowKey:["_time"], columnKey: ["_field"], valueColumn: "_value")
|> keep(columns: ["_time", "rec", "trans", "net"])
'''
datas = query_api.query(org=cfg.getInflux('org'), query=sql)
if datas:
for data in datas:
for record in data.records:
post_data['nic'].append(record.values['net'])
post_data['rec'].append(record.values['rec'])
post_data['trans'].append(record.values['trans'])
else:
res['message'] = 'No monitoring data is found, please check the disk number or time setting.'
res['code'] = 0

if pid:
pass

Expand All @@ -125,7 +111,7 @@ def draw_data_from_db(host, port=None, pid=None, startTime=None, endTime=None, s
|> filter(fn: (r) => r._measurement == "{host}" and r["type"] == "system")
|> pivot(rowKey:["_time"], columnKey: ["_field"], valueColumn: "_value")
|> map(fn: (r) => ({{ r with _time: uint(v: r._time) }}))
|> keep(columns: ["_time", "cpu", "iowait", "usr_cpu", "mem", "mem_available", "{disk_n}", "{disk_r}", "{disk_w}", "{disk_d}", "rec", "trans", "net", "tcp", "retrans"])
|> keep(columns: ["_time", "cpu", "iowait", "usr_cpu", "mem", "mem_available", "{disk_n}", "{disk_r}", "{disk_w}", "{disk_d}", "rec", "trans", "net", "tcp", "retrans", "load1", "load5", "load15"])
'''
datas = query_api.query(org=cfg.getInflux('org'), query=sql)
if datas:
Expand All @@ -147,6 +133,9 @@ def draw_data_from_db(host, port=None, pid=None, startTime=None, endTime=None, s
post_data['disk_d'].append(record.values[disk_d])
post_data['tcp'].append(record.values['tcp'])
post_data['retrans'].append(record.values['retrans'])
post_data['load1'].append(record.values['load1'])
post_data['load5'].append(record.values['load5'])
post_data['load15'].append(record.values['load15'])

else:
res['message'] = 'No monitoring data is found, please check the disk number or time setting.'
Expand Down
23 changes: 3 additions & 20 deletions server/static/README_zh_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

## 介绍
#### 已完成如下功能<br>
1、监控整个服务器的CPU使用率、io wait、内存使用、磁盘IO、网络带宽和TCP连接数<br>
1、监控整个服务器的CPU使用率、CPU Load、io wait、内存使用、磁盘IO、网络带宽和TCP连接数<br>
2、监控指定端口的CPU使用率、上下文切换、内存占用大小、磁盘读写和TCP连接数<br>
3、针对java应用,可以监控jvm大小和垃圾回收情况;当Full GC频率过高时,可发送邮件提醒<br>
4、系统CPU使用率过高,或者剩余内存过低时,可发送邮件提醒;可设置自动清理缓存<br>
Expand All @@ -21,25 +21,8 @@
13、可同时管理监控多台服务器<br>
14、服务端停止后,不影响客户端监控<br>

#### 实现过程
1、使用基于协程的http框架`aiohttp`<br>
2、服务端前端使用`jinja2`模板渲染<br>
3、数据可视化采用`echarts`,可降低服务端压力<br>
4、采用线程池+队列的方式实现同时监控多个端口<br>
5、客户端每隔8s向服务端注册本机IP和端口<br>
6、服务端每隔12s会查询所有已注册的客户端的状态<br>
7、使用influxDB数据库存储监控数据;数据可设置自动过期时间<br>
8、为保证监控结果准确性,直接使用Linux系统命令获取数据,且可视化时未做任何曲线拟合处理<br>
##### 如需查看整体思路和想法,[请快点我](https://blog.ihuster.top/p/840241882.html) .
<br>
##### 如需查看整体思路和想法,[请快点我](https://mp.weixin.qq.com/s?__biz=Mzg5OTA3NDk2MQ==&mid=2247483810&idx=1&sn=d352577de1676c9c4997cbe305f243ca&chksm=c0599f5cf72e164a23c67e8782270f06561a659c2dd40c820e9b07d1d83e35d3fe7ca796e952&token=59716230&lang=zh_CN#rd) .
<br>

#### 项目分支
| 分支 | 框架 | 特点 |
| :----: | :----:| :---- |
| [master分支](https://github.com/leeyoshinari/performance_monitor) | aiohttp | 1、可同时监控多台服务器<br> 2、可视化时使用echarts画图,可降低服务端的压力<br> 3、可视化的图形可交互 |
| [matplotlib分支](https://github.com/leeyoshinari/performance_monitor/tree/matplotlib) | aiohttp | 1、可同时监控多台服务器<br> 2、可视化时使用matplotlib画图,可降低客户端(浏览器)的压力 |
| [single分支](https://github.com/leeyoshinari/performance_monitor/tree/single) | flask | 1、单机监控<br> 2、监控数据存放在日志中<br> 3、使用matplotlib画图 |

## 使用
1. 克隆 performance_monitor
Expand Down Expand Up @@ -114,7 +97,7 @@ pyinstaller安装过程自行百度,下面直接进行打包:<br>
## 注意
1. 服务器必须支持以下命令:`jstat``iostat``pidstat``netstat`,如不支持,请安装。

2. sysstat的版本必须是12+,目前测试过12的版本,其他版本未测试过,使用老版本可能会导致数据异常;最新版本下载地址[请点我](http://sebastien.godard.pagesperso-orange.fr/download.html) 。
2. sysstat的版本必须是12+,目前测试过12.7.5的版本,其他版本未测试过,使用老版本可能会导致数据异常;最新版本下载地址[请点我](http://sebastien.godard.pagesperso-orange.fr/download.html) 。

3. 服务器网卡必须是全双工模式,如果不是,带宽使用率计算将会不正确。

Expand Down
Loading

0 comments on commit 9249a1c

Please sign in to comment.