So this is my opinion / take, on what is best practice on monitoring.
So I have a standard for collecting data, It was made/Invented during my time at TV2 Denmark, where I was part of the team starting up their Operations Center i was the first employee.
The methods was a 14/14/14 or 30/14/14 depending on what data we collected.
So you might think what are those numbers ??
Method of 14/14/14
So this method is pretty simple, it is:
- 14 days of collecting data
- 14 days for looking at the data
- 14 days to getting for an alarm ready for production
Method of 30/14/14
Well i think you have guessed what this method is by reading the previous.
Well yes it is 30 days of collecting data.
Use Cases
The 2 different methods can be used on several cases.
If you have data that has a continuous flow a 14/14/14 should do the trick, because you have a constant data-flow that makes it easier, to find the oddities.
Examples:
- Web server with constant flow of traffic
- Databases that has continuous usage
- Servers that has a constant workload
- How many viewers you have on videos
The 30/14/14 is more for those slow not normal flows.
Examples:
- Server with specific workload such as runs backup once a week
- Systems that performs tasks on a schedule
- Periodically running systems
The Typical things you can monitor for both methods are:
- CPU usage
- Memory Usage
- Disk Usage
- Network traffic
In case of the 14/14/14 what often was monitored, was if we looked back 1 week, how would the current CPU usage look like compared to now or if the CPU load was higher that X.
But on some systems those comparisons would cause what we called “trigger flicker” simply a trigger created an event over and over again.
In some cases the simple solution was using a forecast function ( Extremely usefull on disks ), where we calculated, if we look back X amount of time, how would the load be in X amount of time in the future.
This method often gave a good prediction of how the system reacted, and it is also builtin in Zabbix and a good monitoring item.
An example would be:
forecast(item,14d,1h)
The simple explanation is, we look 14 days back, then calculate what the value will be an hour ahead of time.
This can often help predict if the last value is way off, like in predicting if the amount of blocked IP’s from Fail2Ban is starting to accelerate.
Like I mentioned shortly this can also be used on Disks, but more specific the space left, but often the item prediction is set to 12 hours or 48 hours in the future.