Infrastructure Monitoring Best Practices
At small companies with few servers and workstations, system administrators can usually quickly identify any issues that occur without any special tools. As a company grows, so does the number of servers and other network devices. And if something goes wrong, a system administrator must still be able to identify the problem quickly to prevent serious issues.
Looking for an issue manually in a medium or large infrastructure can be complicated and time-consuming. Fortunately, automated IT infrastructure monitoring is widely available today to help administrators identify the type and source of issues as fast as possible. These tools also help administrators proactively prevent issues and bottlenecks before they occur by monitoring resources allocation and real-time consumption.
This blog post explains what IT infrastructure monitoring is, why use monitoring tools for servers and other network devices, and what best practices to follow.
What Is IT Infrastructure Monitoring?
Infrastructure monitoring is the process of tracking hardware and software metrics in a physical or virtual environment to improve efficiency and optimize processes. This is done by collecting and analyzing the data about the availability, performance, and resource usage of critical hardware and applications.
An IT infrastructure is the underlying framework that allows businesses to deliver services, carry out transactions, provide information, interact with customers, etc. This infrastructure is composed of data centers, applications and software, networks, and hardware like servers, routers, etc.
IT Monitoring Types and Methods
Let’s look at the two main approaches to IT infrastructure monitoring.
- Agent-based monitoring can be done using client-server software by installing agents on each monitored machine. This type of IT monitoring tools requires installing the server component of the system monitoring software on a server or virtual machine. The server software records collected data in a database and provides a web interface for administrators and users to configure the system monitoring software and monitor IT infrastructure
An agent is the component of the IT monitoring software that is installed on the target machine from which data must be collected. The agent interacts with the server via the network and sends the collected data to the monitoring server. The agent should support multiple operating systems to cover the IT infrastructure better.
- Agentless monitoring can be done using server-side software and supported network protocols without installing monitoring software agents on each monitored machine. It can be used for different platforms, which is especially useful if you cannot install the monitoring agent (for example, on a switch or router).
IT monitoring software can check the availability of services on a remote host using ICMP, SSH, FTP, HTTP, and DNS protocols without a monitoring agent installed on the remote host. The server monitoring software tries to access the destination host via the defined protocol, and depending on the server response, determines the status of the needed service.
Two of the protocols used are:
- Simple Network Management Protocol (SNMP) is developed especially for monitoring tasks without installing monitoring agents on remote hosts. The remote host must run the appropriate SNMP service to support data collection via SNMP from this monitored host. SNMP works on the application layer of the OSI model, and the latest version is SNMPv3.
The SNMP protocol is usually supported in switches, routers, access points, firewalls, network printers, and other devices that are connected to the network. Each object identifier is associated with the appropriate parameter, such as received bytes, transmitted bytes, CPU temperature, level of toner in the printer cartridge, etc. Object identifiers are numbered using the hierarchical (tree-like) structure. For example, 188.8.131.52.4.1.3184.108.40.206.220.127.116.11.1.16 is the identifier for the temperature sensor of Intel hardware.
Note that an SNMP agent is not the same as a monitoring agent of system monitoring software.
- Windows Management Instrumentation (WMI) is Microsoft’s proprietary network protocol developed to monitor Windows-based systems without installing agents. The monitoring tool sends a WMI query to a monitored host and then reads the returned data.
IT Monitoring for virtualized systems
Monitoring VMs and containers has its own features that should be taken into account to achieve the desired results.
VM Monitoring. For virtual machines, use agentless monitoring software solutions using VMware APIs to track the performance and efficiency of ESXi hosts, vCenter servers, and virtual machines. Monitoring metrics include CPU, memory, storage, and network usage. This approach allows you to avoid overheads compared to the method when monitoring agents are installed on VMs.
Container monitoring is tricky compared to monitoring traditional servers and virtual machines. This is because containers are provisioned/destroyed quickly and they share resources, which makes it difficult to measure the consumed resources of a host. Deployment of N agents in N containers is not rational. Just like VMs, containers can be monitored via special APIs.
The Docker stats API is a native mechanism provided with Docker containers to monitor them. The main idea of container monitoring is to monitor containerized applications of the microservice architecture running in containers.
IT Infrastructure Monitoring: Components
Let’s explore different components that can be tracked with IT infrastructure monitoring to learn more. This classification of monitored components is conditional because they can intersect with each other.
- Hardware monitoring for CPU temperature, HDD temperature, HDD S.M.A.R.T. status, battery life data, voltage, etc. free memory, disk space, disk activity, and swap file usage.
- Network monitoring for data transfer rates on different network interfaces, the number of connected users (useful for VPN connections), network connections, firewalls, TCP and UDP connections (to detect malware), etc. It can help you detect network overload, low data transfer speed, and unauthorized attempts to access the network.
- Application monitoring to check application logs, including operating system logs, detect error codes, and display aggregated information in the web interface or send notifications to administrators. Application monitoring can include CPU and memory consumption by an application.
- Security monitoring to detect security issues and address software vulnerabilities, opened ports, and unwanted permissions, which be used to initiate attacks in your environment.
- Critical activity monitoring to detect unauthorized login attempts to a system, files modifications, etc. Monitoring files and folders helps you detect unusual activities caused by ransomware and respond quickly to avoid data loss.
- Uptime monitoring to detect whether a host was powered off even if nobody has noticed that (for example, a server was rebooted at night during non-working hours after installing automatic updates or after a power outage). The longer the host operates properly without reboot, the more reliable and stable the system is.
Best Practices for IT Infrastructure Monitoring
To achieve maximum monitoring efficiency, follow these infrastructure monitoring best practices. With a clear understanding of how to implement IT monitoring, you can mitigate downtime risks and react to issues more effectively before users feel the negative impact of failed services and applications.
Choose the right monitoring solution
To choose the right monitoring solution for your organization’s needs, determine which components require monitoring in your IT infrastructure. To do that, categorize hardware, systems, and applications based on how critical they are for business operations.
Then you can go on to define your monitoring strategy and select the optimal IT infrastructure monitoring software. Your strategy will include the hardware and software to monitor, which metrics to monitor, the monitoring depth, and how to respond when issues occur. Depending on these parameters, select the monitoring software that meets your requirements.
If you need to monitor VMware VMs on ESXi hosts, select a solution that accesses VMs at the hypervisor level rather than installing agents on the guest operating system. A universal enterprise monitoring software will combine agents to monitor physical machines and virtualization APIs to monitor hypervisor hosts and VMs. Such monitoring software can use protocols like SNMP to monitor network devices and other equipment and use special APIs to monitor items in the AWS and Azure clouds.
Gather relevant metrics
IT monitoring best practices recommend approaches to always obtain relevant information:
- Define which metrics you need to monitor for physical machines, virtual machines, applications, networks, and different devices.
- Check your performance metrics and monitored logs regularly.
- Periodically review your monitored metrics and make some changes in the IT infrastructure monitoring if necessary.
Configure access to the right dashboards
IT monitoring software usually collects data and displays information in an optimized view in the web interface. A web interface usually contains dashboards with gathered visualized information. A system administrator and authorized users can open the web interface and check summary information, graphs, statistics, and other data for the entire infrastructure and particular servers, devices, and applications.
Define who needs to view the monitoring data. Grant access for users to monitor only what they need to perform their responsibilities, following the principle of least privilege. Configure custom dashboards for different groups of users, for example:
- Programmers can monitor database servers, application servers, web servers, and the Kubernetes clusters they use.
- Testers can monitor servers and VMs used for testing.
- System administrators can monitor all items.
- Sales managers may need to view information about the CRM system.
Configure automated alerts/notifications
Administrators and users can check the monitoring data on-demand in the provided dashboards. This is a useful option, but how can you be informed about the issue immediately? Administrators cannot spend the whole day monitoring statistics. For this reason, most IT monitoring tools allow administrators to configure automatic notifications that are sent via email, Skype, SMS, etc. Administrators can configure triggers based on specific events to send notifications to the chosen destination.
Alerts can be prioritized: the most critical alerts should have the minimum delay, while other alerts can be sent with a delay of a few minutes. For example, if a host goes offline, a notification message is sent in two minutes to an email group or to a Skype group whose members are administrators, advanced users, and team leads. If a server is online again, the appropriate notification message is sent to the group. You can also set alerts for low disk space, CPU overload, and insufficient memory on servers. If the network device has the appropriate functionality, you can even configure notifications about the low level of toner in a cartridge in the network printer. It can be useful if users always print important pages, and you want to avoid forgetting to check whether there are full cartridges in the inventory.
The infrastructure monitoring best practices recommend that you configure sending automatic notifications only for the needed parameters. If you configure notifications to be sent about all issues, it will be difficult to handle the received information.
Set the threshold for notifications
Configure thresholds to display and send notifications. If you configure to set notifications immediately, you can see many alert messages in short CPU performance spikes, short periods of “unreachable” networks caused by server overload, etc. Configure the adequate threshold to react in time and minimize the flood of notifications. Proper configuration of the threshold reduces the probability of false-positive triggering.
When you configure system monitoring software, set adequate intervals to collect data and generate reports. If the interval to generate a report is too small, the processes generating reports and graphs in dashboards can interfere with core processes, and CPU load increases significantly. That can cause overload and failure of the monitoring server.
Mark notification priorities
Without prioritizing notifications, they are displayed as an irrelevant flood of data. Parsing this data to find the important data is time-consuming, not convenient, and inefficient. Configuring the IT infrastructure monitoring solution to display only what you need with the set priorities makes life easier.
Different issues can occur in the IT infrastructure. Some of them may be critical, others not.
- Examples of critical issues. Failure of an Active Directory domain controller server, production database server, ESXi server running mission-critical VMs, bad S.M.A.R.T. status of a disk drive, low disk space, high CPU temperature, insufficient free memory, etc.
- Examples of moderate (middle-priority) issues. Failure of a test server, test VM, bug tracker, etc.
- Examples of light (minor) issues. Low level of toner in a printer, etc.
Priorities can be different for each company, and you should adjust them according to your requirements. Set the priority for different issue types if it is possible to display them in monitoring dashboards and when sending automatic notifications, for example:
- [Critical] Host 192.168.17.2 (DC01) is unreachable for 5 minutes.
- [Critical] CPU temperature is too high (82 °C) on host 192.168.17.89 (Ora12-prod).
- [Critical] Low disk space on C: on host 10.10.10.6 (FS-06).
- [Moderate] VM 10.10.10.35 (Oracle-test) on host 192.168.17.22 (ESXi-22) is unreachable for 5 minutes.
- [Minor] Toner level is low for 192.168.17.8 (HP-printer).
The critical issues are urgent and administrators should fix them as soon as possible. The minor issues can wait for a response.
Test how monitoring is working
After configuring an IT infrastructure monitoring system, you need to test how this system works and whether notifications are sent out properly. Don’t wait for a real emergency situation and schedule a test run after finishing the configuration. After the test run, you may need to fine-tune your IT monitoring system. Testing allows you to ensure that monitoring works as expected and to determine its efficiency.
Create a response action plan
Define what to do after receiving notifications when issues occur. You should have a fast solution on how to respond to critical issues. You need to have a disaster recovery plan and follow this plan in case of failures or data loss to ensure operational continuity and disaster recovery to meet your organization’s RTOs and RPOs. You must always have backups ready for the recovery of machines or specific application data.
Some monitoring software come with comprehensive data protection and disaster recovery functionality, like NAKIVO’s IT Monitoring solution. Server failure and data loss can occur in all types of environments. Data backup allows you to protect your data, recover data in case of failure, and restore workloads with normal operation in a short time. NAKIVO Backup & Replication is a universal data protection solution that supports backup of physical Linux and Windows machines, VMware vSphere VMs, Microsoft Hyper-V VMs, Amazon EC2, Nutanix AHV, and Microsoft 365. You can download the free edition of the product on the official website.