In the early 90’s when monitoring tools began to take form, the system elements you could collect and report on were limited. Today you can capture metrics on just about any element imaginable and if it is not available, you can probably write a custom script. Further, you can take the collection of metrics and set notifications when the monitoring triggers an event or alarm. That’s all there is to monitoring, right? Not so fast. The ability to monitor is only half the battle. Today we are challenged because there is too much data and too many options; this can be just as harmful as not having enough data or having limited choices because it can become overwhelming. Especially for larger environments where you have thousands of systems sending hundreds of events and alarms. How do you filter and look for what really matters? There is no cure-all for this but you can start by developing a monitoring strategy (yes, there is such a thing and it can be very effective).
When setting your monitoring strategy, there are four objectives to keep mind:
- Make sure your business requirements are fulfilled. Without the business, there is no purpose of having the technology assets that come along with it. For this reason, it would make sense that this would be the starting point in building a framework. Generally, you want to determine how services are being delivered so you know what needs monitoring. You also want to know the service level agreements in place for setting the appropriate thresholds. Moreover, you want to know how the business is expected to grow. This is crucial because more business means demand for more capacity and more capacity means more licensing and labor costs. There are many more factors that come into play, but the key is to determine what you need to monitor and your failure tolerance levels.
- Implement your monitoring with your services and applications being the focal point. System monitoring is the way of the past. Technology advances come with the baggage of increased complexity in today’s data center and cloud environments. Consequently, it is meaningless to just know that a system has failed. Take for example a pair of clustered servers. In the event that one of the servers in the cluster goes down, is the outage really impactful? Possibly. The server admin may report that all services failed over properly at the OS level. However, application support may report that their application is processing requests very slowly hence the end users will also be complaining. How could this be? One likely case may be that the application was designed to utilize both servers to its full capacity. For this reason, it is preferred to monitor at the application and business service level. This way, in the event of a server failure, you would be made aware of the application being in a degraded state.
- Select the proper tools. With so many tools available today it can be a tedious task in selecting the proper tool(s). You have to consider the features, the cost, function overlaps with other tools, level of difficulty to implement/support, and many other factors. This can be time consuming and even political, but at the same time it is necessary because these tools can be costly. In order to aid this process, there are two things to keep in mind. First, think integration. Tool integration allows for flexibility and collaboration of multi-vendor tools. This way you can benefit from the best features in each tool and aggregate data collected from the tools for analytics. Second, above all, make sure the tool can perform as expected. If this means going through an extensive proof of concept life cycle, it may be worth your while.
- Involve your operations teams. Monitoring tools are usually purchased with a specific purpose in mind but the process necessary to make the tools effective are overlooked and, at best, implemented in the rear. This should be the complete opposite and in a sense, if the monitoring solutions were to be a product, the operations teams would be the customers. In other words, the monitoring solution should be tailored for the operations team so it can enable them to be efficient and effective in responding to outages and performance problems. This entails careful planning and involving them during the tool selection process. It is also important to provide proper training on the use of tools, creation of run books, and a continuous improvement effort to maintain the sharpness of the tools.
Overall, implementing a successful monitoring strategy should reduce down time and provide operational efficiencies. As a result, your business should experience minimal impact and optimal customer experience.