Skip to content
Careers

Manager of Monitoring Operations

Chennai, India JR013415

Job Description: Manager – Monitoring Operations

Role Summary

The Manager – Monitoring Operations will lead and manage the enterprise monitoring operations team responsible for the availability, performance, and reliability of IT infrastructure and applications. This role will oversee the day-to-day operations of BMC Helix On-Premises Monitoring tool deployed on RedHat OCP (OpenShift Container Platform), Network and Device monitoring using ParkPlace Entuity, along with OS Monitoring using Prometheus-Grafana, ensuring a high service quality, operational excellence, and continuous improvement.

The role requires strong people management skills, deep technical expertise in systems monitoring platforms, and experience operating monitoring solutions in containerized environments.

Key Responsibilities

· Lead, mentor, and manage a team of monitoring engineers/analysts, defining goals, KPIs, shift coverage, and on-call rotations.

· Drive skill development through performance reviews, training initiatives, and continuous learning plans.

· Act as escalation point for major monitoring incidents and outages, guiding quick workarounds to prevent monitoring gaps and loss of metrics.

· Ensure operational excellence aligned with ITIL practices (Incident, Problem, Change) and adherence to security, compliance, and operational standards.

· Manage upgrades, patches, capacity planning, and health checks across the monitoring estate to maintain high availability and performance.

· Oversee the Server (Windows/Linux/AIX), Network, Database & Synthetic URL Monitoring for the Enterprise and for the Global clients’ private cloud.

· Collaborate with Container Platform, Core Infrastructure, and Network teams on platform stability, scaling, resilience, and resource allocation.

· Optimize alert quality, reduce alert fatigue, standardize dashboards/alerting frameworks, and deliver actionable insights.

· Maintain SOPs, runbooks, and operational documentation; provide regular reports on platform health, incidents, and SLA compliance.

· Serve as the primary stakeholder contact for all monitoring services.

· Conduct annual disaster-recovery (DR) tests for the monitoring estate to validate resilience, recovery procedures, and business continuity readiness.

 

 

 

Required Experience & Qualifications

Experience

· 10+ years of overall IT industry experience, including 5+ years in monitoring operations in medium-to-large organizations.

· Hands-on operational expertise with at least two of the following monitoring platforms/tools:

o BMC Helix Monitoring (SaaS or On-Prem)

o RedHat OpenShift Container Platform (OCP) or Kubernetes Cluster Management

o Prometheus, Exporters, OTEL Collectors, and Grafana

o ParkPlace Entuity Network and Hardware Monitoring

· Proven experience in monitoring architecture design, capacity planning, performance tuning, and integration with ITSM tools for automated ticketing workflows.

· Strong knowledge of ITIL processes and operational best practices.

Leadership & Soft Skills

· Strong people-management and leadership capabilities

· Excellent communication and stakeholder-management skills

· Ability to handle high-pressure situations and lead incident response

· Strategic mindset with a focus on operational maturity and optimization

Education & Certifications

· Bachelor’s degree in computer science, Information Technology, or equivalent

· Relevant certifications (preferred, not mandatory):

o RedHat OpenShift / Kubernetes

o BMC Helix

o Foundation certifications in ITIL and/or AI

Nice-to-Have

· Exposure to hybrid or multi-cloud environments

· Experience in Automation, Scripting, APIs and AI-driven service improvements

· Application Performance Monitoring (APM) experience

JR013415

More career opportunities at Ensono

Explore additional openings with our team, and apply today.

Start your digital transformation today.