وصف الوظيفة
Description
Lead the Capacity Planning and End-to-End (E2E) Performance Management function, responsible for ensuring the scalability, reliability, and responsiveness of production services and systems. Drive the strategy, implementation, and continuous improvement of proactive monitoring and automation frameworks to maintain optimal service performance. Analyze business forecasts to anticipate system needs, enhance visibility through modern observability platforms, and automate operational recovery procedures. Promote a data-driven culture of operational excellence and service resilience.
Job Responsibilities
- Define and maintain capacity planning models and performance baselines across all production services and infrastructure layers.
- Develop, maintain, and execute short- and long-term capacity and performance plans aligned with service growth and demand forecasts.
- Analyze business forecasts and usage trends to proactively identify required infrastructure or service expansions.
- Deliver weekly/monthly capacity and performance reports with actionable insights and forecasting trends.
- Own the Configuration Management and Capacity Management processes; ensure alignment with ITSM best practices.
- Maintain and ensure the accuracy of the Configuration Management Database (CMDB) through continuous audits and governance practices.
- Own the enterprise service catalog; integrate new services with defined SLAs and delivery models.
- Design and implement E2E monitoring strategies for services, systems, and infrastructure with advanced observability platforms.
- Deploy and maintain Dynatrace APM to enable full-stack service performance visibility, transaction tracing, and AI-powered root cause detection.
- Leverage Splunk for real-time log analysis, performance dashboards, anomaly detection, and operational insights.
- Integrate performance and capacity dashboards to support proactive service management and executive reporting.
- Provide advanced technical and functional support for Dynatrace APM, Splunk, and Capacity Management platforms.
- Implement proactive mechanisms to detect performance degradations using AI/ML-powered alerting and historical trend analysis.
- Automate repetitive operational tasks, including capacity scaling and DR procedures, using Ansible Automation Platform.
- Collaborate with application and infrastructure teams to resolve performance issues and deliver ongoing improvements.
- Own, maintain, and automate the disaster recovery playbooks to improve RTO and operational readiness across services.