IT Capacity Planning and Performance Team Leader (NOC Excellence)

وصف الوظيفة

Description
Lead the Capacity Planning and End-to-End (E2E) Performance Management function, responsible for ensuring the scalability, reliability, and responsiveness of production services and systems. Drive the strategy, implementation, and continuous improvement of proactive monitoring and automation frameworks to maintain optimal service performance. Analyze business forecasts to anticipate system needs, enhance visibility through modern observability platforms, and automate operational recovery procedures. Promote a data-driven culture of operational excellence and service resilience.

Job Responsibilities

  • Define and maintain capacity planning models and performance baselines across all production services and infrastructure layers.
  • Develop, maintain, and execute short- and long-term capacity and performance plans aligned with service growth and demand forecasts.
  • Analyze business forecasts and usage trends to proactively identify required infrastructure or service expansions.
  • Deliver weekly/monthly capacity and performance reports with actionable insights and forecasting trends.
  • Own the Configuration Management and Capacity Management processes; ensure alignment with ITSM best practices.
  • Maintain and ensure the accuracy of the Configuration Management Database (CMDB) through continuous audits and governance practices.
  • Own the enterprise service catalog; integrate new services with defined SLAs and delivery models.
  • Design and implement E2E monitoring strategies for services, systems, and infrastructure with advanced observability platforms.
  • Deploy and maintain Dynatrace APM to enable full-stack service performance visibility, transaction tracing, and AI-powered root cause detection.
  • Leverage Splunk for real-time log analysis, performance dashboards, anomaly detection, and operational insights.
  • Integrate performance and capacity dashboards to support proactive service management and executive reporting.
  • Provide advanced technical and functional support for Dynatrace APM, Splunk, and Capacity Management platforms.
  • Implement proactive mechanisms to detect performance degradations using AI/ML-powered alerting and historical trend analysis.
  • Automate repetitive operational tasks, including capacity scaling and DR procedures, using Ansible Automation Platform.
  • Collaborate with application and infrastructure teams to resolve performance issues and deliver ongoing improvements.
  • Own, maintain, and automate the disaster recovery playbooks to improve RTO and operational readiness across services.

متطلبات الوظيفة

  • Bachelor’s degree in Computer Science, Engineering, or a related field,
  • 6+ years of relevant experience in IT Operations or Service Management.
  • Strong hands-on experience with Dynatrace APM (full-stack monitoring, service flow analysis, synthetic and real user monitoring), Splunk (data ingestion, correlation, visualization, and anomaly detection), and Ansible Automation Platform (for operational task automation and configuration management)
  • Proven expertise in capacity modeling, workload analysis, infrastructure performance tuning, and horizontal/vertical scaling strategies.
  • Experience in managing and governing CMDB, service catalogs, and service performance KPIs.
  • Scripting and automation experience (e.g., Python, PowerShell, Bash, YAML for Ansible).
  • Strong knowledge of ITSM practices and frameworks (e.g., ITIL v4), particularly in Incident, Problem, Configuration, and Capacity Management.
  • Demonstrated leadership in managing and coaching technical teams and third-party service providers.
  • Experience working in highly available, high-transaction production environments, preferably in financial or critical services.
  • Strong analytical and data visualization skills, with a focus on turning operational data into actionable insights.
  • Excellent communication, stakeholder management, and presentation skills.
  • Strong organizational, coordination, and decision-making abilities.
  • A self-driven, innovative mindset with a continuous improvement and service reliability focus.