Incident Manager

وصف الوظيفة

The Incident Manager is responsible for leading the end-to-end coordination, communication, and resolution of high-severity incidents to minimize business disruption and ensure rapid service restoration. This role acts as the central point of contact during incidents, driving efficient collaboration across technical teams and stakeholders, managing escalations, ensuring post-incident reviews are completed, and continuously improving incident management processes to enhance operational resilience and readiness.

Responsibilities:

  • Incident Coordination
    • Lead and coordinate the resolution of high-severity incidents.
    • Assemble the right technical teams (e.g., engineers, sysadmins, vendors).
    • Ensure a clear command structure and roles during incidents.
  • Communication
    • Keep stakeholders informed (internal teams, leadership, customers).
    • Act as a central point of contact during incidents.
    • Provide timely updates and post-incident reports.
  • Escalation Management
    • Ensure incidents are escalated to appropriate teams promptly.
    • Involve senior engineers or third-party vendors if needed.
  • Root Cause Analysis & Follow-Up
    • Ensure detailed post-incident reviews are conducted (Postmortems or PIRs).
    • Track corrective actions and ensure they’re implemented.
  • Process Improvement
    • Improve incident response processes and runbooks.
    • Recommend changes to reduce recurrence and response times.
  • Monitoring & Readiness
    • Monitor incidents and trends to spot recurring issues.
    • Ensure on-call coverage and incident playbooks are up-to-date.

إمتيازات الوظيفة

  1. Opportunity to work for a dynamic international company with a flat hierarchical structure, where your voice matters and your impact is seen.
  2. The company will contribute up to EUR 25 per month towards staff perks
  3. A company bonus scheme applicable as per bonus scheme rules
  4. EUR equivalent salaries paid in EGP

متطلبات الوظيفة

  • Strong knowledge of ITIL principles, especially Incident, Problem, and Change Management.
  • Good grasp of IT infrastructure, cloud platforms, networking, and application architecture to engage effectively with technical teams.
  • Experience with tools like Datadog, Prometheus, Splunk, New Relic, or equivalent for real-time monitoring and incident detection.
  • Proficiency in using platforms like PagerDuty, Opsgenie, ServiceNow, Jira, or similar for incident tracking and on-call management.
  • Ability to lead post-incident reviews and document technical findings in clear, actionable reports.
  • Experience creating and maintaining response playbooks and operational runbooks.
  • Strong verbal and written communication skills for coordinating cross-functional teams and delivering updates to stakeholders.
  • Ability to make sound, time-critical decisions under pressure during high-impact incidents.
  • Competence in identifying incident patterns, performing trend analysis, and implementing process improvements.
  • Knowledge of escalation protocols and experience coordinating with external service providers or vendors.
  • Excellent written and oral English communication skills to successfully engage with customers and colleagues. 

Education

  • Bachelor’s degree in IT plus 5+ years of relevant industry experience.

Experience

5 years of relevant industry experience