EBCOM, a proud subsidiary of MCI (Hamrah-e Aval), is seeking a Senior Observability & Reliability Engineer to enhance the reliability and observability of our services. In this critical role, you'll design and implement advanced monitoring solutions, improve system visibility, and ensure the seamless performance of our distributed systems. Your expertise in reliability engineering will be key in minimizing incidents and contributing to the continuous improvement of our operational resilience.
Job Description
Observability Engineering:
- Design and enhance observability architecture (Metrics, Logs, Traces).
- Implement centralized monitoring and telemetry collection strategies.
- Build advanced dashboards and visualization layers for service health.
- Define effective alerting models with proper thresholds and noise reduction.
- Improve system visibility across distributed services and APIs.
Reliability Engineering:
- Apply reliability principles to reduce incidents and service degradation.
- Define and track SLOs, SLAs, SLIs, and error budgets.
- Conduct reliability and availability analysis across services.
- Identify systemic weaknesses and recurring failure patterns.
- Contribute to resilience improvements and operational hardening.
Incident Intelligence & Analysis:
- Lead advanced alert triage and event correlation.
- Support incident response with deep telemetry insights.
- Participate in root cause analysis (RCA) and post-incident reviews.
- Develop monitoring improvements based on incident learnings.
Performance & Capacity Insight:
- Analyze performance trends and anomaly detection.
- Support capacity planning using monitoring data.
- Deliver service reliability and performance reports.
Required Technical Skills:
- Strong hands-on experience with monitoring & visualization tools: Prometheus, Zabbix, Grafana, Splunk, Kibana.
- Experience with log aggregation and analysis platforms: ELK Stack or Splunk.
- Solid understanding of Observability concepts (Metrics, Logs, Tracing, Telemetry).
- Familiarity with distributed systems monitoring.
- Linux system knowledge and performance troubleshooting.
- Understanding of network and infrastructure monitoring fundamentals.
- Experience with API/service monitoring.
- Experience with event management and ticketing systems.
Reliability & Engineering Knowledge:
- Deep understanding of MTTR, MTBF, availability, and resilience engineering.
- Experience defining and measuring SLO/SLI frameworks.
- Knowledge of alert optimization and signal-to-noise improvement.
- Strong analytical skills for failure pattern recognition.
- Experience contributing to operational readiness and service reliability.
Nice to Have:
- Experience with APM tools.
- Knowledge of automation or scripting (Python, Bash).
- Exposure to cloud-native monitoring.
- Familiarity with ITIL processes.
Soft Skills:
- Strong analytical and systems-thinking mindset.
- Calm and effective under incident pressure.
- Cross-team collaboration skills.
- Clear reporting and documentation.
Work Conditions:
- Participation in incident scenarios and reliability reviews.
- Close collaboration with Service Operations teams.