کد آگهی: KP543613151

استخدام Senior Observability & Reliability Engineer

همراه کسب و کارهای هوشمند | E Business Companion

در تهران

در وبسایت جابینجا (1 روز پیش)

اطلاعات شغل:

نوع همکاری: تمام‌وقت

مدرک تحصیلی مورد نیاز: کارشناسی

وضعیت نظام وظیفه: کارت پایان خدمت یا معافیت

مهارت‌های مورد نیاز:

ELK

Splunk

ZABBIX

kibana

پرداخت‌ها: توافقی

متن کامل آگهی:

EBCOM, a proud subsidiary of MCI (Hamrah-e Aval), is seeking a Senior Observability & Reliability Engineer to enhance the reliability and observability of our services. In this critical role, you'll design and implement advanced monitoring solutions, improve system visibility, and ensure the seamless performance of our distributed systems. Your expertise in reliability engineering will be key in minimizing incidents and contributing to the continuous improvement of our operational resilience.

Job Description

Observability Engineering:

Design and enhance observability architecture (Metrics, Logs, Traces).
Implement centralized monitoring and telemetry collection strategies.
Build advanced dashboards and visualization layers for service health.
Define effective alerting models with proper thresholds and noise reduction.
Improve system visibility across distributed services and APIs.

Reliability Engineering:

Apply reliability principles to reduce incidents and service degradation.
Define and track SLOs, SLAs, SLIs, and error budgets.
Conduct reliability and availability analysis across services.
Identify systemic weaknesses and recurring failure patterns.
Contribute to resilience improvements and operational hardening.

Incident Intelligence & Analysis:

Lead advanced alert triage and event correlation.
Support incident response with deep telemetry insights.
Participate in root cause analysis (RCA) and post-incident reviews.
Develop monitoring improvements based on incident learnings.

Performance & Capacity Insight:

Analyze performance trends and anomaly detection.
Support capacity planning using monitoring data.
Deliver service reliability and performance reports.

Required Technical Skills:

Strong hands-on experience with monitoring & visualization tools: Prometheus, Zabbix, Grafana, Splunk, Kibana.
Experience with log aggregation and analysis platforms: ELK Stack or Splunk.
Solid understanding of Observability concepts (Metrics, Logs, Tracing, Telemetry).
Familiarity with distributed systems monitoring.
Linux system knowledge and performance troubleshooting.
Understanding of network and infrastructure monitoring fundamentals.
Experience with API/service monitoring.
Experience with event management and ticketing systems.

Reliability & Engineering Knowledge:

Deep understanding of MTTR, MTBF, availability, and resilience engineering.
Experience defining and measuring SLO/SLI frameworks.
Knowledge of alert optimization and signal-to-noise improvement.
Strong analytical skills for failure pattern recognition.
Experience contributing to operational readiness and service reliability.

Nice to Have:

Experience with APM tools.
Knowledge of automation or scripting (Python, Bash).
Exposure to cloud-native monitoring.
Familiarity with ITIL processes.

Soft Skills:

Strong analytical and systems-thinking mindset.
Calm and effective under incident pressure.
Cross-team collaboration skills.
Clear reporting and documentation.

Work Conditions:

Participation in incident scenarios and reliability reviews.
Close collaboration with Service Operations teams.

ثبت آگهی استخدام در جابینجا

این آگهی از وبسایت جابینجا پیدا شده، با زدن دکمه‌ی تماس با کارفرما، به وبسایت جابینجا برین و از اون‌جا برای این شغل اقدام کنین.

هشدار

توجه داشته باشید که دریافت هزینه از کارجو برای استخدام با هر عنوانی غیرقانونی است. در صورت مواجهه با موارد مشکوک،‌ با کلیک بر روی «گزارش مشکل آگهی» به ما در پیگیری تخلفات کمک کنید.

گزارش مشکل آگهی

تماس با کارفرما

این آگهی رو برای دیگران بفرست

نشان کن