Department: Technology
Location: Tehran
Work Type: Full - time
About the Role:
We are seeking a dedicated and proactive Monitoring Specialist to join our 24x7 operations team.The candidate will be responsible for ensuring the seamless performance, availability, and reliability ofapplications and systems through advanced monitoring tools like Grafana, ELK (Elasticsearch,Logstash, Kibana), Prometheus, and other APM (Application Performance Monitoring) tools. This roledemands real-time monitoring, swift troubleshooting, and effective communication to maintain andenhance operational performance.
Responsibilities:
Monitoring and Incident Management:
Continuously monitor application and system performance using Grafana, ELK, Prometheus, and othertools.
Identify, analyze, and resolve performance bottlenecks, latency issues, and system alerts.
Manage and escalate incidents based on defined SLAs and protocol.
Proactive Issue Resolution:
Conduct root cause analysis (RCA) for recurring issues and provide recommendations for resolution.
Develop and implement automated alerting and escalation mechanisms to enhanceoperationalefficien.
System Health and Optimization:
Analyze metrics and logs to ensure optimal performance and system reliability.
Collaborate with development, infrastructure, and DevOps teams for performance tuning and
capacityplanning.
Documentation and Reporting:
Maintain detailed logs of incidents, resolutions, and RCA outcomes.
Generate periodic reports on system performance, availability, and incident trends for stakeholders.
Continuous Improvement:
Recommend and implement enhancements to monitoring dashboards and tools.
Stay updated on the latest monitoring technologies and integrate them into existing workflows.
Requirements
Proficient in monitoring tools like Grafana, ELK (Elasticsearch, Logstash, Kibana), Prometheus, asimilar platforms.
Hands-on experience with Application Performance Monitoring (APM) tools.
Good understanding of Linux/Unix operating systems.
Familiarity with cloud platforms and containerized environments (Docker, Kubernetes).
Analytical and Problem-Solving Skills:
Ability to interpret logs, metrics, and system alerts effectively.
Strong troubleshooting skills for applications, infrastructure, and network layers.
Communication and Collaboration:
Excellent communication skills to report and escalate issues promptly.
Experience collaborating with cross-functional teams, including DevOps, developers, and infrastructure specialists.
این آگهی از وبسایت ایران استخدام پیدا شده، با زدن دکمهی تماس با کارفرما، به وبسایت ایران استخدام برین و از اونجا برای این شغل اقدام کنین.