EBCOM, a subsidiary of MCI (Hamrah-e Aval), is seeking a Senior Infrastructure & Service Support Engineer to oversee and ensure the health, performance, and availability of our critical infrastructure. In this role, you will be responsible for monitoring, optimizing, and troubleshooting infrastructure across compute, storage, and network layers, leveraging advanced tools like Zabbix, Prometheus, and Grafana. Your expertise will directly impact the efficiency and scalability of our infrastructure services, ensuring seamless operations and proactive problem resolution.
Job Description
Infrastructure Health Management & Monitoring:
- Oversee the health management and continuous monitoring of infrastructure, including compute, storage, and network layers.
- Utilize Zabbix, Prometheus, and Grafana to monitor network availability, stability, and performance.
- Proactively address any abnormal behavior patterns or disruptions in infrastructure.
- Ensure high availability of infrastructure services, ensuring performance meets organizational standards.
Infrastructure Visibility & Network Monitoring:
- Maintain end-to-end visibility into network performance across servers, storage, and networking components using Visibility Cloud Monitoring.
- Ensure seamless monitoring of network traffic, bandwidth usage, and server connectivity.
- Proactively monitor network congestion, latency, and potential disruptions to ensure smooth operations.
Capacity & Utilization Analysis:
- Analyze capacity utilization across compute, storage, and network resources using Prometheus and Grafana for metrics collection.
- Identify underutilized or overutilized components and provide optimization recommendations.
- Support capacity planning to ensure scalability and performance of infrastructure services.
Performance & Availability Reporting:
- Ensure continuous monitoring of infrastructure systems’ performance and availability, ensuring accurate reporting.
- Deliver regular service health reports using data collected from Zabbix, Prometheus, Grafana, and Visibility Cloud Monitoring.
- Translate infrastructure performance data into actionable insights for operational improvements.
Problem-Solving & Analytical Support:
- Resolve complex network and infrastructure issues (hardware, network performance, storage).
- Perform root cause analysis (RCA) and provide corrective actions for infrastructure failures and performance degradation.
- Collaborate with engineering and network teams to implement solutions for recurring issues and potential bottlenecks.
Service Requests & Coordination:
- Manage and coordinate infrastructure-related service requests using SDM for iService.
- Ensure accurate tracking of service requests, incidents, and resolutions.
- Work with internal teams and vendors to ensure timely response and resolution within SLAs.
Required Technical Skills
- Hands-on experience with infrastructure monitoring tools such as Zabbix, Prometheus, and Grafana.
- Familiarity with Kibana for log analysis and Visibility Cloud Monitoring tools.
- Strong understanding of networking protocols (TCP/IP, DNS, HTTP, etc.) and network performance management.
- Experience with server infrastructure (Linux/Unix and Windows environments).
- Proficiency in capacity analysis and resource optimization.
- Knowledge of network performance troubleshooting, including latency and congestion issues.
- Understanding of ITIL processes and familiarity with incident, problem, and change management.
- Experience with ticketing systems (iService).
Problem-Solving & Analytical Skills
- Strong analytical skills for identifying and resolving network and infrastructure issues.
- Ability to correlate performance data with operational incidents.
- Proficiency in root cause analysis (RCA) and identifying optimization opportunities.
- Ability to troubleshoot network connectivity, performance degradation, and infrastructure failures.
Soft Skills
- Strong communication and interpersonal skills for cross-team collaboration.
- Calm and effective under pressure, especially during critical incidents.
- Excellent organizational and time management skills.
- Strong reporting and documentation capabilities.
Work Conditions
- Participation in on-call rotation for high-priority incidents.
- Close collaboration with Monitoring Hub, SysAdmin, DevOps, and Data Center Teams.