Job Description:
We are seeking an experienced and proactive Senior Site Reliability Engineer (SRE) to join our team. As a Senior SRE, you will play a key role in enhancing the reliability, availability, and performance of our critical services and applications. You will partner with development teams to create solutions that scale efficiently and support a robust production environment.
Key Responsibilities:
• Design and implement reliable, scalable, and automated infrastructure solutions to support our applications and services.
• Collaborate with cross-functional teams to improve service reliability and performance; establish best practices in monitoring, incident response, and capacity planning.
• Develop and maintain CI/CD (Continuous Integration/Continuous Deployment) pipelines and automation to improve deployment processes and reduce downtime.
• Monitor system performance, identify bottlenecks or issues, and formulate strategies for remediation and scalability.
• Lead incident response efforts and perform root cause analysis following outages and incidents, implementing changes to prevent future occurrences.
• Mentor and guide junior SRE and engineering team members in best practices and continuous improvement initiatives.
• Document processes, incident reports, and operational runbooks to enhance knowledge sharing and training within the team.
Requirements:
• Strong experience with orchestration technologies (Kubernetes, Docker).
• Proficient in scripting and programming languages (e.g., Python, Go, Bash).
• Extensive experience with monitoring and logging solutions (e.g., Prometheus, Zabbix, Grafana, ELK Stack).
• Solid understanding of networking concepts, distributed systems, and microservices architecture.
• Proven track record in incident management, root cause analysis, and performance tuning.
• Excellent communication and collaboration skills to work effectively across teams.
Preferred Qualifications:
• Experience with Infrastructure as Code (IaC) tools (e.g., Terraform, CloudFormation).
• Familiarity with service mesh implementation (e.g., Istio, Linkerd).
• Knowledge of security best practices in cloud environments and applications.
• Contributions to open-source projects or active participation in the SRE/DevOps community.
Benefits:
• Competitive salary and performance-based bonuses.
• Flexible working hours and remote work opportunities.
• Comprehensive health, dental, and vision insurance.
• [List any additional benefits such as retirement plans, paid time off, professional development opportunities, etc.]
• A collaborative and innovative work environment that values growth and development.