We are looking for a Site Reliability Engineer to join the core team responsible for our high-transaction platform. You will be instrumental in ensuring the reliability, performance, and scalability of a distributed system built on Kubernetes, PostgreSQL, Redis, Go, and PHP.
Key Responsibilities:
- Architect & Operate: Design, build, and maintain our core infrastructure on Kubernetes, ensuring high availability and scalability for our high-volume transaction processing services.
- Performance Engineering: Drive performance tuning and optimization initiatives across the stack, from PostgreSQL query analysis and Redis caching strategies to application-level performance profiling in Go and PHP.
- Comprehensive Observability: Develop and manage a robust observability strategy using tools like Prometheus and Grafana. Your goal is to create actionable alerting and insightful dashboards that provide deep visibility into every service.
- Automation: Build and refine CI/CD pipelines, infrastructure provisioning (IaC), and operational tasks to minimize manual intervention and improve engineering velocity.
- Security Hardening: Champion and implement application and infrastructure security best practices. This includes vulnerability management, secure container builds, secrets management, and proactive threat modeling.
- Incident Response: Participate in a blameless on-call rotation to troubleshoot and resolve production incidents, performing deep root cause analysis to prevent recurrence.
- Collaboration: Work closely with development teams to establish Service Level Objectives (SLOs), consult on building reliable services, and embed SRE principles into the software development lifecycle.
Required Qualifications:
- 4+ years of experience in a Site Reliability, DevOps, or similar infrastructure-focused engineering role.
- Deep, hands-on expertise in managing production workloads on Kubernetes.
- Strong experience with managing and optimizing high-transactional PostgreSQL databases and Redis caches.
- Proficiency in scripting and automation, with practical experience in either Go or PHP (ability to read both is a major plus).
- Proven experience building and managing observability stacks (e.g., Prometheus, Grafana, ELK/Loki, Jaeger).
- Experience with Infrastructure as Code (IaC) tools like Terraform or similar.
Preferred Qualifications:
- Strong understanding of application security principles (OWASP Top 10) and infrastructure hardening techniques.
- Experience with CI/CD tools (e.g., GitLab CI, ArgoCD, Jenkins).
- Knowledge of networking concepts within a cloud-native environment (e.g., CNI, service mesh).
- Experience in a high-transaction financial or e-commerce industry.