Senior Site Reliability Engineer with deep expertise in optimizing system reliability, performance, and scalability across cloud environments (Azure, Kubernetes, Service Mesh). Responsible for defining, measuring, and improving Service Level Objectives (SLOs), managing error budgets, and automating toil to drive operational excellence in a blameless culture.
Requirements
- 10+ years of experience in a Site Reliability Engineering, Production Engineering, or equivalent role.
- 5+ years of experience working with Kubernetes or similar microservice architecture.
- 5+ years of experience working in an Azure environment
- Proven experience defining and implementing Service Level Indicators (SLIs) and Service Level Objectives (SLOs) and managing error budgets.
- Experience working in an agile environment and knowledge of agile practices
- Jira experience with project management and story creation is a plus
- Experience with CI/CD systems preferably using Azure DevOps or GitHub Actions
- Strong understanding of networking and routing protocols especially those involved in Service Mesh architectures
- Experience incorporating AI tools such as ChatGPT, Cursor, Codex, or GitHub CoPilot into your day to day work.
- Must be able to work in an on-call rotation with a focus on sustainable incident response and post-mortem analysis (blameless culture).
Benefits
- Flexible working culture
- Incentive programs
- 20 days PTO every year
- Generous paid parental leave
- Leading family support policies
- Company-sponsored 401k match
- Learning and wellness subscription stipend
- Beautiful Union Square office with a casual dress code
- Industry-leading, employer-sponsored insurance for you and your dependents, with several 100% Zip-covered choices available