Senior Site Reliability Engineer with deep expertise in optimizing system reliability, performance, and scalability across cloud environments (Azure, Kubernetes, Service Mesh). Responsible for defining, measuring, and improving Service Level Objectives (SLOs), managing error budgets, and automating toil to drive operational excellence in a blameless culture.

Requirements

10+ years of experience in a Site Reliability Engineering, Production Engineering, or equivalent role.
5+ years of experience working with Kubernetes or similar microservice architecture.
5+ years of experience working in an Azure environment
Proven experience defining and implementing Service Level Indicators (SLIs) and Service Level Objectives (SLOs) and managing error budgets.
Experience working in an agile environment and knowledge of agile practices
Jira experience with project management and story creation is a plus
Experience with CI/CD systems preferably using Azure DevOps or GitHub Actions
Strong understanding of networking and routing protocols especially those involved in Service Mesh architectures
Experience incorporating AI tools such as ChatGPT, Cursor, Codex, or GitHub CoPilot into your day to day work.
Must be able to work in an on-call rotation with a focus on sustainable incident response and post-mortem analysis (blameless culture).

Benefits

Flexible working culture
Incentive programs
20 days PTO every year
Generous paid parental leave
Leading family support policies
Company-sponsored 401k match
Learning and wellness subscription stipend
Beautiful Union Square office with a casual dress code
Industry-leading, employer-sponsored insurance for you and your dependents, with several 100% Zip-covered choices available

Requirements

10+ years of experience in a Site Reliability Engineering, Production Engineering, or equivalent role.

5+ years of experience working with Kubernetes or similar microservice architecture.

5+ years of experience working in an Azure environment

Proven experience defining and implementing Service Level Indicators (SLIs) and Service Level Objectives (SLOs) and managing error budgets.

Experience working in an agile environment and knowledge of agile practices

Jira experience with project management and story creation is a plus

Experience with CI/CD systems preferably using Azure DevOps or GitHub Actions

Strong understanding of networking and routing protocols especially those involved in Service Mesh architectures

Experience incorporating AI tools such as ChatGPT, Cursor, Codex, or GitHub CoPilot into your day to day work.

Must be able to work in an on-call rotation with a focus on sustainable incident response and post-mortem analysis (blameless culture).

Benefits

Flexible working culture

Incentive programs

20 days PTO every year

Generous paid parental leave

Leading family support policies

Company-sponsored 401k match

Learning and wellness subscription stipend

Beautiful Union Square office with a casual dress code

Industry-leading, employer-sponsored insurance for you and your dependents, with several 100% Zip-covered choices available

Senior Site Reliability Engineer

About the Company

Job Description

Requirements

Benefits

Similar Jobs

Senior Site Reliability Engineer

Lead Software Engineer (SSE II)

Senior Software Engineer II - Risk, Process & Data (SOX Compliance)

Senior Site Reliability Engineer

About the Company

Job Description

Requirements

Benefits

Similar Jobs

Senior Site Reliability Engineer

Lead Software Engineer (SSE II)

Senior Software Engineer II - Risk, Process & Data (SOX Compliance)

Job Details

About Zip Co Limited