The Staff Site Reliability Engineer will play a critical role in building and scaling the infrastructure behind ServiceTitan’s new AI platform. The role requires both technical depth and strategic thinking — someone who can architect solutions, mentor teams, and enable true operational excellence across engineering.

Requirements

Lead the design, implementation, and optimization of scalable, resilient infrastructure for cloud-native AI services on Azure.
Establish true continuous delivery (CD) pipelines supporting blue-green deployments, automatic rollbacks, and progressive delivery patterns.
Champion observability excellence - define best practices for metrics, tracing, and logging; help product team design meaningful SLIs, SLOs, and error budgets.
Drive automation across the entire lifecycle: infrastructure provisioning, testing, deployment, and recovery.
Partner with the engineering team to design reliable, fault-tolerant services and perform resilience and capacity reviews.
Establish best practices for observability that not only monitor service health but also track the end-to-end success/failure of complex, automated agent workflows and their business impact (SLIs/SLOs).
Leverage Infrastructure as Code (IaC) using Terraform, Kubernetes, and Docker to standardize environments and reduce manual intervention.
Contribute to and maintain CI/CD pipelines using GitHub Actions, Azure DevOps, or TeamCity.
Implement and improve service health dashboards with Mimir, Grafana, Prometheus, or ELK stack to ensure system visibility and reliability.
Mentor engineers and foster a reliability culture across teams — enabling others to build self-healing, observable systems.

Benefits

Flexible time off
Comprehensive onboarding program
Leadership training
Bonusly
Peer-nominated awards
Company-paid medical, dental, and vision
Parent and siblings’ insurance
Wellness benefit
Office massage
Parental leave and support
Financial planning tools
Employee Assistance Program services

Requirements

Lead the design, implementation, and optimization of scalable, resilient infrastructure for cloud-native AI services on Azure.

Establish true continuous delivery (CD) pipelines supporting blue-green deployments, automatic rollbacks, and progressive delivery patterns.

Champion observability excellence - define best practices for metrics, tracing, and logging; help product team design meaningful SLIs, SLOs, and error budgets.

Drive automation across the entire lifecycle: infrastructure provisioning, testing, deployment, and recovery.

Partner with the engineering team to design reliable, fault-tolerant services and perform resilience and capacity reviews.

Establish best practices for observability that not only monitor service health but also track the end-to-end success/failure of complex, automated agent workflows and their business impact (SLIs/SLOs).

Leverage Infrastructure as Code (IaC) using Terraform, Kubernetes, and Docker to standardize environments and reduce manual intervention.

Contribute to and maintain CI/CD pipelines using GitHub Actions, Azure DevOps, or TeamCity.

Implement and improve service health dashboards with Mimir, Grafana, Prometheus, or ELK stack to ensure system visibility and reliability.

Mentor engineers and foster a reliability culture across teams — enabling others to build self-healing, observable systems.

Staff Site Reliability Engineer

About the Company

Job Description

Requirements

Benefits

Similar Jobs

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Software Engineer

Staff Site Reliability Engineer

About the Company

Job Description

Requirements

Benefits

Similar Jobs

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Software Engineer

Job Details

About ServiceTitan