The Staff Site Reliability Engineer will play a critical role in building and scaling the infrastructure behind ServiceTitan’s new AI platform. The role requires both technical depth and strategic thinking — someone who can architect solutions, mentor teams, and enable true operational excellence across engineering.
Requirements
- Lead the design, implementation, and optimization of scalable, resilient infrastructure for cloud-native AI services on Azure.
- Establish true continuous delivery (CD) pipelines supporting blue-green deployments, automatic rollbacks, and progressive delivery patterns.
- Champion observability excellence - define best practices for metrics, tracing, and logging; help product team design meaningful SLIs, SLOs, and error budgets.
- Drive automation across the entire lifecycle: infrastructure provisioning, testing, deployment, and recovery.
- Partner with the engineering team to design reliable, fault-tolerant services and perform resilience and capacity reviews.
- Establish best practices for observability that not only monitor service health but also track the end-to-end success/failure of complex, automated agent workflows and their business impact (SLIs/SLOs).
- Leverage Infrastructure as Code (IaC) using Terraform, Kubernetes, and Docker to standardize environments and reduce manual intervention.
- Contribute to and maintain CI/CD pipelines using GitHub Actions, Azure DevOps, or TeamCity.
- Implement and improve service health dashboards with Mimir, Grafana, Prometheus, or ELK stack to ensure system visibility and reliability.
- Mentor engineers and foster a reliability culture across teams — enabling others to build self-healing, observable systems.
Benefits
- Flexible time off
- Comprehensive onboarding program
- Leadership training
- Bonusly
- Peer-nominated awards
- Company-paid medical, dental, and vision
- Parent and siblings’ insurance
- Wellness benefit
- Office massage
- Parental leave and support
- Financial planning tools
- Employee Assistance Program services