Groq delivers fast, reliable AI inference. Our LPU-based system powers GroqCloudTM, giving businesses and developers the speed and scale they need. Headquartered in Silicon Valley, we’re on a mission to make high-performance AI compute more accessible and affordable. When real-time AI is within reach, anything is possible.
Requirements
- Design, implement, and optimize large-scale multi-cluster Kubernetes deployments supporting mission-critical workloads.
- Build Kubernetes controllers and operators in Go to support continuous deployment strategies for model instances and production workloads.
- Implement advanced deployment patterns (blue/green, canary, progressive delivery) to ensure safe and reliable production rollouts.
- Drive GitOps practices using (and building on top of) Flux (preferred) or ArgoCD, ensuring reproducible, declarative, and auditable deployments.
- Inject observability into every deployment—leveraging Prometheus, VictoriaMetrics, Grafana, and OpenTelemetry for metrics, logging, and tracing.
- Architect automated rollback, health checks, and failover mechanisms to maximize uptime and deployment confidence.
- Operate and optimize deployments across multiple regions, clusters, and heterogeneous workloads.
- Partner with application, platform, and infrastructure engineers to align deployment best practices across the organization.
- Drive standards for deployment reliability, mentor peers on Kubernetes and GitOps practices, and raise the bar for automation across engineering.
Benefits
- Comprehensive compensation package
- Equity
- Benefits