Pathway is a hot AI startup that needs a Senior ML Infrastructure / DevOps Engineer to scale its GPU clusters, automate its ML platform, and work with its R&D team to productionize ML workloads. The role is remote and can be based in the EU, US, or Canada.
Requirements
- 5+ years of experience in DevOps/SRE/Platform/Infrastructure roles running production systems, ideally with high-performance or ML workloads.
- Strong experience with workload management, containerization, and orchestration (Slurm, Docker, Kubernetes) in production environments.
- Solid understanding of CI/CD tools and workflows (GitHub Actions, GitLab CI, Jenkins, etc.), including building pipelines from scratch.
- Hands-on cloud infrastructure experience (AWS, GCP, Azure), especially around GPU instances, VPC/networking, storage, and managed ML services (e.g., SageMaker HyperPod, Vertex AI).
- Proficiency with infrastructure as code (Terraform, CloudFormation, or similar) and a bias toward automation over manual operations.
- Experience with monitoring and logging stacks (Grafana, Prometheus, Loki, CloudWatch, or equivalents).
- Familiarity with ML pipeline and experiment orchestration tools (MLflow, Kubeflow, Airflow, Metaflow, etc.) and with model/version management.
- Solid programming skills in Python, plus the ability to read and debug code that uses common ML libraries (PyTorch, TensorFlow) even if you are not a full-time model developer.
Benefits
- Inclusive workplace culture
- Responsibilities and ability to make significant contribution to the company's success
- Intellectually stimulating work environment
- Exciting career prospects