Pathway is a hot AI startup that needs a Senior ML Infrastructure / DevOps Engineer to scale its GPU clusters, automate its ML platform, and work with its R&D team to productionize ML workloads. The role is remote and can be based in the EU, US, or Canada.

Requirements

5+ years of experience in DevOps/SRE/Platform/Infrastructure roles running production systems, ideally with high-performance or ML workloads.
Strong experience with workload management, containerization, and orchestration (Slurm, Docker, Kubernetes) in production environments.
Solid understanding of CI/CD tools and workflows (GitHub Actions, GitLab CI, Jenkins, etc.), including building pipelines from scratch.
Hands-on cloud infrastructure experience (AWS, GCP, Azure), especially around GPU instances, VPC/networking, storage, and managed ML services (e.g., SageMaker HyperPod, Vertex AI).
Proficiency with infrastructure as code (Terraform, CloudFormation, or similar) and a bias toward automation over manual operations.
Experience with monitoring and logging stacks (Grafana, Prometheus, Loki, CloudWatch, or equivalents).
Familiarity with ML pipeline and experiment orchestration tools (MLflow, Kubeflow, Airflow, Metaflow, etc.) and with model/version management.
Solid programming skills in Python, plus the ability to read and debug code that uses common ML libraries (PyTorch, TensorFlow) even if you are not a full-time model developer.

Benefits

Inclusive workplace culture
Responsibilities and ability to make significant contribution to the company's success
Intellectually stimulating work environment
Exciting career prospects

Requirements

5+ years of experience in DevOps/SRE/Platform/Infrastructure roles running production systems, ideally with high-performance or ML workloads.

Strong experience with workload management, containerization, and orchestration (Slurm, Docker, Kubernetes) in production environments.

Solid understanding of CI/CD tools and workflows (GitHub Actions, GitLab CI, Jenkins, etc.), including building pipelines from scratch.

Hands-on cloud infrastructure experience (AWS, GCP, Azure), especially around GPU instances, VPC/networking, storage, and managed ML services (e.g., SageMaker HyperPod, Vertex AI).

Proficiency with infrastructure as code (Terraform, CloudFormation, or similar) and a bias toward automation over manual operations.

Experience with monitoring and logging stacks (Grafana, Prometheus, Loki, CloudWatch, or equivalents).

Familiarity with ML pipeline and experiment orchestration tools (MLflow, Kubeflow, Airflow, Metaflow, etc.) and with model/version management.

Solid programming skills in Python, plus the ability to read and debug code that uses common ML libraries (PyTorch, TensorFlow) even if you are not a full-time model developer.

Senior ML Infrastructure / DevOps Engineer

About the Company

Job Description

Requirements

Benefits

Similar Jobs

Senior ML Infrastructure / DevOps Engineer

Senior ML Infrastructure / DevOps Engineer

Senior ML Infrastructure / DevOps Engineer

Products

Use Cases

Insights

Resources

Company

Senior ML Infrastructure / DevOps Engineer

About the Company

Job Description

Requirements

Benefits

Similar Jobs

Senior ML Infrastructure / DevOps Engineer

Senior ML Infrastructure / DevOps Engineer

Senior ML Infrastructure / DevOps Engineer

Job Details

About Pathway