Pluralis Research is pioneering Protocol Learning—a fully decentralised way to train and deploy AI models. We’re looking for an ML Training Platform Engineer to architect, build, and scale the foundational infrastructure powering our decentralized ML training platform.
Requirements
- Multi-Cloud Infrastructure: Design resource management systems provisioning and orchestrating compute across AWS, GCP, and Azure using infrastructure-as-code (Pulumi/Terraform).
- Distributed Training Systems: Architect fault-tolerant infrastructure for distributed ML.
- Real-World Networking: Build systems that simulate and handle real-world network conditions — bandwidth shaping, latency injection, packet loss — while managing dynamic node churn and ensuring efficient data flow across workers with heterogeneous connectivity.
- Deep experience in Infrastructure & Platform Engineering, Distributed Systems & ML Infrastructure, and Systems Programming & Reliability.
- Experience in a startup environment with an emphasis on micro-services orchestration or big tech background.
- A deep understanding of multi-cloud infra & distributed training systems.