At Krea, we are building next-generation AI creative tools. We are dedicated to making AI intuitive and controllable for creatives. Our mission is to build tools that empower human creativity, not replace it.
Requirements
- Design, build, and maintain large-scale distributed infrastructure to reliably support AI research and real-time model serving.
- Own and scale our multi-thousand-node Kubernetes GPU clusters, ensuring efficient and fault-tolerant operations.
- Collaborate closely with ML engineers and researchers to architect systems that enable rapid experimentation and deployment.
- Improve network architecture, optimize load balancing, and streamline operational practices across multi-zone cloud deployments.
- Experience with Kubernetes at scale (thousands of nodes), cloud infrastructure management (AWS/GCP/Azure), high-performance and fault-tolerant networking, low-level Linux interfaces and administration, and debugging complex distributed systems in production.
- Knowledge of Python, Golang, Ruby, Rust, and similar systems languages.