LinkedIn is seeking a Senior Staff Software Engineer - HPC Network Engineering to design, deploy, and operate high-performance, low-latency Ethernet fabrics for large-scale GPU clusters. The role focuses on RoCE v2–based GPU interconnect networks supporting AI/ML training, inference, and HPC workloads.
Requirements
- Network architecture and design for large-scale LLM training and inference workloads.
- Design RoCE v2–based GPU interconnection fabrics for multi-rack and multi-pod GPU clusters
- Define lossless Ethernet architectures (Clos / fat-tree / leaf-spine) optimized for RDMA
- Select and validate 400G / 800G Ethernet switching platforms and NICs (ConnectX, BlueField, etc.)
- Deep expertise in host-level and Kubernetes pod networking architectures, including enablement of high-performance features such as RDMA and GPU Direct.
- Experience in host network performance tuning for large-scale collective communications, balancing latency, throughput, and congestion control.
- Analyze system performance and diagnose complex cross-layer issues.
- Basic Qualifications: BA/BS Degree in Computer Science or related technical discipline, or equivalent practical experience
- 10+ years of experience building and operating large-scale distributed systems or data-intensive backend platforms.
- Experience in one or more programming languages such as Go, Python, C++, or similar.
- Experience in Linux system engineering and host networking.
- Demonstrated knowledge of network protocols, fabric design, and performance optimization.
- Proven ability to lead complex technical initiatives end-to-end in a multi-team environment.
- Experience with system design skills with focus on scalability, reliability, and performance.
- Experience with container platforms (Kubernetes) and microservices.
- Preferred Qualifications: Experience supporting large-scale AI or HPC workloads.
- Familiarity with LLM training frameworks and communication libraries (e.g., NCCL, MPI).
- Experience with streaming systems (Kafka, Flink, Spark Streaming, or similar) and high-throughput data pipeline architectures.
- Experience with performance benchmarking and profiling tools.
- Experience with infrastructure automation or configuration management tools.
- Demonstrated influence across organizations (tech lead, architect, principal/IC leadership roles).
- Suggested Skills: Distributed Systems, HPC Networking, Performance Optimization, Technical Leadership
Benefits
- Generous health and wellness programs
- Time away for employees of all levels
- Annual performance bonus
- Stock
- Benefits
- Incentive compensation plans