Nvidia Corporation
Redmond, WA
We are now looking for a Senior Software Engineer for AI Resiliency! At NVIDIA, we are pushing the boundaries of what's possible in AI. We are currently seeking a Senior Software Engineer to lead the development of AI software resiliency for the most powerful AI supercomputers in the world. As a member of our AI Software Resiliency team, you will play a pivotal role in defining and implementing critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs. Your expertise will be crucial in driving down cluster downtime towards zero, ensuring that our AI systems remain robust and reliable at all times. What You'll Be Doing: Develop AI Software Resiliency Features: Implement and optimize software features that improve AI system reliability at a massive scale, such as fast checkpoint-recovery, error detection, error isolation, and straggler/hang detection. Hands-On Coding & Optimization: Contribute to large-scale distributed systems with high-quality,...