Nvidia Corporation
Santa Clara, CA
NVIDIA's DGX Cloud (DGXC) powers AI for strategic research and product workloads. The company seeks an expert Technical Program Manager (IC5) to lead strategic programs emphasizing resilience, reliability, and goodput. This role requires collaboration across multiple teams. It involves driving improvements in resilience, service stability, and operational scale. The TPM also guides architectural decisions related to resilience reference architecture. The TPM leads programs spanning DGXC infrastructure, Resilience Tools, and core platform services to deliver fault-tolerant, high-availability training and inference environments at scale. We are looking for a TPM who is analytical, technically skilled, and comfortable working with cloud infrastructure, software, operations, and environments driven by data and research. You will work closely with engineering, SRE, operations, and researchers to develop scalable resilience strategies, improve operational performance, and assist in...