Senior Software Engineer - NVLink Rack Scale Stability and Reliability
Company: Nvidia
Location: US, CA, Santa Clara (Remote)
Salary: $152k - $241.5k per year
Type: Full-time
Remote: Yes
Posted: 2026-06-19
About this role
NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence.
We are looking for highly motivated Senior Software Engineers to join our Fabric Networking team with a targeted focus on NVLink Rack-Scale Systems Stability & Reliability. In this role, you will partner closely with architects and developers building our next-generation NVLink and NVSwitch systems, helping transform first-of-their-kind platforms into stable, reliable, and volume production-ready systems. You will work on complex system-level challenges spanning resiliency, diagnostics, recovery, and large-scale AI infrastructure, contributing directly to the software foundation powering next-generation datacenter deployments.
What you will be doing:
- Drive platform bringup, feature enablement, end-to-end software validation, and debug for next-generation NVLink-based GPU and rack-scale systems.
- Develop tools, diagnostics, automation, and infrastructure for system validation, regression testing, and fleet support.
- Lead reliability and MTBI validation through stress testing, telemetry analysis, failure injection, and issue resolution.
- Triage complex software, firmware, networking, and platform issues across validation, deployment, and production environments.
- Collaborate with architecture, hardware, firmware, software, and Customer engagement teams to improve system quality and reliability.
- Build and maintain SRE-style validation infrastructure, including provisioning, monitoring, and operational readines...