Senior Storage Production Engineer - DGX Cloud

Company: Nvidia

Location: US, CA, Santa Clara (Remote)

Salary: $176k - $276k per year

Type: Full-time

Remote: Yes

Posted: 2026-06-14

About this role

Production engineering is a field that involves crafting, building, and maintaining large-scale production systems with high efficiency and availability. It encompasses various areas, including software and systems engineering practices, storage, data management, and services. Professionals in the role of Production Engineers hold specialized knowledge and expertise across various domains, including storage architecture, high-performance distributed storage, data management, systems, networking, coding, database management, prioritization, continuous delivery and deployment, along with open-source cloud-enabling technologies such as Kubernetes, containers, and virtualization. Their responsibilities include ensuring storage architectures are reliable, scalable, and efficient. They optimize data placement and access patterns. They manage large-scale distributed storage systems and ensure low-latency data access for HPC and AI/ML workloads.

Storage Production Engineers at NVIDIA ensure that our internal and external-facing GPU cloud services meet reliability and uptime goals as promised to the users while enabling developers to make changes to the existing system through careful preparation and planning while keeping an eye on capacity, latency, and performance. This role also requires a mindset focused on automating storage operations, improving data access efficiency, and optimizing storage performance. Much of our software development focuses on optimizing operations through automation, enhancing system responsiveness, and improving the efficiency of storage and production systems. Since Production Engineers are responsible for the big picture of how our systems interface with each other, we use a breadth of tools and approaches to tackle a broad spectrum of challenges. Practices such as proactive storage performance monitoring, automated fault detection and remediation, scalable data redundancy methods, and integration of intelligent caching mechanisms factor into...

Create Your Job Alert

Other Senior Jobs

Other Jobs in US