Senior Cloud Operations Engineer
Company: NVIDIA
Location: California (Remote)
Type: Contract
Remote: Yes
Posted: 2026-06-23
About this role
At NVIDIA, we are seeking a highly skilled Senior Operations Engineer to join our world-class NGC Cloud team. In this role, you will help drive the efficiency, reliability, and scalability of the systems that power our global business operations. This is an exceptional opportunity to shape how we automate, streamline, and support critical operational workflows across the organization. You will define how we implement innovative automation and support solutions, enabling teams to operate seamlessly and deliver impact at global scale—all within an encouraging and inclusive environment.
What You'll Be Doing
- Driving day-to-day interactions with NVIDIA wide IT subsystems, ensuring smooth operational workflows across infrastructure and applications.
- Crafting and maintaining GitLab CI/CD pipelines to automate build, test, and deployment workflows.
- Monitoring system health, building/maintaining dashboards, creating alerts, and producing operational reports.
- Performing user offboarding, access reviews, and compliance-related tasks across multiple systems.
- Drive interactions with various IT subsystems, ensuring API performance and integration stability meet defined SLAs and SLOs.
- Coordinating changes and releases between engineering, operations, and security teams.
- Enforcing security guidelines, managing vulnerability remediation, and collaborating with security teams on audits and assessments.
- Maintaining documentation, SOPs, and process improvements to enhance operational maturity.
What We Need To See
- 8+ years of hands-on experience building/supporting complex services and BS/MS in Computer Science (or equivalent experience).
- Knowledge in Python for automation, data handling, and tool development.
- Experience with monitoring tools (such as Prometheus, Grafana, Datadog, CloudWatch, Splunk) and reporting.
- Familiarity with ITSM practices, including incident, problem, and modification processes.
- Ability to perform secure and compliant o...