Software Engineer, Frontier Clusters Infrastructure
Company: OpenAI
Location: San Francisco, CA
Salary: $230k - $490k per year
Type: Full-time
Level: Junior
Posted: 2026-02-23
About this role
About The Team
The Frontier Systems team at OpenAI builds, launches, and supports the largest supercomputers in the world that OpenAI uses for its most cutting edge model training.
We take data center designs, turn them into real, working systems and build any software needed for running large-scale frontier model trainings.
Our mission is to bring up, stabilize and keep these hyperscale supercomputers reliable and efficient during the training of the frontier models.
About The Role
We are looking for engineers to operate the next generation of compute clusters that power OpenAI’s frontier research.
This role blends distributed systems engineering with hands-on infrastructure work on our largest datacenters. You will scale Kubernetes clusters to massive scale, automate bare-metal bring-up, and build the software layer that hides the complexity of a magnitude of nodes across multiple data centers.
You will work at the intersection of hardware and software, where speed and reliability are critical. Expect to manage fast-moving operations, quickly diagnose and fix issues when things are on fire, and continuously raise the bar for automation and uptime.
In This Role, You Will
- Spin up and scale large Kubernetes clusters, including automation for provisioning, bootstrapping, and cluster lifecycle management
- Build software abstractions that unify multiple clusters and present a seamless interface to training workloads
- Own node bring-up from bare metal through firmware upgrades, ensuring fast, repeatable deployment at massive scale
- Improve operational metrics such as reducing cluster restart times (e.g., from hours to minutes) and accelerating firmware or OS upgrade cycles
- Integrate networking and hardware health systems to deliver end-to-end reliability across servers, switches, and data center infrastructure
- Develop monitoring and observability systems to detect issues early and keep clusters stable under extrem...