Senior Software Engineer — AI Evaluation & Benchmarks (Python)
Company: Jobgether
Location: Location not specified (Remote)
Type: Full-time
Remote: Yes
Posted: 2026-05-15
About this role
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Software Engineer — AI Evaluation & Benchmarks (Python) in United States.
In this highly specialized engineering role, you will help define how frontier AI systems are evaluated on real-world software engineering tasks. You will design and build the benchmarks, datasets, and evaluation pipelines used to measure coding ability across debugging, reasoning, and production-grade development scenarios. This position sits at the intersection of software engineering and AI research, where your work directly influences how next-generation models are trained and improved. You will develop scalable systems to run evaluations across large and complex codebases, analyze model outputs for correctness and edge-case failures, and translate findings into structured improvements in benchmark design. The role requires deep technical rigor, strong Python expertise, and a product-minded approach to experimentation and iteration. You will operate in a fast-moving, remote-first environment focused on innovation, precision, and impact on the future of AI systems.
Accountabilities
- Design and build coding benchmarks that evaluate frontier AI models on real-world software engineering tasks, including debugging, reasoning, and production-level coding challenges.
- Develop and maintain scalable evaluation pipelines and data infrastructure to support large-scale model testing workflows.
- Analyze AI-generated code for correctness, robustness, performance issues, and edge-case failures across diverse programming scenarios.
- Construct structured evaluation environments across large repositories and multi-language codebases to ensure rigorous model assessment.
- Provide detailed technical feedback on model behavior, failure modes, and performance patterns to improve benchmarking frameworks.
- Contribute to the design and evolution of evaluation methodologies that define stan...