Senior Software Engineer — AI Evaluation & Benchmarks (Python)

Company: Jobgether

Location: Location not specified (Remote)

Type: Full-time

Remote: Yes

Posted: 2026-05-15

About this role

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Software Engineer — AI Evaluation & Benchmarks (Python) in United States.
In this highly specialized engineering role, you will help define how frontier AI systems are evaluated on real-world software engineering tasks. You will design and build the benchmarks, datasets, and evaluation pipelines used to measure coding ability across debugging, reasoning, and production-grade development scenarios. This position sits at the intersection of software engineering and AI research, where your work directly influences how next-generation models are trained and improved. You will develop scalable systems to run evaluations across large and complex codebases, analyze model outputs for correctness and edge-case failures, and translate findings into structured improvements in benchmark design. The role requires deep technical rigor, strong Python expertise, and a product-minded approach to experimentation and iteration. You will operate in a fast-moving, remote-first environment focused on innovation, precision, and impact on the future of AI systems.

Accountabilities

Design and build coding benchmarks that evaluate frontier AI models on real-world software engineering tasks, including debugging, reasoning, and production-level coding challenges.
Develop and maintain scalable evaluation pipelines and data infrastructure to support large-scale model testing workflows.
Analyze AI-generated code for correctness, robustness, performance issues, and edge-case failures across diverse programming scenarios.
Construct structured evaluation environments across large repositories and multi-language codebases to ensure rigorous model assessment.
Provide detailed technical feedback on model behavior, failure modes, and performance patterns to improve benchmarking frameworks.
Contribute to the design and evolution of evaluation methodologies that define stan...

Create Your Job Alert

Senior Software Engineer — AI Evaluation & Benchmarks (Python)

About this role

Other Senior Jobs

Other Jobs in Location not specified