Senior Software Engineer — AI Evaluation & Benchmarks (Python)

Company: Jobgether

Location: Location not specified (Remote)

Type: Full-time

Remote: Yes

Posted: 2026-05-15

About this role

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Software Engineer — AI Evaluation & Benchmarks (Python) in United States.
In this highly specialized engineering role, you will help define how frontier AI systems are evaluated on real-world software engineering tasks. You will design and build the benchmarks, datasets, and evaluation pipelines used to measure coding ability across debugging, reasoning, and production-grade development scenarios. This position sits at the intersection of software engineering and AI research, where your work directly influences how next-generation models are trained and improved. You will develop scalable systems to run evaluations across large and complex codebases, analyze model outputs for correctness and edge-case failures, and translate findings into structured improvements in benchmark design. The role requires deep technical rigor, strong Python expertise, and a product-minded approach to experimentation and iteration. You will operate in a fast-moving, remote-first environment focused on innovation, precision, and impact on the future of AI systems.


Accountabilities

  • Design and build coding benchmarks that evaluate frontier AI models on real-world software engineering tasks, including debugging, reasoning, and production-level coding challenges.
  • Develop and maintain scalable evaluation pipelines and data infrastructure to support large-scale model testing workflows.
  • Analyze AI-generated code for correctness, robustness, performance issues, and edge-case failures across diverse programming scenarios.
  • Construct structured evaluation environments across large repositories and multi-language codebases to ensure rigorous model assessment.
  • Provide detailed technical feedback on model behavior, failure modes, and performance patterns to improve benchmarking frameworks.
  • Contribute to the design and evolution of evaluation methodologies that define stan...

Create Your Job Alert

Other Senior Jobs

Other Jobs in Location not specified