About UsGramian Consultancy is a boutique consultancy specializing in IT professional services and engineering talent solutions. With a strong background in software engineering and leadership, we help companies build high-performing teams by matching them with professionals who truly fit their needs.Role overviewWe are looking for an AI Evaluation Engineer specialized in software engineering to design benchmark tasks based on real-world coding workflows.You will create scenarios where AI systems must analyze large codebases, apply precise changes (bug fixes, refactors, migrations), and produce correct, testable outputs.Commitments Required: 8 hours per day with an overlap of 4 hours with PST. Employment type: Contractor assignment (no medical/paid leave)Duration of contract: 4 weeks+Location: Bangladesh, Brazil, Colombia, Egypt, Ghana, India, Indonesia, Kenya, Nigeria,Turkey, VietnamInterview: take home assessment <h3>Responsibilities</h3><ul><li>Design and build multi-agent benchmark tasks based on real-world code changes (bug fixes, migrations, refactors) </li><li> Work with the Harbor evaluation framework to run and validate tasks in containerized environments </li><li> Write clear, precise task instructions (file paths, function signatures, expected behavior, constraints) </li><li> Develop Python-based verification scripts to validate correctness of code changes </li><li> Define task decomposition strategies across multiple specialized agents </li><li> Analyze and navigate large open-source codebases to extract realistic task scenarios </li><li> Run, debug, and refine tasks in Docker environments to ensure reproducibility </li><li> Improve task quality, clarity, and difficulty based on evaluation results</li></ul>Requirements<ul><li>5+ years of experience in software development (Python and JavaScript) </li><li>Strong experience working with large codebases (e.g., Django, Flask, FastAPI, Node.js or similar) </li><li>Familiarity with Git workflows (pull requests, diffs, commits, cherry-picking) </li><li>Experience writing tests or validation scripts (pytest, unittest, or similar) </li><li>Ability to write clear, precise technical specifications </li><li>Familiarity with AI coding benchmarks or evaluation frameworks (e.g., SWE-bench or similar) </li><li>Hands-on experience with Docker (Dockerfiles, image builds, debugging)</li></ul><h3>Nice to Have</h3><ul><li>Experience contributing to or maintaining open-source projects </li><li>Experience with code migrations or large-scale refactoring </li><li>Familiarity with CI/CD pipelines and automated testing workflows </li><li>Exposure to LLM-based coding tools or evaluation frameworks</li></ul>

AI Evaluation Engineer (Software Engineering / Code)