AI Evaluation Engineer (Software Engineering / Code)
Gramian Consulting Group • CO, EG, KE, GH, NG, BR
<p><strong>About Us</strong></p><p>Gramian Consultancy is a boutique consultancy specializing in IT professional services and engineering talent solutions. With a strong background in software engineering and leadership, we help companies build high-performing teams by matching them with professionals who truly fit their needs.</p><p><strong>Role overview</strong></p><p>We are looking for an <strong>AI Evaluation Engineer specialized in software engineering</strong> to design benchmark tasks based on real-world coding workflows.</p><p>You will create scenarios where AI systems must analyze large codebases, apply precise changes (bug fixes, refactors, migrations), and produce correct, testable outputs.</p><p></p><p><strong>Commitments Required: 8 hours per day with an overlap of 4 hours with PST. </strong></p><p><strong>Employment type: Contractor assignment (no medical/paid leave)</strong></p><p><strong>Duration of contract: 4 weeks+</strong></p><p><strong>Location:</strong> <strong>Bangladesh, Brazil, Colombia, Egypt, Ghana, India, Indonesia, Kenya, Nigeria,Turkey, Vietnam</strong></p><p><strong>Interview: take home assessment </strong></p><p></p><h3><strong>Responsibilities</strong></h3><ul><li>Design and build <strong>multi-agent benchmark tasks</strong> based on real-world code changes (bug fixes, migrations, refactors) </li><li> Work with the <strong>Harbor evaluation framework</strong> to run and validate tasks in containerized environments </li><li> Write <strong>clear, precise task instructions</strong> (file paths, function signatures, expected behavior, constraints) </li><li> Develop <strong>Python-based verification scripts</strong> to validate correctness of code changes </li><li> Define <strong>task decomposition strategies</strong> across multiple specialized agents </li><li> Analyze and navigate <strong>large open-source codebases</strong> to extract realistic task scenarios </li><li> Run, debug, and refine tasks in <strong>Docker environments</strong> to ensure reproducibility </li><li> Improve task quality, clarity, and difficulty based on evaluation results</li></ul><p><strong>Requirements</strong></p><ul><li>5+ years of experience in <strong>software development (Python and JavaScript)</strong> </li><li>Strong experience working with <strong>large codebases</strong> (e.g., Django, Flask, FastAPI, Node.js or similar) </li><li>Familiarity with <strong>Git workflows</strong> (pull requests, diffs, commits, cherry-picking) </li><li>Experience writing <strong>tests or validation scripts</strong> (pytest, unittest, or similar) </li><li>Ability to write <strong>clear, precise technical specifications</strong> </li><li>Familiarity with <strong>AI coding benchmarks or evaluation frameworks</strong> (e.g., SWE-bench or similar) </li><li>Hands-on experience with <strong>Docker</strong> (Dockerfiles, image builds, debugging)</li></ul><p></p><h3><strong>Nice to Have</strong></h3><ul><li>Experience contributing to or maintaining <strong>open-source projects</strong> </li><li>Experience with <strong>code migrations or large-scale refactoring</strong> </li><li>Familiarity with <strong>CI/CD pipelines and automated testing workflows</strong> </li><li>Exposure to <strong>LLM-based coding tools or evaluation frameworks</strong></li></ul>