AI Evaluation Engineer (Agentic Coding / Software Engineering)
Gramian Consulting Group • CO, EG, KE, GH, NG, BR
<p><strong>About Us</strong></p><p>Gramian Consultancy is a boutique consultancy specializing in IT professional services and engineering talent solutions. With a strong background in software engineering and leadership, we help companies build high-performing teams by matching them with professionals who truly fit their needs.</p><p><strong>Role overview</strong></p><p>We are looking for an <strong>AI Evaluation Engineer specialized in software engineering workflows</strong> to evaluate and improve datasets used for agentic coding models.</p><p>In this role, you will work on realistic coding tasks — reviewing model trajectories, validating outputs, and producing high-quality evaluations. This is a <strong>hands-on engineering role</strong>, requiring strong debugging skills, attention to detail, and the ability to assess correctness in real code scenarios.</p><p></p><p><strong>Commitments Required: 8 hours per day with an overlap of 4 hours with PST. </strong></p><p><strong>Employment type: Contractor assignment (no medical/paid leave)</strong></p><p><strong>Duration of contract: 5 weeks+</strong></p><p><strong>Location:</strong> <strong>Bangladesh, Brazil, Colombia, Egypt, Ghana, India, Indonesia, Kenya, Nigeria,Turkey, Vietnam</strong></p><p><strong>Interview: take home assessment (60min) </strong></p><p></p><h3><strong>Responsibilities</strong></h3><ul><li>Execute coding tasks within <strong>agentic coding environments</strong>, maintaining strict evaluation protocols </li><li>Review and evaluate <strong>model-generated code trajectories</strong> for correctness and completeness </li><li>Validate outputs by reading code, running tests, analyzing logs, and inspecting artifacts </li><li>Perform targeted validation using <strong>scripts, tests, and manual checks</strong> </li><li>Write <strong>clear, evidence-based rationales</strong> for evaluations and rankings </li><li>Design realistic, multi-step <strong>coding tasks and workflows</strong> (offline work) </li><li>Create and refine <strong>evaluation rubrics and scoring criteria</strong> </li><li>Ensure consistency, quality, and compliance across evaluations </li><li>Identify issues in environments, instructions, or workflows and report with clear evidence</li></ul><p><strong>Requirements</strong></p><ul><li>5+ years of experience in <strong>software engineering, QA, developer tooling, or similar code-heavy roles</strong> </li><li>Strong proficiency in at least one programming ecosystem (e.g., Python, JavaScript/TypeScript, Java, C/C++, Rust, SQL) </li><li>Ability to <strong>read and understand unfamiliar codebases</strong> and implement/debug changes </li><li>Experience running and interpreting <strong>tests, scripts, and CLI tools</strong> </li><li>Strong debugging and problem-solving skills, including handling edge cases </li><li>Comfortable working in <strong>Linux/terminal environments</strong> </li><li>Familiarity with <strong>Git workflows</strong> and standard development tooling </li><li>Experience with <strong>AI coding tools or agentic coding environments</strong> (e.g., Cursor, Claude Code, or similar) </li><li>Strong attention to detail and ability to produce <strong>consistent, high-quality evaluations</strong></li></ul>