In this study, we explore the use of generative AI – specifically large language models (LLMs) – to automate the generation and evaluation of technical exercises for hiring processes. The motivation stems from a common bottleneck in recruitment: creating high-quality, role-specific assessments that reflect different skill levels, while also enabling fair and consistent candidate evaluation.
This work aims to shift traditionally manual and time-consuming tasks into a fully automated, intelligent system. Our core objectives include exercise generation based on structured parameters, automatic validation of question quality, and robust assessment of candidate submissions, all achieved through LLM-driven pipelines. The study also investigates the feasibility of controlled execution environments for code testing and examines the limits and trade-offs of various automation techniques.
The foundation of this study builds on the advances in transformer-based models, which have redefined natural language understanding since their introduction. Architectures such as GPT-4 exemplify the progression toward models capable of both natural language reasoning and task-specific problem-solving.
Prompt engineering emerges as a critical field in this context. Techniques such as few-shot prompting, chain-of-thought, and self-critique mechanisms enable dynamic tailoring of LLM behavior. These approaches allow models to generate not just text but structured, purposeful artifacts, like coding exercises or structured evaluations, that align with specified parameters.
In terms of assessment, the use of LLMs as virtual evaluators is also growing. Recent literature supports the generation of unit tests and evaluation criteria directly from problem statements. However, despite the promise of dynamic execution and testing within containers, industry and research consensus highlight the overhead and complexity of maintaining runtime environments for diverse codebases. Our findings corroborate these trade-offs, leading us to prioritize alternative validation techniques, such as multi-pass scoring using varied temperature settings and model ensembles.
The study leverages multiple layers of generative AI infrastructure. At its core, the system interacts with LLMs through structured prompt templates, allowing for high customization based on variables like question category, difficulty, and target seniority level. All prompts are assembled dynamically, taking into account historical context and role-specific requirements.
The architecture consists of several modular services:
- A generation engine that initiates and manages the full life cycle of exercise creation.
- A prompt construction service that adapts the language and structure of prompts to the intended audience.
- An evaluation loop that implements self-refinement by having the model review and score its own outputs.
- A scalable orchestration layer capable of batch-generating thousands of exercises through controlled parallelism.
Prompt templates follow structured JSON specifications and each generated item – question or solution – goes through iterative validation, including scoring loops until a quality threshold is reached or a maximum number of attempts is exhausted.
On the submission evaluation side, candidate answers are processed through similar multi-pass LLM scoring routines, accounting for different question types such as multiple-choice, open-ended, and programming. Evaluation criteria include factual correctness, clarity, code performance, and adherence to best practices.
Although we evaluate the potential use of container-based code execution, including Docker-based environments, our cost-benefit analysis discourages their use at scale in the current phase. Instead, the study relies on logic-based validation methods, LLM-inferred test generation, and synthetic ground truths for correctness checking.
Study Details
The primary goal of this study is to autonomously generate, validate, and assess technical exercises tailored to various candidate profiles. This work responds to the need for scalable, high-fidelity assessments in recruitment processes that traditionally rely on manually curated content and human-led evaluation.
We begin by designing a modular service responsible for managing the full lifecycle of an exercise. This service receives structured input parameters, such as difficulty level, exercise type, category, and intended seniority, and performs a validity check to ensure semantic consistency. For instance, mismatches like “senior-level” questions with “beginner” concepts are filtered out before generation begins.
The generation process is iterative by design. A first version of the exercise is created using the LLM, followed by a self-evaluation step in which the same or another LLM instance acts as a reviewer. It scores the question across multiple dimensions, including clarity, technical depth, and relevance. If the average score falls below a predefined threshold, the system triggers a regeneration loop. The model uses feedback from the previous round to produce an improved version of the exercise. This loop is capped at a maximum number of attempts to prevent unbounded resource usage.
A similar approach governs solution generation. After the exercise is finalized, the system prompts the LLM to generate an ideal solution. This solution is again subjected to validation loops, where it is assessed for correctness, code quality, clarity, and algorithmic efficiency. The system uses multiple LLM passes, often with varied temperature values, to simulate reviewer variability. Scores are averaged to stabilize final judgments.
For scale, we introduce a batch generation service capable of producing exercises across the combinatorial space of input parameters. The batch engine calculates the total number of unique configurations and distributes the generation tasks across parallel threads, bounded by a configurable concurrency limit. This design allows the system to scale efficiently according to the available compute resources.
Prompt engineering is abstracted into its own service. It uses predefined templates with dynamic sections based on the exercise metadata. Prompts are constructed with explicit roles (system vs. user) and structured contexts to ensure consistent interpretation by the LLM. This level of prompt control is essential for aligning output format and style with downstream parsing and evaluation processes.
Candidate submissions are processed through a submission evaluation service. This component selects the appropriate evaluation method based on the question type. For code-based questions, the system checks correctness using a mix of model-based evaluation and synthetic test generation. For open-ended and multiple-choice formats, evaluations consider clarity, terminology, structural logic, and overall alignment with expected answers. Scoring is repeated across multiple runs with elevated temperatures, and the final result is averaged to reduce the impact of outlier responses or stochastic variation.
We conduct a focused validation campaign consisting of 30 exercises, each followed by the generation and evaluation of an ideal solution. Exercises are evaluated on clarity, relevance, technical depth, and practicality. The average score across these dimensions is 88.1%, with all 30 exercises surpassing the 80% approval threshold. In contrast, solution generation yields an average score of 82.3%, with 25 out of 30 solutions approved. This discrepancy highlights the greater complexity in synthesizing high-quality solutions compared to question formulation.
These results validate the robustness of our generation loops and scoring strategies. However, they also reveal that solution synthesis remains a more sensitive task, requiring further refinement. Potential areas for improvement include augmenting prompt clarity, integrating reference implementations, or applying external verification mechanisms.
We also evaluate the feasibility of dynamic code execution in containerized environments. While technically viable, the cost and complexity of supporting multi-language runtimes, dynamic dependency resolution, and security isolation lead us to deprioritize this approach in favor of logic-based validation. Nonetheless, we acknowledge the long-term value of integrating containerized execution for high-stakes assessments or edge cases where LLM-based scoring may be insufficient.
This study demonstrates the practical viability of leveraging LLMs to automate technical assessment workflows with a high degree of reliability and adaptability. From a business perspective, the reduction in manual effort, increased consistency in evaluation, and improved scalability offer substantial operational benefits. Technically, the study confirms the utility of prompt-driven generation loops, structured evaluation pipelines, and controlled parallelization as viable patterns for applying generative AI in enterprise assessment systems.