Automating Exercise Creation and Evaluation with LLMs

In this study, we explore the use of generative AI – specifically large language models (LLMs) – to automate the generation and evaluation of technical exercises for hiring processes. The motivation stems from a common bottleneck in recruitment: creating high-quality, role-specific assessments that reflect different skill levels, while also enabling fair and consistent candidate evaluation.

This work aims to shift traditionally manual and time-consuming tasks into a fully automated, intelligent system. Our core objectives include exercise generation based on structured parameters, automatic validation of question quality, and robust assessment of candidate submissions, all achieved through LLM-driven pipelines. The study also investigates the feasibility of controlled execution environments for code testing and examines the limits and trade-offs of various automation techniques.

The foundation of this study builds on the advances in transformer-based models, which have redefined natural language understanding since their introduction. Architectures such as GPT-4 exemplify the progression toward models capable of both natural language reasoning and task-specific problem-solving.

Prompt engineering emerges as a critical field in this context. Techniques such as few-shot prompting, chain-of-thought, and self-critique mechanisms enable dynamic tailoring of LLM behavior. These approaches allow models to generate not just text but structured, purposeful artifacts, like coding exercises or structured evaluations, that align with specified parameters.

In terms of assessment, the use of LLMs as virtual evaluators is also growing. Recent literature supports the generation of unit tests and evaluation criteria directly from problem statements. However, despite the promise of dynamic execution and testing within containers, industry and research consensus highlight the overhead and complexity of maintaining runtime environments for diverse codebases. Our findings corroborate these trade-offs, leading us to prioritize alternative validation techniques, such as multi-pass scoring using varied temperature settings and model ensembles.

The study leverages multiple layers of generative AI infrastructure. At its core, the system interacts with LLMs through structured prompt templates, allowing for high customization based on variables like question category, difficulty, and target seniority level. All prompts are assembled dynamically, taking into account historical context and role-specific requirements.

The architecture consists of several modular services:

A generation engine that initiates and manages the full life cycle of exercise creation.
A prompt construction service that adapts the language and structure of prompts to the intended audience.
An evaluation loop that implements self-refinement by having the model review and score its own outputs.
A scalable orchestration layer capable of batch-generating thousands of exercises through controlled parallelism.

Prompt templates follow structured JSON specifications and each generated item – question or solution – goes through iterative validation, including scoring loops until a quality threshold is reached or a maximum number of attempts is exhausted.

On the submission evaluation side, candidate answers are processed through similar multi-pass LLM scoring routines, accounting for different question types such as multiple-choice, open-ended, and programming. Evaluation criteria include factual correctness, clarity, code performance, and adherence to best practices.

Although we evaluate the potential use of container-based code execution, including Docker-based environments, our cost-benefit analysis discourages their use at scale in the current phase. Instead, the study relies on logic-based validation methods, LLM-inferred test generation, and synthetic ground truths for correctness checking.

Study Details

The primary goal of this study is to autonomously generate, validate, and assess technical exercises tailored to various candidate profiles. This work responds to the need for scalable, high-fidelity assessments in recruitment processes that traditionally rely on manually curated content and human-led evaluation.

We begin by designing a modular service responsible for managing the full lifecycle of an exercise. This service receives structured input parameters, such as difficulty level, exercise type, category, and intended seniority, and performs a validity check to ensure semantic consistency. For instance, mismatches like “senior-level” questions with “beginner” concepts are filtered out before generation begins.

The generation process is iterative by design. A first version of the exercise is created using the LLM, followed by a self-evaluation step in which the same or another LLM instance acts as a reviewer. It scores the question across multiple dimensions, including clarity, technical depth, and relevance. If the average score falls below a predefined threshold, the system triggers a regeneration loop. The model uses feedback from the previous round to produce an improved version of the exercise. This loop is capped at a maximum number of attempts to prevent unbounded resource usage.

A similar approach governs solution generation. After the exercise is finalized, the system prompts the LLM to generate an ideal solution. This solution is again subjected to validation loops, where it is assessed for correctness, code quality, clarity, and algorithmic efficiency. The system uses multiple LLM passes, often with varied temperature values, to simulate reviewer variability. Scores are averaged to stabilize final judgments.

For scale, we introduce a batch generation service capable of producing exercises across the combinatorial space of input parameters. The batch engine calculates the total number of unique configurations and distributes the generation tasks across parallel threads, bounded by a configurable concurrency limit. This design allows the system to scale efficiently according to the available compute resources.

Prompt engineering is abstracted into its own service. It uses predefined templates with dynamic sections based on the exercise metadata. Prompts are constructed with explicit roles (system vs. user) and structured contexts to ensure consistent interpretation by the LLM. This level of prompt control is essential for aligning output format and style with downstream parsing and evaluation processes.

Candidate submissions are processed through a submission evaluation service. This component selects the appropriate evaluation method based on the question type. For code-based questions, the system checks correctness using a mix of model-based evaluation and synthetic test generation. For open-ended and multiple-choice formats, evaluations consider clarity, terminology, structural logic, and overall alignment with expected answers. Scoring is repeated across multiple runs with elevated temperatures, and the final result is averaged to reduce the impact of outlier responses or stochastic variation.

We conduct a focused validation campaign consisting of 30 exercises, each followed by the generation and evaluation of an ideal solution. Exercises are evaluated on clarity, relevance, technical depth, and practicality. The average score across these dimensions is 88.1%, with all 30 exercises surpassing the 80% approval threshold. In contrast, solution generation yields an average score of 82.3%, with 25 out of 30 solutions approved. This discrepancy highlights the greater complexity in synthesizing high-quality solutions compared to question formulation.

These results validate the robustness of our generation loops and scoring strategies. However, they also reveal that solution synthesis remains a more sensitive task, requiring further refinement. Potential areas for improvement include augmenting prompt clarity, integrating reference implementations, or applying external verification mechanisms.

We also evaluate the feasibility of dynamic code execution in containerized environments. While technically viable, the cost and complexity of supporting multi-language runtimes, dynamic dependency resolution, and security isolation lead us to deprioritize this approach in favor of logic-based validation. Nonetheless, we acknowledge the long-term value of integrating containerized execution for high-stakes assessments or edge cases where LLM-based scoring may be insufficient.

This study demonstrates the practical viability of leveraging LLMs to automate technical assessment workflows with a high degree of reliability and adaptability. From a business perspective, the reduction in manual effort, increased consistency in evaluation, and improved scalability offer substantial operational benefits. Technically, the study confirms the utility of prompt-driven generation loops, structured evaluation pipelines, and controlled parallelization as viable patterns for applying generative AI in enterprise assessment systems.

Automating Exercise Creation and Evaluation with LLMs

A study on the application of large language models for the automated generation, validation, and assessment of technical exercises in recruitment scenarios, focusing on scalable, iterative, and cost-aware approaches.

Optimizing First-Attempt Parcel Delivery Using Explainable Machine Learning

A Study on AI-Enhanced Speech Processing for Live Communication

Automating Exercise Creation and Evaluation with LLMs

Semantic Parsing and Full-Text Indexing for Intelligent Candidate Search

A Deep Learning Approach to EV Charging Utilization Prediction

AI-Driven Shift Planning for Emergency Services

Implementation of Predictive Maintenance in the Food Industry

Applying Natural Language Processing to Text Classification

AI-Powered Handwriting Recognition for Clinical Analysis

Software Development with AI-Powered Assistance

Investigating the Integration of LLMs for Personalized Coaching

Donation Tracking with Blockchain Technologies

Augmented Reality and IoT for Order-Picking

A Study on Cetacean Conservation

Invoice Management with AI-Driven OCR Solutions

Healthcare with Data Integration and AI-Powered Diagnosis Systems

Indoor Navigation with Augmented Reality

Enhancing Meteorological Forecasts with Automated Descriptive Prediction Models

Creating a C# Framework to Improve Solidity Development for Blockchain Smart Contracts

DeFi Liquidity Management Analysis of Uniswap V2's CPMM Model

Enhancing Blockchain Transaction Management through Asynchronous Communication Mechanisms

Mapping Capital Flow in Ethereum: An Analysis Using Graph Databases

Decentralized Application Architecture: A Case Study in Blockchain and Gaming

Defending Blockchain Networks Against Sybil Attacks on EOSIO-Based Blockchains

Developing a Comprehensive Market Analyzer Through Data-Driven Insights

Real-Time Player Tracking and 2D Field Mapping Using Homography for Football Analytics

Enhancing Software Development Efficiency with Code Generation Tools

Geospatial Information Systems and Collaborative Data Models

Real-Time Player Tracking in Football: A Deep Learning Approach

Micropayments on the Lightning Network

Web Data Extraction and Sentiment Analysis

Advancing Real-Time Player Detection in Sports: A Study on Tracking Algorithms

Building a Dynamic IoT Platform for Real-Time Data Integration and Analysis

Building GDPR-Compliant Data Management Systems: A Study in Data Security and Governance

Optimizing Energy Consumption Analysis Through Big Data Technologies

Scalable Blockchain Solutions for Database Operations

Optimizing Web Search for Domain-Specific Queries

Real-Time Data Updates Through Full Stack Apps

A Technical Exploration into Automated Code Generation

A New API for Easy Gesture Recognition Programmability with Kinect

Dynamic Data Querying and Visualization

Enhancing Azure Table Storage with Transactional and Referential Integrity Support

Building Scalable Platforms for Dynamic Business Process Networks

Social Media Monitoring and Analysis Using AI and Big Data

Generative Adversarial Networks for the Generation of Music and Images