Semantic Parsing and Full-Text Indexing for Intelligent Candidate Search

This study presents a semantically structured query system that improves candidate discovery across unstructured content such as CVs and interview notes. We explore the integration of custom grammar rules, full-text indexing, and logical parsing to enable expressive, performant search in recruitment platforms.

In modern recruitment workflows, the initial candidate search phase is often limited by the rigidity of keyword-based filters and structured fields. Recruiters must translate flexible, often ambiguous hiring requirements into exact queries that systems can execute – typically through checkboxes or Boolean logic applied to structured attributes. This process fails to account for the vast amount of relevant information buried in unstructured candidate sources, such as free-form CVs or manually written interview notes.

This study addresses that gap by introducing a semantic layer over the search experience. The work enables recruiters to express intent using controlled but flexible grammar that differentiates between mandatory and optional criteria. These expressions are automatically parsed into structured logic and executed against a search backend optimized for unstructured data. The system bridges the gap between natural recruiter language and formal database logic, ensuring that meaningful candidate profiles are surfaced – even when the information is expressed in non-standard ways.

We focus specifically on how this grammar-driven approach interacts with full-text indexing strategies and how semantic parsing can extend the coverage of search beyond structured data. The result is a more accurate and performant candidate discovery engine that reduces the manual effort of query refinement and increases the likelihood of surfacing relevant candidates.

Recruitment systems today typically rely on structured data filtering or full-text keyword search. Structured filters – such as dropdowns for technologies or years of experience – are precise but brittle. They fail when candidates describe skills in informal terms or when recruiters are uncertain about the exact terminology to use. Full-text keyword searches offer more flexibility but tend to produce noisy results, often returning irrelevant candidates due to lack of context or semantic understanding.

Recent developments in NLP have introduced entity extraction and skill tagging from CVs, but these enhancements remain largely disconnected from the recruiter’s search logic. Even when candidate profiles are enriched with extracted tags, the query systems used to access them remain primitive, focused on exact matching or manually defined synonyms.

There is also a performance tradeoff. More expressive query systems tend to require heavyweight semantic models that do not scale well in live search scenarios. As a result, many platforms either compromise on precision or restrict search to predefined fields.

This study proposes a middle path: a structured grammar that captures semantic intent, integrated with optimized full-text indexing over unstructured content. The aim is to offer both precision and flexibility without relying on opaque, non-deterministic models.

At the core of the system is a domain-specific grammar framework that enables recruiters to compose expressive and precise queries over unstructured candidate data. This grammar supports a set of logical operators – MUST, CAN, NOT, AND, OR, and parentheses for grouping which allow for the definition of mandatory, optional, and negated conditions, as well as logical combinations and precedence control. Beyond logical structure, the system incorporates specialized grammar tailored to different semantic domains. These include WORKING and WORKED to express current or past employment relationships; BEFORE, AFTER, and various relative temporal filters to constrain time-based conditions; Remote, Hybrid, and OnSite to capture work format preferences; salary-related tokens like GROSS, NET, and RATE; and grammars for spoken languages and geographic zones. For example, a query such as “must have worked with Spring after 2020 and can be remote” is parsed into a structured semantic representation that combines technology experience, temporal constraints, and location flexibility, providing context-aware filtering over narrative candidate data that would otherwise be opaque to traditional search mechanisms.

The query is executed against a search backend that includes full-text indexes over key unstructured fields: CVs, interview summaries, and candidate notes. These fields are preprocessed and indexed using SQL-based full-text search capabilities, with custom indexes created specifically for high-frequency columns.

To increase recall, the system also performs a post-processing step where it identifies candidate-technology relationships that are not explicitly labeled but appear in narrative content. These inferred mappings are fed back into the search pipeline, allowing the system to match queries against implicit knowledge extracted from text.

Study Details

The study on semantic candidate search is designed to validate whether a structured grammar and optimized indexing strategy can deliver higher precision and recall in recruitment searches without sacrificing performance. We treat recruiter intent as a formalizable object – parsed, structured, and executed against a mixed data environment of structured attributes and unstructured text.

Our primary goal is to enable recruiters to express complex queries naturally, while ensuring the system executes them accurately and transparently. This includes differentiating between mandatory and optional skills, handling varied terminology, and retrieving candidates even when relevant data appears only in narrative content such as CVs or interview notes.

A secondary goal is to maintain sub-second response times on large datasets, validating that the system is operationally viable in real-time recruitment scenarios. We also aim to improve “hidden” candidate discoverability by automatically extracting and indexing relationships between skills and candidates that are not explicitly structured.

The system begins by transforming recruiter inputs into a structured JSON representation that preserves the complete semantics of the query. In addition to MUST and CAN, the parser supports NOT, AND, OR, and parentheses for precedence, allowing complex compositions to be expressed unambiguously. On top of these logical operators, the parser applies domain grammars that model recruiter intent across multiple dimensions: employment state with WORKING and WORKED; temporal constraints with BEFORE, AFTER, and relative windows such as LastWeek, Last2Weeks, LastMonth, Last2Months, Last6Months, and LastYear; work format with OnSite, Remote, and Hybrid; compensation with GROSS, NET, and RATE; plus dedicated grammars for languages, keywords, companies, and geographic zones. For example, the input “must have WORKED with Spring AFTER 2020 AND (CAN know Maven OR CAN know Hibernate) AND Work=Remote” is parsed into a nested semantic tree that distinguishes mandatory technology use, time bounds, optional framework familiarity, and work format, all encoded deterministically in JSON for downstream execution.

We then index candidate data across two layers. Structured attributes such as titles, locations, known technologies, compensation ranges, and language tags are normalized and stored for direct matching. Unstructured sources – free-text CV sections, interview notes, and recruiter annotations – are preprocessed and indexed using SQL full-text capabilities. Tokenization and query expansion align with the grammar so that grouped expressions and negations are honored during scoring, rather than degraded into flat keyword matches.

Once the corpus is indexed, parsed queries are executed against the backend with full fidelity to the logical and domain constraints. MUST clauses act as hard filters; CAN clauses contribute weighted preferences; NOT clauses exclude matches at parse-time to avoid accidental term co-occurrence; and grouped expressions preserve intended precedence. Results are ranked by a composite signal that blends semantic fit, field-level boosts, and term frequency in the most relevant sources, ensuring that, for example, a WORKED AFTER LastYear constraint on a technology weighs recent, hands-on evidence higher than historical mentions.

The study demonstrates that a grammar-driven approach improves both recall and precision in candidate search by turning recruiter intent into an explicit, inspectable structure. When key skills appear only in narrative sources such as CV free text or interview notes, the system still retrieves the right profiles because the parser carries the full semantics of the request into execution, rather than collapsing it into brittle keyword lists.

The inference layer proves essential for surfacing “hidden” skills recorded only in unstructured text. By linking narrative mentions back to canonical technology tags, the system increases the qualified candidate pool without inflating noise. Multiple cases where interview notes contain technologies that are absent from the structured profile are captured by our pipeline and become addressable by future queries, improving recall without relaxing precision thresholds.

Because every user input is translated into a deterministic JSON representation before execution, recruiters can inspect, audit, and iteratively refine intent without guessing how the engine interprets their words. This clarity reduces the iteration cycles typically spent on trial-and-error filtering and lowers the cognitive load of operating complex searches. Recruiters spend less time massaging filters and scanning irrelevant profiles. This shortens time-to-shortlist and reduces the cost of candidate acquisition while preserving traceability.

Technically, the study establishes that carefully designed grammar, plus selective indexing, can deliver semantic retrieval without resorting to opaque, computationally heavy models. SQL full-text remains viable when the query planner receives a structured, semantics-aware plan rather than raw keywords.