Legal NLP & Query Expansion

NLP & Text Representation 2021 Applied Prototype

Overview

This project developed a seed-based NLP prototype for legal document retrieval and terminology expansion.

The work started from a small set of manually selected legal terms, used them to retrieve candidate documents, extracted additional relevant terms from those documents and explored whether lexical and semantic similarity could support document-concept assignment.

The project was not designed as a production legal search engine. It was an applied methodological prototype for working with specialized legal text.

Problem

Legal concepts are often expressed through multiple related terms. A narrow query can miss relevant documents if the corpus uses alternative terminology.

The practical question was whether a small initial vocabulary could be expanded using information contained in the corpus itself.

Example 1 — Seed-Based Retrieval

One initial conceptual area was represented by the seed terms:

murder kill gun

The prototype used these terms to identify legal documents related to the same conceptual area. Documents were represented through TF-IDF-style lexical scores and compared with the query using cosine similarity.

Example 2 — Query Expansion

A second conceptual area was represented by:

debt mortgage money

After retrieving candidate documents, the prototype extracted high-scoring terms from the selected subset. These terms were treated as possible additions to the original query vocabulary.

The logic was iterative: start from a narrow concept, retrieve documents, extract candidate terms and use those terms to enrich the conceptual vocabulary.

Technologies and Methods Used

Implemented Elements

Internal Scores and Evaluation Limits

The prototype included internal relevance scores: TF-IDF weights, cosine similarity values, ranked document scores and embedding-based concept scores. These measures were useful for exploration, ranking and candidate term selection.

They should not be interpreted as a complete external validation. A full evaluation would require annotated relevance judgments and standard retrieval metrics.

Methodological Note

The iterative structure of the prototype is useful but also risky. If the system repeatedly expands a query using terms extracted from previously retrieved documents, the process can become circular and may drift away from the original legal concept.

For this reason, query expansion should be controlled through external validation, expert review or comparison against a labeled benchmark.

Modern Extension

The original prototype used sparse lexical scoring, cosine similarity and static word embeddings. A modern version would evaluate the same idea against current retrieval methods.

Resources

Technical report in preparation.

Code available upon request.

Technical Context