Legal NLP & Query Expansion

NLP & Text Representation 2021 Applied Prototype

Overview

This project developed a seed-based NLP prototype for legal document retrieval and terminology expansion.

The work started from a small set of manually selected legal terms, used them to retrieve candidate documents, extracted additional relevant terms from those documents and explored whether lexical and semantic similarity could support document-concept assignment.

The project was not designed as a production legal search engine. It was an applied methodological prototype for working with specialized legal text.

Problem

Legal concepts are often expressed through multiple related terms. A narrow query can miss relevant documents if the corpus uses alternative terminology.

The practical question was whether a small initial vocabulary could be expanded using information contained in the corpus itself.

Example 1 — Seed-Based Retrieval

One initial conceptual area was represented by the seed terms:

murder kill gun

The prototype used these terms to identify legal documents related to the same conceptual area. Documents were represented through TF-IDF-style lexical scores and compared with the query using cosine similarity.

Example 2 — Query Expansion

A second conceptual area was represented by:

debt mortgage money

After retrieving candidate documents, the prototype extracted high-scoring terms from the selected subset. These terms were treated as possible additions to the original query vocabulary.

The logic was iterative: start from a narrow concept, retrieve documents, extract candidate terms and use those terms to enrich the conceptual vocabulary.

Technologies and Methods Used

Python for prototype implementation.
NLTK for tokenization and basic text preprocessing.
NumPy / pandas for intermediate document-term structures and data handling.
TF-IDF-style scoring for sparse lexical representation of terms and documents.
Cosine similarity for query-document ranking and term/document comparison.
GloVe embeddings for exploratory word-level semantic similarity.
Seed-based query expansion to enrich manually defined concept vocabularies.

Implemented Elements

Filtering of legal documents using initial seed terms.
Tokenization and basic preprocessing of legal text.
TF-IDF-style term scoring.
Cosine-similarity ranking between queries and documents.
Similarity-threshold selection of candidate documents.
Extraction of high-scoring terms for query expansion.
Exploratory semantic similarity using static word embeddings.
Score-based assignment of documents to conceptual classes.

Internal Scores and Evaluation Limits

The prototype included internal relevance scores: TF-IDF weights, cosine similarity values, ranked document scores and embedding-based concept scores. These measures were useful for exploration, ranking and candidate term selection.

They should not be interpreted as a complete external validation. A full evaluation would require annotated relevance judgments and standard retrieval metrics.

Already present: internal scoring, ranking and threshold-based document selection.
To be added: Precision@k, Recall@k, Mean Reciprocal Rank and nDCG for retrieval quality.
To be added: expert validation of candidate expansion terms.
To be added: checks for semantic drift across expansion cycles.

Methodological Note

The iterative structure of the prototype is useful but also risky. If the system repeatedly expands a query using terms extracted from previously retrieved documents, the process can become circular and may drift away from the original legal concept.

For this reason, query expansion should be controlled through external validation, expert review or comparison against a labeled benchmark.

Modern Extension

The original prototype used sparse lexical scoring, cosine similarity and static word embeddings. A modern version would evaluate the same idea against current retrieval methods.

Use BM25 as a stronger sparse-retrieval baseline.
Use sentence embeddings for document-level semantic search.
Compare sparse, dense and hybrid retrieval strategies.
Add reranking for the top retrieved documents.
Evaluate retrieval quality with annotated relevance judgments.

Resources

Technical report in preparation.

Code available upon request.

Technical Context

Pennington, Socher & Manning (2014), GloVe — relevant to the original prototype because static word embeddings were used for exploratory semantic similarity.
Reimers & Gurevych (2019), Sentence-BERT — relevant as a modern document/sentence embedding alternative to word-level static embeddings.
Thakur et al. (2021), BEIR — relevant as a modern framework for evaluating retrieval systems across heterogeneous datasets.