A Beginner’s Guide to Data Extraction with Langextract and LLMs

by SkillAiNest

A Beginner’s Guide to Data Extraction with Langextract and LLMsA Beginner’s Guide to Data Extraction with Langextract and LLMs
Photo by author

# Introduction

Did you know that a huge amount of valuable information is still contained in unstructured text? For example, research papers, clinical notes, financial reports, etc. Extracting reliable and structured information from these texts has always been a challenge. Longextric There is an open source Python library (released by Google) that solves this problem using the Large Language Model (LLM). You can specify what to extract with simple notations and some examples, and then LLM (such as Google’s Gemini, OpenII, or Local Models) is used to extract that information from documents of any length. Another thing that makes it useful is its support for very long documents (via chunking and multi-pass processing) and interactive visualization of results. Let’s explore this library in more detail.

# 1. Install and configure

To install Lange Extract locally, first make sure you have Python 3.10+ installed. Library is available Pypi. In a terminal or virtual environment, run:

For an isolated environment, you can first configure and activate the virtual environment:

python -m venv langextract_env
source langextract_env/bin/activate  # On Windows: .\langextract_env\Scripts\activate
pip install langextract

There are other options from source and usage Docker Also you can check from here.

# 2. Configuring API Keys (for Cloud Model)

LangeExtract itself is free and open source, but if you use a cloud-hosted LLM (such as Google Gemini or the OpenIGPT model), you must provide an API key. You can set LANGEXTRACT_API_KEY environment variable or store it in a .env File in your working directory. For example:

export LANGEXTRACT_API_KEY="YOUR_API_KEY_HERE"

Or in a .env file:

cat >> .env << 'EOF'
LANGEXTRACT_API_KEY=your-api-key-here
EOF
echo '.env' >> .gitignore

Through on-device LLMS Olma or other native backends do not require an API key. To enable Open Eyeyou run away pip install langextract(openai)set yours OPENAI_API_KEYand use openai model_id. for Vertex AI (Enterprise users), service account authentication is supported.

# 3. Description of extraction function

Lange Extract works by telling you what information to extract. You do this by writing a clear prompt description and supplying one or more ExampleData Annotations that show what the correct extraction looks like on the sample text. For example, to extract characters, emotions, and relationships from a line of literature, you might write:

import langextract as lx

prompt = """
  Extract characters, emotions, and relationships in order of appearance.
  Use exact text for extractions. Do not paraphrase or overlap entities.
  Provide meaningful attributes for each entity to add context."""
examples = (
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? ...",
        extractions=(
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"}
            )
        )
    )
)

These examples (taken from Lange Extract’s readme) tell the model exactly what structural output to expect. You can create similar instances for your domain.

# 4. Run the extraction

Once your prompts and examples are defined, you simply call lx.extract() The key arguments of the function are:

  • text_or_documents: your input text, or a list of texts, or even a URL string (LongExtract can fetch and process text from a Gutenberg or other URL).
  • prompt_description: extraction instructions (a string)
  • examples: a list ExampleData which specifies the desired output.
  • model_id: LLM identifier to use (eg "gemini-2.5-flash" For Google Gemini Flash, or the likes of the Ulama model "gemma2:2b"or OpenAI models such as "gpt-4o")
  • Other optional parameters: extraction_passes (for extrapolation for higher recall on long texts), max_workers (parallel processing on chunks), fence_outputfor , for , for , . use_schema_constraintsetc.

For example:

input_text=""'JULIET. O Romeo, Romeo! wherefore art thou Romeo?
Deny thy father and refuse thy name;
Or, if thou wilt not, be but sworn my love,
And I'll no longer be a Capulet.
ROMEO. Shall I hear more, or shall I speak at this?
JULIET. 'Tis but thy name that is my enemy;
Thou art thyself, though not a Montague.
What’s in a name? That which we call a rose
By any other name would smell as sweet.'''


result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash"
)

It sends and returns prompts and examples along with text to selected LLMs Result ObjectLangeExtract automatically handles tokenizing long texts in chunks, batching calls and merging output in parallel.

# 5. Managing output and vision

Production of lx.extract() is a Python object (often called result) which contains extracted entities and attributes. You can inspect it programmatically or save it for later. LangExtract also provides helper functions for saving results: for example, you can write the results to a JSONL (JSON Lines) file (one line per line) and generate an interactive HTML overview. For example:

lx.io.save_annotated_documents((result), output_name="extraction_results.jsonl", output_dir=".")
html = lx.visualize("extraction_results.jsonl")
with open("viz.html", "w") as f:
    f.write(html if isinstance(html, str) else html.data)

It writes one extraction_results.jsonl file and an interactive viz.html The JSONL format file is convenient for large datasets and further processing, and the HTML file highlights each extracted epoch in context (color-coded) for easy human inspection, such as:
Output and concept: Lange extractOutput and concept: Lange extract

# 6. Supporting input formats

Lange Extract is flexible about input. You can supply:

  • Plain text strings: Any text you load into Python (such as from a file or database) can be processed.
  • urls: As shown above, you can pass a URL (eg a Project Gutenberg link) as text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt". A language extract will be downloaded and extracted from this document.
  • Text list: Pass a Python list of strings to process multiple documents in one call.
  • Rich Text or Markdown: Since Logistic works at the text level, you can eat too Markdown Or HTML if you preprocess it on raw text. (Lange Extract doesn’t analyze PDFs or images by itself, you need to extract the text first.

# 7. Conclusion

Lange Extract makes it easy to convert unstructured text into structured data. With high accuracy, clear source mapping, and simple customization, it works well when rule-based methods fall short. This is particularly useful for complex or domain-specific extractions. While there is room for improvement, Lange Extract is already a robust tool for extracting ground information in 2025.

Kanwal Mehreen is a machine learning engineer and technical writer with a deep passion for data science and the intersection of AI with medicine. He co-authored the eBook “Maximizing Productivity with ChatGPT”. As a 2022 Google Generation Scholar for APAC, she champions diversity and academic excellence. He has also been recognized as a Teradata Diversity in Tech Scholar, a MITACS GlobalLink Research Scholar, and a Harvard Wicked Scholar. Kanwal is a passionate advocate for change, having founded the Fame Code to empower women in stem fields.

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro