How to crawl an entire document site with Olostep

Photo by author

# Introduction

Web crawling is the process of automatically visiting web pages, following links, and systematically gathering content from a website. It is commonly used to gather large amounts of information from documentation sites, articles, knowledge bases, and other web resources.

Crawling an entire website and then converting that content into a format that an AI agent can actually use isn’t as easy as it sounds. Documentary sites often have nested pages, repetitive navigation links, boilerplate content, and inconsistent page structures. On top of that, extracted content needs to be clean, organized, and stored in a way that is useful for downstream AI workflows such as retrieval, question answering, or agent-based systems.

In this guide, we’ll learn why to use Olostep Instead of Abrasive or Seleniumconfigure everything needed for a web crawling project, write a simple crawling script to scrape a documented website, and finally build a frontend using it Gradio So that anyone can provide links and other arguments to crawl website pages.

# Choosing Olostep over Scrappy or Selenium

Scrapy is powerful, but it’s built as a complete scraping framework. This is useful when you want deeper control, but it also means more setup and more engineering work.

Selenium is more popular for browser automation. It’s useful for interacting with JavaScript-heavy pages, but it’s not really designed as a workflow crawler on its own.

With Olostep, the pitch is pretty straightforward: search, crawl, scrape and structure web data through an application programming interface (API), including support for LLM-friendly outputs. Markdowntext, HTML, and structure JSON. This means you don’t need to manually stitch pieces together for discovery, extraction, formatting, and downstream AI use.

For document sites, this can give you a much faster path from URLs to usable content because you’re spending less time building your own crawling stack and more time working with the content you need.

# Installing packages and setting up the API

First, install The python Packages used in this project. The official Olostep Software Development Kit (SDK) requires Python 3.11 or later.

pip install olostep python-dotenv tqdm

These packages handle the main parts of the workflow:

olostep Connects your script to the Olostep API.
python-dotenv Loads your API key from an .env file.
tqdm Adds a progress bar so you can track saved pages

Next, create a free Olostep account, open the dashboard, and create an API key from the API Keys page. Olostep’s official documentation and integrations point users to a dashboard for API key setup.

Then create an .env file in your project folder.

OLOSTEP_API_KEY=your_real_api_key_here

This keeps your API key separate from your Python code, which is a clean and secure way to manage credentials.

# Creating a crawler script

In this part of the project, we’ll create a Python script that crawls a document website, extracts each page into Markdown format, cleans the content, and saves it locally as individual files. We’ll create a project folder, add a Python file, and then write the code step-by-step to make it easy to follow and test.

First, create a project folder for your crawler. Inside this folder, create a new Python file named crawl_docs_with_olostep.py.

Now we will add the code to this file one section at a time. This makes it easier to understand what each part of the script does and how the entire crawler works together.

// Specifying crawl settings

Start by importing the required libraries. Then specify the main crawl settings, such as the starting URL, crawl depth, page limits, inclusion and exclusion rules, and the output folder where the Markdown files will be saved. These values control how much of the documentation site is crawled and where the results are stored.

import os
import re
from pathlib import Path
from urllib.parse import urlparse

from dotenv import load_dotenv
from tqdm import tqdm
from olostep import Olostep

START_URL = "
MAX_PAGES = 10
MAX_DEPTH = 1

INCLUDE_URLS = (
    "/**"
)

EXCLUDE_URLS = ()

OUTPUT_DIR = Path("olostep_docs_output")

// Creating a helper function to create protected filenames

Each crawled page needs to be saved as its own Markdown file. To do this, we need a helper function that converts the URL into a clean and filesystem-safe filename. This avoids problems with slashes, symbols, and other characters that don’t work well in file names.

def slugify_url(url: str) -> str:
    parsed = urlparse(url)
    path = parsed.path.strip("

    if not path:
        path = "index"

    filename = re.sub(r"(^a-zA-Z0-9/_-)+", "-", path)
    filename = filename.replace(" "__").strip("-_")

    return f"{filename or 'page'}.md"

// Creating a helper function to save Markdown files

Next, add helper functions to process the extracted content before saving it.

The first function cleans up the Markdown by removing extra interface text, repeated blank lines, and unwanted page elements such as feedback prompts. This helps keep archived files focused on the content of the original documents.

def clean_markdown(markdown: str) -> str:
    text = markdown.replace("\r\n", "\n").strip()
    text = re.sub(r"\(\s*\u200b?\s*\)\(#.*?\)", "", text, flags=re.DOTALL)

    lines = (line.rstrip() for line in text.splitlines())

    start_index = 0
    for index in range(len(lines) - 1):
        title = lines(index).strip()
        underline = lines(index + 1).strip()
        if title and underline and set(underline) == {"="}:
            start_index = index
            break
    else:
        for index, line in enumerate(lines):
            if line.lstrip().startswith("# "):
                start_index = index
                break

    lines = lines(start_index:)

    for index, line in enumerate(lines):
        if line.strip() == "Was this page helpful?":
            lines = lines(:index)
            break

    cleaned_lines: list(str) = ()
    for line in lines:
        stripped = line.strip()
        if stripped in {"Copy page", "YesNo", "⌘I"}:
            continue
        if not stripped and cleaned_lines and not cleaned_lines(-1):
            continue
        cleaned_lines.append(line)

    return "\n".join(cleaned_lines).strip()

The second function saves the cleaned Markdown to the output folder and appends the source URL to the top of the file. There is also a small helper function to clean up old Markdown files before saving the new crawl result.

def save_markdown(output_dir: Path, url: str, markdown: str) -> None:
    output_dir.mkdir(parents=True, exist_ok=True)
    filepath = output_dir / slugify_url(url)

    content = f"""---
source_url: {url}
---

{markdown}
"""
    filepath.write_text(content, encoding="utf-8")

There is also a small helper function to clean up old Markdown files before saving the new crawl result.

def clear_output_dir(output_dir: Path) -> None:
    if not output_dir.exists():
        return

    for filepath in output_dir.glob("*.md"):
        filepath.unlink()

// Creating main crawler logic

This is the main part of the script. It loads the API key from the .env file, creates the Olostep client, starts the crawl, waits for it to finish, retrieves each crawled page as Markdown, cleans up the content, and saves it locally.

This section ties everything together and turns individual helper functions into a working documentation crawler.

def main() -> None:
    load_dotenv()
    api_key = os.getenv("OLOSTEP_API_KEY")

    if not api_key:
        raise RuntimeError("Missing OLOSTEP_API_KEY in your .env file.")

    client = Olostep(api_key=api_key)

    crawl = client.crawls.create(
        start_url=START_URL,
        max_pages=MAX_PAGES,
        max_depth=MAX_DEPTH,
        include_urls=INCLUDE_URLS,
        exclude_urls=EXCLUDE_URLS,
        include_external=False,
        include_subdomain=False,
        follow_robots_txt=True,
    )

    print(f"Started crawl: {crawl.id}")
    crawl.wait_till_done(check_every_n_secs=5)

    pages = list(crawl.pages())
    clear_output_dir(OUTPUT_DIR)

    for page in tqdm(pages, desc="Saving pages"):
        try:
            content = page.retrieve(("markdown"))
            markdown = getattr(content, "markdown_content", None)

            if markdown:
                save_markdown(OUTPUT_DIR, page.url, clean_markdown(markdown))
        except Exception as exc:
            print(f"Failed to retrieve {page.url}: {exc}")

    print(f"Done. Files saved in: {OUTPUT_DIR.resolve()}")


if __name__ == "__main__":
    main()

Note: The full script is available here: kingabzpro/web-crawl-olostepa web crawler and starter web app built with Olostep.

// Testing the web crawling script

After the script is complete, run it from your terminal:

python crawl_docs_with_olostep.py

As the script runs, you’ll see the crawler process the pages and save them one by one as Markdown files in its output folder.

After the crawl is finished, open the archived files to check the extracted content. You should see clean, readable Markdown versions of the documentation pages.

At this point, your document content is ready to be used in AI workflows such as search, retrieval, or agent-based systems.

# Creating an Olostep Web Crawling Web Application

In this part of the project, we will build a simple web application on top of a crawler script. Instead of editing a Python file each time, this application gives you an easy way to enter document URLs, select crawl settings, run crawls, and preview saved Markdown files all in one place.

The front-end code of this application is available here. app.py In stock: web-crawl-olostep/app.py.

This application does a few useful things:

Lets you enter a starting URL for the crawl.
Lets you set the maximum number of pages to crawl.
Lets you control the crawl depth.
Lets you include and exclude URL patterns.
Runs the backend crawler directly from the interface.
Saves crawled pages to folders based on URL.
Displays all saved Markdown files in a dropdown.
Previews each Markdown file directly within the application.
Lets you clear previous crawl results with one button.

To start the application, run:

After that, Gradio will start a local web server and provide a link like this:

* Running on local URL: 
* To create a public link, set `share=True` in `launch()`.

Once the application is running, open the local URL in your browser. In our example, we have applied URL to the cloud code documentation and asked it to crawl 50 pages with a depth of 5.

When you click Run the crawl.the application sends your settings to the backend crawler and starts the crawl. In Terminal, you can see the progress as the pages are crawled and saved one by one.

After the crawl is finished, the output folder will contain the saved Markdown files. In this example, you will see that 50 files were added.

The dropdown is then automatically updated in the application, so you can open any saved file and view it as correctly formatted Markdown directly in the web interface.

This makes the crawler very easy to use. Instead of changing values in code each time, you can test different document sites and crawl settings through a simple interface. It also makes it easy to share the project with other people who might not want to work directly in Python.

# The final takeaway

Web crawling is not just about collecting pages from a website. The real challenge is turning this content into clean, structured files that an AI system can actually use. In this project, we used a simple Python script and a Gradio application to make this process very simple.

Just as importantly, the workflow is fast enough for real use. In our example, crawling 50 pages with a depth of 5 took only 50 seconds, which shows that you can generate document data quickly without creating a heavy pipeline.

This setup can go beyond a bar crawl. You can schedule it to run every day with a cron or task scheduler, and even only update pages that have changed. It keeps your documents up to date using only a small number of credits.

For teams that need this kind of workflow to make business sense, Olostep is built with that in mind. It is significantly more affordable than building or maintaining an internal crawling solution, and is at least 50% cheaper than comparable alternatives on the market.

As your usage grows, the cost per request continues to decrease, making it a practical choice for large document pipelines. This combination of reliability, scalability, and strong unit economics is why some fast-growing AI-native startups rely on Olostep to power their data infrastructure.

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master’s degree in Technology Management and a Bachelor’s degree in Telecommunication Engineering. His vision is to create an AI product using graph neural networks for students struggling with mental illness.

# Introduction

# Choosing Olostep over Scrappy or Selenium

# Installing packages and setting up the API

# Creating a crawler script

// Specifying crawl settings

// Creating a helper function to create protected filenames

// Creating a helper function to save Markdown files

// Creating main crawler logic

// Testing the web crawling script

# Creating an Olostep Web Crawling Web Application

# The final takeaway

Editor's pick

Get latest news

How to crawl an entire document site with Olostep

# Introduction

# Choosing Olostep over Scrappy or Selenium

# Installing packages and setting up the API

# Creating a crawler script

// Specifying crawl settings

// Creating a helper function to create protected filenames

// Creating a helper function to save Markdown files

// Creating main crawler logic

// Testing the web crawling script

# Creating an Olostep Web Crawling Web Application

# The final takeaway

Waydev | Engineering Intelligence for the AI ​​Era

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news

Waydev | Engineering Intelligence for the AI Era