PDF files are everywhere. They are used for reports, invoices, bank statements, research papers and legal documents. While PDFs are easy to read for humans, they are not easy to work with in code. Extracting text, splitting pages, or merging files often feels more difficult than it should.
This is the place pypdf helps PYPDF is a popular Python library that allows you to read, edit, and write PDF files. It’s lightweight, easy to learn, and generally works well for PDF tasks. If you’ve ever needed to extract text from a PDF, merge multiple PDFs, or protect a file with a password, PYPDF is a good place to start.
In this article, you will learn what PYPDF is, how it works, and how to use it through simple and practical examples. You will also learn how to like tools PDF BOM Handle PDF operations. This tutorial requires a basic understanding of Python.
What we will cover
What is PYPDF?
PPDF is a pure Python library for working with PDF files. It allows you to open existing PDFs, read their structure, extract content and create new PDF files. Because it’s written in Python, it doesn’t require external tools or system-level dependencies.
The library understands the internal structure of PDFs, such as pages, text sequences, metadata, and encryption. You don’t need to know how PDFs work internally to use PDFs, but it helps to understand that PDFs aren’t just text. It is an organized document with objects and references.
PYPDF is often used in automation scripts, data pipelines, compliance systems, and document processing tools. It is a common choice when you need a reliable and simple solution without heavy dependencies.
Installing PYPDF
Installing PYPDF is straightforward. You can install it using PIP, the standard package manager for Python.
pip install pypdf
Once installed, you can import it into your Python code. The main class you will use is PDFReader to read files and PDFWriter to edit or modify them.
from pypdf import PdfReader, PdfWriter
With this setup, you’re ready to start working with PDFs.
Reading a PDF file
The first step in most tasks is opening a PDF file. PPDF makes this easy by using the PDFReader class.
from pypdf import PdfReader
reader = PdfReader("sample.pdf")
print(len(reader.pages))
This code opens a PDF file and prints the number of pages. Each page in a document is represented as an object that you can access and work with.
You can also inspect basic metadata such as title or author available.
metadata = reader.metadata
print(metadata)
Metadata in PDF is optional, so not all files will have meaningful values.
One of the most common use cases is text extraction. PYPDF allows you to extract text page by page.
from pypdf import PdfReader
reader = PdfReader("sample.pdf")
page = reader.pages(0)
text = page.extract_text()
print(text)
This code extracts the text from the first page of the PDF. For many documents such as reports or articles, this works well.
It is important to understand that text extraction is not perfect. PDFs store text based on order, not reading order. This means that extracted text may appear out of order or in missing places in some cases. Still, for most structured documents, PYPDF provides usable results.
If you want to extract text from all pages, you can loop through them.
full_text = ""
for page in reader.pages:
full_text += page.extract_text() + "\n"
print(full_text)
This approach is common when building search indexes or document analysis pipelines.
Splitting a PDF into multiple files
Another practical task is to split the PDF into smaller files. This is useful when dealing with large reports or scanned documents.
from pypdf import PdfReader, PdfWriter
reader = PdfReader("sample.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i + 1}.pdf", "wb") as f:
writer.write(f)
This code creates one PDF file per page. Each new file contains exactly one page from the original document.
You can split a PDF into chunks, such as every five pages, controlling how many pages you add before writing the file.
Merging multiple PDFs
Merging PDFs is another common requirement. PYPDF allows you to combine several PDF files into one.
from pypdf import PdfReader, PdfWriter
writer = PdfWriter()
files = ("file1.pdf", "file2.pdf", "file3.pdf")
for file in files:
reader = PdfReader(file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as f:
writer.write(f)
This script reads each input file and merges all pages into a single output PDF. The order of the files in the list defines the order in the merged document.
This approach is often used in reporting systems where multiple results are combined into one final document.
Rotating and editing pages
Sometimes you need to rotate pages, especially when working with scanned documents. PYPDF makes it easy.
from pypdf import PdfReader, PdfWriter
reader = PdfReader("sample.pdf")
writer = PdfWriter()
page = reader.pages(0)
page.rotate(90)
writer.add_page(page)
with open("rotated.pdf", "wb") as f:
writer.write(f)
This code rotates the first page 90 degrees clockwise. You can apply similar logic to all pages or selected pages.
You can also crop pages by adjusting their media box, which controls the dimensions of the page. This is useful when removing margins or focusing on a particular area.
Encrypting and Decrypting PDF
Security is important when handling sensitive documents. PYPDF supports PDF encryption and password protection.
from pypdf import PdfReader, PdfWriter
reader = PdfReader("sample.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
writer.encrypt("strongpassword")
with open("protected.pdf", "wb") as f:
writer.write(f)
The resulting PDF requires a password to open. It is usually used for sharing confidential reports or financial documents.
If you need to read an encrypted PDF, you can provide the password when opening it.
reader = PdfReader("protected.pdf")
reader.decrypt("strongpassword")
After decryption, you can work with the file like any other PDF.
Metadata helps describe a document. You can add or update metadata using PYPDF.
from pypdf import PdfWriter
writer = PdfWriter()
writer.add_metadata({
"/Title": "Monthly Report",
"/Author": "Finance Team",
"/Subject": "Revenue Analysis"
})
with open("metadata.pdf", "wb") as f:
writer.write(f)
This metadata becomes part of the PDF file and can be viewed in most PDF readers. It is useful for document management and search systems.
General limitations of PYPDF
Although PYPDF is powerful, it has limitations. It does not perform optical character recognition (OCR), so it cannot extract text from scanned images. For scanned PDFs, you need OCR tools like TestRect.
Can’t extract complex layouts like multi-column text or tables cleanly. In such cases, post-processing or layout-aware tools are required.
Despite these limitations, PYPDF is reliable for a wide range of everyday PDF tasks.
When using PYPDF
PYPDF is best suited for automation, backend services, and scripts where you need simple and fast PDF processing. It works well in data pipelines, compliance systems and internal tools.
If you need advanced understanding or visual analysis, you may need heavier tools. But for most developers, PYPDF covers the basic needs with minimal effort.
The result
PYPDF is a practical and easy-to-use library for working with PDF files in Python. It allows you to read documents, extract text, merge and split files, rotate pages and add security with just a few lines of code.
Its simple API and lightweight design make it a strong choice for developers who want to automate PDF workflows without added complexity. While it doesn’t solve every PDF problem, it handles the most common ones very well.
If you work with PDFs regularly, learning PYPDF is a valuable skill that can save time and reduce manual work in many projects.