Python API

Preface

Thread incompatibility

PDFium is not thread-safe. It is not allowed to call pdfium functions simultaneously across different threads, not even with different documents. [1] However, you may still use pdfium in a threaded context if it is ensured that only a single pdfium call can be made at a time (e.g. via mutex). It is fine to do pdfium work in one thread and other work in other threads.

The same applies to pypdfium2’s helpers, or any wrapper calling pdfium, whether directly or indirectly, unless protected by mutex.

To parallelize expensive pdfium tasks such as rendering, consider processes instead of threads.

API layers

pypdfium2 provides multiple API layers:

  • The raw PDFium API, to be used with ctypes (pypdfium2.raw or pypdfium2_raw [2]).

  • The support model API, which is a set of Python helper classes around the raw API (pypdfium2).

  • Additionally, there is the internal API, which contains various utilities that aid with using the raw API and are accessed by helpers, but do not fit in the support model namespace itself (pypdfium2.internal).

Wrapper objects provide a raw attribute to access the underlying ctypes object. In addition, helpers automatically resolve to raw if used as C function parameter. [3] This allows to conveniently use helpers where available, while the raw API can still be accessed as needed.

The raw API is quite stable and provides a high level of backwards compatibility (seeing as PDFium is well-tested and relied on by popular projects), but it can be difficult to use, and special care needs to be taken with memory management.

The support model API is still in beta stage. It only covers a subset of pdfium features. Backwards incompatible changes may be applied occasionally, although we try to contain them within major releases. On the other hand, it is supposed to be safer and easier to use (“pythonic”), abstracting the finnicky interaction with C functions.

Memory management

Note

This section covers the support model. It is not applicable to the raw API alone!

PDFium objects commonly need to be closed by the caller to release allocated memory. [4] Where necessary, pypdfium2’s helper classes implement automatic closing on garbage collection using weakref.finalize. Additionally, they provide close() methods that can be used to release memory explicitly.

It may be advantageous to close objects explicitly instead of relying on Python garbage collection behaviour, to release allocated memory and acquired file handles immediately. [5]

Closed objects must not be accessed anymore. Closing an object sets the underlying raw attribute to None, which should prevent illegal use of closed raw handles, though. Attempts to re-close an already closed object are silently ignored. Closing a parent object will automatically close any open children (e.g. pages derived from a pdf).

Raw objects must not be detached from their wrappers. Accessing a raw object after it was closed, whether explicitly or on garbage collection of the wrapper, is illegal (use after free). Due to limitations in weakref, finalizers can only be attached to wrapper objects, although they logically belong to the raw object.

Version

Note

Version info can be fooled. See it as orientation rather than inherently reliable data.

PYPDFIUM_INFO = 4.29.0

pypdfium2 helpers version.

It is suggesed to compare against api_tag and possibly also beta (see below).

Parameters:
  • version (str) – Joined tag and desc, forming the full version.

  • tag (str) – Version ciphers joined as str, including possible beta. Corresponds to the latest release tag at install time.

  • desc (str) – Non-cipher descriptors represented as str.

  • api_tag (tuple[int]) – Version ciphers joined as tuple, excluding possible beta.

  • major (int) – Major cipher.

  • minor (int) – Minor cipher.

  • patch (int) – Patch cipher.

  • beta (int | None) – Beta cipher, or None if not a beta version.

  • n_commits (int) – Number of commits after tag at install time. 0 for release.

  • hash (str | None) – Hash of head commit if n_commits > 0, None otherwise.

  • dirty (bool) – True if there were uncommitted changes at install time, False otherwise.

  • data_source (str) –

    Source of this version info. Possible values:

    • git: Parsed from git describe. Always used if available. Highest accuracy.

    • given: Pre-supplied version file (e.g. packaged with sdist, or else created by caller).

    • record: Parsed from autorelease record. Implies that possible changes after tag are unknown.

  • is_editable (bool | None) –

    True for editable install, False otherwise. None if unknown.

    If True, the version info is the one captured at install time. An arbitrary number of forward or reverse changes may have happened since. The actual current state is unknown.

PDFIUM_INFO = 125.0.6406.0

PDFium version.

It is suggesed to compare against build (see below).

Parameters:
  • version (str) – Joined tag and desc, forming the full version.

  • tag (str) – Version ciphers joined as str.

  • desc (str) – Descriptors (origin, flags) represented as str.

  • api_tag (tuple[int]) – Version ciphers joined as tuple.

  • major (int) – Chromium major cipher.

  • minor (int) – Chromium minor cipher.

  • build (int) – Chromium/pdfium build cipher. This value allows to uniquely identify the pdfium sources the binary was built from.

  • patch (int) – Chromium patch cipher.

  • n_commits (int) – Number of commits after tag at install time. 0 for tagged build commit.

  • hash (str | None) – Hash of head commit if n_commits > 0, None otherwise.

  • origin (str) –

    The pdfium binary’s origin. Possible values:

    • pdfium-binaries: Compiled by bblanchon/pdfium-binaries, and bundled into pypdfium2.

    • sourcebuild: Provided by the caller (commonly compiled using pypdfium2’s integrated build script), and bundled into pypdfium2.

    • system: Loaded from a standard system location using ctypes.util.find_library(), or an explicit directory provided at setup time.

  • flags (tuple[str]) – Tuple of pdfium feature flags. Empty for default build. (V8, XFA) for pdfium-binaries V8 build.

Deprecated since version 4.22: The legacy members V_PYPDFIUM2, V_LIBPDFIUM, V_BUILDNAME, V_PDFIUM_IS_V8, V_LIBPDFIUM_FULL will be removed in version 5.

Document

class PdfDocument(input, password=None, autoclose=False)[source]

Bases: AutoCloseable

Document helper class.

Parameters:
  • input_data (str | pathlib.Path | bytes | ctypes.Array | BinaryIO | FPDF_DOCUMENT) – The input PDF given as file path, bytes, ctypes array, byte buffer, or raw PDFium document handle. A byte buffer is defined as an object that implements seek() tell() read() readinto().

  • password (str | None) – A password to unlock the PDF, if encrypted. Otherwise, None or an empty string may be passed. If a password is given but the PDF is not encrypted, it will be ignored (as of PDFium 5418).

  • autoclose (bool) – Whether byte buffer input should be automatically closed on finalization.

Raises:
  • PdfiumError – Raised if the document failed to load. The exception message is annotated with the reason reported by PDFium.

  • FileNotFoundError – Raised if an invalid or non-existent file path was given.

Hint

  • len() may be called to get a document’s number of pages.

  • Looping over a document will yield its pages from beginning to end.

  • Pages may be loaded using list index access.

  • The del keyword and list index access may be used to delete pages.

raw

The underlying PDFium document handle.

Type:

FPDF_DOCUMENT

formenv

Form env, if the document has forms and init_forms() was called.

Type:

PdfFormEnv | None

property parent
classmethod new()[source]
Returns:

A new, empty document.

Return type:

PdfDocument

init_forms(config=None)[source]

Initialize a form env, if the document has forms. If already initialized, nothing will be done. See the formenv attribute.

Note

If form rendering is desired, this method should be called directly after constructing the document, before getting any page handles (due to PDFium’s API).

Parameters:

config (FPDF_FORMFILLINFO | None) – Custom form config interface to use (optional).

get_formtype()[source]
Returns:

PDFium form type that applies to the document (FORMTYPE_*). FORMTYPE_NONE if the document has no forms.

Return type:

int

get_pagemode()[source]
Returns:

Page displaying mode (PAGEMODE_*).

Return type:

int

is_tagged()[source]
Returns:

Whether the document is tagged (cf. PDF 1.7, 10.7 “Tagged PDF”).

Return type:

bool

save(dest, version=None, flags=pdfium_c.FPDF_NO_INCREMENTAL)[source]

Save the document at its current state.

Parameters:
  • dest (str | pathlib.Path | io.BytesIO) – File path or byte buffer the document shall be written to.

  • version (int | None) – The PDF version to use, given as an integer (14 for 1.4, 15 for 1.5, …). If None (the default), PDFium will set a version automatically.

  • flags (int) – PDFium saving flags (defaults to FPDF_NO_INCREMENTAL).

get_identifier(type=pdfium_c.FILEIDTYPE_PERMANENT)[source]
Parameters:

type (int) – The identifier type to retrieve (FILEIDTYPE_*), either permanent or changing. If the file was updated incrementally, the permanent identifier stays the same, while the changing identifier is re-calculated.

Returns:

Unique file identifier from the PDF’s trailer dictionary. See PDF 1.7, Section 14.4 “File Identifiers”.

Return type:

bytes

get_version()[source]
Returns:

The PDF version of the document (14 for 1.4, 15 for 1.5, …), or None if the document is new or its version could not be determined.

Return type:

int | None

get_metadata_value(key)[source]
Returns:

Value of the given key in the PDF’s metadata dictionary. If the key is not contained, an empty string will be returned.

Return type:

str

METADATA_KEYS = ('Title', 'Author', 'Subject', 'Keywords', 'Creator', 'Producer', 'CreationDate', 'ModDate')
get_metadata_dict(skip_empty=False)[source]

Get the document’s metadata as dictionary.

Parameters:

skip_empty (bool) – If True, skip items whose value is an empty string.

Returns:

PDF metadata.

Return type:

dict

count_attachments()[source]
Returns:

The number of embedded files in the document.

Return type:

int

get_attachment(index)[source]
Returns:

The attachment at index (zero-based).

Return type:

PdfAttachment

new_attachment(name)[source]

Add a new attachment to the document. It may appear at an arbitrary index (as of PDFium 5418).

Parameters:

name (str) – The name the attachment shall have. Usually a file name with extension.

Returns:

Handle to the new, empty attachment.

Return type:

PdfAttachment

del_attachment(index)[source]

Unlink the attachment at index (zero-based). It will be hidden from the viewer, but is still present in the file (as of PDFium 5418). Following attachments shift one slot to the left in the array representation used by PDFium’s API.

Handles to the attachment in question received from get_attachment() must not be accessed anymore after this method has been called.

get_page(index)[source]
Returns:

The page at index (zero-based).

Return type:

PdfPage

Note

This calls FORM_OnAfterLoadPage() if the document has an active form env. The form env must not be closed before the page is closed!

new_page(width, height, index=None)[source]

Insert a new, empty page into the document.

Parameters:
  • width (float) – Target page width (horizontal size).

  • height (float) – Target page height (vertical size).

  • index (int | None) – Suggested zero-based index at which the page shall be inserted. If None or larger that the document’s current last index, the page will be appended to the end.

Returns:

The newly created page.

Return type:

PdfPage

del_page(index)[source]

Remove the page at index (zero-based).

import_pages(pdf, pages=None, index=None)[source]

Import pages from a foreign document.

Parameters:
  • pdf (PdfDocument) – The document from which to import pages.

  • pages (list[int] | str | None) – The pages to include. It may either be a list of zero-based page indices, or a string of one-based page numbers and ranges. If None, all pages will be included.

  • index (int) – Zero-based index at which to insert the given pages. If None, they are appended to the end of the document.

get_page_size(index)[source]
Returns:

Width and height in PDF canvas units of the page at index (zero-based).

Return type:

(float, float)

get_page_label(index)[source]
Returns:

Label of the page at index (zero-based). (A page label is essentially an alias that may be displayed instead of the page number.)

Return type:

str

page_as_xobject(index, dest_pdf)[source]

Capture a page as XObject and attach it to a document’s resources.

Parameters:
  • index (int) – Zero-based index of the page.

  • dest_pdf (PdfDocument) – Target document to which the XObject shall be added.

Returns:

The page as XObject.

Return type:

PdfXObject

get_toc(max_depth=15, parent=None, level=0, seen=None)[source]

Iterate through the bookmarks in the document’s table of contents.

Parameters:

max_depth (int) – Maximum recursion depth to consider.

Yields:

PdfOutlineItem – Bookmark information.

render(converter, renderer=PdfPage.render, page_indices=None, pass_info=False, n_processes=None, mk_formconfig=None, **kwargs)[source]

Deprecated since version 4.19: This method will be removed with the next major release due to serious issues rooted in the original API design. Use PdfPage.render() instead. Note that the CLI provides parallel rendering using a proper caller-side process pool with inline saving in rendering jobs.

Changed in version 4.25: Removed the original process pool implementation and turned this into a wrapper for linear rendering, due to the serious conceptual issues and possible memory load escalation, especially with expensive receiving code (e.g. PNG encoding) or long documents. See the changelog for more info

class PdfFormEnv(raw, config, pdf)[source]

Bases: AutoCloseable

Form environment helper class.

raw

The underlying PDFium form env handle.

Type:

FPDF_FORMHANDLE

config

Accompanying form configuration interface, to be kept alive.

Type:

FPDF_FORMFILLINFO

pdf

Parent document this form env belongs to.

Type:

PdfDocument

property parent
class PdfXObject(raw, pdf)[source]

Bases: AutoCloseable

XObject helper class.

raw

The underlying PDFium XObject handle.

Type:

FPDF_XOBJECT

pdf

Reference to the document this XObject belongs to.

Type:

PdfDocument

property parent
as_pageobject()[source]
Returns:

An independent page object representation of the XObject. If multiple page objects are created from one XObject, they share resources. Page objects created from an XObject remain valid after the XObject is closed.

Return type:

PdfObject

class PdfOutlineItem(level, title, is_closed, n_kids, page_index, view_mode, view_pos)

Bases: tuple

Bookmark information.

Parameters:
  • level (int) – Number of parent items.

  • title (str) – Title string of the bookmark.

  • is_closed (bool) – True if child items shall be collapsed, False if they shall be expanded. None if the item has no descendants (i. e. n_kids == 0).

  • n_kids (int) – Absolute number of child items, according to the PDF.

  • page_index (int | None) – Zero-based index of the page the bookmark points to. May be None if the bookmark has no target page (or it could not be determined).

  • view_mode (int) – A view mode constant (PDFDEST_VIEW_*) defining how the coordinates of view_pos shall be interpreted.

  • view_pos (list[float]) – Target position on the page the viewport should jump to when the bookmark is clicked. It is a sequence of float values in PDF canvas units. Depending on view_mode, it may contain between 0 and 4 coordinates.

Page

class PdfPage(raw, pdf, formenv)[source]

Bases: AutoCloseable

Page helper class.

raw

The underlying PDFium page handle.

Type:

FPDF_PAGE

pdf

Reference to the document this page belongs to.

Type:

PdfDocument

property parent
get_width()[source]
Returns:

Page width (horizontal size), in PDF canvas units.

Return type:

float

get_height()[source]
Returns:

Page height (vertical size), in PDF canvas units.

Return type:

float

get_size()[source]
Returns:

Page width and height, in PDF canvas units.

Return type:

(float, float)

get_rotation()[source]
Returns:

Clockwise page rotation in degrees.

Return type:

int

set_rotation(rotation)[source]

Define the absolute, clockwise page rotation (0, 90, 180, or 270 degrees).

get_mediabox(fallback_ok=True)[source]
Returns:

The page MediaBox in PDF canvas units, consisting of four coordinates (usually x0, y0, x1, y1). If MediaBox is not defined, returns ANSI A (0, 0, 612, 792) if fallback_ok=True, None otherwise.

Return type:

(float, float, float, float) | None

Note

Due to quirks in PDFium’s public API, all get_*box() functions except get_bbox() do not inherit from parent nodes in the page tree (as of PDFium 5418).

set_mediabox(l, b, r, t)[source]

Set the page’s MediaBox by passing four float coordinates (usually x0, y0, x1, y1).

get_cropbox(fallback_ok=True)[source]
Returns:

The page’s CropBox (If not defined, falls back to MediaBox).

set_cropbox(l, b, r, t)[source]

Set the page’s CropBox.

get_bleedbox(fallback_ok=True)[source]
Returns:

The page’s BleedBox (If not defined, falls back to CropBox).

set_bleedbox(l, b, r, t)[source]

Set the page’s BleedBox.

get_trimbox(fallback_ok=True)[source]
Returns:

The page’s TrimBox (If not defined, falls back to CropBox).

set_trimbox(l, b, r, t)[source]

Set the page’s TrimBox.

get_artbox(fallback_ok=True)[source]
Returns:

The page’s ArtBox (If not defined, falls back to CropBox).

set_artbox(l, b, r, t)[source]

Set the page’s ArtBox.

get_bbox()[source]
Returns:

The bounding box of the page (the intersection between its media box and crop box).

get_textpage()[source]
Returns:

A new text page handle for this page.

Return type:

PdfTextPage

insert_obj(pageobj)[source]

Insert a page object into the page.

The page object must not belong to a page yet. If it belongs to a PDF, this page must be part of the PDF.

Position and form are defined by the object’s matrix. If it is the identity matrix, the object will appear as-is on the bottom left corner of the page.

Parameters:

pageobj (PdfObject) – The page object to insert.

remove_obj(pageobj)[source]

Remove a page object from the page. As of PDFium 5692, detached page objects may be only re-inserted into existing pages of the same document. If the page object is not re-inserted into a page, its close() method may be called.

Parameters:

pageobj (PdfObject) – The page object to remove.

gen_content()[source]

Generate page content to apply additions, removals or modifications of page objects.

If page content was changed, this function should be called once before saving the document or re-loading the page.

get_objects(filter=None, max_depth=2, form=None, level=0)[source]

Iterate through the page objects on this page.

Parameters:
  • filter (list[int] | None) – An optional list of page object types to filter (FPDF_PAGEOBJ_*). Any objects whose type is not contained will be skipped. If None or empty, all objects will be provided, regardless of their type.

  • max_depth (int) – Maximum recursion depth to consider when descending into Form XObjects.

Yields:

PdfObject – A page object.

render(scale=1, rotation=0, crop=(0, 0, 0, 0), may_draw_forms=True, bitmap_maker=PdfBitmap.new_native, color_scheme=None, fill_to_stroke=False, **kwargs)[source]

Rasterize the page to a PdfBitmap.

Parameters:
  • scale (float) – A factor scaling the number of pixels per PDF canvas unit. This defines the resolution of the image. To convert a DPI value to a scale factor, multiply it by the size of 1 canvas unit in inches (usually 1/72in). [6]

  • rotation (int) – Additional rotation in degrees (0, 90, 180, or 270).

  • crop (tuple[float, float, float, float]) – Amount in PDF canvas units to cut off from page borders (left, bottom, right, top). Crop is applied after rotation.

  • may_draw_forms (bool) – If True, render form fields (provided the document has forms and init_forms() was called).

  • bitmap_maker (Callable) – Callback function used to create the PdfBitmap.

  • color_scheme (PdfColorScheme | None) – An optional, custom rendering color scheme.

  • fill_to_stroke (bool) – If True and rendering with custom color scheme, fill paths will be stroked.

  • fill_color (tuple[int, int, int, int]) – Color the bitmap will be filled with before rendering (RGBA values from 0 to 255).

  • grayscale (bool) – If True, render in grayscale mode.

  • optimize_mode (None | str) – Page rendering optimization mode (None, “lcd”, “print”).

  • draw_annots (bool) – If True, render page annotations.

  • no_smoothtext (bool) – If True, disable text anti-aliasing. Overrides optimize_mode="lcd".

  • no_smoothimage (bool) – If True, disable image anti-aliasing.

  • no_smoothpath (bool) – If True, disable path anti-aliasing.

  • force_halftone (bool) – If True, always use halftone for image stretching.

  • limit_image_cache (bool) – If True, limit image cache size.

  • rev_byteorder (bool) – If True, render with reverse byte order, leading to RGB(A/X) output instead of BGR(A/X). Other pixel formats are not affected.

  • prefer_bgrx (bool) – If True, prefer four-channel over three-channel pixel formats, even if the alpha byte is unused. Other pixel formats are not affected.

  • force_bitmap_format (int | None) – If given, override automatic pixel format selection and enforce use of the given format (one of the FPDFBitmap_* constants).

  • extra_flags (int) – Additional PDFium rendering flags. May be combined with bitwise OR (| operator).

Returns:

Bitmap of the rendered page.

Return type:

PdfBitmap

class PdfColorScheme(path_fill, path_stroke, text_fill, text_stroke)[source]

Bases: object

Rendering color scheme. Each color shall be provided as a list of values for red, green, blue and alpha, ranging from 0 to 255.

convert(rev_byteorder)[source]
Returns:

The color scheme as FPDF_COLORSCHEME object.

Page Objects

class PdfObject(raw, *args, **kwargs)[source]

Bases: AutoCloseable

Page object helper class.

When constructing a PdfObject, an instance of a more specific subclass may be returned instead, depending on the object’s type (e. g. PdfImage).

raw

The underlying PDFium pageobject handle.

Type:

FPDF_PAGEOBJECT

type

The object’s type (FPDF_PAGEOBJ_*).

Type:

int

page

Reference to the page this pageobject belongs to. May be None if it does not belong to a page yet.

Type:

PdfPage

pdf

Reference to the document this pageobject belongs to. May be None if the object does not belong to a document yet. This attribute is always set if page is set.

Type:

PdfDocument

level

Nesting level signifying the number of parent Form XObjects, at the time of construction. Zero if the object is not nested in a Form XObject.

Type:

int

property parent
get_pos()[source]

Get the position of the object on the page.

Returns:

A tuple of four float coordinates for left, bottom, right, and top.

get_matrix()[source]
Returns:

The pageobject’s current transform matrix.

Return type:

PdfMatrix

set_matrix(matrix)[source]
Parameters:

matrix (PdfMatrix) – Set this matrix as the pageobject’s transform matrix.

transform(matrix)[source]
Parameters:

matrix (PdfMatrix) – Multiply the page object’s current transform matrix by this matrix.

class PdfImage(raw, *args, **kwargs)[source]

Bases: PdfObject

Image object helper class (specific kind of page object).

SIMPLE_FILTERS = ('ASCIIHexDecode', 'ASCII85Decode', 'RunLengthDecode', 'FlateDecode', 'LZWDecode')

Filters applied by FPDFImageObj_GetImageDataDecoded(). Hereafter referred to as “simple filters”, while non-simple filters will be called “complex filters”.

classmethod new(pdf)[source]
Parameters:

pdf (PdfDocument) – The document to which the new image object shall be added.

Returns:

Handle to a new, empty image. Note that position and size of the image are defined by its matrix, which defaults to the identity matrix. This means that new images will appear as a tiny square of 1x1 units on the bottom left corner of the page. Use PdfMatrix and set_matrix() to adjust size and position.

Return type:

PdfImage

get_metadata()[source]

Retrieve image metadata including DPI, bits per pixel, color space, and size. If the image does not belong to a page yet, bits per pixel and color space will be unset (0).

Note

  • The DPI values signify the resolution of the image on the PDF page, not the DPI metadata embedded in the image file.

  • Due to issues in PDFium, this function can be slow. If you only need image size, prefer the faster get_size() instead.

Returns:

Image metadata structure

Return type:

FPDF_IMAGEOBJ_METADATA

get_size()[source]

New in version 4.8/5731.

Returns:

Image dimensions as a tuple of (width, height).

Return type:

(int, int)

load_jpeg(source, pages=None, inline=False, autoclose=True)[source]

Set a JPEG as the image object’s content.

Parameters:
  • source (str | pathlib.Path | BinaryIO) – Input JPEG, given as file path or readable byte buffer.

  • pages (list[PdfPage] | None) – If replacing an image, pass in a list of loaded pages that might contain it, to update their cache. (The same image may be shown multiple times in different transforms across a PDF.) May be None or an empty sequence if the image is not shared.

  • inline (bool) – Whether to load the image content into memory. If True, the buffer may be closed after this function call. Otherwise, the buffer needs to remain open until the PDF is closed.

  • autoclose (bool) – If the input is a buffer, whether it should be automatically closed once not needed by the PDF anymore.

set_bitmap(bitmap, pages=None)[source]

Set a bitmap as the image object’s content. The pixel data will be flate compressed (as of PDFium 5418).

Parameters:
  • bitmap (PdfBitmap) – The bitmap to inject into the image object.

  • pages (list[PdfPage] | None) – A list of loaded pages that might contain the image object. See load_jpeg().

get_bitmap(render=False)[source]

Get a bitmap rasterization of the image.

Parameters:

render (bool) – Whether the image should be rendered, thereby applying possible transform matrices and alpha masks.

Returns:

Image bitmap (with a buffer allocated by PDFium).

Return type:

PdfBitmap

get_data(decode_simple=False)[source]
Parameters:

decode_simple (bool) – If True, apply simple filters, resulting in semi-decoded data (see SIMPLE_FILTERS). Otherwise, the raw data will be returned.

Returns:

The data of the image stream (as c_ubyte array).

Return type:

ctypes.Array

get_filters(skip_simple=False)[source]
Parameters:

skip_simple (bool) – If True, exclude simple filters.

Returns:

A list of image filters, to be applied in order (from lowest to highest index).

Return type:

list[str]

extract(dest, *args, **kwargs)[source]

Extract the image into an independently usable file or byte buffer. Where possible within PDFium’s limited public API, it will be attempted to transfer the image data directly, avoiding an unnecessary layer of decoding and re-encoding. Otherwise, the fully decoded data will be retrieved and (re-)encoded using PIL.

As PDFium does not expose all required information, only DCTDecode (JPEG) and JPXDecode (JPEG 2000) images can be extracted directly. For images with complex filters, the bitmap data is used. Otherwise, get_data(decode_simple=True) is used, which avoids lossy conversion for images whose bit depth or colour format is not supported by PDFium’s bitmap implementation.

Parameters:
  • dest (str | io.BytesIO) – File prefix or byte buffer to which the image shall be written.

  • fb_format (str) – The image format to use in case it is necessary to (re-)encode the data.

  • fb_render (bool) – Whether the image should be rendered if falling back to bitmap-based extraction.

Text Page

class PdfTextPage(raw, page)[source]

Bases: AutoCloseable

Text page helper class.

raw

The underlying PDFium textpage handle.

Type:

FPDF_TEXTPAGE

page

Reference to the page this textpage belongs to.

Type:

PdfPage

property parent
get_text_range(index=0, count=-1, errors='ignore', force_this=False)[source]

Warning

Changed in version 4.28: Unexpected upstream changes have caused allocation size concerns with this API. Using it is now discouraged unless you specifically need to extract a character range. Prefer get_text_bounded() where possible. Calling this method with default params now implicitly translates to get_text_bounded() (pass force_this=True to circumvent).

Extract text from a given range.

Parameters:
  • index (int) – Index of the first char to include.

  • count (int) – Number of chars to cover, relative to the internal char list. Defaults to -1 for all remaining chars after index.

  • errors (str) – Error handling when decoding the data (see bytes.decode()).

Returns:

The text in the range in question, or an empty string if no text was found.

Return type:

str

Note

  • The returned text’s length does not have to match count, even if it will for most PDFs. This is because the underlying API may exclude/insert chars compared to the internal list, although rare in practice. This means, if the char at i is excluded, get_text_range(i, 2)[1] will raise an index error. Pdfium provides raw APIs FPDFText_GetTextIndexFromCharIndex() / FPDFText_GetCharIndexFromTextIndex() to translate between the two views and identify excluded/inserted chars.

  • In case of leading/trailing excluded characters, pypdfium2 modifies index and count accordingly to prevent pdfium from unexpectedly reading beyond range(index, index+count).

get_text_bounded(left=None, bottom=None, right=None, top=None, errors='ignore')[source]

Extract text from given boundaries in PDF coordinates. If a boundary value is None, it defaults to the corresponding value of PdfPage.get_bbox().

Parameters:

errors (str) – Error treatment when decoding the data (see bytes.decode()).

Returns:

The text on the page area in question, or an empty string if no text was found.

Return type:

str

count_chars()[source]
Returns:

The number of characters on the text page.

Return type:

int

count_rects(index=0, count=-1)[source]
Parameters:
  • index (int) – Start character index.

  • count (int) – Character count to consider (defaults to -1 for all remaining).

Returns:

The number of text rectangles in the given character range.

Return type:

int

get_index(x, y, x_tol, y_tol)[source]

Get the index of a character by position.

Parameters:
  • x (float) – Horizontal position (in PDF canvas units).

  • y (float) – Vertical position.

  • x_tol (float) – Horizontal tolerance.

  • y_tol (float) – Vertical tolerance.

Returns:

The index of the character at or nearby the point (x, y). May be None if there is no character or an error occurred.

Return type:

int | None

get_charbox(index, loose=False)[source]

Get the bounding box of a single character.

Parameters:
  • index (int) – Index of the character to work with, in the page’s character array.

  • loose (bool) – Get a more comprehensive box covering the entire font bounds, as opposed to the default tight box specific to the one character.

Returns:

Float values for left, bottom, right and top in PDF canvas units.

get_rect(index)[source]

Get the bounding box of a text rectangle at the given index. Note that count_rects() must be called once with default parameters before subsequent get_rect() calls for this function to work (due to PDFium’s API).

Returns:

Float values for left, bottom, right and top in PDF canvas units.

search(text, index=0, match_case=False, match_whole_word=False, consecutive=False)[source]

Locate text on the page.

Parameters:
  • text (str) – The string to search for.

  • index (int) – Character index at which to start searching.

  • match_case (bool) – If True, the search will be case-specific (upper and lower letters treated as different characters).

  • match_whole_word (bool) – If True, substring occurrences will be ignored (e. g. cat would not match category).

  • consecutive (bool) – If False (the default), search() will skip past the current match to look for the next match. If True, parts of the previous match may be caught again (e. g. searching for aa in aaaa would match 3 rather than 2 times).

Returns:

A helper object to search text.

Return type:

PdfTextSearcher

class PdfTextSearcher(raw, textpage)[source]

Bases: AutoCloseable

Text searcher helper class.

raw

The underlying PDFium searcher handle.

Type:

FPDF_SCHHANDLE

textpage

Reference to the textpage this searcher belongs to.

Type:

PdfTextPage

property parent
get_next()[source]
Returns:

Start character index and count of the next occurrence, or None if the last occurrence was passed.

Return type:

(int, int)

get_prev()[source]
Returns:

Start character index and count of the previous occurrence (i. e. the one before the last valid occurrence), or None if the last occurrence was passed.

Return type:

(int, int)

Bitmap

class PdfBitmap(raw, buffer, width, height, stride, format, rev_byteorder, needs_free)[source]

Bases: AutoCloseable

Bitmap helper class.

Hint

This class provides built-in converters (e. g. to_pil(), to_numpy()) that may be used to create a different representation of the bitmap. Converters can be applied on PdfBitmap objects either as bound method (bitmap.to_*()), or as function (PdfBitmap.to_*(bitmap)) The second pattern is useful for API methods that need to apply a caller-provided converter (e. g. PdfDocument.render())

Note

All attributes of PdfBitmapInfo are available in this class as well.

Warning

bitmap.close(), which frees the buffer of foreign bitmaps, is not validated for safety. A bitmap must not be closed when other objects still depend on its buffer!

raw

The underlying PDFium bitmap handle.

Type:

FPDF_BITMAP

buffer

A ctypes array representation of the pixel data (each item is an unsigned byte, i. e. a number ranging from 0 to 255).

Type:

c_ubyte

property parent
get_info()[source]
Returns:

A namedtuple describing the bitmap.

Return type:

PdfBitmapInfo

classmethod from_raw(raw, rev_byteorder=False, ex_buffer=None)[source]

Construct a PdfBitmap wrapper around a raw PDFium bitmap handle.

Parameters:
  • raw (FPDF_BITMAP) – PDFium bitmap handle.

  • rev_byteorder (bool) – Whether the bitmap uses reverse byte order.

  • ex_buffer (c_ubyte | None) – If the bitmap was created from a buffer allocated by Python/ctypes, pass in the ctypes array to keep it referenced.

classmethod new_native(width, height, format, rev_byteorder=False, buffer=None)[source]

Create a new bitmap using FPDFBitmap_CreateEx(), with a buffer allocated by Python/ctypes. Bitmaps created by this function are always packed (no unused bytes at line end).

classmethod new_foreign(width, height, format, rev_byteorder=False, force_packed=False)[source]

Create a new bitmap using FPDFBitmap_CreateEx(), with a buffer allocated by PDFium.

Using this method is discouraged. Prefer new_native() instead.

classmethod new_foreign_simple(width, height, use_alpha, rev_byteorder=False)[source]

Create a new bitmap using FPDFBitmap_Create(). The buffer is allocated by PDFium. The resulting bitmap is supposed to be packed (i. e. no gap of unused bytes between lines).

Using this method is discouraged. Prefer new_native() instead.

fill_rect(left, top, width, height, color)[source]

Fill a rectangle on the bitmap with the given color. The coordinate system starts at the top left corner of the image.

Note

This function replaces the color values in the given rectangle. It does not perform alpha compositing.

Parameters:

color (tuple[int, int, int, int]) – RGBA fill color (a tuple of 4 integers ranging from 0 to 255).

to_numpy()[source]

Convert the bitmap to a numpy array.

The array contains as many rows as the bitmap is high. Each row contains as many pixels as the bitmap is wide. The length of each pixel corresponds to the number of channels.

The resulting array is supposed to share memory with the original bitmap buffer, so changes to the buffer should be reflected in the array, and vice versa.

Returns:

NumPy array (representation of the bitmap buffer).

Return type:

numpy.ndarray

to_pil()[source]

Convert the bitmap to a PIL image, using PIL.Image.frombuffer().

For RGBA, RGBX and L buffers, PIL is supposed to share memory with the original bitmap buffer, so changes to the buffer should be reflected in the image, and vice versa. Otherwise, PIL will make a copy of the data.

Returns:

PIL image (representation or copy of the bitmap buffer).

Return type:

PIL.Image.Image

Changed in version 4.16: Set image.readonly = False so that changes to the image are also reflected in the buffer.

classmethod from_pil(pil_image, recopy=False)[source]

Convert a PIL image to a PDFium bitmap. Due to the restricted number of color formats and bit depths supported by PDFium’s bitmap implementation, this may be a lossy operation.

Bitmaps returned by this function should be treated as immutable (i.e. don’t call fill_rect()).

Parameters:

pil_image (PIL.Image.Image) – The image.

Returns:

PDFium bitmap (with a copy of the PIL image’s data).

Return type:

PdfBitmap

Deprecated since version 4.25: The recopy parameter has been deprecated.

class PdfBitmapInfo(width, height, stride, format, rev_byteorder, n_channels, mode)

Bases: tuple

width

Width of the bitmap (horizontal size).

Type:

int

height

Height of the bitmap (vertical size).

Type:

int

stride

Number of bytes per line in the bitmap buffer. Depending on how the bitmap was created, there may be a padding of unused bytes at the end of each line, so this value can be greater than width * n_channels.

Type:

int

format

PDFium bitmap format constant (FPDFBitmap_*)

Type:

int

rev_byteorder

Whether the bitmap is using reverse byte order.

Type:

bool

n_channels

Number of channels per pixel.

Type:

int

mode

The bitmap format as string (see PIL Modes).

Type:

str

Matrix

class PdfMatrix(a=1, b=0, c=0, d=1, e=0, f=0)[source]

Bases: object

PDF transformation matrix helper class.

See the PDF 1.7 specification, Section 8.3.3 (“Common Transformations”).

Note

  • The PDF format uses row vectors.

  • Transformations operate from the origin of the coordinate system (PDF coordinates: bottom left corner, Device coordinates: top left corner).

  • Matrix calculations are implemented independently in Python.

  • Matrix objects are immutable, so transforming methods return a new matrix.

  • Matrix objects implement ctypes auto-conversion to FS_MATRIX for easy use as C function parameter.

a

Matrix value [0][0].

Type:

float

b

Matrix value [0][1].

Type:

float

c

Matrix value [1][0].

Type:

float

d

Matrix value [1][1].

Type:

float

e

Matrix value [2][0] (X translation).

Type:

float

f

Matrix value [2][1] (Y translation).

Type:

float

get()[source]

Get the matrix as tuple of the form (a, b, c, d, e, f).

classmethod from_raw(raw)[source]

Load a PdfMatrix from a raw FS_MATRIX object.

to_raw()[source]

Convert the matrix to a raw FS_MATRIX object.

multiply(other)[source]

Multiply this matrix by another PdfMatrix, to concatenate transformations.

translate(x, y)[source]
Parameters:
  • x (float) – Horizontal shift (<0: left, >0: right).

  • y (float) – Vertical shift.

scale(x, y)[source]
Parameters:
  • x (float) – A factor to scale the X axis (<1: compress, >1: stretch).

  • y (float) – A factor to scale the Y axis.

rotate(angle, ccw=False, rad=False)[source]
Parameters:
  • angle (float) – Angle by which to rotate the matrix.

  • ccw (bool) – If True, rotate counter-clockwise.

  • rad (bool) – If True, interpret the angle as radians.

mirror(v, h)[source]
Parameters:
  • v (bool) – Whether to mirror vertically (at the Y axis).

  • h (bool) – Whether to mirror horizontall (at the X axis).

skew(x_angle, y_angle, rad=False)[source]
Parameters:
  • x_angle (float) – Inner angle to skew the X axis.

  • y_angle (float) – Inner angle to skew the Y axis.

  • rad (bool) – If True, interpret the angles as radians.

on_point(x, y)[source]
Returns:

Transformed point.

Return type:

(float, float)

on_rect(left, bottom, right, top)[source]
Returns:

Transformed rectangle.

Return type:

(float, float, float, float)

Attachment

class PdfAttachment(raw, pdf)[source]

Bases: AutoCastable

Attachment helper class. See PDF 1.7, Section 7.11 “File Specifications”.

raw

The underlying PDFium attachment handle.

Type:

FPDF_ATTACHMENT

pdf

Reference to the document this attachment belongs to. Must remain valid as long as the attachment is used.

Type:

PdfDocument

get_name()[source]
Returns:

Name of the attachment.

Return type:

str

get_data()[source]
Returns:

The attachment’s file data (as c_char array).

Return type:

ctypes.Array

set_data(data)[source]

Set the attachment’s file data. If this function is called on an existing attachment, it will be changed to point at the new data, but the previous data will not be removed from the file (as of PDFium 5418).

Parameters:

data (bytes | ctypes.Array) – New file data for the attachment. May be any data type that can be implicitly converted to c_void_p.

has_key(key)[source]
Parameters:

key (str) – A key to look for in the attachment’s params dictionary.

Returns:

True if key is contained in the params dictionary, False otherwise.

Return type:

bool

get_value_type(key)[source]
Returns:

Type of the value of key in the params dictionary (FPDF_OBJECT_*).

Return type:

int

get_str_value(key)[source]
Returns:

The value of key in the params dictionary, if it is a string or name. Otherwise, an empty string will be returned. On other failures, an exception will be raised.

Return type:

str

set_str_value(key, value)[source]

Set the attribute specified by key to the string value.

Parameters:

value (str) – New string value for the attribute.

Miscellaneous

exception PdfiumError[source]

Bases: RuntimeError

An exception from the PDFium library, detected by function return code.

class PdfUnspHandler[source]

Bases: object

Unsupported feature handler helper class.

handlers

A dictionary of named handler functions to be called with an unsupported code (FPDF_UNSP_*) when PDFium detects an unsupported feature.

Type:

dict[str, Callable]

setup(add_default=True)[source]

Attach the handler to PDFium, and register an exit function to keep the object alive for the rest of the session.

Parameters:

add_default (bool) – If True, add a default callback that will log unsupported features as warning.

Internal

Warning

The following helpers are considered internal, so their API may change any time. They are isolated in an own namespace (pypdfium2.internal).

RotationToConst = {0: 0, 90: 1, 180: 2, 270: 3}

Convert a rotation value in degrees to a PDFium constant.

RotationToDegrees = {0: 0, 1: 90, 2: 180, 3: 270}

Convert a PDFium rotation constant to a value in degrees. Inversion of RotationToConst.

BitmapTypeToNChannels = {1: 1, 2: 3, 3: 4, 4: 4}

Get the number of channels for a PDFium bitmap format. (FPDFBitmap_Unknown is deliberately not handled.)

BitmapTypeToStr = {1: 'L', 2: 'BGR', 3: 'BGRX', 4: 'BGRA'}

Convert a PDFium bitmap format to string, assuming BGR byte order. (FPDFBitmap_Unknown is deliberately not handled.)

BitmapTypeToStrReverse = {1: 'L', 2: 'RGB', 3: 'RGBX', 4: 'RGBA'}

Convert a PDFium bitmap format to string, assuming RGB byte order. (FPDFBitmap_Unknown is deliberately not handled.)

BitmapStrToConst = {'BGR': 2, 'BGRA': 4, 'BGRX': 3, 'L': 1}

Convert a string to PDFium bitmap format, assuming BGR byte order. Inversion of BitmapTypeToStr.

BitmapStrReverseToConst = {'L': 1, 'RGB': 2, 'RGBA': 4, 'RGBX': 3}

Convert a string to PDFium bitmap format, assuming RGB byte order. Inversion of BitmapTypeToStrReverse.

FormTypeToStr = {0: 'None', 1: 'AcroForm', 2: 'XFA', 3: 'XFAF'}

Convert a PDFium form type (FORMTYPE_*) to string.

ColorspaceToStr = {0: '?', 1: 'DeviceGray', 2: 'DeviceRGB', 3: 'DeviceCMYK', 4: 'CalGray', 5: 'CalRGB', 6: 'Lab', 7: 'ICCBased', 8: 'Separation', 9: 'DeviceN', 10: 'Indexed', 11: 'Pattern'}

Convert a PDFium color space constant (FPDF_COLORSPACE_*) to string.

ViewmodeToStr = {0: '?', 1: 'XYZ', 2: 'Fit', 3: 'FitH', 4: 'FitV', 5: 'FitR', 6: 'FitB', 7: 'FitBH', 8: 'FitBV'}

Convert a PDFium view mode constant (PDFDEST_VIEW_*) to string.

ObjectTypeToStr = {0: '?', 1: 'text', 2: 'path', 3: 'image', 4: 'shading', 5: 'form'}

Convert a PDFium object type constant (FPDF_PAGEOBJ_*) to string.

ObjectTypeToConst = {'?': 0, 'form': 5, 'image': 3, 'path': 2, 'shading': 4, 'text': 1}

Convert an object type string to a PDFium constant. Inversion of ObjectTypeToStr.

PageModeToStr = {-1: '?', 0: 'None', 1: 'Outline', 2: 'Thumbnails', 3: 'Full-screen', 4: 'Layers', 5: 'Attachments'}

Convert a PDFium page mode constant (PAGEMODE_*) to string.

ErrorToStr = {0: 'Success', 1: 'Unknown error', 2: 'File access error', 3: 'Data format error', 4: 'Incorrect password error', 5: 'Unsupported security scheme error', 6: 'Page not found or content error'}

Convert a PDFium error constant (FPDF_ERR_*) to string.

UnsupportedInfoToStr = {1: 'XFA form', 2: 'Portable collection', 3: 'Attachment (incomplete support)', 4: 'Security', 5: 'Shared review', 6: 'Shared form (acrobat)', 7: 'Shared form (filesystem)', 8: 'Shared form (email)', 11: '3D annotation', 12: 'Movie annotation', 13: 'Sound annotation', 14: 'Screen media annotation', 15: 'Screen rich media annotation', 16: 'Attachment annotation', 17: 'Signature annotation'}

Convert a PDFium unsupported constant (FPDF_UNSP_*) to string.

class AutoCastable[source]

Bases: object

class AutoCloseable(close_func, *args, obj=None, needs_free=True, **kwargs)[source]

Bases: AutoCastable

close(_by_parent=False)[source]
color_tohex(color, rev_byteorder)[source]
set_callback(struct, fname, callback)[source]
is_buffer(buf, spec='r')[source]
get_bufreader(buffer)[source]
get_bufwriter(buffer)[source]
pages_c_array(pages)[source]