pdfannots2json logo

pdfannots2json Matthew Meyers

Use this command to install pdfannots2json:
winget install --id=MatthewMeyers.pdfannots2json -e

Extracts annotations from PDF and converts them to a JSON list.

pdfannots2json is a command-line tool designed to extract annotations from PDF files and convert them into a structured JSON format. This utility simplifies the process of working with annotated documents by providing a consistent and machine-readable output.

Key Features:

  • Extracts various annotation types, including highlights, strikes, underlines, text notes, rectangles, and more.
  • Converts rectangle annotations into images for visual reference.
  • Supports OCR (Optical Character Recognition) for extracting text from image-based annotations when Tesseract is installed.
  • Offers flexible command-line options to customize output, such as specifying image formats, DPI, quality, and paths.
  • Works across multiple platforms, including macOS, Linux, and Windows.

Audience & Benefit:

Ideal for developers, researchers, and professionals who need to programmatically analyze or process annotated PDFs. This tool streamlines workflows by converting annotations into a JSON format, enabling easy integration with other software systems or further processing.

Installation via winget is supported for convenience, allowing users to quickly set up the tool on their systems.

README

pdfannots2json

Extracts annotations from PDF and converts them to a JSON list.

Supported annotations:

  • highlight
  • strike
  • underline
  • text (also called notes)
  • rectangle
    • Note: rectangle annotations are exported as images

pdfannots2json uses UniPDF to extract annotations and MuPDF (Fitz) to extract images from PDFs.

Usage: pdfannots2json 

Arguments:
      Path to input PDF

Flags:
  -h, --help                          Show context-sensitive help.
  -v, --version                       Display the current version of pdf-annots2json
  -b, --ignore-before=TIME            Ignore annotations added before this date. Must be ISO 8601 formatted
  -w, --no-write                      Do not save images to disk
  -o, --image-output-path=STRING      Output path of image annotations
  -n, --image-base-name="annot"       Base name of saved images
  -f, --image-format="jpg"            Image format. Supports png and jpg
  -d, --image-dpi=120                 Image DPI
  -q, --image-quality=90              Image quality. Only applies to jpg images
  -e, --attempt-ocr                   Attempt to extract text from images. tesseract-ocr must be installed on your system
  -l, --ocr-lang="eng"                Set the OCR language. Supports multiple languages, eg. 'eng+deu'. The desired languages must be installed
      --tesseract-path="tesseract"    Absolute path to the tesseract executable
      --tess-data-dir=STRING          Absolute path to the tesseract data folder

Supported platforms (see releases)

  • Mac (intel, M1)
  • Linux (x64)
  • Windows (x64)

OCR

Using the --attempt-ocr flag instructs pdfannots2json to extract text from the images created by rectangle annotations. This requires that tesseract is installed on your system, including the appropriate language data (by default tesseract only support english). Tesseract can be installed from homebrew on mac, various linux package managers, and from here on windows: https://github.com/UB-Mannheim/tesseract/wiki Additional language files can be downloaded here: https://github.com/tesseract-ocr/tessdata (See here for a description of the language codes: https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html)

Sample output

[
  {
    "color": "#7fff7f",
    "colorCategory": "Green",
    "date": "2022-03-14T19:57:56Z",
    "id": "rectangle-p1x26y636",
    "imagePath": "/some/path/annot-1-x26-y636.jpg",
    "ocrText": "Fabio D’Antoni !-23-*®, Alessio Matiz 23, Franco Fabbro 2“ and Cristiano Crescentini 24© ",
    "page": 1,
    "type": "image",
    "x": 26.43,
    "y": 636.51
  },
  {
    "annotatedText": "Objectives: We explored the effects of a single 40-min session ",
    "color": "#ff7f7f",
    "colorCategory": "Red",
    "date": "2022-03-14T19:56:25Z",
    "id": "highlight-p1x205y514",
    "page": 1,
    "type": "highlight",
    "x": 205.42,
    "y": 514.86
  },
  {
    "annotatedText": "processing of distressing memories reported by a non-clinical sample of adult participants. Design: A within-subject design was used. Methods: Participants (n = 40 Psychologists/MDs) reported four distressing memories",
    "color": "#ffff7f",
    "colorCategory": "Yellow",
    "comment": "This is a highlight",
    "date": "2022-03-14T19:57:17Z",
    "id": "highlight-p1x166y462",
    "page": 1,
    "type": "highlight",
    "x": 166.02,
    "y": 462.99
  },
  {
    "annotatedText": "Post-Intervention, Follow-up. Results: SUD scores associated with EMDR, BSP, and BSM signifcantly decreased from Pre- to Post-Intervention (p \u003c 0.001). At Post-Intervention and Follow-up, EMDR and BSP SUD scores were signifcantly lower than BSM and BR scores (p \u003c 0.02). At both Post-Intervention a",
    "color": "#ffff7f",
    "colorCategory": "Yellow",
    "comment": "This is an underline",
    "date": "2022-03-14T19:57:11Z",
    "id": "underline-p1x166y385",
    "page": 1,
    "type": "underline",
    "x": 166.02,
    "y": 385.17
  },
  {
    "annotatedText": "Keywords: psychotherapy; distressing memories; EMDR; Brainspotting; body scan meditation; mindfulness; bottom-up therapy; body-oriented intervention; trauma; stress",
    "color": "#ff7fff",
    "colorCategory": "Magenta",
    "comment": "This is a strike",
    "date": "2022-03-14T19:57:07Z",
    "id": "strike-p1x166y295",
    "page": 1,
    "type": "strike",
    "x": 166.39,
    "y": 295.38
  },
  {
    "color": "#ff7fff",
    "colorCategory": "Magenta",
    "comment": "Hello",
    "date": "2022-03-14T19:57:00Z",
    "id": "text-p1x351y197",
    "page": 1,
    "type": "text",
    "x": 351.45,
    "y": 197.36
  },
  {
    "annotatedText": "In clinical contexts, however, distressing ",
    "color": "#7fff7f",
    "colorCategory": "Green",
    "date": "2022-03-14T19:57:23Z",
    "id": "highlight-p1x187y166",
    "page": 1,
    "type": "highlight",
    "x": 187.65,
    "y": 166.72
  },
  {
    "annotatedText": "and may originate from both “big trauma” (“T”), such as life-threatening experiences and sexual violence, o",
    "color": "#7fff7f",
    "colorCategory": "Green",
    "comment": "A green highlight",
    "date": "2022-03-14T19:57:32Z",
    "id": "highlight-p1x166y129",
    "page": 1,
    "type": "highlight",
    "x": 166.07,
    "y": 129.02
  }
]
Versions
1.0.16
Website
License