This document describes a scalable optical character recognition (OCR) pipeline using Apache NiFi and Tesseract to extract text from PDF documents at large scale. The pipeline includes steps for converting PDFs to images, preprocessing the images using ImageMagick to enhance text, and performing OCR with Tesseract to extract text. NiFi is used to operationalize the pipeline and handle large-scale processing. Example use cases discussed include analyzing medical records and extracting text from large datasets to enable better journalism.