One of the known problems in data extensive business is to extract data from PDF when PDF is the output of the scanned document. In this article, we’ll see how to extract text from scanned pdf using one of ByteScout PDF SDK. ByteScout is an established player known to provide reliable PDF solutions to developers.
We’ll see through how to convert scanned pdf to text using ByteScout PDF Extractor library. For this program purpose, I have taken one scanned PDF which has scanned images. We’ll process that using the program explained later in this article and check the following steps:
This video explains how to convert a scanned PDF:
Let’s jump in. Here the input PDF used.
Following is the C# program used to demonstrate how to turn scanned pdf into text using ByteScout library.
using System.Diagnostics; using Bytescout.PDFExtractor; // To make OCR work you should add to your project references to Bytescout.PDFExtractor.dll and Bytescout.PDFExtractor.OCRExtension.dll namespace MakeSearchablePDF { class Program { static void Main(string[] args) { // Create Bytescout.PDFExtractor.SearchablePDFMaker instance SearchablePDFMaker searchablePDFMaker = new SearchablePDFMaker(); searchablePDFMaker.RegistrationName = "demo"; searchablePDFMaker.RegistrationKey = "demo"; // Load sample PDF document searchablePDFMaker.LoadDocumentFromFile("sample_ocr.pdf"); // Set the location of "tessdata" folder containing language data files searchablePDFMaker.OCRLanguageDataFolder = @"c:\Program Files\Bytescout PDF Extractor SDK\Redistributable\net2.00\tessdata\"; // Set OCR language searchablePDFMaker.OCRLanguage = "eng"; // "eng" for english, "deu" for German, "fra" for French, "spa" for Spanish etc - according to files in /tessdata // Set PDF document rendering resolution searchablePDFMaker.OCRResolution = 300; // Save extracted text to file searchablePDFMaker.MakePDFSearchable("output.pdf"); // Cleanup searchablePDFMaker.Dispose(); // Open output file in default associated application ProcessStartInfo processStartInfo = new ProcessStartInfo("output.pdf"); processStartInfo.UseShellExecute = true; Process.Start(processStartInfo); } } }
The output PDF is as follows.
By analyzing input and output it’s evident that it’s retaining all the structure with fonts, color, and all style.
ByteScout library provides a solution in many programming languages such as VB, C#, Java, Classic ASP, Delphi, etc. It also provides PDF.co Web API too, so that it can simply use with most of the programming language without any installation.
In order to run this program, you need to use the ByteScout PDF Extractor library. One of the easy ways to have this library along with other ByteScout libraries is to install ByteScout SDK at your machine from this link. https://bytescout.com/download/web-installer
Though the code is very simple and self-explanatory, Let’s walk through it to understand how it converts scanned pdf to text pdf.
Initialize the SearchablePDFMaker instance and load them with registration keys. This needs to be replaced with actual keys. For this demo purpose, I’m using demo keys.
// Create Bytescout.PDFExtractor.SearchablePDFMaker instance SearchablePDFMaker searchablePDFMaker = new SearchablePDFMaker(); searchablePDFMaker.RegistrationName = "demo"; searchablePDFMaker.RegistrationKey = "demo";
Assign the input file to be processed. Here in this example, we’re using a physical file as an input. We can also use stream object as an input file, in that case, we can use LoadDocumentFromStream
method instead.
// Load sample PDF document searchablePDFMaker.LoadDocumentFromFile("sample_ocr.pdf");
Now needs to set OCR options such as a location of the language data folder, language, resolution, etc.
// Set the location of "tessdata" folder containing language data files searchablePDFMaker.OCRLanguageDataFolder = @"c:\Program Files\Bytescout PDF Extractor SDK\Redistributable\net2.00\tessdata\"; // Set OCR language searchablePDFMaker.OCRLanguage = "eng"; // "eng" for english, "deu" for German, "fra" for French, "spa" for Spanish etc - according to files in /tessdata // Set PDF document rendering resolution searchablePDFMaker.OCRResolution = 300;
OCR works with most human languages. It contains features such as convert scanned pdf to text from rotated images, choosing a specific location for OCR conversation so that it only performs OCR on the selected locations, etc.
Let’s say we need to use different languages used in this example, then the only thing we need to do is to provide appropriate language code and have its language data and provide the correct path for it.
For example, if you need to have support for the Hindi language then need to follow these directions.
Need to put the below files into the “test data” folder. Also, the language code that needs to be used will be “hin”. This location contains these files https://github.com/tesseract-ocr/tessdata/tree/3.04.00.
o hin.traineddata
o hin.cube.bigrams
o hin.cube.lm
o hin.cube.nn
o hin.cube.params
o hin.cube.word-freq
o hin.tesseract_cube.nn
The code will be like below:
// Set OCR language searchablePDFMaker.OCRLanguage = "hin"; // Need to set Font which supports hindi characters searchablePDFMaker.LabelingFont = "Arial Unicode MS";
With all things set, now we just need to start the conversation and specify the output location.
// Save extracted text to file searchablePDFMaker.MakePDFSearchable("output.pdf");
Here, we can have multiple options such as saving the output as a physical document or saving it as a stream. We can also control a specific page to be outputted.
That’s all. It’s that easy to turn scanned pdf into text using ByteScout SDK and PDF.co Web API.
Happy Coding!