Recently I had a challenging project to develop an interface for a mechanical engineer who needed to chart and visualize data from PDF spec sheets on an Excel spreadsheet. Fortunately, I found these great SDK tools from Bytescout which made the technical challenges and coding a breeze and made the whole project fun and easy! In this multi-tutorial, we will explore the rich variety of tools available in Bytescout’s awesome PDF Extractor SDK, and learn how to put them to use to solve real-world problems. Let’s start with a bread and butter tool to get a PDF table into a CSV format for use in Excel or Google Sheets.
In the project mentioned above, a mechanical engineer needed to pull tables of physical wear specs from PDF files into spreadsheets and chart the wear over time. Bytescout’s SDK makes it easy to export pages from PDF, and in this case using VBScript to convert PDF to CSV for Excel tables. This is ideal because VBScript is nearly identical to VBScript, the native macro language of Excel. Notice in this code sample, that all the tables detected in a PDF are indexed so that you can select exactly those needed. Later on, we’ll see how to use regex – regular expressions – to find precise points in a PDF to extract. Have a look:
' Create Bytescout.PDFExtractor.TextExtractor object
Set tableDetector= CreateObject("Bytescout.PDFExtractor.TableDetector")
tableDetector.RegistrationName = "demo"
tableDetector.RegistrationKey = "demo"
' Create Bytescout.PDFExtractor.CSVExtractor object
Set csvExtractor = CreateObject("Bytescout.PDFExtractor.CSVExtractor")
csvExtractor.RegistrationName = "demo"
csvExtractor.RegistrationKey = "demo"
' We define what kind of tables to detect.
' So we set min required number of columns to 3 ...
tableDetector.DetectionMinNumberOfColumns = 3
' ... and we set min required number of rows to 3
tableDetector.DetectionMinNumberOfRows = 3
' Load sample PDF document
tableDetector.LoadDocumentFromFile("..\..\sample3.pdf")
csvExtractor.LoadDocumentFromFile "..\..\sample3.pdf"
' Get page count
pageCount = tableDetector.GetPageCount()
' Iterate through pages
For i = 0 to pageCount - 1
t = 0
' Find first table and continue if found
If (tableDetector.FindTable(i)) Then
Do
' Set extraction area for CSV extractor to rectangle received from the table detector
csvExtractor.SetExtractionArea _
tableDetector.GetFoundTableRectangle_Left(), _
tableDetector.GetFoundTableRectangle_Top(), _
tableDetector.GetFoundTableRectangle_Width(), _
tableDetector.GetFoundTableRectangle_Height()
' Export the table to CSV file
csvExtractor.SavePageCSVToFile i, "page-" & CStr(i) & "-table-" & CStr(t) & ".csv"
t = t + 1
Loop While tableDetector.FindNextTable()
End If
Next
Set csvExtractor = Nothing
Set tableDetector = Nothing
Continuing the above theme of table extraction, suppose the original PDFs contain scanned tables. No problem! Here is where Bytescout SDK really meets the challenge to extract content from pdf files. In this next code sample, we demonstrate how to use the OCR Analyser, which is a class for analyzing scanned documents in PDF format. Bytescout makes it easy to learn how to do OCR and how to extract PDF files using OCR technology. Bytescout’s SDK optimizes Optical Character Recognition (OCR) algorithms to provide the highest level of accuracy in character recognition available today! To follow along with this VB.NET code example, first, add the Imports to bring in the dll references to your project:
Imports System.Drawing
Imports Bytescout.PDFExtractor
' This example demonstrates the use of OCR Analyser - a tooling class for analysis of scanned documents
' in PDF or raster image formats to find best parameters for Optical Character Recognition (OCR) that
' provide highest recognition quality.
' To make OCR work you should add the following references to your project:
' 'Bytescout.PDFExtractor.dll', 'Bytescout.PDFExtractor.OCRExtension.dll'.
Class Program
Friend Shared Sub Main(args As String())
' Input document
Dim inputDocument As String = ".\sample_ocr.pdf"
' Document page index
Dim pageIndex As Integer = 0
' Area of the document page to perform the analysis (optional).
' RectangleF.Empty means the full page.
Dim rectangle As RectangleF = RectangleF.Empty ' New RectangleF(100, 50, 350, 250)
' Location of "tessdata" folder containing language data files
Dim ocrLanguageDataFolder As String = "c:\Program Files\Bytescout PDF Extractor SDK\Redistributable\net2.00\tessdata\"
' OCR language
Dim ocrLanguage As String = "eng" ' "eng" for english, "deu" for German, "fra" for French, "spa" for Spanish etc - according to files in /tessdata
' Find more language files at https://github.com/tesseract-ocr/tessdata/tree/3.04.00
' Create OCRAnalyzer instance and activate it with your registration information
Using ocrAnalyzer As New OCRAnalyzer("demo", "demo")
' Display analysis progress
AddHandler ocrAnalyzer.ProgressChanged, Sub(sender, message, progress, ByRef cancel)
Console.WriteLine(message)
End Sub
' Load document to OCRAnalyzer
ocrAnalyzer.LoadDocumentFromFile(inputDocument)
' Setup OCRAnalyzer
ocrAnalyzer.OCRLanguage = ocrLanguage
ocrAnalyzer.OCRLanguageDataFolder = ocrLanguageDataFolder
' Set page area for analysis (optional)
ocrAnalyzer.SetExtractionArea(rectangle)
' Perform analysis and get results
Dim analysisResults As OCRAnalysisResults = ocrAnalyzer.AnalyzeByOCRConfidence(pageIndex)
' Now extract page text using detected OCR parameters
Dim outputDocument As String = ".\result.txt"
' Create TextExtractor instance
Using textExtractor As TextExtractor = New TextExtractor("demo", "demo")
' Load document to TextExtractor
textExtractor.LoadDocumentFromFile(inputDocument)
' Setup TextExtractor
textExtractor.OCRMode = OCRMode.Auto
textExtractor.OCRLanguageDataFolder = ocrLanguageDataFolder
textExtractor.OCRLanguage = ocrLanguage
' Apply analysys results to TextExtractor instance
ocrAnalyzer.ApplyResults(analysisResults, textExtractor)
' Set extraction area (optional)
textExtractor.SetExtractionArea(rectangle)
' Save extracted text to file
textExtractor.SaveTextToFile(outputDocument)
' Open output file in default associated application (for demonstration purposes)
System.Diagnostics.Process.Start(outputDocument)
End Using
End Using
End Sub
End Class
One great advantage of the Dot Net Framework is that all the coding languages share in common a vast library of functions and procedures. This makes it easy for project managers to transition between languages in coding projects where a team has members with varied backgrounds. Here I am going to illustrate this super convenience by translating the method to extract text from a scanned PDF with OCR using the C# language. As you can see in the code sample below, only minor differences distinguish this code from the previous example, such as the familiar C# syntax, for example, “using” instead of “imports.” Have a look:
using System;
using System.Drawing;
using Bytescout.PDFExtractor;
// This example demonstrates the use of OCR Analyser - a tooling class for analysis of scanned documents
// in PDF or raster image formats to find best parameters for Optical Character Recognition (OCR) that
// provide highest recognition quality.
// To make OCR work you should add the following references to your project:
// 'Bytescout.PDFExtractor.dll', 'Bytescout.PDFExtractor.OCRExtension.dll'.
namespace OCRAnalyser
{
class Program
{
static void Main(string[] args)
{
// Input document
string inputDocument = @".\sample_ocr.pdf";
// Document page index
int pageIndex = 0;
// Area of the document page to perform the analysis (optional).
// RectangleF.Empty means the full page.
RectangleF rectangle = RectangleF.Empty; // new RectangleF(100, 50, 350, 250);
// Location of "tessdata" folder containing language data files
string ocrLanguageDataFolder = @"c:\Program Files\Bytescout PDF Extractor SDK\Redistributable\net2.00\tessdata\";
// OCR language
string ocrLanguage = "eng"; // "eng" for english, "deu" for German, "fra" for French, "spa" for Spanish etc - according to files in /tessdata
// Find more language files at https://github.com/tesseract-ocr/tessdata/tree/3.04.00
// Create OCRAnalyzer instance and activate it with your registration information
using (OCRAnalyzer ocrAnalyzer = new OCRAnalyzer("demo", "demo"))
{
// Display analysis progress
ocrAnalyzer.ProgressChanged += (object sender, string message, double progress, ref bool cancel) =>
{
Console.WriteLine(message);
};
// Load document to OCRAnalyzer
ocrAnalyzer.LoadDocumentFromFile(inputDocument);
// Setup OCRAnalyzer
ocrAnalyzer.OCRLanguage = ocrLanguage;
ocrAnalyzer.OCRLanguageDataFolder = ocrLanguageDataFolder;
// Set page area for analysis (optional)
ocrAnalyzer.SetExtractionArea(rectangle);
// Perform analysis and get results
OCRAnalysisResults analysisResults = ocrAnalyzer.AnalyzeByOCRConfidence(pageIndex);
// Now extract the text using detected OCR parameters
string outputDocument = @".\result.txt";
// Create TextExtractor instance
using (TextExtractor textExtractor = new TextExtractor("demo", "demo"))
{
// Load document to TextExtractor
textExtractor.LoadDocumentFromFile(inputDocument);
// Setup TextExtractor
textExtractor.OCRMode = OCRMode.Auto;
textExtractor.OCRLanguageDataFolder = ocrLanguageDataFolder;
textExtractor.OCRLanguage = ocrLanguage;
// Apply analysys results to TextExtractor instance
ocrAnalyzer.ApplyResults(analysisResults, textExtractor);
// Set extraction area (optional)
textExtractor.SetExtractionArea(rectangle);
// Save extracted text to file
textExtractor.SaveTextToFile(outputDocument);
// Open output file in default associated application (for demonstration purposes)
System.Diagnostics.Process.Start(outputDocument);
}
}
}
}
}
In the examples above, we assumed that we knew in advance the location or index of the table needed. Now suppose we need to locate the table and extract text from PDF files by matching unique patterns of characters in text such as a date format, for example. Here, Bytescout adds intelligence to the toolbox by supporting the use of standard regular expression pattern matching to find and extract a range of text in a PDF document. This makes it easy to extract pages from PDF files without knowing the location or page number. In the following code, the pattern for date format is searched with the line: pattern = “[0-9]{2}/[0-9]{2}/[0-9]{4}” as a regex expression:
' Create Bytescout.PDFExtractor.TextExtractor object
Set extractor = CreateObject("Bytescout.PDFExtractor.TextExtractor")
extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"
' Load sample PDF document
extractor.LoadDocumentFromFile("..\..\Invoice.pdf")
extractor.RegexSearch = True ' Turn on the regex search
pattern = "[0-9]{2}/[0-9]{2}/[0-9]{4}" ' Search dates in format 'mm/dd/yyyy'
' Get page count
pageCount = extractor.GetPageCount()
For i = 0 to PageCount - 1
If extractor.Find(i, pattern, false) Then ' Parameters are: page index, string to find, case sensitivity
Do
extractedString = extractor.FoundText.Text
MsgBox "Found match on page #" & CStr(i) & ": " & extractedString
extractor.ResetExtractionArea()
Loop While extractor.FindNext
End If
Next
MsgBox "Done"
Set extractor = Nothing
A frequent need exists to render tabular data in XML – extensible markup language which encodes documents in a universally human and machine-readable format – and this will continue our discussion on how to extract text from pdf, and more particularly how to extract table data from PDF documents. Bytescout’s Extractor SDK contains methods to render PDFs into the very widely-used XML format. The following code combines the table extractor with a method to extract tabular data to XML. In my own experience, this is valuable for converting PDFs to web content:
' Create Bytescout.PDFExtractor.TextExtractor object
Set tableDetector= CreateObject("Bytescout.PDFExtractor.TableDetector")
tableDetector.RegistrationName = "demo"
tableDetector.RegistrationKey = "demo"
' Create Bytescout.PDFExtractor.xmlExtractor object
Set xmlExtractor = CreateObject("Bytescout.PDFExtractor.XMLExtractor")
xmlExtractor.RegistrationName = "demo"
xmlExtractor.RegistrationKey = "demo"
' We should define what kind of tables we should detect.
' So we set min required number of columns to 3 ...
tableDetector.DetectionMinNumberOfColumns = 3
' ... and we set min required number of rows to 3
tableDetector.DetectionMinNumberOfRows = 3
' Load sample PDF document
tableDetector.LoadDocumentFromFile("..\..\sample3.pdf")
xmlExtractor.LoadDocumentFromFile "..\..\sample3.pdf"
' Get page count
pageCount = tableDetector.GetPageCount()
' Iterate through pages
For i = 0 to pageCount - 1
t = 0
' Find first table and continue if found
If (tableDetector.FindTable(i)) Then
Do
' Set extraction area for CSV extractor to rectangle received from the table detector
xmlExtractor.SetExtractionArea _
tableDetector.GetFoundTableRectangle_Left(), _
tableDetector.GetFoundTableRectangle_Top(), _
tableDetector.GetFoundTableRectangle_Width(), _
tableDetector.GetFoundTableRectangle_Height()
' Export the table to CSV file
xmlExtractor.SavePageXMLToFile i, "page-" & CStr(i) & "-table-" & CStr(t) & ".xml"
t = t + 1
Loop While tableDetector.FindNextTable()
End If
Next
Set xmlExtractor = Nothing
Set tableDetector = Nothing
When I extract data from pdf tables, the actual table structure may be of interest for such purposes as ensuring accurate rendering in target documents. Especially for developers, it is often useful to replicate an existing table structure for consistency across docs. The next code sample illustrates how to capture the table structure when extracting a table from a PDF file:
' Create Bytescout.PDFExtractor.StructuredExtractor object
Set extractor = CreateObject("Bytescout.PDFExtractor.StructuredExtractor")
extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"
' Load sample PDF document
extractor.LoadDocumentFromFile "../../sample3.pdf"
For ipage = 0 To extractor.GetPageCount() - 1
' starting extraction from page #"
extractor.PrepareStructure ipage
rowCount = extractor.GetRowCount(ipage)
For row = 0 To rowCount - 1
columnCount = extractor.GetColumnCount(ipage, row)
For col = 0 To columnCount-1
WScript.Echo "Cell at page #" +CStr(ipage) + ", row=" & CStr(row) & ", column=" & _
CStr(col) & vbCRLF & extractor.GetCellValue(ipage, row, col)
Next
Next
Next
I am specifically impressed with the Bytescout SDK methods to extract images from PDF files. Here again, is a great set of tools for preparing web content from media previously locked up in the proprietary PDF documents format. Notice the easy use of the GetFirstImage method. This is followed in the code sample below by saving the image in this enumerator method. The outcome of this enumerator is to extract all images from pdf and save them individually in the folder specified. Have a look at the code here:
' Create Bytescout.PDFExtractor.ImageExtractor object
Set extractor = CreateObject("Bytescout.PDFExtractor.ImageExtractor")
extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"
' Load sample PDF document
extractor.LoadDocumentFromFile("..\..\sample1.pdf")
i = 0
' Initialize image enumeration
If extractor.GetFirstImage() Then
Do
outputFileName = "image" & i & ".png"
' Save image to file
extractor.SaveCurrentImageToFile outputFileName
i = i + 1
Loop While extractor.GetNextImage() ' Advance image enumeration
End If
' Open first output image in default associated application
Set shell = CreateObject("WScript.Shell")
shell.Run "image0.png", 1, false
Set shell = Nothing
Set extractor = Nothing
The next natural extension of the code above is to specify a page within a PDF and have the Extractor SDK fetch and save all images on that page. The ImageExtractor is invoked for this purpose. First, in the code below, the pages are enumerated. Next, as in the previous example, the images are enumerated and saved as png files. The shell script completes this workflow. These examples focus on Visual Basic and related dialects; later we will explore methods in other languages such as C#. Have a look at this example:
' Create Bytescout.PDFExtractor.ImageExtractor object
Set extractor = CreateObject("Bytescout.PDFExtractor.ImageExtractor")
extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"
' Load sample PDF document
extractor.LoadDocumentFromFile("..\..\sample1.pdf")
' Get page count
pageCount = extractor.GetPageCount()
' Extract images from each page
For i = 0 To pageCount - 1
j = 0
' Initialize page images enumeration
If extractor.GetFirstPageImage(i) Then
Do
outputFileName = "page" & i & "image" & j & ".png"
' Save image to file
extractor.SaveCurrentImageToFile outputFileName
j = j + 1
Loop While extractor.GetNextImage() ' Advance image enumeration
End If
Next
' Open first output file in default associated application
Set shell = CreateObject("WScript.Shell")
shell.Run "page0image0.png", 1, false
Set shell = Nothing
Set extractor = Nothing
Among the advanced methods in the Extractor SDK toolkit is the capability to enumerate all image coordinates within a PDF document. When I need to edit scanned PDF files possibly containing both images and text, it is especially valuable to break out both by first using OCR as in the earlier examples, and then identify image content as in the code below. The GetCurrentImageRectangle method gets stores the width and height of all images in the PDF as illustrated in this example:
' Create Bytescout.PDFExtractor.ImageExtractor object
Set extractor = CreateObject("Bytescout.PDFExtractor.ImageExtractor")
extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"
' Load sample PDF document
extractor.LoadDocumentFromFile("..\..\sample1.pdf")
i = 0
' Initialize image enumeration
If extractor.GetFirstImage() Then
Do
' display coordinates of the image
MsgBox "Image #" & CStr(i) & vbCRLF & "Coordinates: " & CStr( extractor.GetCurrentImageRectangle_Left()) & ", " & _
CStr( extractor.GetCurrentImageRectangle_Top()) & ", " & CStr( extractor.GetCurrentImageRectangle_Width()) & ", " & _
CStr( extractor.GetCurrentImageRectangle_Height())
i = i + 1
Loop While extractor.GetNextImage() ' Advance image enumeration
End If
Set extractor = Nothing
The methods of the Bytescout Extractor as illustrated here can naturally be combined to achieve valuable outcomes. For example, I have used several methods at once in this way: I can locate a PDF table using regex, even if it is scanned, and use Bytescout’s OCR algorithm and XML methods to export the PDF table to XML format. This makes for a sophisticated tool for managing PDF files into web content.
Following the theme of extracting PDF data for my mechanical engineering example above, we can also extract PDF directly to Excel with this simple code:
REM Run the script from the command line cscript.exe PdfToXls-CommandLine.vbs "../../sample3.pdf" "output.xlsx" pause
This method shows how to use PDF Extractor tools via the command-line interface (CLI) as well, which is often convenient for prototyping code before commits.
That’s the END OF PART 1. Check PART 2 for more examples.
If you really think that PDF Extractor SDK is something you were looking for, you are welcome to TRY IT HERE!