Recently I had a challenging project to develop an interface for a mechanical engineer who needed to chart and visualize data from PDF spec sheets on an Excel spreadsheet. Fortunately, I found these great SDK tools from Bytescout which made the technical challenges and coding a breeze and made the whole project fun and easy! In this multi-tutorial, we will explore the rich variety of tools available in Bytescout’s awesome PDF Extractor SDK, and learn how to put them to use to solve real-world problems. Let’s start with a bread and butter tool to get a PDF table into a CSV format for use in Excel or Google Sheets.
In the project mentioned above, a mechanical engineer needed to pull tables of physical wear specs from PDF files into spreadsheets and chart the wear over time. Bytescout’s SDK makes it easy to export pages from PDF, and in this case using VBScript to convert PDF to CSV for Excel tables. This is ideal because VBScript is nearly identical to VBScript, the native macro language of Excel. Notice in this code sample, that all the tables detected in a PDF are indexed so that you can select exactly those needed. Later on, we’ll see how to use regex – regular expressions – to find precise points in a PDF to extract. Have a look:
' Create Bytescout.PDFExtractor.TextExtractor object Set tableDetector= CreateObject("Bytescout.PDFExtractor.TableDetector") tableDetector.RegistrationName = "demo" tableDetector.RegistrationKey = "demo" ' Create Bytescout.PDFExtractor.CSVExtractor object Set csvExtractor = CreateObject("Bytescout.PDFExtractor.CSVExtractor") csvExtractor.RegistrationName = "demo" csvExtractor.RegistrationKey = "demo" ' We define what kind of tables to detect. ' So we set min required number of columns to 3 ... tableDetector.DetectionMinNumberOfColumns = 3 ' ... and we set min required number of rows to 3 tableDetector.DetectionMinNumberOfRows = 3 ' Load sample PDF document tableDetector.LoadDocumentFromFile("..\..\sample3.pdf") csvExtractor.LoadDocumentFromFile "..\..\sample3.pdf" ' Get page count pageCount = tableDetector.GetPageCount() ' Iterate through pages For i = 0 to pageCount - 1 t = 0 ' Find first table and continue if found If (tableDetector.FindTable(i)) Then Do ' Set extraction area for CSV extractor to rectangle received from the table detector csvExtractor.SetExtractionArea _ tableDetector.GetFoundTableRectangle_Left(), _ tableDetector.GetFoundTableRectangle_Top(), _ tableDetector.GetFoundTableRectangle_Width(), _ tableDetector.GetFoundTableRectangle_Height() ' Export the table to CSV file csvExtractor.SavePageCSVToFile i, "page-" & CStr(i) & "-table-" & CStr(t) & ".csv" t = t + 1 Loop While tableDetector.FindNextTable() End If Next Set csvExtractor = Nothing Set tableDetector = Nothing
Continuing the above theme of table extraction, suppose the original PDFs contain scanned tables. No problem! Here is where Bytescout SDK really meets the challenge to extract content from pdf files. In this next code sample, we demonstrate how to use the OCR Analyser, which is a class for analyzing scanned documents in PDF format. Bytescout makes it easy to learn how to do OCR and how to extract PDF files using OCR technology. Bytescout’s SDK optimizes Optical Character Recognition (OCR) algorithms to provide the highest level of accuracy in character recognition available today! To follow along with this VB.NET code example, first, add the Imports to bring in the dll references to your project:
Imports System.Drawing Imports Bytescout.PDFExtractor ' This example demonstrates the use of OCR Analyser - a tooling class for analysis of scanned documents ' in PDF or raster image formats to find best parameters for Optical Character Recognition (OCR) that ' provide highest recognition quality. ' To make OCR work you should add the following references to your project: ' 'Bytescout.PDFExtractor.dll', 'Bytescout.PDFExtractor.OCRExtension.dll'. Class Program Friend Shared Sub Main(args As String()) ' Input document Dim inputDocument As String = ".\sample_ocr.pdf" ' Document page index Dim pageIndex As Integer = 0 ' Area of the document page to perform the analysis (optional). ' RectangleF.Empty means the full page. Dim rectangle As RectangleF = RectangleF.Empty ' New RectangleF(100, 50, 350, 250) ' Location of "tessdata" folder containing language data files Dim ocrLanguageDataFolder As String = "c:\Program Files\Bytescout PDF Extractor SDK\Redistributable\net2.00\tessdata\" ' OCR language Dim ocrLanguage As String = "eng" ' "eng" for english, "deu" for German, "fra" for French, "spa" for Spanish etc - according to files in /tessdata ' Find more language files at https://github.com/tesseract-ocr/tessdata/tree/3.04.00 ' Create OCRAnalyzer instance and activate it with your registration information Using ocrAnalyzer As New OCRAnalyzer("demo", "demo") ' Display analysis progress AddHandler ocrAnalyzer.ProgressChanged, Sub(sender, message, progress, ByRef cancel) Console.WriteLine(message) End Sub ' Load document to OCRAnalyzer ocrAnalyzer.LoadDocumentFromFile(inputDocument) ' Setup OCRAnalyzer ocrAnalyzer.OCRLanguage = ocrLanguage ocrAnalyzer.OCRLanguageDataFolder = ocrLanguageDataFolder ' Set page area for analysis (optional) ocrAnalyzer.SetExtractionArea(rectangle) ' Perform analysis and get results Dim analysisResults As OCRAnalysisResults = ocrAnalyzer.AnalyzeByOCRConfidence(pageIndex) ' Now extract page text using detected OCR parameters Dim outputDocument As String = ".\result.txt" ' Create TextExtractor instance Using textExtractor As TextExtractor = New TextExtractor("demo", "demo") ' Load document to TextExtractor textExtractor.LoadDocumentFromFile(inputDocument) ' Setup TextExtractor textExtractor.OCRMode = OCRMode.Auto textExtractor.OCRLanguageDataFolder = ocrLanguageDataFolder textExtractor.OCRLanguage = ocrLanguage ' Apply analysys results to TextExtractor instance ocrAnalyzer.ApplyResults(analysisResults, textExtractor) ' Set extraction area (optional) textExtractor.SetExtractionArea(rectangle) ' Save extracted text to file textExtractor.SaveTextToFile(outputDocument) ' Open output file in default associated application (for demonstration purposes) System.Diagnostics.Process.Start(outputDocument) End Using End Using End Sub End Class
One great advantage of the Dot Net Framework is that all the coding languages share in common a vast library of functions and procedures. This makes it easy for project managers to transition between languages in coding projects where a team has members with varied backgrounds. Here I am going to illustrate this super convenience by translating the method to extract text from a scanned PDF with OCR using the C# language. As you can see in the code sample below, only minor differences distinguish this code from the previous example, such as the familiar C# syntax, for example, “using” instead of “imports.” Have a look:
using System; using System.Drawing; using Bytescout.PDFExtractor; // This example demonstrates the use of OCR Analyser - a tooling class for analysis of scanned documents // in PDF or raster image formats to find best parameters for Optical Character Recognition (OCR) that // provide highest recognition quality. // To make OCR work you should add the following references to your project: // 'Bytescout.PDFExtractor.dll', 'Bytescout.PDFExtractor.OCRExtension.dll'. namespace OCRAnalyser { class Program { static void Main(string[] args) { // Input document string inputDocument = @".\sample_ocr.pdf"; // Document page index int pageIndex = 0; // Area of the document page to perform the analysis (optional). // RectangleF.Empty means the full page. RectangleF rectangle = RectangleF.Empty; // new RectangleF(100, 50, 350, 250); // Location of "tessdata" folder containing language data files string ocrLanguageDataFolder = @"c:\Program Files\Bytescout PDF Extractor SDK\Redistributable\net2.00\tessdata\"; // OCR language string ocrLanguage = "eng"; // "eng" for english, "deu" for German, "fra" for French, "spa" for Spanish etc - according to files in /tessdata // Find more language files at https://github.com/tesseract-ocr/tessdata/tree/3.04.00 // Create OCRAnalyzer instance and activate it with your registration information using (OCRAnalyzer ocrAnalyzer = new OCRAnalyzer("demo", "demo")) { // Display analysis progress ocrAnalyzer.ProgressChanged += (object sender, string message, double progress, ref bool cancel) => { Console.WriteLine(message); }; // Load document to OCRAnalyzer ocrAnalyzer.LoadDocumentFromFile(inputDocument); // Setup OCRAnalyzer ocrAnalyzer.OCRLanguage = ocrLanguage; ocrAnalyzer.OCRLanguageDataFolder = ocrLanguageDataFolder; // Set page area for analysis (optional) ocrAnalyzer.SetExtractionArea(rectangle); // Perform analysis and get results OCRAnalysisResults analysisResults = ocrAnalyzer.AnalyzeByOCRConfidence(pageIndex); // Now extract the text using detected OCR parameters string outputDocument = @".\result.txt"; // Create TextExtractor instance using (TextExtractor textExtractor = new TextExtractor("demo", "demo")) { // Load document to TextExtractor textExtractor.LoadDocumentFromFile(inputDocument); // Setup TextExtractor textExtractor.OCRMode = OCRMode.Auto; textExtractor.OCRLanguageDataFolder = ocrLanguageDataFolder; textExtractor.OCRLanguage = ocrLanguage; // Apply analysys results to TextExtractor instance ocrAnalyzer.ApplyResults(analysisResults, textExtractor); // Set extraction area (optional) textExtractor.SetExtractionArea(rectangle); // Save extracted text to file textExtractor.SaveTextToFile(outputDocument); // Open output file in default associated application (for demonstration purposes) System.Diagnostics.Process.Start(outputDocument); } } } } }
In the examples above, we assumed that we knew in advance the location or index of the table needed. Now suppose we need to locate the table and extract text from PDF files by matching unique patterns of characters in text such as a date format, for example. Here, Bytescout adds intelligence to the toolbox by supporting the use of standard regular expression pattern matching to find and extract a range of text in a PDF document. This makes it easy to extract pages from PDF files without knowing the location or page number. In the following code, the pattern for date format is searched with the line: pattern = “[0-9]{2}/[0-9]{2}/[0-9]{4}” as a regex expression:
' Create Bytescout.PDFExtractor.TextExtractor object Set extractor = CreateObject("Bytescout.PDFExtractor.TextExtractor") extractor.RegistrationName = "demo" extractor.RegistrationKey = "demo" ' Load sample PDF document extractor.LoadDocumentFromFile("..\..\Invoice.pdf") extractor.RegexSearch = True ' Turn on the regex search pattern = "[0-9]{2}/[0-9]{2}/[0-9]{4}" ' Search dates in format 'mm/dd/yyyy' ' Get page count pageCount = extractor.GetPageCount() For i = 0 to PageCount - 1 If extractor.Find(i, pattern, false) Then ' Parameters are: page index, string to find, case sensitivity Do extractedString = extractor.FoundText.Text MsgBox "Found match on page #" & CStr(i) & ": " & extractedString extractor.ResetExtractionArea() Loop While extractor.FindNext End If Next MsgBox "Done" Set extractor = Nothing
A frequent need exists to render tabular data in XML – extensible markup language which encodes documents in a universally human and machine-readable format – and this will continue our discussion on how to extract text from pdf, and more particularly how to extract table data from PDF documents. Bytescout’s Extractor SDK contains methods to render PDFs into the very widely-used XML format. The following code combines the table extractor with a method to extract tabular data to XML. In my own experience, this is valuable for converting PDFs to web content:
' Create Bytescout.PDFExtractor.TextExtractor object Set tableDetector= CreateObject("Bytescout.PDFExtractor.TableDetector") tableDetector.RegistrationName = "demo" tableDetector.RegistrationKey = "demo" ' Create Bytescout.PDFExtractor.xmlExtractor object Set xmlExtractor = CreateObject("Bytescout.PDFExtractor.XMLExtractor") xmlExtractor.RegistrationName = "demo" xmlExtractor.RegistrationKey = "demo" ' We should define what kind of tables we should detect. ' So we set min required number of columns to 3 ... tableDetector.DetectionMinNumberOfColumns = 3 ' ... and we set min required number of rows to 3 tableDetector.DetectionMinNumberOfRows = 3 ' Load sample PDF document tableDetector.LoadDocumentFromFile("..\..\sample3.pdf") xmlExtractor.LoadDocumentFromFile "..\..\sample3.pdf" ' Get page count pageCount = tableDetector.GetPageCount() ' Iterate through pages For i = 0 to pageCount - 1 t = 0 ' Find first table and continue if found If (tableDetector.FindTable(i)) Then Do ' Set extraction area for CSV extractor to rectangle received from the table detector xmlExtractor.SetExtractionArea _ tableDetector.GetFoundTableRectangle_Left(), _ tableDetector.GetFoundTableRectangle_Top(), _ tableDetector.GetFoundTableRectangle_Width(), _ tableDetector.GetFoundTableRectangle_Height() ' Export the table to CSV file xmlExtractor.SavePageXMLToFile i, "page-" & CStr(i) & "-table-" & CStr(t) & ".xml" t = t + 1 Loop While tableDetector.FindNextTable() End If Next Set xmlExtractor = Nothing Set tableDetector = Nothing
When I extract data from pdf tables, the actual table structure may be of interest for such purposes as ensuring accurate rendering in target documents. Especially for developers, it is often useful to replicate an existing table structure for consistency across docs. The next code sample illustrates how to capture the table structure when extracting a table from a PDF file:
' Create Bytescout.PDFExtractor.StructuredExtractor object Set extractor = CreateObject("Bytescout.PDFExtractor.StructuredExtractor") extractor.RegistrationName = "demo" extractor.RegistrationKey = "demo" ' Load sample PDF document extractor.LoadDocumentFromFile "../../sample3.pdf" For ipage = 0 To extractor.GetPageCount() - 1 ' starting extraction from page #" extractor.PrepareStructure ipage rowCount = extractor.GetRowCount(ipage) For row = 0 To rowCount - 1 columnCount = extractor.GetColumnCount(ipage, row) For col = 0 To columnCount-1 WScript.Echo "Cell at page #" +CStr(ipage) + ", row=" & CStr(row) & ", column=" & _ CStr(col) & vbCRLF & extractor.GetCellValue(ipage, row, col) Next Next Next
I am specifically impressed with the Bytescout SDK methods to extract images from PDF files. Here again, is a great set of tools for preparing web content from media previously locked up in the proprietary PDF documents format. Notice the easy use of the GetFirstImage method. This is followed in the code sample below by saving the image in this enumerator method. The outcome of this enumerator is to extract all images from pdf and save them individually in the folder specified. Have a look at the code here:
' Create Bytescout.PDFExtractor.ImageExtractor object Set extractor = CreateObject("Bytescout.PDFExtractor.ImageExtractor") extractor.RegistrationName = "demo" extractor.RegistrationKey = "demo" ' Load sample PDF document extractor.LoadDocumentFromFile("..\..\sample1.pdf") i = 0 ' Initialize image enumeration If extractor.GetFirstImage() Then Do outputFileName = "image" & i & ".png" ' Save image to file extractor.SaveCurrentImageToFile outputFileName i = i + 1 Loop While extractor.GetNextImage() ' Advance image enumeration End If ' Open first output image in default associated application Set shell = CreateObject("WScript.Shell") shell.Run "image0.png", 1, false Set shell = Nothing Set extractor = Nothing
The next natural extension of the code above is to specify a page within a PDF and have the Extractor SDK fetch and save all images on that page. The ImageExtractor is invoked for this purpose. First, in the code below, the pages are enumerated. Next, as in the previous example, the images are enumerated and saved as png files. The shell script completes this workflow. These examples focus on Visual Basic and related dialects; later we will explore methods in other languages such as C#. Have a look at this example:
' Create Bytescout.PDFExtractor.ImageExtractor object Set extractor = CreateObject("Bytescout.PDFExtractor.ImageExtractor") extractor.RegistrationName = "demo" extractor.RegistrationKey = "demo" ' Load sample PDF document extractor.LoadDocumentFromFile("..\..\sample1.pdf") ' Get page count pageCount = extractor.GetPageCount() ' Extract images from each page For i = 0 To pageCount - 1 j = 0 ' Initialize page images enumeration If extractor.GetFirstPageImage(i) Then Do outputFileName = "page" & i & "image" & j & ".png" ' Save image to file extractor.SaveCurrentImageToFile outputFileName j = j + 1 Loop While extractor.GetNextImage() ' Advance image enumeration End If Next ' Open first output file in default associated application Set shell = CreateObject("WScript.Shell") shell.Run "page0image0.png", 1, false Set shell = Nothing Set extractor = Nothing
Among the advanced methods in the Extractor SDK toolkit is the capability to enumerate all image coordinates within a PDF document. When I need to edit scanned PDF files possibly containing both images and text, it is especially valuable to break out both by first using OCR as in the earlier examples, and then identify image content as in the code below. The GetCurrentImageRectangle method gets stores the width and height of all images in the PDF as illustrated in this example:
' Create Bytescout.PDFExtractor.ImageExtractor object Set extractor = CreateObject("Bytescout.PDFExtractor.ImageExtractor") extractor.RegistrationName = "demo" extractor.RegistrationKey = "demo" ' Load sample PDF document extractor.LoadDocumentFromFile("..\..\sample1.pdf") i = 0 ' Initialize image enumeration If extractor.GetFirstImage() Then Do ' display coordinates of the image MsgBox "Image #" & CStr(i) & vbCRLF & "Coordinates: " & CStr( extractor.GetCurrentImageRectangle_Left()) & ", " & _ CStr( extractor.GetCurrentImageRectangle_Top()) & ", " & CStr( extractor.GetCurrentImageRectangle_Width()) & ", " & _ CStr( extractor.GetCurrentImageRectangle_Height()) i = i + 1 Loop While extractor.GetNextImage() ' Advance image enumeration End If Set extractor = Nothing
The methods of the Bytescout Extractor as illustrated here can naturally be combined to achieve valuable outcomes. For example, I have used several methods at once in this way: I can locate a PDF table using regex, even if it is scanned, and use Bytescout’s OCR algorithm and XML methods to export the PDF table to XML format. This makes for a sophisticated tool for managing PDF files into web content.
Following the theme of extracting PDF data for my mechanical engineering example above, we can also extract PDF directly to Excel with this simple code:
REM Run the script from the command line cscript.exe PdfToXls-CommandLine.vbs "../../sample3.pdf" "output.xlsx" pause
This method shows how to use PDF Extractor tools via the command-line interface (CLI) as well, which is often convenient for prototyping code before commits.
That’s the END OF PART 1. Check PART 2 for more examples.
If you really think that PDF Extractor SDK is something you were looking for, you are welcome to TRY IT HERE!