Home
/
Blog
/
The Awesome ByteScout PDF Extractor Tools (Part 1)

The Awesome ByteScout PDF Extractor Tools (Part 1)

Recently I had a challenging project to develop an interface for a mechanical engineer who needed to chart and visualize data from PDF spec sheets on an Excel spreadsheet. Fortunately, I found these great SDK tools from Bytescout which made the technical challenges and coding a breeze and made the whole project fun and easy! In this multi-tutorial, we will explore the rich variety of tools available in Bytescout’s awesome PDF Extractor SDK, and learn how to put them to use to solve real-world problems. Let’s start with a bread and butter tool to get a PDF table into a CSV format for use in Excel or Google Sheets.

START YOUR FREE TRIAL

Getting a PDF table into CSV

In the project mentioned above, a mechanical engineer needed to pull tables of physical wear specs from PDF files into spreadsheets and chart the wear over time. Bytescout’s SDK makes it easy to export pages from PDF, and in this case using VBScript to convert PDF to CSV for Excel tables. This is ideal because VBScript is nearly identical to VBScript, the native macro language of Excel. Notice in this code sample, that all the tables detected in a PDF are indexed so that you can select exactly those needed. Later on, we’ll see how to use regex – regular expressions – to find precise points in a PDF to extract. Have a look:

' Create Bytescout.PDFExtractor.TextExtractor object
Set tableDetector= CreateObject("Bytescout.PDFExtractor.TableDetector")
tableDetector.RegistrationName = "demo"
tableDetector.RegistrationKey = "demo"
 
' Create Bytescout.PDFExtractor.CSVExtractor object
Set csvExtractor = CreateObject("Bytescout.PDFExtractor.CSVExtractor")
csvExtractor.RegistrationName = "demo"
csvExtractor.RegistrationKey = "demo"
 
' We define what kind of tables to detect.
' So we set min required number of columns to 3 ...
tableDetector.DetectionMinNumberOfColumns = 3
' ... and we set min required number of rows to 3
tableDetector.DetectionMinNumberOfRows = 3
 
' Load sample PDF document
tableDetector.LoadDocumentFromFile("..\..\sample3.pdf")
csvExtractor.LoadDocumentFromFile "..\..\sample3.pdf"
 
' Get page count
pageCount = tableDetector.GetPageCount()
 
' Iterate through pages
For i = 0 to pageCount - 1
  
    t = 0
    ' Find first table and continue if found
    If (tableDetector.FindTable(i)) Then
 
        Do
            ' Set extraction area for CSV extractor to rectangle received from the table detector
            csvExtractor.SetExtractionArea _
                tableDetector.GetFoundTableRectangle_Left(), _
                tableDetector.GetFoundTableRectangle_Top(), _
                tableDetector.GetFoundTableRectangle_Width(), _
                tableDetector.GetFoundTableRectangle_Height()
            ' Export the table to CSV file
            csvExtractor.SavePageCSVToFile i, "page-" &amp;amp;amp; CStr(i) &amp;amp;amp; "-table-" &amp;amp;amp; CStr(t) &amp;amp;amp; ".csv"
            t = t + 1
        Loop While tableDetector.FindNextTable()
         
    End If
 
Next
 
Set csvExtractor = Nothing
Set tableDetector = Nothing

Using OCR with Scanned PDFs

Continuing the above theme of table extraction, suppose the original PDFs contain scanned tables. No problem! Here is where Bytescout SDK really meets the challenge to extract content from pdf files. In this next code sample, we demonstrate how to use the OCR Analyser, which is a class for analyzing scanned documents in PDF format. Bytescout makes it easy to learn how to do OCR and how to extract PDF files using OCR technology. Bytescout’s SDK optimizes Optical Character Recognition (OCR) algorithms to provide the highest level of accuracy in character recognition available today! To follow along with this VB.NET code example, first, add the Imports to bring in the dll references to your project:

Imports System.Drawing
Imports Bytescout.PDFExtractor
 
' This example demonstrates the use of OCR Analyser - a tooling class for analysis of scanned documents
' in PDF or raster image formats to find best parameters for Optical Character Recognition (OCR) that
' provide highest recognition quality.
 
' To make OCR work you should add the following references to your project:
' 'Bytescout.PDFExtractor.dll', 'Bytescout.PDFExtractor.OCRExtension.dll'.
 
Class Program
 
    Friend Shared Sub Main(args As String())
 
        ' Input document
        Dim inputDocument As String = ".\sample_ocr.pdf"
 
        ' Document page index
        Dim pageIndex As Integer = 0
 
        ' Area of the document page to perform the analysis (optional).
        ' RectangleF.Empty means the full page.
        Dim rectangle As RectangleF = RectangleF.Empty ' New RectangleF(100, 50, 350, 250)
 
        ' Location of "tessdata" folder containing language data files
        Dim ocrLanguageDataFolder As String = "c:\Program Files\Bytescout PDF Extractor SDK\Redistributable\net2.00\tessdata\"
 
        ' OCR language
        Dim ocrLanguage As String = "eng" ' "eng" for english, "deu" for German, "fra" for French, "spa" for Spanish etc - according to files in /tessdata
        ' Find more language files at https://github.com/tesseract-ocr/tessdata/tree/3.04.00
 
 
        ' Create OCRAnalyzer instance and activate it with your registration information
        Using ocrAnalyzer As New OCRAnalyzer("demo", "demo")
 
            ' Display analysis progress
            AddHandler ocrAnalyzer.ProgressChanged, Sub(sender, message, progress, ByRef cancel)
                                                        Console.WriteLine(message)
                                                    End Sub
 
            ' Load document to OCRAnalyzer
            ocrAnalyzer.LoadDocumentFromFile(inputDocument)
 
            ' Setup OCRAnalyzer
            ocrAnalyzer.OCRLanguage = ocrLanguage
            ocrAnalyzer.OCRLanguageDataFolder = ocrLanguageDataFolder
 
            ' Set page area for analysis (optional)
            ocrAnalyzer.SetExtractionArea(rectangle)
 
            ' Perform analysis and get results
            Dim analysisResults As OCRAnalysisResults = ocrAnalyzer.AnalyzeByOCRConfidence(pageIndex)
 
 
            ' Now extract page text using detected OCR parameters
 
            Dim outputDocument As String = ".\result.txt"
 
            ' Create TextExtractor instance
            Using textExtractor As TextExtractor = New TextExtractor("demo", "demo")
 
                ' Load document to TextExtractor
                textExtractor.LoadDocumentFromFile(inputDocument)
 
                ' Setup TextExtractor
                textExtractor.OCRMode = OCRMode.Auto
                textExtractor.OCRLanguageDataFolder = ocrLanguageDataFolder
                textExtractor.OCRLanguage = ocrLanguage
 
                ' Apply analysys results to TextExtractor instance
                ocrAnalyzer.ApplyResults(analysisResults, textExtractor)
 
                ' Set extraction area (optional)
                textExtractor.SetExtractionArea(rectangle)
 
                ' Save extracted text to file
                textExtractor.SaveTextToFile(outputDocument)
 
                ' Open output file in default associated application (for demonstration purposes)
                System.Diagnostics.Process.Start(outputDocument)
 
            End Using
 
        End Using
     End Sub
   
End Class

Revisiting OCR Methods with C#

One great advantage of the Dot Net Framework is that all the coding languages share in common a vast library of functions and procedures. This makes it easy for project managers to transition between languages in coding projects where a team has members with varied backgrounds. Here I am going to illustrate this super convenience by translating the method to extract text from a scanned PDF with OCR using the C# language. As you can see in the code sample below, only minor differences distinguish this code from the previous example, such as the familiar C# syntax, for example, “using” instead of “imports.” Have a look:

using System;
using System.Drawing;
using Bytescout.PDFExtractor;
 
// This example demonstrates the use of OCR Analyser - a tooling class for analysis of scanned documents
// in PDF or raster image formats to find best parameters for Optical Character Recognition (OCR) that
// provide highest recognition quality.
 
// To make OCR work you should add the following references to your project:
// 'Bytescout.PDFExtractor.dll', 'Bytescout.PDFExtractor.OCRExtension.dll'.
 
namespace OCRAnalyser
{
    class Program
    {
        static void Main(string[] args)
        {
            // Input document
            string inputDocument = @".\sample_ocr.pdf";
             
            // Document page index
            int pageIndex = 0;
             
            // Area of the document page to perform the analysis (optional).
            // RectangleF.Empty means the full page.
            RectangleF rectangle = RectangleF.Empty; // new RectangleF(100, 50, 350, 250);
 
            // Location of "tessdata" folder containing language data files
            string ocrLanguageDataFolder = @"c:\Program Files\Bytescout PDF Extractor SDK\Redistributable\net2.00\tessdata\";
 
            // OCR language
            string ocrLanguage = "eng"; // "eng" for english, "deu" for German, "fra" for French, "spa" for Spanish etc - according to files in /tessdata
            // Find more language files at https://github.com/tesseract-ocr/tessdata/tree/3.04.00
 
 
            // Create OCRAnalyzer instance and activate it with your registration information
            using (OCRAnalyzer ocrAnalyzer = new OCRAnalyzer("demo", "demo"))
            {
                // Display analysis progress
                ocrAnalyzer.ProgressChanged += (object sender, string message, double progress, ref bool cancel) =&amp;gt;
                {
                    Console.WriteLine(message);
                };
 
                // Load document to OCRAnalyzer
                ocrAnalyzer.LoadDocumentFromFile(inputDocument);
 
                // Setup OCRAnalyzer
                ocrAnalyzer.OCRLanguage = ocrLanguage;
                ocrAnalyzer.OCRLanguageDataFolder = ocrLanguageDataFolder;
                 
                // Set page area for analysis (optional)
                ocrAnalyzer.SetExtractionArea(rectangle);
                 
                // Perform analysis and get results
                OCRAnalysisResults analysisResults = ocrAnalyzer.AnalyzeByOCRConfidence(pageIndex);
 
 
                // Now extract the text using detected OCR parameters
 
                string outputDocument = @".\result.txt";
                 
                // Create TextExtractor instance
                using (TextExtractor textExtractor = new TextExtractor("demo", "demo"))
                {
                    // Load document to TextExtractor
                    textExtractor.LoadDocumentFromFile(inputDocument);
 
                    // Setup TextExtractor
                    textExtractor.OCRMode = OCRMode.Auto;
                    textExtractor.OCRLanguageDataFolder = ocrLanguageDataFolder;
                    textExtractor.OCRLanguage = ocrLanguage;
 
                    // Apply analysys results to TextExtractor instance
                    ocrAnalyzer.ApplyResults(analysisResults, textExtractor);
 
                    // Set extraction area (optional)
                    textExtractor.SetExtractionArea(rectangle);
 
                    // Save extracted text to file
                    textExtractor.SaveTextToFile(outputDocument);
 
                    // Open output file in default associated application (for demonstration purposes)
                    System.Diagnostics.Process.Start(outputDocument);
                }
            }
        }
    }
}

Find Text to Extract Using Regular Expressions

In the examples above, we assumed that we knew in advance the location or index of the table needed. Now suppose we need to locate the table and extract text from PDF files by matching unique patterns of characters in text such as a date format, for example. Here, Bytescout adds intelligence to the toolbox by supporting the use of standard regular expression pattern matching to find and extract a range of text in a PDF document. This makes it easy to extract pages from PDF files without knowing the location or page number. In the following code, the pattern for date format is searched with the line: pattern = “[0-9]{2}/[0-9]{2}/[0-9]{4}” as a regex expression:

' Create Bytescout.PDFExtractor.TextExtractor object
Set extractor = CreateObject("Bytescout.PDFExtractor.TextExtractor")
extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"
 
' Load sample PDF document
extractor.LoadDocumentFromFile("..\..\Invoice.pdf")
 
extractor.RegexSearch = True ' Turn on the regex search
pattern = "[0-9]{2}/[0-9]{2}/[0-9]{4}" ' Search dates in format 'mm/dd/yyyy'
 
' Get page count
pageCount = extractor.GetPageCount()
 
For i = 0 to PageCount - 1
    If extractor.Find(i, pattern, false) Then ' Parameters are: page index, string to find, case sensitivity
        Do
            extractedString = extractor.FoundText.Text
            MsgBox "Found match on page #" &amp;amp; CStr(i) &amp;amp; ": " &amp;amp; extractedString
            extractor.ResetExtractionArea()
        Loop While extractor.FindNext
    End If
Next
MsgBox "Done"
Set extractor = Nothing

Extract a PDF Table to XML

A frequent need exists to render tabular data in XML – extensible markup language which encodes documents in a universally human and machine-readable format – and this will continue our discussion on how to extract text from pdf, and more particularly how to extract table data from PDF documents. Bytescout’s Extractor SDK contains methods to render PDFs into the very widely-used XML format. The following code combines the table extractor with a method to extract tabular data to XML. In my own experience, this is valuable for converting PDFs to web content:

' Create Bytescout.PDFExtractor.TextExtractor object
Set tableDetector= CreateObject("Bytescout.PDFExtractor.TableDetector")
tableDetector.RegistrationName = "demo"
tableDetector.RegistrationKey = "demo"
 
' Create Bytescout.PDFExtractor.xmlExtractor object
Set xmlExtractor = CreateObject("Bytescout.PDFExtractor.XMLExtractor")
xmlExtractor.RegistrationName = "demo"
xmlExtractor.RegistrationKey = "demo"
 
' We should define what kind of tables we should detect.
' So we set min required number of columns to 3 ...
tableDetector.DetectionMinNumberOfColumns = 3
' ... and we set min required number of rows to 3
tableDetector.DetectionMinNumberOfRows = 3
 
' Load sample PDF document
tableDetector.LoadDocumentFromFile("..\..\sample3.pdf")
xmlExtractor.LoadDocumentFromFile "..\..\sample3.pdf"
 
' Get page count
pageCount = tableDetector.GetPageCount()
 
' Iterate through pages
For i = 0 to pageCount - 1
  
    t = 0
    ' Find first table and continue if found
    If (tableDetector.FindTable(i)) Then
 
        Do
            ' Set extraction area for CSV extractor to rectangle received from the table detector
            xmlExtractor.SetExtractionArea _
                tableDetector.GetFoundTableRectangle_Left(), _
                tableDetector.GetFoundTableRectangle_Top(), _
                tableDetector.GetFoundTableRectangle_Width(), _
                tableDetector.GetFoundTableRectangle_Height()
            ' Export the table to CSV file
            xmlExtractor.SavePageXMLToFile i, "page-" & CStr(i) & "-table-" & CStr(t) & ".xml"
            t = t + 1
        Loop While tableDetector.FindNextTable()
         
    End If
 
Next
 
Set xmlExtractor = Nothing
Set tableDetector = Nothing

Capturing the PDF Table Structure

When I extract data from pdf tables, the actual table structure may be of interest for such purposes as ensuring accurate rendering in target documents. Especially for developers, it is often useful to replicate an existing table structure for consistency across docs. The next code sample illustrates how to capture the table structure when extracting a table from a PDF file:

' Create Bytescout.PDFExtractor.StructuredExtractor object
Set extractor = CreateObject("Bytescout.PDFExtractor.StructuredExtractor")
 
extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"
 
' Load sample PDF document
extractor.LoadDocumentFromFile "../../sample3.pdf"
            
For ipage = 0 To extractor.GetPageCount() - 1
 
    ' starting extraction from page #"
    extractor.PrepareStructure ipage
 
    rowCount = extractor.GetRowCount(ipage)
     
    For row = 0 To rowCount - 1
        columnCount = extractor.GetColumnCount(ipage, row)
 
        For col = 0 To columnCount-1
            WScript.Echo "Cell at page #" +CStr(ipage) + ", row=" & CStr(row) & ", column=" & _
                CStr(col) & vbCRLF & extractor.GetCellValue(ipage, row, col)
        Next
    Next
Next

How to Extract Images from PDFs

I am specifically impressed with the Bytescout SDK methods to extract images from PDF files. Here again, is a great set of tools for preparing web content from media previously locked up in the proprietary PDF documents format. Notice the easy use of the GetFirstImage method. This is followed in the code sample below by saving the image in this enumerator method. The outcome of this enumerator is to extract all images from pdf and save them individually in the folder specified. Have a look at the code here:

' Create Bytescout.PDFExtractor.ImageExtractor object
Set extractor = CreateObject("Bytescout.PDFExtractor.ImageExtractor")
extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"
 
' Load sample PDF document
extractor.LoadDocumentFromFile("..\..\sample1.pdf")
 
i = 0
 
' Initialize image enumeration
If extractor.GetFirstImage() Then
    Do
        outputFileName = "image" & i & ".png"
        ' Save image to file
        extractor.SaveCurrentImageToFile outputFileName
        i = i + 1
    Loop While extractor.GetNextImage() ' Advance image enumeration
End If
 
' Open first output image in default associated application
Set shell = CreateObject("WScript.Shell")
shell.Run "image0.png", 1, false
Set shell = Nothing
 
Set extractor = Nothing

Extracting Images Specifically By Page

The next natural extension of the code above is to specify a page within a PDF and have the Extractor SDK fetch and save all images on that page. The ImageExtractor is invoked for this purpose. First, in the code below, the pages are enumerated. Next, as in the previous example, the images are enumerated and saved as png files. The shell script completes this workflow. These examples focus on Visual Basic and related dialects; later we will explore methods in other languages such as C#. Have a look at this example:

' Create Bytescout.PDFExtractor.ImageExtractor object
Set extractor = CreateObject("Bytescout.PDFExtractor.ImageExtractor")
extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"
 
' Load sample PDF document
extractor.LoadDocumentFromFile("..\..\sample1.pdf")
 
' Get page count
pageCount = extractor.GetPageCount()
         
' Extract images from each page
For i = 0 To pageCount - 1
    j = 0
    ' Initialize page images enumeration
    If extractor.GetFirstPageImage(i) Then
        Do
            outputFileName = "page" & i & "image" & j & ".png"
            ' Save image to file
            extractor.SaveCurrentImageToFile outputFileName
            j = j + 1
        Loop While extractor.GetNextImage() ' Advance image enumeration
    End If
Next
 
' Open first output file in default associated application
Set shell = CreateObject("WScript.Shell")
shell.Run "page0image0.png", 1, false
Set shell = Nothing
 
Set extractor = Nothing

Enumerating Image Coordinates

Among the advanced methods in the Extractor SDK toolkit is the capability to enumerate all image coordinates within a PDF document. When I need to edit scanned PDF files possibly containing both images and text, it is especially valuable to break out both by first using OCR as in the earlier examples, and then identify image content as in the code below. The GetCurrentImageRectangle method gets stores the width and height of all images in the PDF as illustrated in this example:

' Create Bytescout.PDFExtractor.ImageExtractor object
Set extractor = CreateObject("Bytescout.PDFExtractor.ImageExtractor")
extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"
 
' Load sample PDF document
extractor.LoadDocumentFromFile("..\..\sample1.pdf")
 
i = 0
 
' Initialize image enumeration
If extractor.GetFirstImage() Then
    Do
        ' display coordinates of the image
        MsgBox "Image #" & CStr(i) & vbCRLF & "Coordinates: " & CStr( extractor.GetCurrentImageRectangle_Left()) & ", " & _
            CStr( extractor.GetCurrentImageRectangle_Top()) & ", " & CStr( extractor.GetCurrentImageRectangle_Width()) & ", " & _
            CStr( extractor.GetCurrentImageRectangle_Height())
        i = i + 1
    Loop While extractor.GetNextImage() ' Advance image enumeration
End If
 
Set extractor = Nothing

A Powerful Combination of Tools

The methods of the Bytescout Extractor as illustrated here can naturally be combined to achieve valuable outcomes. For example, I have used several methods at once in this way: I can locate a PDF table using regex, even if it is scanned, and use Bytescout’s OCR algorithm and XML methods to export the PDF table to XML format. This makes for a sophisticated tool for managing PDF files into web content.

Following the theme of extracting PDF data for my mechanical engineering example above, we can also extract PDF directly to Excel with this simple code:

REM Run the script from the command line
cscript.exe PdfToXls-CommandLine.vbs "../../sample3.pdf" "output.xlsx"
pause

This method shows how to use PDF Extractor tools via the command-line interface (CLI) as well, which is often convenient for prototyping code before commits.

That’s the END OF PART 1. Check PART 2 for more examples.

If you really think that PDF Extractor SDK is something you were looking for, you are welcome to TRY IT HERE!

REQUEST A QUOTE