Extract PDF data from tables

1. Extract data from pdf tables with Adobe Acrobat Pro DC
2. Extract data manually with Adobe Reader
3. Extracting data from PDF tables using C#

  • Prerequisites
  • .NET Compatibility
  • PDF Extractor SDK, how does it works?
  • Filters

4. Export PDF table to CSV format with C#
5. Extract PDF table column with C#
6. PDF table To JSON using C#
7. Extract PDF table to XML using C#

Extract data from scanned documents / OCR

1. Extract data with Adobe Acrobat DC
2. Extract data from scanned document with poor quality of printing and handwriting note
3. Extract data with OCR from scanned documents using C# and PDF Extractor SDK
4. Extract rich media contents with Adobe Acrobat DC
5. Extract rich media contents from PDF with PDF Extractor SDK and C#
6. Extract audio file mp3 from PDF document with PDF extractor SDK and C#
7. Extract video file from PDF document with PDF extractor SDK and C#
8. Extract images from PDF file using Adobe Acrobat DC
9. Extract images from PDF file using C#
10. Extract embedded documents in PDF file

This article aims to show how to extract data from PDF files including text, image, audio, video using C#. We all know that PDF format became the standard format in document exchanges and PDF documents are suitable for reliable viewing and printing of business documents. Almost of all office softwares like Microsoft Office, LibreOffice or OpenOffice.org had integrated the PDF format into them and them all had implemented the very useful feature known as “Export to PDF”. So exporting to a pdf file is now very easy, but what about the inverse process?
Let’s consider that you’ve received a document in PDF format and want to extract some information from it. At a first glance, the task seems to be quite easy with just copying from the document source and pasting it somewhere else. But thing becomes complicated when you’re dealing with a lot of data, this tremendous process will make your work life awful. Facing to that it’s appropriate to use dedicated tools or specialized frameworks to automate the whole of the job. Not only they will improve your productivity but also save your time. This article has four main sections:
– PDF extract data from tables
– Extract data from scanned docs
– Extract data with OCR
– Extract rich media contents

Extract PDF data from tables

Extract data from pdf tables with Adobe Acrobat Pro DC

As it’s name implies, Adobe Acrobat is a commercial app made by Adobe and it is the first and the official software to work with PDF files. You can download the 7 days trial version at https://acrobat.adobe.com/us/en/free-trial-download.html. At the time of writing, the released version is Adode Acrobat Pro DC 2015 Release.

You also have to download our case study files here (sample1) . It’s content looks like below

The table contain daily historical Microsoft and Facebook stock prices and volumes from the Nasdaq public website.

We need to manually extract the table’s content and export it to different formats like CSV, TXT,….

Step 1: Open the PDF file

In Adobe Acrobat Pro DC > File > Open

Step 2: Locate the table from which you want to extract data and drag a selection over the table as shown below

Step 3: Right click and select “Export Selection As…”

Step 4: Choose the export type

Adobe Acrobat Pro DC can handle up to 8 different formats:

  • – Word Document (*.docx)
  • – Word 97-2003 Document (*.doc)
  • – Excel Workbook (*.xlsx)
  • – PowerPoint Presentation *(*.pptx)
  • – Rich Text Format (*.rtf)
  • – XML Spreadsheet 2003 (*.xml)
  • – HTML (*.html, *.htm)
  • – Comma Separated Values (*.csv)

The exported CSV file looks like


"Date","Open","High","Low","Close / Last","Volume"
2017-01-04T00:00:00.000,62.48,62.75,62.12,62.3,21325140
2017-01-03T00:00:00.000,62.79,62.84,62.125,62.58,20655190
2016-12-30T00:00:00.000,62.96,62.99,62.03,62.14,25575720
2016-12-29T00:00:00.000,62.86,63.2,62.73,62.9,10248460
2016-12-28T00:00:00.000,63.4,63.4,62.83,62.99,14348340
2016-12-27T00:00:00.000,63.21,64.07,63.21,63.28,11743650
2016-12-23T00:00:00.000,63.45,63.54,62.8,63.24,12399540
2016-12-22T00:00:00.000,63.84,64.1,63.405,63.55,22175270
2016-12-21T00:00:00.000,63.43,63.7,63.12,63.54,17084370
2016-12-20T00:00:00.000,63.69,63.8,63.025,63.54,26017470
2016-12-19T00:00:00.000,62.56,63.77,62.42,63.62,34318500
2016-12-16T00:00:00.000,62.95,62.95,62.115,62.3,42452660

Adobe Acrobat Pro is the most powerful tools to manipulate PDF files. In few words, you can do whatever you want with your pdf file with it – except some limitations that we’re going to see at section 4 (dealing with rich media content).

Extract data manually with Adobe Reader

Adobe Reader PC is a simple software to read PDF files. It has some limitations compared to it’s counterpart Adobe Acrobat Pro. However, you can do some basic stuffs like copying table’s contents and pasting it into your favorite spreadsheet app.

Step 1: Open the file with Adobe Reader

Step 2: Select the table’s content by dragging any desired rows and columns

Step 3: Open your favorite spreadsheet app and paste the selection into it, we’re using LibreOffice Calc in this article

As seen in the figure below, we have to define column delimiter in order to correctly display the content.

Step 4: Click OK

Using our spreadsheet software, we can then export to many other formats. In our case, LibreOffice gives us 15 available formats.

Abode Reader is not as flexible as Adobe Acrobat Pro, it hasn’t actually no export features. It’s main utility is to visualize, to print and to fill out PDF documents.

The two previous sections show you two ways to manually extract data from tables. They both are working well and are very useful for small loads. The next section will show you how to extract data from PDF tables using programming tools. We will focus essentially on PDK Extractor SDK.

Extracting data from PDF tables using C#

Prerequisites

In order to run all the following programs, you have to install the PDF Extractor SDK. You can download it at https://bytescout.com/products/developer/pdfsdk/index.html

PDF Extractor SDK (https://bytescout.com/products/developer/pdfextractorsdk/index.html) is one of Bytescout’s products. It allows developers to convert/extract data from PDF and export them to other formats. This is important to know that we can do that without any additional softwares required unlike the actual Adobe SDK which mandatory needs Adobe Acrobat software to be installed.

After installing PDF Extractor SDK, all requisites dll can be found in the folder C:\Program Files\Bytescout PDF Extractor SDK

.NET Compatibility

PDF Extractor SDK supports the following .NET Frameworks:

  • .NET Framework 2.0
  • .NET Framework 3.5 / .NET Framework 3.5 Client Profile
  • .NET Framework 4.0 / .NET Framework 4.0 Client Profile

You then need to do “add a reference” to Bytescout.PDFExtractor.dll library.

PDF Extractor SDK, how does it works?

Prior to any data extraction processes, we need to locate the targeted table among all the tables in the PDF document. This task is done by the Bytescout.PDFExtractor.TableDetector object which can loop over existing tables in the document.

The program below shows how to locate the N-th table (targetTableNumber variable) in the P-th page (targetPageNumber variable) of the whole PDF document.

using Bytescout.PDFExtractor;

namespace Topic4.Sample1.TableDetectorSample
{
    class Program
    {
        static void Main(string[] args)
        {
            //The path of the PDF file
            var pdfFile = @"sample1.pdf";

	    // this is the index of the page containing the targeted table
            int targetPageNumber = 1;

            // this is the value of the table we are looking for, 1 or 2 in the current example
            int targetTableNumber = 1;

            // Create an instance of a TableDetector
            TableDetector tdetector = new TableDetector();

            // License informations here
            tdetector.RegistrationKey = "demo";
            tdetector.RegistrationName = "demo";
          
            // Load the document file
            tdetector.LoadDocumentFromFile(pdfFile);

            // Count the number of pages in the actual PDF file
            int pageCount = tdetector.GetPageCount();            

            // Loop over document pages ...
            for (int i = 0; i< pageCount; i++)
            {

                //... we are only interested in the targetPageNumber-th page
                if (targetPageNumber != (i + 1)) continue;

                //... if tables are found in the current page...
                if (tdetector.FindTable(i))
                {
                    int tableLoopNumber = 1;
                
                    //... loop over the tables in the current page ...
                    do
                    {
                        if (tableLoopNumber == (targetTableNumber-1))
                        {
                    
                            //... targetTableNumber-th table in the targetPageNumber-th page is here
                            break;
                        }

                        tableLoopNumber++;
                    }
                    while (tdetector.FindNextTable());
                }
            }
        }
    }
}

Filters:

The TableDetector class offers some useful properties to filter the search:

  •  DetectionMinNumberOfColumns
  •  DetectionMinNumberOfRows

After locating the right table, we want to gather some data from it. This is achieved by an instance of extractor class: CSVExtractor, TextExtractor, JSONExtractor, XLSExctrator,…

Export PDF table to CSV format with C#

We need to export the first PDF table of our case study document to CSV format. The previous program is updated as following

using Bytescout.PDFExtractor;

namespace Topic4.Sample1.TableDetectorSample
{
    class Program
    {
        static void Main(string[] args)
        {
            //The path of the PDF file
            var pdfFile = @"sample1.pdf";
            
            // this is the index of the page containing the targeted table
            int targetPageNumber = 1;

            // this is the value of the table we are looking for, 1 or 2 in the current example
            int targetTableNumber = 1;

            //Create an instance of a TableDetector
            TableDetector tdetector = new TableDetector();

            //License informations here
            tdetector.RegistrationKey = "demo";
            tdetector.RegistrationName = "demo";

            //Add some filters. Only tables meeting those criterias are considered
            tdetector.DetectionMinNumberOfColumns = 1;
            tdetector.DetectionMinNumberOfRows = 10; 

            //Load the document file
            tdetector.LoadDocumentFromFile(pdfFile);

            //Count the number of pages in the actual PDF file
            int pageCount = tdetector.GetPageCount();

           

            //Loop over document pages ...
            for (int i = 0; i< pageCount; i++)
            {

                //... we are only interested in the targetPageNumber-th page
                if (targetPageNumber != (i + 1)) continue;

                //... if tables are found in the current page...
                if (tdetector.FindTable(i))
                {
                    int tableLoopNumber = 0;

                    //... loop over the tables in the current page ...
                    do
                    {
                        if (tableLoopNumber == (targetTableNumber-1))
                        {

                            //Process the targetTableNumber-th table in the targetPageNumber-th page
                            

                            //Create the csv extractor object, set license information
                            CSVExtractor extractor = new CSVExtractor();                            
                            extractor.RegistrationName = "demo";
                            extractor.RegistrationKey = "demo";

                            //set the csv separator symbol - default value is comma
                            extractor.CSVSeparatorSymbol = ";";                            

                            //Load the pdf file into the extractor object
                            extractor.LoadDocumentFromFile(pdfFile);

                            // set extraction area for CSV extractor to rectangle given by table detector
                            extractor.SetExtractionArea(
                                tdetector.GetFoundTableRectangle_Left(),
                                tdetector.GetFoundTableRectangle_Top(),
                                tdetector.GetFoundTableRectangle_Width(),
                                tdetector.GetFoundTableRectangle_Height()
                            );

                            //Three methods to extract the table's content as csv 
                           
                            //returns a string using GetCSV method
                            string result_csv1 = extractor.GetCSV();

                            //returns a string using GetCSVFromPage method, page index is given as input parameter
                            string result_csv2 = extractor.GetCSVFromPage(targetPageNumber - 1);

                            // or directly save to the file system
                            extractor.SavePageCSVToFile(tableLoopNumber, "page-" + targetPageNumber + "-table-" + targetTableNumber + ".csv");

                            break;
                        }

                        tableLoopNumber++;
                    }
                    while (tdetector.FindNextTable());
                }
            }
        }
    }
}

Once the table is located, we create an extractor object to define the area inside which we want to extract data and the final csv looks like


"Date";"Open";"High";"Low";"Close / Last";"Volume";
"01/04/2017";"117.55";"119.66";"117.29";"118.69";"19,594,560";
"01/03/2017";"116.03";"117.84";"115.51";"116.86";"20,635,600";
"12/30/2016";"116.595";"116.83";"114.7739";"115.05";"18,668,290";
"12/29/2016";"117";"117.531";"116.06";"116.35";"9,925,082";
"12/28/2016";"118.19";"118.25";"116.65";"116.92";"11,985,740";
"12/27/2016";"116.96";"118.68";"116.864";"118.01";"12,034,590";
"12/23/2016";"117";"117.56";"116.3";"117.27";"10,885,030";
"12/22/2016";"118.86";"118.99";"116.93";"117.4";"16,226,770";
"12/21/2016";"118.92";"119.2";"118.48";"119.04";"10,747,610";
"12/20/2016";"119.5";"119.77";"118.8";"119.09";"13,673,570";
"12/19/2016";"119.85";"120.36";"118.51";"119.24";"15,871,360";
"12/16/2016";"120.9";"121.5";"119.27";"119.87";"25,316,220";

Extract PDF table column with C#

The next program shows how to extract a specific column from a given table.

The class Bytescout.PDFExtractor.TextExtractor is used to locate a specific text pattern in the PDF document. Then we define the extraction area and finally save the column content in a text file.

Bytescout.PDFExtractor.TextExtractor class is not only limited to PDF files, it can locate and extract text from PNG, JPEC, BMP, TIFF files.

We also need to add the System.Drawing (because we’re using the RectangleF class) assembly to our project.

using Bytescout.PDFExtractor;
using System.Drawing;

namespace Topic4.Sample1.TextExtractorSample
{
    class Program
    {
        static void Main(string[] args)
        {
            //The path of the PDF file
            var pdffile = "sample1.pdf";

            // this is the index of the page containing the targeted table
            int targetPageNumber = 1;

            // this is the value of the table we are looking for, 1 or 2 in the current example
            int targetTableNumber = 1;

            //Create an instance of a TableDetector
            TableDetector tdetector = new TableDetector();

            //License informations here
            tdetector.RegistrationKey = "demo";
            tdetector.RegistrationName = "demo";

            //Add some filters. Only tables meeting those criterias are considered
            tdetector.DetectionMinNumberOfColumns = 1;
            tdetector.DetectionMinNumberOfRows = 10;

            //Load the document file
            tdetector.LoadDocumentFromFile(pdfFile);

            //Count the number of pages in the actual PDF file
            int pageCount = tdetector.GetPageCount();



            //Loop over document pages ...
            for (int i = 0; i < pageCount; i++)
            {

                //... we are only interested in the targetPageNumber-th page
                if (targetPageNumber != (i + 1)) continue;

                //... if tables are found in the current page...
                if (tdetector.FindTable(i))
                {
                    int tableLoopNumber = 0;

                    //... loop over the tables in the current page ...
                    do
                    {
                        if (tableLoopNumber == (targetTableNumber - 1))
                        {                            
                           
                            //use a text extractor in order to extract text from the pdf document
                            TextExtractor extractor = new TextExtractor();
                            extractor.RegistrationName = "demo";
                            extractor.RegistrationKey = "demo";

                            // Load sample PDF document
                            extractor.LoadDocumentFromFile(pdfFile);

                            // set to extract text column by column
                            extractor.ExtractColumnByColumn = true;                         

                            //Try to locate the "Close / Last" column header in the current page 
                            bool close_last_header_found = extractor.Find(targetPageNumber - 1, "Close / Lasts", true);

                            //if found
                            if (close_last_header_found)
                            {
                                // Retrieve the boundary of the text
                                RectangleF rectangle = extractor.FoundText.Bounds;

                                //Set the selection area of the extractor                               
                                extractor.SetExtractionArea(
                                    rectangle.Left
                                    , rectangle.Top
                                    , rectangle.Width
                                    , rectangle.Height * 20
                                    );
                                //Save to file
                                extractor.SaveTextToFile("output.txt");

                                //Dispose
                                extractor.Dispose();                              
                            }
                            break;
                        }

                        tableLoopNumber++;
                    }
                    while (tdetector.FindNextTable());
                }
            }
        }
    }
}

The content of the result file looks like:

Close / Last
62.3
62.58
62.14
62.9
62.99
63.28
63.24
63.55
63.54
63.54
63.62
62.3

More generally, the class Bytescout.PDFExtractor.TextExtractor uses a rectangle surface called extraction area. The extraction area is well defined by using four parameters:

  •  the left and top coordinates are used to locate to topleft corner of the extraction area
  •  the width is used to set the width of the extraction area
  •  the height parameter specifies the height of the extraction area

Only texts standing inside the extraction area are going to be gathered during the extraction phase.

 Parsing PDF table cell by cell with C# PDF API

With PDF Extractor SDK, we can navigate through the table’s cells using the Bytescout.PDFExtractor.StructuredExtractor class in the way of enumerating a matrix structure. The following program shows how to do that

using Bytescout.PDFExtractor;
using System;

namespace Topic4.Sample1.TableStructureSample
{
    class Program
    {
        static void Main(string[] args)
        {
            //The path of the PDF file
            var pdfFile = "sample1.pdf";
            
            // this is the index of the page containing the targeted table
            int targetPageNumber = 1;
           
            // Create Bytescout.PDFExtractor.StructuredExtractor instance (former TableExtractor)
            StructuredExtractor extractor = new StructuredExtractor();
            extractor.RegistrationName = "demo";
            extractor.RegistrationKey = "demo";

            // Load sample PDF document
            extractor.LoadDocumentFromFile(pdfFile);

            //Define the extraction area
            extractor.SetExtractionArea(new System.Drawing.RectangleF(20,380,600,200));
            
            for (int ipage = 0; ipage <extractor.GetPageCount(); ipage++)
            {
                //In the current program, we only need the first page
                if ((ipage + 1) != targetPageNumber) continue;

                //Prepare the page structure
                extractor.PrepareStructure(ipage);

                //Count the actual table rows 
                int rowCount = extractor.GetRowCount(ipage);
                int CellsAlreadyScanned = 0;

                //Loop over the row count...
                for (int row = 0; row < rowCount; row++)
                {
                    // ...then loop over the column
                    int columnCount = extractor.GetColumnCount(ipage, row);

                    for (int col = 0; col< columnCount; col++)
                    {
                        //The text of the table cell[row,col]
                        var cellValue = extractor.GetCellValue(ipage, row, col);                        
                        Console.Write(string.Format("{0}\t", cellValue));
                    }
                    Console.WriteLine("\n");
                    CellsAlreadyScanned += columnCount;
                }
            }
            Console.WriteLine("Press any key..");
            Console.ReadKey();
        }
    }
}

The program output is

PDF table To JSON using C#

The following program shows how to extract data from PDF table and save them as a json file using the Bytescout.PDFExtractor.JSONExtractor class. We can also retrieve some metadata (like font name,font size, font style and position) informations in addition to the actual cell content value.

using Bytescout.PDFExtractor;

namespace Topie4.Sample1.JsonExtractorSample
{
    class Program
    {
        static void Main(string[] args)
        {
            //The path of the PDF file            
            var pdfFile = @"sample1.pdf";

            // this is the index of the page containing the targeted table
            int targetPageNumber = 1;

            // this is the value of the table we are looking for, 1 or 2 in the current example
            int targetTableNumber = 1;

            //Create an instance of a TableDetector
            TableDetector tdetector = new TableDetector();

            //License informations here
            tdetector.RegistrationKey = "demo";
            tdetector.RegistrationName = "demo";

            //Add some filters. Only tables meeting those criterias are considered
            tdetector.DetectionMinNumberOfColumns = 1;
            tdetector.DetectionMinNumberOfRows = 10;

            //Load the document file
            tdetector.LoadDocumentFromFile(pdfFile);

            //Count the number of pages in the actual PDF file
            int pageCount = tdetector.GetPageCount();



            //Loop over document pages ...
            for (int i = 0; i < pageCount; i++)
            {

                //... we are only interested in the targetPageNumber-th page
                if (targetPageNumber != (i + 1)) continue;

                //... if tables are found in the current page...
                if (tdetector.FindTable(i))
                {
                    int tableLoopNumber = 0;

                    //... loop over the tables in the current page ...
                    do
                    {
                        if (tableLoopNumber == (targetTableNumber - 1))
                        {
                            //Process the targetTableNumber-th table in the targetPageNumber-th page


                            //Create the csv extractor object, set license information
                            JSONExtractor extractor = new JSONExtractor();
                            extractor.RegistrationName = "demo";
                            extractor.RegistrationKey = "demo";
                           
                            //Load the pdf file into the extractor object
                            extractor.LoadDocumentFromFile(pdfFile);

                            // set extraction area for CSV extractor to rectangle given by table detector
                            extractor.SetExtractionArea(
                                tdetector.GetFoundTableRectangle_Left(),
                                tdetector.GetFoundTableRectangle_Top(),
                                tdetector.GetFoundTableRectangle_Width(),
                                tdetector.GetFoundTableRectangle_Height()
                            );
                          
                            // or directly save to the file system
                            extractor.SaveJSONToFile(tableLoopNumber, "page-" + targetPageNumber + "-table-" + targetTableNumber + ".json");

                            break;
                        }

                        tableLoopNumber++;
                    }
                    while (tdetector.FindNextTable());
                }
            }
        }
    }
}

The result looks like

            {
             "text": {
                "@fontName": "Times",
                "@fontSize": "12.0",
                "@x": "59",
                "@y": "105",
                "@width": "54",
                "@height": "12",
                "#text": "01/04/2017"
              }
            },
            {
              "text": {
                "@fontName": "Times",
                "@fontSize": "12.0",
                "@x": "193",
                "@y": "105",
                "@width": "27",
                "@height": "12",
                "#text": "62.48"
              }
            }

Extract PDF table to XML using C#

The same process as exporting to JSON applies here. Instead of using JSONExtractor class we have to use XMLExtractor class. To save the XML into the file system we call the method XMLExtractor.SaveXMLToFile

Extract data from scanned documents / OCR

The rest of this article is about extracting data from scanned documents and OCR capabilities. We’ll see how to extract data with Adobe Acrobat DC and we’ll also see how to handle the data extraction process using C# and PDF Extractor SDK.

You can find below five scanned files that we’re going to use.

scan_sample1_600dpi_normal.pdf : document scanned at 600 dpi

scan_sample1_600dpi_handwritingnote.pdf : document scanned at 600dpi with handwriting note at the bottom of the page

scan_sample1_70dpi_handwritingnote.pdf: document scanned at 70 dpi with handwriting note

scan_sample1_600dpi_badqualityprinting.pdf : printed with a very poor quality and scanned at 600dpi

scan_sample1_600dpi_badorientation.pdf : scanned at 600dpi with bad orientation during the scan process

Extract data with Adobe Acrobat DC

When we open the pdf scanned file with Adobe Acrobat DC, we see that it automatically tries to convert the page to editable contents. It pops out the message below

Once the pattern recognition is done each cell of our table becomes editable. More generally Adobe Acrobat DC has a powerful built-in OCR to automatically detect characters and texts inside the scanned page.

We can then export the modified document to many other formats.

Go to File > Export To >

You can download here the “Text (Plain)” version of our document

Extract data from scanned document with poor quality of printing and handwriting note

The corresponding demo file is scan_sample1_70dpi_handwritingnote.pdf. The specificity of this file is that the scan is done at low resolution 70 dpi and having a handwriting note at the bottom of the table.

Despite of the low accuracy of the OCR at 70 dpi, the major part of the data has been well reconstructed. However, it seems to have a trouble to detect all cell borders, the result file is available here. The pattern recognition over the handwriting note had also failed.

The same process applied to the same document scanned at 600dpi is very accurate. The extraction process has performing well and all of the cells data are successfully gathered (here)

The file scan_sample1_600dpi_badqualityprinting.pdf (scan at 600 dpi with poor ink quality) had definitely failed through the extraction process. Adobe Acrobat DC didn’t recognized any patterns and had  considered it as a blank page.

The last file we’re trying to extract is scan_sample1_600dpi_badorientation.pdf. The particularity of this file is the bad orientation during the scan process.

However Adobe Acrobat automatically adjusts the document page orientation during the pattern recognition process as seen in the figure below

The data extraction result is available here We can see that almost all of the data are well retrieved.

Extract data with OCR from scanned documents using C# and PDF Extractor SDK

The following program extracts data from the pdf document file scanned at 600dpi under normal conditions

using Bytescout.PDFExtractor;

namespace OCRExample
{
    class Program
    {
        static void Main(string[] args)
        {
            var pdfFile = @"scan_sample1_600dpi_normal.pdf";
            
            // Create Bytescout.PDFExtractor.TextExtractor instance
            JSONExtractor extractor = new JSONExtractor();
            extractor.RegistrationName = "demo";
            extractor.RegistrationKey = "demo";

            //Load the pdf document file
            extractor.LoadDocumentFromFile(pdfFile);

            //We are only interested in text
            extractor.OCRMode = OCRMode.TextFromImagesAndFonts;
            
            //Define the language data folder
            extractor.OCRLanguageDataFolder = @"C:\\Program Files\\Bytescout PDF Extractor SDK\\net4.00\\tessdata\\";
            
            //The actual language is eng for english
            extractor.OCRLanguage = "eng";

            //We set the resolution to 600dpi
            extractor.OCRResolution = 600;

            //Save the file
            extractor.SaveJSONToFile("OCR_JSON_scan_sample1_600dpi_normal.json");
            
        }
    }
}

For each pattern, the OCR engine associates the property named @OCRConfidence which indicates how good or bad the recognition was, higher the value more accurate is the result. The OCR engine also returns the predicted font name, the size, the coordinate of the data, it’s width and height in the PDF document and the text value of course.

The figure below shows a partial output

{
              "text": {
                "@fontName": "Times",
                "@fontSize": "30.0",
                "@OCRConfidence": "66.82",
                "@x": "184",
                "@y": "542",
                "@width": "32",
                "@height": "29",
                "#text": "115.921"
              }
            },
            {
              "text": {
                "@fontName": "Times",
                "@fontSize": "8.0",
                "@OCRConfidence": "86.21",
                "@x": "268",
                "@y": "555",
                "@width": "168",
                "@height": "8",
                "#text": "1192 ."
              }
            }

The OCR uses a set of language libraries (located at C:\\Program Files\\Bytescout PDF Extractor SDK\\net4.00\\tessdata\\ ), the default installation contains four languages: english,german, french and spanish. The property JSONExtractor. OCRLanguageDataFolder is set to the actual language of the document. The OCR process is active when the property JSONExtractor.OCRMode is different than OCRMode.Off. We can also apply multiple preprocessing algorithms to the OCRImagePreprocessingFilters property of the extractor object to help the OCR Engine to give better pattern recognition performance. We can mix between the following methods AddContrast(), AddDeskew(),AddDilate(), AddGammaCorrection(), AddHorizontalLinesRemover(), AddMedian(), AddVerticalLinesRemover(). According to the case, the time process may vary from a few seconds to one minute or even more per page.

Extract rich media contents with Adobe Acrobat DC

In the Adobe Acrobat glossary, rich media are audio and video contents. They can be a 3D animation, an audio file, a flash SWF animation or a video file in H264 compilant format. At the time of writing, extracting rich media contents isn’t a supported feature, this is one of the major lacks of Adobe Acrobat and you should use a third-party tools to do that. PDF Extractor SDK can fortunately do all the jobs for you just with only a few lines of code. This is exactly what we’ll show you in the next section.

Extract rich media contents from PDF with PDF Extractor SDK and C#

Pdf content is not only limited to text format. Most of the time PDF documents contain pictures or documents and even more complex objects like audio or video media files may be embedded into the document.
The following example shows you how to extract such objects using PDF extractor SDK
The basic steps to perform this process are:
1- Create the extractor object. The type actually depends on what kind of objects we’re going to extract
2- Locate the object in the document. We can loop over existing objects to find the index of the targeted object.
3- Call the appropriate method of the extractor object in step 1 to extract the data. The extractor object has also some interesting properties about the data as file type, data size,
4- Save the extracted data to the file system.
For more details about supported rich media contents, please visit the official adobe acrobat help page https://helpx.adobe.com/acrobat/using/rich-media.html

Extract audio file mp3 from PDF document with PDF extractor SDK and C#

The Bytescout.PDFExtractor.MultimediaExtractor is the most suitable component to extract an embedded audio file from a PDF document.
The file Pdf_with_mp3.pdf contains an audio mp3 object. We can extract the audio file using the following lines of code

            MultimediaExtractor extractor = new MultimediaExtractor();
            extractor.LoadDocumentFromFile(@"Pdf_with_mp3.pdf");
            bool audioFound = extractor.GetFirstAudio();
            if (audioFound)
            {
                string audiofileName = "eldorado" + extractor.GetCurrentAudioExtension();
                extractor.SaveCurrentAudioToFile(audiofileName);
            }

Note: The file extension was originally “.mp3”, however the method MultimediaExtractor. GetCurrentAudioExtension() returns “.mpa” file extension.

Some interesting methods are MultimediaExtractor.GetCurrentAudioBytesSize() to get the actual file size, MultimediaExtractor.GetDocumentAudioCount() returns the number of embedded audio files in the document and MultimediaExtractor.GetNextAudio() allows to switch to the next audio file.

Extract video file from PDF document with PDF extractor SDK and C#

You can extract any embeded videos files using the following steps.
1- Create an instance of Bytescout.PDFExtractor.MultimediaExtractor class to grab the video file
2- Save the file to disk using the method MultimediaExtractor .SaveCurrentVideoToFile.

The following program shows how to extract the embedded video file ( H264 compilant otherwise it will fail) in Pdf_with_video.pdf and save it into the file system.

	    MultimediaExtractor extractor = new MultimediaExtractor();
            extractor.LoadDocumentFromFile(@"Pdf_with_video.pdf");
            bool videoFound = extractor.GetFirstVideo();
            if (videoFound)
            {
                string videofileName = "abc" + extractor.GetCurrentVideoExtension();
                extractor.SaveCurrentVideoToFile(videofileName);
            }

The method MultimediaExtractor.GetFirstVideo() is where we locate the targeted video. Some useful methods are:
– MultimediaExtractor.GetDocumentVideoCount() to get the total number of video objects in the current file.
– MultimediaExtract.GetNextVideo() to navigate to the next video file.

Extract images from PDF file using Adobe Acrobat DC

Adobe Acrobat DC can extract embedded images in the PDF document.

The following steps show you how to do that.

Step 1: Open the document in Adobe Acrobat DC

Step 2: Tools > Export PDF

Step 3: The last screen allows you to configure the image type

Extract images from PDF file using C#

Pdf extractor Sdk can extract any embedded images in the pdf document. It has a full support on Gif, Tiff, jpg… formats. You can achieve that in three steps:
1- Create an instance of Bytescout.PDFExtractor.ImageExtractor class
2- Load the PDF document with the method Locate Bytescout.PDFExtractor.ImageExtractor .LoadDocumentFromFile
3- Locate the image in the page
4- Save the image with the method Bytescout.PDFExtractor.ImageExtractor.SaveCurrentImageToFile

Note: We need to use System.Drawing.Imaging assembly in order to use ImageFormat class.

You can download the document here

ImageExtractor extractor = new ImageExtractor();
extractor.LoadDocumentFromFile(@"Pdf_with_image.pdf");
var firstImageFound = extractor.GetFirstImage();
if (firstImageFound)
{
extractor.SaveCurrentImageToFile("firstimage.png", ImageFormat.Png);
}

One interesting feature is the ability to choose the export format regardless of the initial format of the image. Depending on your need, you can pass here Png, Jpeg, Ico, gif, bmp, Exif they all are well handled by PDF Extractor SDK.

Extract embedded documents in PDF file

PDF Extractor can extract any embedded documents from pdf. This process involves four steps
1- Create an instance of Bytescout.PDFExtractor.AttachementExtractor class
2- Load the PDF file using Bytescout.PDFExtractor.AttachementExtractor.LoadDocumentFromFile method
3- Locate the file with Bytescout.PDFExtractor.AttachementExtractor.GetFileName with file’s index as input parameter.
4- Call the Save method to write to disk.

            AttachmentExtractor extractor = new AttachmentExtractor();
            extractor.LoadDocumentFromFile(@"Pdf_with_files.pdf");
            extractor.Save(0, extractor.GetFileName(0));

We’ve seen along this article several ways to extract data from PDF document. If you want to do it manually, Adobe Acrobat DC is definitely the best choice. However this product is not free and you have to pay to get the commercial license. An alternative is to use Adobe Reader but there’s some limitations using it. For automated extraction process, we’ve seen that PDF Extractor SDK is a simple, complete and reliable tools for pdf extraction data. It supports a lot of commonly used formats (xml, csv, json, html ,text and so on). For more complex OCR tasks, Adobe Acrobat is a very reliable software, the pattern recognition error rate is quite low and it also supports many export formats. PDF Extractor SDK offers a powerful OCR engine, many features are available to developers to optimize the character recognition process. PDF Extractor SDK is definitely a well placed tools when your business requires to deal with rich media contents. We’ve seen so far that it has a full support of extracting audios, videos from any PDF files and is compatible with any CLR compilant programming languages C# / VB.NET.