The Awesome ByteScout PDF Extractor Tools (Part 2)

  • Home
  • /
  • Blog
  • /
  • The Awesome ByteScout PDF Extractor Tools (Part 2)
Try Free SQL Trainer - learn by doing!
SQL queries made easy - Natural Questions to SQL Converter.

In Part 1 of this multi-tutorial about my fabulous experience as a developer using the Bytescout PDF text Extractor SDK tools I covered several easy but sophisticated tools and showed how to extract images from pdf online as well as how to extract pages from PDFs or extract one page from a PDF.

Now, in Part 2 I want to delve into the more basic nuts and bolts functions and show how to delete pages from a PDF among other useful methods. Before we do that, let’s look at how to extract PDF to JSON, one of the most popular content formats online today, one which also makes it easy to scrape data from PDF files. Note how easily this is accomplished in the following code sample:

' Create Bytescout.PDFExtractor.JSONExtractor object
Set extractor = CreateObject("Bytescout.PDFExtractor.JSONExtractor")
extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"
 
' Load sample PDF document
extractor.LoadDocumentFromFile "../../sample3.pdf"
 
extractor.SaveJSONToFile "output.json"
 
WScript.Echo "Extracted data saved to 'output.json' file."

Extracting JSON with C#

I want to demonstrate that Bytescout is developer friendly and offers the same functionality whether you enjoy coding with VBScript, Visual Basic, C# or any of the Dot Net Framework family of languages. In the next code sample, we will tackle the objective of extracting PDF data to JSON format in C# and do it with flexibility and ease. The operative method in this code is PDFExtractor.JSONExtractor, and now to extend the above example, let’s go ahead and extract a PDF with images to JSON as well. Have a look at this code:

using System;
using Bytescout.PDFExtractor;
 
namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create Bytescout.PDFExtractor.JSONExtractor instance
            JSONExtractor extractor = new JSONExtractor();
            extractor.RegistrationName = "demo";
            extractor.RegistrationKey = "demo";
 
            // Load sample PDF document
            extractor.LoadDocumentFromFile("sample1.pdf");
 
            // Uncomment this line to get rid of empty nodes in JSON
            //extractor.PreserveFormattingOnTextExtraction = false;
 
            // Set output image format
            extractor.ImageFormat = OutputImageFormat.PNG;
             
            // Save images to external files
            extractor.SaveImages = ImageHandling.OuterFile;
            extractor.ImageFolder = "images"; // Folder for external images
            extractor.SaveJSONToFile("result_with_external_images.json");
 
            // Embed images into JSON as Base64 encoded string
            extractor.SaveImages = ImageHandling.Embed;
            extractor.SaveJSONToFile("result_with_embedded_images.json");
        }
    }
}

Parallel Processing Tasks With PDF Extractor

C# and its twins C and C++ are known for speed on a machine level, and so naturally I want to expand this multi-tutorial and illustrate how these Extractor tools are capable of asynchronous coding methods. Begin with the System. Threading class and the Parallel_Processing Namespace to use these methods. All of the methods I have covered so far in Parts one and two can be threaded to run in parallel to optimize processing speed when there are multiple PDF files to handle. Have a look at this basic intro to async coding with Bytescout:

using System;
using System.IO;
using System.Threading;
using Bytescout.PDFExtractor;
namespace Parallel_Processing
{
class Program
{
// Limit to 4 threads in queue.
// Set this value to number of your processor cores for max performance.
private static readonly Semaphore ThreadLimiter = new Semaphore(4, 4);
static void Main(string[] args)
{
// Get all PDF files in a folder
string[] files = Directory.GetFiles(@”..\..\..\..\”, “*.pdf”);
// Array of events to wait
ManualResetEvent[] doneEvents = new ManualResetEvent[files.Length];

Nuts & Bolts of Merging PDF Documents

An important item in the Extractor toolbox is support for the capability to merge two or more documents into a single PDF. This tool has been especially useful to me in managing batch PDF processing where a lot of tools are needed. I will continue along the C# theme in this example, but as previously noted, this code transitions easily to any of the other Dot Net languages. In the code sample below I will first include the standard libraries along with PDFExtractor. We will set the interface pointer and the document registration. Set a path and filenames to merge. Have a look at this code sample:

#include "stdafx.h"
#include "comip.h"

// you may also refer to the tlb from \net4.00\ folder
// you may also want to include the tlb file into the project so you could compile it and use intellisense for it
#import "c:\\Program Files\\Bytescout PDF Extractor SDK\\net2.00\\Bytescout.PDFExtractor.tlb" raw_interfaces_only

using namespace Bytescout_PDFExtractor;

int _tmain(int argc, _TCHAR* argv[])
{
// Initialize COM.
HRESULT hr = CoInitializeEx(NULL, COINIT_APARTMENTTHREADED);

// Create the interface pointer.
_DocumentSplitterPtr pIDocumentSplitter(__uuidof(DocumentSplitter));

// Set the registration name and key
// Note: You should use _bstr_t or BSTR to pass string to the library because of COM requirements
_bstr_t bstrRegName(L"DEMO");
pIDocumentSplitter->put_RegistrationName(bstrRegName);

_bstr_t bstrRegKey(L"DEMO");
pIDocumentSplitter->put_RegistrationKey(bstrRegKey);

// you may enable optimization for extracted pages from documents
// pIDocumentSplitter->put_OptimizeSplittedDocuments = true;

// Load sample PDF document
HRESULT sRes = S_OK;
//1. extract selected pages (!note: page numbers are 1-based)
_bstr_t bstrPath(L"..\\..\\sample2.pdf");
_bstr_t bstrParam(L"page2.pdf");
sRes = pIDocumentSplitter->ExtractPage(bstrPath, bstrParam, 2);

// 2. split the doc into 2 parts at page #2
// (!) Note: page numbers are 1-based
_bstr_t bstrPathInput(L"..\\..\\sample2.pdf");
_bstr_t bstrParam1(L"part1.pdf");
_bstr_t bstrParam2(L"part2.pdf");
sRes = pIDocumentSplitter->Split(bstrPathInput, bstrParam1, bstrParam2, 2);

// 3. merge page 2 extracted on step 1 along with base pdf
// Create the interface pointer.
_DocumentMergerPtr pIDocumentMerger(__uuidof(DocumentMerger));
//_bstr_t bstrRegName(L"DEMO");
pIDocumentMerger->put_RegistrationName(bstrRegName);
//_bstr_t bstrRegKey(L"DEMO");
pIDocumentMerger->put_RegistrationKey(bstrRegKey);

// merge 2 files into the 3rd one
_bstr_t bstrParamMerge1(L"page2.pdf");
_bstr_t bstrParamMerge2(L"..\\..\\sample2.pdf");
_bstr_t bstrParamMergeOutput(L"merged.pdf");

sRes = pIDocumentMerger->Merge2(bstrParamMerge1, bstrParamMerge2,bstrParamMergeOutput);

// finally release both instances
pIDocumentSplitter->Release();
pIDocumentMerger->Release();

// uninitialize ActiveX COM support
CoUninitialize();

return 0;
}

Getting User Supplied Form Field Data From PDFs

A clever method in the PDF Extractor toolkit which I have leveraged many times is the capability to select and extract the data from fields entered by users into PDFs. In a web app where the PDF is provided with designated form fields, it is essential to think of these fields as user inputs. In this example, I am basically using a PDF form as a method of input for obtaining data from users, and then using PDF Extractor to retrieve entered data. Since we already know the fields of interest we just need to extract user input. The following C# code sample illustrates how to quickly develop a function to extract the form data from a PDF:

using System;
using System.Diagnostics;
using System.Xml;
using Bytescout.PDFExtractor;
 
namespace ExtractFilledFormData
{
    static class Program
    {
        static void Main()
        {
            // Create XMLExtractor instance
            XMLExtractor extractor = new XMLExtractor();
            extractor.RegistrationName = "demo";
            extractor.RegistrationKey = "demo";
 
            // Load sample PDF document
            extractor.LoadDocumentFromFile(@".\interactiveform.pdf");
 
            // Get PDF document text as XML
            string xmlText = extractor.GetXML();
 
            // Load XML
            XmlDocument xmlDocument = new XmlDocument();
            xmlDocument.LoadXml(xmlText);
 
            // Select all "control" nodes
            XmlNodeList formControls = xmlDocument.SelectNodes("//control");
            if (formControls != null)
            {
                foreach (XmlNode formControl in formControls)
                {
                    XmlAttribute typeAttribute = formControl.Attributes["type"];
 
                    // Trace filled textboxes
                    if (typeAttribute.Value == "editbox")
                    {
                        if (!String.IsNullOrEmpty(formControl.InnerText))
                            Trace.WriteLine("EDITBOX " + formControl.Attributes["id"].Value + ": " + formControl.InnerText);
                    }
                    // Trace checked checkboxes
                    else if (typeAttribute.Value == "checkbox")
                    {
                        if (formControl.Attributes["state"].Value == "1")
                            Trace.WriteLine("CHECKBOX " + formControl.Attributes["id"].Value + ": " + formControl.Attributes["state"].Value);
 
                    }
                }
            }
        }
    }
}

Comparing Two Documents

Another crucial capability provided in the Bytescout PDF Extractor SDK toolkit is this great function to compare two PDF documents and generate an outcome report in HTML format. After initializing the SDK, in the code sample provided below, we define two files for comparison as comparison1.pdf and comparison2.pdf as the targets. Once the registry keys are added, the GenerateHtmlReport will generate a report detailing the difference between the two PDF documents. Paste this code sample to your editor to try it for yourself:

#include "stdafx.h"
#include "comip.h"
 
#import "c:\\Program Files\\Bytescout PDF Extractor SDK\\net4.00\\Bytescout.PDFExtractor.tlb" raw_interfaces_only
 
using namespace Bytescout_PDFExtractor;
 
int _tmain(int argc, _TCHAR* argv[])
{
    // Initialize COM.
    HRESULT hr = CoInitializeEx(NULL, COINIT_APARTMENTTHREADED);
 
    // Load first document
    _TextExtractorPtr document1(__uuidof(TextExtractor));
    document1->put_RegistrationName(_bstr_t(L"DEMO"));
    document1->put_RegistrationKey(_bstr_t(L"DEMO"));
    document1->LoadDocumentFromFile(_bstr_t(L"..\\..\\comparison1.pdf"));
 
    // Load second  document
    _TextExtractorPtr document2(__uuidof(TextExtractor));
    document2->put_RegistrationName(_bstr_t(L"DEMO"));
    document2->put_RegistrationKey(_bstr_t(L"DEMO"));
    document2->LoadDocumentFromFile(_bstr_t(L"..\\..\\comparison2.pdf"));
 
    // Compare documents
    _TextComparerPtr comparer(__uuidof(TextComparer));
    comparer->put_RegistrationName(_bstr_t(L"DEMO"));
    comparer->put_RegistrationKey(_bstr_t(L"DEMO"));
    DECIMAL result;
    comparer->Compare((_BaseTextExtractorPtr) document1, (_BaseTextExtractorPtr) document2, &result);
 
    // Generate report
    VARIANT_BOOL ok;
    comparer->GenerateHtmlReport_2(_bstr_t(L"report.html"), &ok);
 
    document1->Release();
    document2->Release();
    comparer->Release();
 
    CoUninitialize();
 
    return 0;
}

Nuts & Bolts – Splitting a PDF into Two Parts

Nitty gritty operations on PDFs are a breeze with the Bytescout PDF Extractor SDK. The goal in this example is to divide one PDF document into two. I have this example set up to split the document starting document two with the contents of page two. After adding Bytescout_PDFExtractor we just need to define two output files as sample1 and sample2. This example uses C# but you can also use the equivalent calls with ASP or VB. Have a look at this code snippet:

#include "stdafx.h"
#include "comip.h"
 
// you may also refer to the tlb from \net4.00\ folder
// you may also want to include the tlb file into the project so you could compile it and use intellisense for it
#import "c:\\Program Files\\Bytescout PDF Extractor SDK\\net2.00\\Bytescout.PDFExtractor.tlb" raw_interfaces_only
 
using namespace Bytescout_PDFExtractor;
 
int _tmain(int argc, _TCHAR* argv[])
{
    // Initialize COM.
    HRESULT hr = CoInitializeEx(NULL, COINIT_APARTMENTTHREADED);
 
    // Create the interface pointer.
    _DocumentSplitterPtr pIDocumentSplitter(__uuidof(DocumentSplitter));
 
    // Set the registration name and key
    // Note: You should use _bstr_t or BSTR to pass string to the library because of COM requirements
    _bstr_t bstrRegName(L"DEMO");
    pIDocumentSplitter->put_RegistrationName(bstrRegName);
     
    _bstr_t bstrRegKey(L"DEMO");
    pIDocumentSplitter->put_RegistrationKey(bstrRegKey);
 
    // you may enable optimization for extracted pages from documents
    // pIDocumentSplitter->put_OptimizeSplittedDocuments = true;
 
    // Load sample PDF document
    HRESULT sRes = S_OK;
    //1. extract selected pages (!note: page numbers are 1-based)
    _bstr_t bstrPath(L"..\\..\\sample2.pdf");
    _bstr_t bstrParam(L"page2.pdf");
    sRes = pIDocumentSplitter->ExtractPage(bstrPath, bstrParam, 2);
 
    // 2. split the doc into 2 parts at page #2
    // (!) Note: page numbers are 1-based
    _bstr_t bstrPathInput(L"..\\..\\sample2.pdf");
    _bstr_t bstrParam1(L"part1.pdf");
    _bstr_t bstrParam2(L"part2.pdf");
    sRes = pIDocumentSplitter->Split(bstrPathInput, bstrParam1, bstrParam2, 2);
 
    // 3. merge page 2 extracted on step 1 along with base pdf
    // Create the interface pointer.
    _DocumentMergerPtr pIDocumentMerger(__uuidof(DocumentMerger));
    //_bstr_t bstrRegName(L"DEMO");
    pIDocumentMerger->put_RegistrationName(bstrRegName); 	
    //_bstr_t bstrRegKey(L"DEMO");
    pIDocumentMerger->put_RegistrationKey(bstrRegKey);
 
    // merge 2 files into the 3rd one
    _bstr_t bstrParamMerge1(L"page2.pdf");
    _bstr_t bstrParamMerge2(L"..\\..\\sample2.pdf");
    _bstr_t bstrParamMergeOutput(L"merged.pdf");
 
    sRes = pIDocumentMerger->Merge2(bstrParamMerge1, bstrParamMerge2,bstrParamMergeOutput);
 
    // finally release both instances
    pIDocumentSplitter->Release();
    pIDocumentMerger->Release();
 
    // uninitialize ActiveX COM support
    CoUninitialize();
 
    return 0;
}

Nuts & Bolts – Finding Text in a PDF

This example and the next are operations I use together regularly. Here, I am simply using the PDF Extractor to find a specified text string and its location. In the next example, I will demo how to delete a string from the document. As you may have guessed, what I am building toward here is a library of functions which will let you operate PDFs programmatically like a word processor, using both synchronous and asynchronous methods. Here is the code to delete a string:

using System;
using System.Drawing;
using Bytescout.PDFExtractor;
namespace FindText
{
class Program
{
static void Main(string[] args)
{
// Create Bytescout.PDFExtractor.TextExtractor instance
TextExtractor extractor = new TextExtractor();
extractor.RegistrationName = “demo”;
extractor.RegistrationKey = “demo”;
// Load sample PDF document
extractor.LoadDocumentFromFile(@”.\sample1.pdf”);
// Set the matching mode.
// WordMatchingMode.None – treats the search string as substring
// WordMatchingMode.ExactMatch – treats the search string as separate word
// WordMatchingMode.SmartMatch – will find the word in various forms (like Adobe Reader).
extractor.WordMatchingMode = WordMatchingMode.ExactMatch;
int pageCount = extractor.GetPageCount();
for (int i = 0; i < pageCount; i++) { // Search each page for "ipsum" string if (extractor.Find(i, "ipsum", false)) { do { Console.WriteLine(""); Console.WriteLine("Found on page " + i + " at location " + extractor.FoundText.Bounds.ToString()); Console.WriteLine(""); // Iterate through each element in the found text foreach (SearchResultElement element in extractor.FoundText.Elements) { Console.WriteLine ("Element #" + element.Index + " at left=" + element.Left + "; top=" + element.Top + "; width=" + element.Width + "; height=" + element.Height); Console.WriteLine ("Text: " + element.Text); Console.WriteLine ("Font is bold: " + element.FontIsBold); Console.WriteLine ("Font is italic:" + element.FontIsItalic); Console.WriteLine ( "Font name: " + element.FontName); Console.WriteLine ( "Font size:" + element.FontSize); Console.WriteLine ( "Font color:" + element.FontColor); } } while (extractor.FindNext()); } } Console.WriteLine(); Console.WriteLine("Press any key to continue..."); Console.ReadLine(); } } }

Nuts & Bolts – Removing Text from a PDF

Continuing the theme of bread and butter mechanics with PDF Extractor, I want to demonstrate how to isolate and delete a specified chunk of text from a PDF file. Let’s also continue the C# coding workflow and include the libraries for Diagnostics and Drawing in addition to the usual ones. Check out the comments inline for a lot of specific details on customizing this code to your individual project needs. In this code, I am basically finding a chunk of text and removing it from the doc, then repaginating. Have a look:

using System;
using System.Diagnostics;
using System.Drawing;
using Bytescout.PDFExtractor;
 
namespace RemoveText
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create Bytescout.PDFExtractor.Remover instance
            Remover remover = new Remover("demo", "demo");
 
            // Load sample PDF document
            remover.LoadDocumentFromFile(@"sample1.pdf");
 
            // Remove text "LOREM IPSUM" and save edited document as "result1.pdf".
            // NOTE: The removed text might be larger than the search string. Currently the Remover deletes
            // the whole PDF text object containing the search string.
            remover.RemoveText(0, "LOREM IPSUM", true, @"result1.pdf");
 
            // Remove text objects contained in the specified rectangle or intersecting with it.
            // NOTE: The removed text might be larger than the specified rectangle. Currently the Remover is unable
            // to split PDF text objects.
            remover.RemoveText(0, new RectangleF(74f, 550f, 489f, 67f), @"result2.pdf");
 
            // Remove text object contained in the specified point.
            // NOTE: The removed text might be larger than a word in the specified point. Currently the Remover is able
            // to remove only the whole PDF text object containing the word.
            remover.RemoveText(0, new PointF(121f, 230f), @"result3.pdf");
             
            // Clean up.
            remover.Dispose();
 
            Console.WriteLine();
            Console.WriteLine("Press any key to continue and open result PDF files in default PDF viewer...");
            Console.ReadKey();
 
            Process.Start("result1.pdf");
            Process.Start("result2.pdf");
            Process.Start("result3.pdf");
        }
    }
}

Extracting Images from PDFs with ASP

To round off this multi tutorial and complete my detailed experience with PDF Extractor SDK, I want to include a couple of code samples which leverage Bytescout’s ASP cross-functionality. In this example, I am going to fetch images embedded in a PDF doc and save them individually. I’m going to create an array first to hold the images from the PDF. Have a look at this sample in ASP:

using System;
using System.Collections.Generic;
using System.Drawing.Imaging;
using System.IO;
using Bytescout.PDFExtractor;
 
namespace ExtractImages
{
    public partial class _Default : System.Web.UI.Page
    {
        protected void Page_Load(object sender, EventArgs e)
        {
            String inputFile = Server.MapPath(@".\bin\sample1.pdf");
 
            // Create Bytescout.PDFExtractor.ImageExtractor instance
            ImageExtractor extractor = new ImageExtractor();
            extractor.RegistrationName = "demo";
            extractor.RegistrationKey = "demo";
             
            // Load sample PDF document
            extractor.LoadDocumentFromFile(inputFile);
 
            // Array to keep extracted images
            List<byte&#91;&#93;> extractedImages = new List<byte&#91;&#93;>();
 
            // Initialize image enumeration
            if (extractor.GetFirstImage())
            {
                do
                {
                    // Extract image to memory
                    using (MemoryStream memoryStream = new MemoryStream())
                    {
                        extractor.SaveCurrentImageToStream(memoryStream, ImageFormat.Png);
                        // Keep image as byte array
                        extractedImages.Add(memoryStream.ToArray());
                    }
                }
                while (extractor.GetNextImage()); // Advance image enumeration
            }
             
            // Write first image to the output stream
 
            Response.Clear();
 
            Response.ContentType = "image/png";
            Response.AddHeader("Content-Disposition", "inline;filename=image.png");
 
            Response.BinaryWrite(extractedImages[0]);
             
            Response.End();
        }
    }
}

Full Power PDF Extractor Tools

I have used Bytescout’s PDF Extractor SDK in many projects and regularly found it to be the most easily interfaced toolkit to get full control of PDF docs. The real power is realized by combining several tools at once and in parallel or batch PDF processing. That is where you can do advanced surgical procedures and get exactly the results required. Proprietary data formats can be challenging to work with, but PDF Extractor puts all the control in your hands!

prev
next