Home
/
Blog
/
Multiple Uses of PDF Extractor Powerful Toolkit

Multiple Uses of PDF Extractor Powerful Toolkit

In this tutorial, we will show you how to use PDF Extractor SDK to perform multiple PDF activities in C# programming.
PDF Extractor SDK is a complete toolkit of enhanced PDF and image extractor engines in C# and VB.NET. You can quickly customize this SDK in your app allowing you to extract any data from your PDF document automatically.

In this brief guide, we will cover the following features of PDF Extractor SDK in C#:

how to find and extract tables from PDF to CSV
how to make a PDF searchable in C#
how to split PDF files based on keywords in C#
how to rotate a PDF file in C#
how to extract images from PDF in C#
how to merge PDF files in C#
merge PDF in C# – video guide

Extract PDF Tables to CSV Format

The source code snippet can be used to easily detect tables in PDF and extract them into a CSV file using PDF Extractor SDK in C#.

C# Code Sample to Extract Tables

Just copy-paste the following C# source code to see the program in action.

using System;
using Bytescout.PDFExtractor;

namespace ExtractTextByPages
{
	class Program
	{
		static void Main(string[] args)
		{
			// Create Bytescout.PDFExtractor.TextExtractor instance
			CSVExtractor extractor = new CSVExtractor();
			extractor.RegistrationName = "demo";
			extractor.RegistrationKey = "demo";

            TableDetector tdetector = new TableDetector();
            tdetector.RegistrationKey = "demo";
            tdetector.RegistrationName = "demo";

			// Load sample PDF document
			extractor.LoadDocumentFromFile("sample3.pdf");
            tdetector.LoadDocumentFromFile("sample3.pdf");

			// Get page count
			int pageCount = tdetector.GetPageCount();

			for (int i = 0; i &lt; pageCount; i++)
			{
                int j = 1;
                // find first table and continue if found
                if (tdetector.FindTable(i))
                    do
                    {
                        // set extraction area for CSV extractor to rectangle given by table detector
                        extractor.SetExtractionArea(tdetector.GetFoundTableRectangle_Left(),
                            tdetector.GetFoundTableRectangle_Top(),
                            tdetector.GetFoundTableRectangle_Width(),
                            tdetector.GetFoundTableRectangle_Height()
                        );

                        // and finally save the table into CSV file
                        extractor.SavePageCSVToFile(i, "page-" + i + "-table-" + j + ".csv");
                        j++;
                    } while (tdetector.FindNextTable()); // search next table
			}

			// Open first output file in default associated application
			System.Diagnostics.Process.Start("page-0-table-1.csv");
		}
	}
}

Make a PDF File Searchable in C#

Check out the source code snippet here to make searchable PDF in C# with the help of ByteScout PDF Extractor SDK.

using Bytescout.PDFExtractor;

// To make OCR work you should add to your project references to Bytescout.PDFExtractor.dll and Bytescout.PDFExtractor.OCRExtension.dll 

namespace MakeSearchablePDF
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create Bytescout.PDFExtractor.TextExtractor instance
            SearchablePDFMaker searchablePDFMaker = new SearchablePDFMaker();
            searchablePDFMaker.RegistrationName = "demo";
            searchablePDFMaker.RegistrationKey = "demo";

            // Load sample PDF document
            searchablePDFMaker.LoadDocumentFromFile("sample_ocr.pdf");
            
            // Set the location of "tessdata" folder containing language data files
            searchablePDFMaker.OCRLanguageDataFolder = @"c:\Program Files\Bytescout PDF Extractor SDK\Redistributable\net2.00\tessdata\";

            // Set OCR language
            searchablePDFMaker.OCRLanguage = "eng"; // "eng" for english, "deu" for German, "fra" for French, "spa" for Spanish etc - according to files in /tessdata

            // Set PDF document rendering resolution
            searchablePDFMaker.OCRResolution = 300;

            // Save extracted text to file
            searchablePDFMaker.MakePDFSearchable("output.pdf");

            // Open output file in default associated application
            System.Diagnostics.Process.Start("output.pdf");
        }
    }
}

If you need to make PDF unsearchable, you can follow this step-by-step tutorial.

Split PDF in C# Based on a Keyword

Use the next code snippet if you need to split PDF by keywords in C# programming using PDF Extractor SDK. The sample code below can be copy-pasted for PDF splitting based on any keywords.

using Bytescout.PDFExtractor;
using System.IO;

namespace FindAndExtractPageExample
{
class Program
{
 static void Main(string[] args)
 {
 	string inputFile = "sample.pdf";
 	string keyword = "demographic";

 	TextExtractor extractor = new TextExtractor("demo", "demo");
 	extractor.LoadDocumentFromFile(inputFile);

 	// Search each page for keyword
 	for (int i = 0; i < extractor.GetPageCount(); i++)
 	{
   if (extractor.Find(i, keyword, false))
   {
   	// extract the page containing the keyword
   	ExtractPage(inputFile, i, "page" + i + ".pdf");
   }
 	}
 }

 private static void ExtractPage(string inputFile, int pageIndex, string outputFile)
 {
 	DocumentSplitter splitter = new DocumentSplitter("demo", "demo");

 	if (pageIndex == 0)
 	{
   if (splitter.GetPageCount(inputFile) == 1)
   {
   	// no splitting required if there is the only page
   	File.Copy(inputFile, outputFile);
   }
   else
   {
   	// split at the second page (page numeration starts from 1 in this function).
   	// the first part will be our sought-for 1-page document.
   	splitter.Split(inputFile, outputFile, "waste", 2);
   	File.Delete("waste"); // delete the waste part
   }
 	}
 	else
 	{
   if (pageIndex == splitter.GetPageCount(inputFile) - 1)
   {
   	// if this is the last page, just split on it.
   	// the second part will be our sought-for 1-page document.
   	splitter.Split(inputFile, "waste", outputFile, pageIndex + 1);
   	File.Delete("waste"); // delete the waste part
   }
   else
   {
   	// if the required page is in the middle of the document, we need two split operations:
   	splitter.Split(inputFile, "waste", "part", pageIndex + 1);
   	File.Delete("waste");
   	splitter.Split("part", outputFile, "waste", 2);
   	File.Delete("part");
   	File.Delete("waste");
   }
 	}  	
 }
}
}

Rotate a PDF using C# Source Code

PDF Extract SDK can rotate a PDF file by degrees in C#, VB.NET, and ASP.NET. If you need to rotate your document with no hassle, just copy-paste the code snippet below into your project.

using System.Diagnostics;
using Bytescout.PDFExtractor;

namespace RotateDocument
{
	class Program
	{
		static void Main(string[] args)
		{
			string inputFile = "sample1.pdf";

			using (DocumentRotator rotator = new DocumentRotator("demo", "demo"))
            {
                rotator.Rotate(inputFile, "result.pdf", RotationAngle.Deg90);
            }

			Process.Start("result.pdf");
		}
	}
}

Extract Images from PDF in C#

This source code snippet can be useful if you need a quick image extraction from PDF. Just copy-paste it into your C# project and speed up the whole process.

using System;
using System.Drawing.Imaging;
using Bytescout.PDFExtractor;

namespace ExtractAllImages
{
	class Program
	{
		static void Main(string[] args)
		{
			// Create Bytescout.PDFExtractor.ImageExtractor instance
			ImageExtractor extractor = new ImageExtractor();
			extractor.RegistrationName = "demo";
			extractor.RegistrationKey = "demo";
			
			// Load sample PDF document
			extractor.LoadDocumentFromFile("sample1.pdf");

			int i = 0;

			// Initialize image enumeration
			if (extractor.GetFirstImage())
			{
				do
				{
					string outputFileName = "image" + i + ".png";

					// Save image to file
					extractor.SaveCurrentImageToFile(outputFileName, ImageFormat.Png);

					i++;

				} while (extractor.GetNextImage()); // Advance image enumeration
			}

			// Open first output file in default associated application
			System.Diagnostics.Process.Start("image0.png");
		}
	}
}

Merge PDF Files in C#

Find the source code snippet below to merge PDF files in C# using ByteScout PDF Extractor SDK.

using System.Diagnostics;
using Bytescout.PDFExtractor;

namespace MergeDocuments
{
	class Program
	{
		static void Main(string[] args)
		{
			string[] inputFiles = new string[] {"sample1.pdf", "sample2.pdf", "sample3.pdf"};

			using (DocumentMerger merger = new DocumentMerger("demo", "demo"))
            {
                merger.Merge(inputFiles, "result.pdf");
            }

			Process.Start("result.pdf");
		}
	}
}

Merge PDF in C# – Video Guide

You can also check out the live demo showing how to merge PDF files using PDF Extractor SDK.

These are just a few uses of powerful PDF Extractor toolkit for C# programming. If you’d like to learn more, don’t hesitate to check our SDK documentation.