ByteScout PDF Extractor SDK - C# - Index PDF Documents In Folder - ByteScout

ByteScout PDF Extractor SDK – C# – Index PDF Documents In Folder

  • Home
  • /
  • Articles
  • /
  • ByteScout PDF Extractor SDK – C# – Index PDF Documents In Folder

How to index PDF documents in folder for files in C# using ByteScout PDF Extractor SDK

Step By Step Tutorial: how to index PDF documents in folder for files in C#

The documentation is designed to help you to implement the features on your side. ByteScout PDF Extractor SDK was made to help with files in C#. ByteScout PDF Extractor SDK is the SDK is designed to help developers with pdf tables and pdf data extraction from unstructured documents like pdf, tiff, scans, images, scanned and electronic forms. The library is powered by OCR, computer vision and AI to provide unique functionality like table detection, automatic table structure extraction, data restoration, data restructuring and reconstruction. Supports PDF, TIFF, PNG, JPG images as input and can output CSV, XML, JSON formatted data. Includes full set of utilities like pdf splitter, pdf merger, searchable pdf maker and other utilities.

C# code samples for C# developers help to speed up the application’s code writing when using ByteScout PDF Extractor SDK. Sample code in C# is all you need. Copy-paste it to your the code editor, then add a reference to ByteScout PDF Extractor SDK and you are ready to try it! Use of ByteScout PDF Extractor SDK in C# is also explained in the documentation included along with the product.

ByteScout PDF Extractor SDK – free trial version is on available our website. Also, there are other code samples to help you with your C# application included into trial version.

Try ByteScout PDF Extractor SDK today:  60 Day Free Trial (on-premise) or  Web API (on-demand version)

Program.cs
      
using Bytescout.PDFExtractor; using System; using System.Collections.Generic; using System.IO; namespace IndexDocsInFolder { class Program { static void Main(string[] args) { try { // Output file list var lstAllFilesInfo = new List<FileIndexOutput>(); // Get all files inside directory var allFiles = Directory.GetFiles(@".\Files", "*.*"); // Iterate all files, and get details foreach (var itmFile in allFiles) { // Get basic file information FileInfo fileInfo = new FileInfo(itmFile); // Check whether file is supported if (_IsFileSupported(fileInfo)) { // Fill file index model var oFileIndex = new FileIndexOutput(); oFileIndex.fileName = fileInfo.Name; oFileIndex.fileDate = fileInfo.CreationTime; oFileIndex.content = _GetFileContent(fileInfo); // Add to final list lstAllFilesInfo.Add(oFileIndex); } } // Print all output Console.WriteLine("Total {0} files indexed\n", lstAllFilesInfo.Count); foreach (var itmFileInfo in lstAllFilesInfo) { Console.WriteLine("fileName: {0}", itmFileInfo.fileName); Console.WriteLine("fileDate: {0}", itmFileInfo.fileDate.ToString("MMM dd yyyy hh:mm:ss")); Console.WriteLine("content: {0}", itmFileInfo.content.Trim()); Console.WriteLine(); } } catch (Exception ex) { Console.WriteLine(ex.Message); } Console.WriteLine("Press any key to exit..."); Console.ReadLine(); } /// <summary> /// Get File Content /// </summary> private static string _GetFileContent(FileInfo fileInfo) { string fileExtension = System.IO.Path.GetExtension(fileInfo.FullName); if (fileExtension == ".pdf") { return _GetPdfFileContent(fileInfo); } else if (fileExtension == ".png" || fileExtension == ".jpg") { return _GetImageContet(fileInfo); } throw new Exception("File not supported."); } /// <summary> /// Get PDF File Content /// </summary> private static string _GetPdfFileContent(FileInfo fileInfo) { //Read all file content... using (TextExtractor textExtractor = new TextExtractor("demo","demo")) { //Load Document textExtractor.LoadDocumentFromFile(fileInfo.FullName); return textExtractor.GetText(); } } /// <summary> /// Get Image Contents /// </summary> private static string _GetImageContet(FileInfo fileInfo) { //Read all file content... using (TextExtractor extractor = new TextExtractor()) { // Load document extractor.LoadDocumentFromFile(fileInfo.FullName); //Set option to repair text extractor.OCRMode = OCRMode.TextFromImagesAndVectorsAndRepairedFonts; // Enable Optical Character Recognition (OCR) // in .Auto mode (SDK automatically checks if needs to use OCR or not) extractor.OCRMode = OCRMode.Auto; // Set the location of OCR language data files extractor.OCRLanguageDataFolder = @"c:\Program Files\Bytescout PDF Extractor SDK\ocrdata\"; // Set OCR language extractor.OCRLanguage = "eng"; // "eng" for english, "deu" for German, "fra" for French, "spa" for Spanish etc - according to files in "ocrdata" folder // Find more language files at https://github.com/bytescout/ocrdata // Set PDF document rendering resolution extractor.OCRResolution = 300; // Read all text return extractor.GetText(); } } /// <summary> /// Check whether file is valid /// </summary> private static bool _IsFileSupported(FileInfo fileInfo) { //Get File Extension string fileExtension = Path.GetExtension(fileInfo.Name); //Check whether file extension is valid return (fileExtension == ".pdf" || fileExtension == ".png" || fileExtension == ".jpg"); } } /// <summary> /// FileIndexOutput class /// </summary> public class FileIndexOutput { public string fileName { get; set; } public DateTime fileDate { get; set; } public string content { get; set; } } }

Try ByteScout PDF Extractor SDK today:  60 Day Free Trial (on-premise) or  Web API (on-demand version)

VIDEO

ON-PREMISE VERSION INFORMATION

Get 60 Day Free Trial or Visit ByteScout PDF Extractor SDK Home Page

Explore ByteScout PDF Extractor SDK Documentation

Get ByteScout PDF Extractor SDK Free Training

WEB API

Get Your Free API Key

Explore Web API Documentation

Tutorials:

prev
next