Extract text from PDF files in ASP.NET, C#, VB.NET, VBScript - ByteScout

Extract text from PDF files in ASP.NET, C#, VB.NET, VBScript

  • Home
  • /
  • Articles
  • /
  • Extract text from PDF files in ASP.NET, C#, VB.NET, VBScript

These samples show how to extract all text from PDF file into TXT file (plain text) using Bytescout PDF Extractor SDK.

If you want to code along then you need to have Bytescout SDK installed in your machine. You can get your free trial here.

We’ve provided source code in different languages/framework below. Select your programming language:


Input PDF file and output TXT file with extracted text (click to view full-size screenshot)

ASP.NET

using System;
using System.Data;
using System.Configuration;
using System.Collections;
using System.IO;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Web.UI.HtmlControls;
using Bytescout.PDFExtractor;

namespace ExtractAllText
{
	public partial class _Default : System.Web.UI.Page
	{
		protected void Page_Load(object sender, EventArgs e)
		{
			// This test file will be copied to the project directory on the pre-build event (see the project properties).
			String inputFile = Server.MapPath("sample2.pdf");

			// Create Bytescout.PDFExtractor.TextExtractor instance
			TextExtractor extractor = new TextExtractor();
			extractor.RegistrationName = "demo";
			extractor.RegistrationKey = "demo";
			
			// Load sample PDF document
			extractor.LoadDocumentFromFile(inputFile);

			Response.Clear();
			Response.ContentType = "text/html";

			// Save extracted text to output stream
			extractor.SaveTextToStream(Response.OutputStream);

			Response.End();
		}
	}
}

C#

using System;
using Bytescout.PDFExtractor;

namespace ExtractAllText
{
	class Program
	{
		static void Main(string[] args)
		{
			// Create Bytescout.PDFExtractor.TextExtractor instance
			TextExtractor extractor = new TextExtractor();
			extractor.RegistrationName = "demo";
			extractor.RegistrationKey = "demo";

			// Load sample PDF document
			extractor.LoadDocumentFromFile("sample2.pdf");

			// Save extracted text to file
			extractor.SaveTextToFile("output.txt");

			// Open output file in default associated application
			System.Diagnostics.Process.Start("output.txt");
		}
	}
}

VB.NET

Imports Bytescout.PDFExtractor

Class Program
	Friend Shared Sub Main(args As String())

		' Create Bytescout.PDFExtractor.TextExtractor instance
		Dim extractor As New TextExtractor()
		extractor.RegistrationName = "demo"
		extractor.RegistrationKey = "demo"

		' Load sample PDF document
		extractor.LoadDocumentFromFile("sample2.pdf")

		' Save extracted text to file
		extractor.SaveTextToFile("output.txt")

		' Open output file in default associated application
		System.Diagnostics.Process.Start("output.txt")
	End Sub
End Class

VBScript

' Create Bytescout.PDFExtractor.TextExtractor object
Set extractor = CreateObject("Bytescout.PDFExtractor.TextExtractor")
extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"

' Load sample PDF document
extractor.LoadDocumentFromFile("....sample2.pdf")

' Save extracted text to file
extractor.SaveTextToFile("output.txt")

' Open output file in default associated application
Set shell = CreateObject("WScript.Shell")
shell.Run "output.txt", 1, false
Set shell = Nothing

Set extractor = Nothing

Program Output is as below:

 

Though source code is pretty simple and straight forward, Let’s analyze it briefly.

1. Create Bytescout.PDFExtractor.TextExtractor instance.

TextExtractor extractor = new TextExtractor();
extractor.RegistrationName = "demo";
extractor.RegistrationKey = "demo";

After creating instance we need to provide registration key and name, which we’ll get upon registration for Bytescout SDK.

2. Load PDF document

extractor.LoadDocumentFromFile("sample2.pdf");

Here we’re simply loading input pdf document. There are different version of this method for different scenarios like loading password protected pdf, loading only few pages of pdf document etc. We can also use “LoadDocumentFromStream” method to load document from memory or any other stream data.

3. Save extracted text to file

extractor.SaveTextToFile("output.txt");

In this step we’re performing text extraction along with saving it to output file. If we want to save output to stream then we can use “SaveTextToStream” method.

 

That’s all guys. I hope you get an idea on how to use Bytescout PDFExtractor assembly to extract text from PDF document.

Happy Coding!

Tutorials:

prev
next