How to find text in PDF file and get coordinates in ASP.NET, C#, VB.NET, VBScript using PDF Extractor SDK - ByteScout

How to find text in PDF file and get coordinates in ASP.NET, C#, VB.NET, VBScript using PDF Extractor SDK

  • Home
  • /
  • Articles
  • /
  • How to find text in PDF file and get coordinates in ASP.NET, C#, VB.NET, VBScript using PDF Extractor SDK

These sample source codes can be used to find text in PDF files and get coordinates using Bytescout PDF Extractor SDK.

We’ve provided source code snippet below. Select your programming language:

Select your programming language:

Let’s see the code and we’ll analyze it later in this article.

ASP.NET

using System;
using Bytescout.PDFExtractor;

namespace ExtractText
{
	public partial class _Default : System.Web.UI.Page
	{
		/*
		IF YOU SEE TEMPORARY FOLDER ACCESS ERRORS: 
		Temporary folder access is required for web application when you use ByteScout SDK in it.
		If you are getting errors related to the access to temporary folder like "Access to the path 'C:\Windows\TEMP\... is denied" then you need to add permission for this temporary folder to make ByteScout SDK working on that machine and IIS configuration because ByteScout SDK requires access to temp folder to cache some of its data for more efficient work.
		SOLUTION:
		If your IIS Application Pool has "Load User Profile" option enabled the IIS provides access to user's temp folder. Check user's temporary folder
		If you are running Web Application under an impersonated account or IIS_IUSRS group, IIS may redirect all requests into separate temp folder like "c:\temp\".
		In this case
		- check the User or User Group your web application is running under
		- then add permissions for this User or User Group to read and write into that temp folder (c:\temp or c:\windows\temp\ folder)
		- restart your web application and try again
		*/

		protected void Page_Load(object sender, EventArgs e)
		{
			String inputFile = Server.MapPath(@".\bin\sample1.pdf");

			// Create Bytescout.PDFExtractor.TextExtractor instance
			TextExtractor extractor = new TextExtractor();
			extractor.RegistrationName = "demo";
			extractor.RegistrationKey = "demo";
			
			// Load sample PDF document
			extractor.LoadDocumentFromFile(inputFile);

			Response.Clear();
			Response.ContentType = "text/html";

			Response.Write("
<pre>");
			
			// Write extracted text to output stream
			extractor.SaveTextToStream(Response.OutputStream);

			Response.Write("</pre>

");

			Response.End();
		}
	}
}

C#

using System;
using System.Drawing;
using Bytescout.PDFExtractor;

namespace FindText
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create Bytescout.PDFExtractor.TextExtractor instance
            TextExtractor extractor = new TextExtractor();
            extractor.RegistrationName = "demo";
            extractor.RegistrationKey = "demo";

            // Load sample PDF document
            extractor.LoadDocumentFromFile(@".\sample1.pdf");
            
            // Set the matching mode.
            // WordMatchingMode.None - treats the search string as substring
            // WordMatchingMode.ExactMatch - treats the search string as separate word
            // WordMatchingMode.SmartMatch - will find the word in various forms (like Adobe Reader).
            extractor.WordMatchingMode = WordMatchingMode.ExactMatch;

            int pageCount = extractor.GetPageCount();

            for (int i = 0; i < pageCount; i++)
            {
                // Search each page for "ipsum" string
                if (extractor.Find(i, "ipsum", false))
                {
                    do
                    {
                        Console.WriteLine("");
                        Console.WriteLine("Found on page " + i + " at location " + extractor.FoundText.Bounds.ToString());
                        Console.WriteLine("");
                        // Iterate through each element in the found text
                        foreach (SearchResultElement element in extractor.FoundText.Elements)
                        {
                            Console.WriteLine ("Element #" + element.Index + " at left=" + element.Left + "; top=" + element.Top + "; width=" + element.Width + "; height=" + element.Height);
                            Console.WriteLine ("Text: " + element.Text);
                            Console.WriteLine ("Font is bold: " + element.FontIsBold); 
                            Console.WriteLine ("Font is italic:" + element.FontIsItalic);
                            Console.WriteLine ( "Font name: " + element.FontName);
                            Console.WriteLine ( "Font size:" + element.FontSize);
                            Console.WriteLine ( "Font color:" + element.FontColor);
                        }
                    }
                    while (extractor.FindNext());
                }
            }

            // Cleanup
			extractor.Dispose();

            Console.WriteLine();
            Console.WriteLine("Press any key to continue...");
            Console.ReadLine();
        }
    }
}

VB.NET

Imports System.Drawing
Imports Bytescout.PDFExtractor

Class Program
    Friend Shared Sub Main(args As String())

            ' Create Bytescout.PDFExtractor.TextExtractor instance
            Dim extractor As New TextExtractor()
            extractor.RegistrationName = "demo"
            extractor.RegistrationKey = "demo"

            ' Load sample PDF document
            extractor.LoadDocumentFromFile(".\sample1.pdf")
            
            ' Set the matching mode.
            ' WordMatchingMode.None - treats the search string as substring;
            ' WordMatchingMode.ExactMatch - treats the search string as separate word;
            ' WordMatchingMode.SmartMatch - will find the word in various forms (like Adobe Reader).
            extractor.WordMatchingMode = WordMatchingMode.ExactMatch

            Dim pageCount As Integer = extractor.GetPageCount()

            For i As Integer = 0 To pageCount - 1
                ' Search each page for "ipsum" string
                If extractor.Find(i, "ipsum", False) Then
                    Do
                        Console.WriteLine("")
                        Console.WriteLine(("Found on page " & i & " at location ") + extractor.FoundText.Bounds.ToString())
                        Console.WriteLine("")
                        ' Iterate through each element in the found text
                        For Each element As SearchResultElement In extractor.FoundText.Elements
                        Console.WriteLine((((("Element #" + element.Index.ToString() & " at left=") + element.Left.ToString() & "; top=") + element.Top.ToString() & "; width=") + element.Width.ToString() & "; height=") + element.Height.ToString())
                        Console.WriteLine("Text: " + element.Text)
                        Console.WriteLine("Font is bold: " + element.FontIsBold.ToString())
                        Console.WriteLine("Font is italic:" + element.FontIsItalic.ToString())
                        Console.WriteLine("Font name: " + element.FontName)
                        Console.WriteLine("Font size:" + element.FontSize.ToString())
                        Console.WriteLine("Font color:" + element.FontColor.ToString())

                        Next
                    Loop While extractor.FindNext()
                End If
            Next

            ' Cleanup
		    extractor.Dispose()

            Console.WriteLine()
            Console.WriteLine("Press any key to continue...")
            Console.ReadLine()
            
    End Sub
End Class

VBScript

' Create Bytescout.PDFExtractor.TextExtractor object
Set extractor = CreateObject("Bytescout.PDFExtractor.TextExtractor")
extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"

' Load sample PDF document
extractor.LoadDocumentFromFile("..\..\sample1.pdf")

' Set the matching mode:
' 0 = WordMatchingMode.None - treats the search string as substring;
' 1 = WordMatchingMode.SmartMatch - will find the word in various forms (like Adobe Reader);
' 2 = WordMatchingMode.ExactMatch - treats the search string as separate word.
extractor.WordMatchingMode = 2

' Get page count

pageCount = extractor.GetPageCount()

For i = 0 To PageCount - 1 
 
    If extractor.Find(i, "ipsum", false) Then ' parameters are: page index, string to find, case sensitivity.
        Do
            foundMessage = "Found word 'ipsum' on page #" & CStr(i) & " at { " & _
                "x = " & CStr(extractor.FoundText.Left) & "; " & _
                "y = " & CStr(extractor.FoundText.Top) & "; " & _
                "width = " & CStr(extractor.FoundText.Width) & "; " & _
                "height = " & CStr(extractor.FoundText.Height) & " }"

            elementInfo = ""

            ' Iterate through elements of the found text object
            For j = 0 to extractor.FoundText.ElementCount - 1
                Set element = extractor.FoundText.GetElement(j)	
                elementInfo = elementInfo & "Element #" & CStr(j) & " at { x = " & CStr(element.Left) & "; y = " & CStr(element.Top) & "; width = " & CStr(element.Width) & "; height = " & CStr(element.Height) & vbCRLF
                elementInfo = elementInfo & "Text: " & CStr(element.Text) & vbCRLF
                elementInfo = elementInfo & "Font is bold: " & CStr(element.FontIsBold) & vbCRLF
                elementInfo = elementInfo & "Font is italic: " & CStr(element.FontIsItalic) & vbCRLF
                elementInfo = elementInfo & "Font name: " & CStr(element.FontName) & vbCRLF
                elementInfo = elementInfo & "Font size: " & CStr(element.FontSize) & vbCRLF
                elementInfo = elementInfo & "Font color (as OLE_COLOR): " & CStr(element.FontColorAsOleColor) & vbCRLF & vbCRLF
            Next 

            WScript.Echo foundMessage & vbCRLF & vbCRLF & elementInfo

        Loop While extractor.FindNext
        
    End If

Next

WScript.Echo "Done"

Set extractor = Nothing

All the code snippet achieve same functionality, let’s review C# code snippet here.

We’re using Bytescout.PDFExtractor library here. If you want to code along, then you need to install Bytescout SDK in your machine. Bytescout SDK are available at this link.

First of all we’re creating instance of “TextExtractor” class and passing registration key and name to it. We’re passing “demo” key and name here which has it’s limitations but for this demo it’s okay. If you are using in production, then this needs to be replaced with actual registration key and name.

// Create Bytescout.PDFExtractor.TextExtractor instance
TextExtractor extractor = new TextExtractor();
extractor.RegistrationName = "demo";
extractor.RegistrationKey = "demo";

We are loading input PDF file to text extractor instance by using “LoadDocumentFromFile” method. We can also have stream as input source, and we can utilize it by using “LoadDocumentFromStream” method.

// Load sample PDF document
extractor.LoadDocumentFromFile("sample1.pdf");

Then we’re getting all page numbers and looping through all pages to perform search. We get the number of pages in PDF by using “GetPageCount” method.

int pageCount = extractor.GetPageCount();

Lastly we’re using method “Find” to search word “ipsum” in input file.

We’re looping through all enumerations, till all words are found. Co-Ordinates of found text are printed on console.

That’s all guys, I hope you find this article useful to understand how to find text using Bytescout SDK.

Happy Coding!

Tutorials:

prev
next