How to Find Text in PDF File and Get Coordinates in ASP.NET, C#, VB.NET, VBScript using PDF Extractor SDK - ByteScout

How to Find Text in PDF File and Get Coordinates in ASP.NET, C#, VB.NET, VBScript using PDF Extractor SDK

  • Home
  • /
  • Articles
  • /
  • How to Find Text in PDF File and Get Coordinates in ASP.NET, C#, VB.NET, VBScript using PDF Extractor SDK

These sample source codes can be used to find text in PDF files and get coordinates using Bytescout PDF Extractor SDK.

We’ve provided the source code snippet below. Select your programming language:

Select your programming language:

Let’s see the code and we’ll analyze it later in this article.

ASP.NET Source Code

Take a look at this ASP.NET source code snippet that can be helpful to find text in PDF files and get coordinates using ByteScout PDF Extractor SDK.

using System;
using Bytescout.PDFExtractor;

namespace ExtractText
{
	public partial class _Default : System.Web.UI.Page
	{
		/*
		IF YOU SEE TEMPORARY FOLDER ACCESS ERRORS: 
		Temporary folder access is required for web application when you use ByteScout SDK in it.
		If you are getting errors related to the access to temporary folder like "Access to the path 'C:\Windows\TEMP\... is denied" then you need to add permission for this temporary folder to make ByteScout SDK working on that machine and IIS configuration because ByteScout SDK requires access to temp folder to cache some of its data for more efficient work.
		SOLUTION:
		If your IIS Application Pool has "Load User Profile" option enabled the IIS provides access to user's temp folder. Check user's temporary folder
		If you are running Web Application under an impersonated account or IIS_IUSRS group, IIS may redirect all requests into separate temp folder like "c:\temp\".
		In this case
		- check the User or User Group your web application is running under
		- then add permissions for this User or User Group to read and write into that temp folder (c:\temp or c:\windows\temp\ folder)
		- restart your web application and try again
		*/

		protected void Page_Load(object sender, EventArgs e)
		{
			String inputFile = Server.MapPath(@".\bin\sample1.pdf");

			// Create Bytescout.PDFExtractor.TextExtractor instance
			TextExtractor extractor = new TextExtractor();
			extractor.RegistrationName = "demo";
			extractor.RegistrationKey = "demo";
			
			// Load sample PDF document
			extractor.LoadDocumentFromFile(inputFile);

			Response.Clear();
			Response.ContentType = "text/html";

			Response.Write("
<pre>");
			
			// Write extracted text to output stream
			extractor.SaveTextToStream(Response.OutputStream);

			Response.Write("</pre>

");

			Response.End();
		}
	}
}

C# Source Code

Check out this C# code snippet useful in finding text in PDF and getting coordinates with the help of ByteScout PDF Extractor SDK.

using System;
using System.Drawing;
using Bytescout.PDFExtractor;

namespace FindText
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create Bytescout.PDFExtractor.TextExtractor instance
            TextExtractor extractor = new TextExtractor();
            extractor.RegistrationName = "demo";
            extractor.RegistrationKey = "demo";

            // Load sample PDF document
            extractor.LoadDocumentFromFile(@".\sample1.pdf");
            
            // Set the matching mode.
            // WordMatchingMode.None - treats the search string as substring
            // WordMatchingMode.ExactMatch - treats the search string as separate word
            // WordMatchingMode.SmartMatch - will find the word in various forms (like Adobe Reader).
            extractor.WordMatchingMode = WordMatchingMode.ExactMatch;

            int pageCount = extractor.GetPageCount();

            for (int i = 0; i < pageCount; i++)
            {
                // Search each page for "ipsum" string
                if (extractor.Find(i, "ipsum", false))
                {
                    do
                    {
                        Console.WriteLine("");
                        Console.WriteLine("Found on page " + i + " at location " + extractor.FoundText.Bounds.ToString());
                        Console.WriteLine("");
                        // Iterate through each element in the found text
                        foreach (SearchResultElement element in extractor.FoundText.Elements)
                        {
                            Console.WriteLine ("Element #" + element.Index + " at left=" + element.Left + "; top=" + element.Top + "; width=" + element.Width + "; height=" + element.Height);
                            Console.WriteLine ("Text: " + element.Text);
                            Console.WriteLine ("Font is bold: " + element.FontIsBold); 
                            Console.WriteLine ("Font is italic:" + element.FontIsItalic);
                            Console.WriteLine ( "Font name: " + element.FontName);
                            Console.WriteLine ( "Font size:" + element.FontSize);
                            Console.WriteLine ( "Font color:" + element.FontColor);
                        }
                    }
                    while (extractor.FindNext());
                }
            }

            // Cleanup
			extractor.Dispose();

            Console.WriteLine();
            Console.WriteLine("Press any key to continue...");
            Console.ReadLine();
        }
    }
}

VB.NET Source Code

The following VB.NET source code can be used for searching text in PDF documents and getting coordinates via ByteScout PDF Extractor SDK.

Imports System.Drawing
Imports Bytescout.PDFExtractor

Class Program
    Friend Shared Sub Main(args As String())

            ' Create Bytescout.PDFExtractor.TextExtractor instance
            Dim extractor As New TextExtractor()
            extractor.RegistrationName = "demo"
            extractor.RegistrationKey = "demo"

            ' Load sample PDF document
            extractor.LoadDocumentFromFile(".\sample1.pdf")
            
            ' Set the matching mode.
            ' WordMatchingMode.None - treats the search string as substring;
            ' WordMatchingMode.ExactMatch - treats the search string as separate word;
            ' WordMatchingMode.SmartMatch - will find the word in various forms (like Adobe Reader).
            extractor.WordMatchingMode = WordMatchingMode.ExactMatch

            Dim pageCount As Integer = extractor.GetPageCount()

            For i As Integer = 0 To pageCount - 1
                ' Search each page for "ipsum" string
                If extractor.Find(i, "ipsum", False) Then
                    Do
                        Console.WriteLine("")
                        Console.WriteLine(("Found on page " & i & " at location ") + extractor.FoundText.Bounds.ToString())
                        Console.WriteLine("")
                        ' Iterate through each element in the found text
                        For Each element As SearchResultElement In extractor.FoundText.Elements
                        Console.WriteLine((((("Element #" + element.Index.ToString() & " at left=") + element.Left.ToString() & "; top=") + element.Top.ToString() & "; width=") + element.Width.ToString() & "; height=") + element.Height.ToString())
                        Console.WriteLine("Text: " + element.Text)
                        Console.WriteLine("Font is bold: " + element.FontIsBold.ToString())
                        Console.WriteLine("Font is italic:" + element.FontIsItalic.ToString())
                        Console.WriteLine("Font name: " + element.FontName)
                        Console.WriteLine("Font size:" + element.FontSize.ToString())
                        Console.WriteLine("Font color:" + element.FontColor.ToString())

                        Next
                    Loop While extractor.FindNext()
                End If
            Next

            ' Cleanup
		    extractor.Dispose()

            Console.WriteLine()
            Console.WriteLine("Press any key to continue...")
            Console.ReadLine()
            
    End Sub
End Class

VBScript Source Code

Here’s VBScript code sample that is handy in finding text in PDF and getting coordinates through ByteScout PDF Extractor SDK.

' Create Bytescout.PDFExtractor.TextExtractor object
Set extractor = CreateObject("Bytescout.PDFExtractor.TextExtractor")
extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"

' Load sample PDF document
extractor.LoadDocumentFromFile("..\..\sample1.pdf")

' Set the matching mode:
' 0 = WordMatchingMode.None - treats the search string as substring;
' 1 = WordMatchingMode.SmartMatch - will find the word in various forms (like Adobe Reader);
' 2 = WordMatchingMode.ExactMatch - treats the search string as separate word.
extractor.WordMatchingMode = 2

' Get page count

pageCount = extractor.GetPageCount()

For i = 0 To PageCount - 1 
 
    If extractor.Find(i, "ipsum", false) Then ' parameters are: page index, string to find, case sensitivity.
        Do
            foundMessage = "Found word 'ipsum' on page #" & CStr(i) & " at { " & _
                "x = " & CStr(extractor.FoundText.Left) & "; " & _
                "y = " & CStr(extractor.FoundText.Top) & "; " & _
                "width = " & CStr(extractor.FoundText.Width) & "; " & _
                "height = " & CStr(extractor.FoundText.Height) & " }"

            elementInfo = ""

            ' Iterate through elements of the found text object
            For j = 0 to extractor.FoundText.ElementCount - 1
                Set element = extractor.FoundText.GetElement(j)	
                elementInfo = elementInfo & "Element #" & CStr(j) & " at { x = " & CStr(element.Left) & "; y = " & CStr(element.Top) & "; width = " & CStr(element.Width) & "; height = " & CStr(element.Height) & vbCRLF
                elementInfo = elementInfo & "Text: " & CStr(element.Text) & vbCRLF
                elementInfo = elementInfo & "Font is bold: " & CStr(element.FontIsBold) & vbCRLF
                elementInfo = elementInfo & "Font is italic: " & CStr(element.FontIsItalic) & vbCRLF
                elementInfo = elementInfo & "Font name: " & CStr(element.FontName) & vbCRLF
                elementInfo = elementInfo & "Font size: " & CStr(element.FontSize) & vbCRLF
                elementInfo = elementInfo & "Font color (as OLE_COLOR): " & CStr(element.FontColorAsOleColor) & vbCRLF & vbCRLF
            Next 

            WScript.Echo foundMessage & vbCRLF & vbCRLF & elementInfo

        Loop While extractor.FindNext
        
    End If

Next

WScript.Echo "Done"

Set extractor = Nothing

All the code snippets achieve the same functionality, let’s review the C# code snippet here.

We’re using Bytescout.PDFExtractor library here. If you want to code along, then you need to install Bytescout SDK on your machine. Bytescout SDK are available at this link.

First of all, we’re creating an instance of “TextExtractor” class and passing the registration key and name to it. We’re passing “demo” key and name here which has its limitations but for this demo it’s okay. If you are using it in production, then this needs to be replaced with the actual registration key and name.

// Create Bytescout.PDFExtractor.TextExtractor instance
TextExtractor extractor = new TextExtractor();
extractor.RegistrationName = "demo";
extractor.RegistrationKey = "demo";

We are loading input PDF file to text extractor instance by using “LoadDocumentFromFile” method. We can also have stream as input source, and we can utilize it by using “LoadDocumentFromStream” method.

// Load sample PDF document
extractor.LoadDocumentFromFile("sample1.pdf");

Then we’re getting all page numbers and loop through all pages to perform the search. We get the number of pages in PDF by using “GetPageCount” method.

int pageCount = extractor.GetPageCount();

Lastly we’re using method “Find” to search word “ipsum” in input file.

We’re looping through all enumerations, till all words are found. Co-Ordinates of found text are printed on the console.

That’s all guys, I hope you find this article useful to understand how to find text using Bytescout SDK.

Happy Coding!

Tutorials:

prev
next