These sample source codes can be used to find text in PDF files and get coordinates using Bytescout PDF Extractor SDK.
We’ve provided the source code snippet below. Select your programming language:
Select your programming language:
Let’s see the code and we’ll analyze it later in this article.
Take a look at this ASP.NET source code snippet that can be helpful to find text in PDF files and get coordinates using ByteScout PDF Extractor SDK.
using System;
using Bytescout.PDFExtractor;
namespace ExtractText
{
public partial class _Default : System.Web.UI.Page
{
/*
IF YOU SEE TEMPORARY FOLDER ACCESS ERRORS:
Temporary folder access is required for web application when you use ByteScout SDK in it.
If you are getting errors related to the access to temporary folder like "Access to the path 'C:\Windows\TEMP\... is denied" then you need to add permission for this temporary folder to make ByteScout SDK working on that machine and IIS configuration because ByteScout SDK requires access to temp folder to cache some of its data for more efficient work.
SOLUTION:
If your IIS Application Pool has "Load User Profile" option enabled the IIS provides access to user's temp folder. Check user's temporary folder
If you are running Web Application under an impersonated account or IIS_IUSRS group, IIS may redirect all requests into separate temp folder like "c:\temp\".
In this case
- check the User or User Group your web application is running under
- then add permissions for this User or User Group to read and write into that temp folder (c:\temp or c:\windows\temp\ folder)
- restart your web application and try again
*/
protected void Page_Load(object sender, EventArgs e)
{
String inputFile = Server.MapPath(@".\bin\sample1.pdf");
// Create Bytescout.PDFExtractor.TextExtractor instance
TextExtractor extractor = new TextExtractor();
extractor.RegistrationName = "demo";
extractor.RegistrationKey = "demo";
// Load sample PDF document
extractor.LoadDocumentFromFile(inputFile);
Response.Clear();
Response.ContentType = "text/html";
Response.Write("
<pre>");
// Write extracted text to output stream
extractor.SaveTextToStream(Response.OutputStream);
Response.Write("</pre>
");
Response.End();
}
}
}
Check out this C# code snippet useful in finding text in PDF and getting coordinates with the help of ByteScout PDF Extractor SDK.
using System;
using System.Drawing;
using Bytescout.PDFExtractor;
namespace FindText
{
class Program
{
static void Main(string[] args)
{
// Create Bytescout.PDFExtractor.TextExtractor instance
TextExtractor extractor = new TextExtractor();
extractor.RegistrationName = "demo";
extractor.RegistrationKey = "demo";
// Load sample PDF document
extractor.LoadDocumentFromFile(@".\sample1.pdf");
// Set the matching mode.
// WordMatchingMode.None - treats the search string as substring
// WordMatchingMode.ExactMatch - treats the search string as separate word
// WordMatchingMode.SmartMatch - will find the word in various forms (like Adobe Reader).
extractor.WordMatchingMode = WordMatchingMode.ExactMatch;
int pageCount = extractor.GetPageCount();
for (int i = 0; i < pageCount; i++)
{
// Search each page for "ipsum" string
if (extractor.Find(i, "ipsum", false))
{
do
{
Console.WriteLine("");
Console.WriteLine("Found on page " + i + " at location " + extractor.FoundText.Bounds.ToString());
Console.WriteLine("");
// Iterate through each element in the found text
foreach (SearchResultElement element in extractor.FoundText.Elements)
{
Console.WriteLine ("Element #" + element.Index + " at left=" + element.Left + "; top=" + element.Top + "; width=" + element.Width + "; height=" + element.Height);
Console.WriteLine ("Text: " + element.Text);
Console.WriteLine ("Font is bold: " + element.FontIsBold);
Console.WriteLine ("Font is italic:" + element.FontIsItalic);
Console.WriteLine ( "Font name: " + element.FontName);
Console.WriteLine ( "Font size:" + element.FontSize);
Console.WriteLine ( "Font color:" + element.FontColor);
}
}
while (extractor.FindNext());
}
}
// Cleanup
extractor.Dispose();
Console.WriteLine();
Console.WriteLine("Press any key to continue...");
Console.ReadLine();
}
}
}
The following VB.NET source code can be used for searching text in PDF documents and getting coordinates via ByteScout PDF Extractor SDK.
Imports System.Drawing
Imports Bytescout.PDFExtractor
Class Program
Friend Shared Sub Main(args As String())
' Create Bytescout.PDFExtractor.TextExtractor instance
Dim extractor As New TextExtractor()
extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"
' Load sample PDF document
extractor.LoadDocumentFromFile(".\sample1.pdf")
' Set the matching mode.
' WordMatchingMode.None - treats the search string as substring;
' WordMatchingMode.ExactMatch - treats the search string as separate word;
' WordMatchingMode.SmartMatch - will find the word in various forms (like Adobe Reader).
extractor.WordMatchingMode = WordMatchingMode.ExactMatch
Dim pageCount As Integer = extractor.GetPageCount()
For i As Integer = 0 To pageCount - 1
' Search each page for "ipsum" string
If extractor.Find(i, "ipsum", False) Then
Do
Console.WriteLine("")
Console.WriteLine(("Found on page " & i & " at location ") + extractor.FoundText.Bounds.ToString())
Console.WriteLine("")
' Iterate through each element in the found text
For Each element As SearchResultElement In extractor.FoundText.Elements
Console.WriteLine((((("Element #" + element.Index.ToString() & " at left=") + element.Left.ToString() & "; top=") + element.Top.ToString() & "; width=") + element.Width.ToString() & "; height=") + element.Height.ToString())
Console.WriteLine("Text: " + element.Text)
Console.WriteLine("Font is bold: " + element.FontIsBold.ToString())
Console.WriteLine("Font is italic:" + element.FontIsItalic.ToString())
Console.WriteLine("Font name: " + element.FontName)
Console.WriteLine("Font size:" + element.FontSize.ToString())
Console.WriteLine("Font color:" + element.FontColor.ToString())
Next
Loop While extractor.FindNext()
End If
Next
' Cleanup
extractor.Dispose()
Console.WriteLine()
Console.WriteLine("Press any key to continue...")
Console.ReadLine()
End Sub
End Class
Here’s VBScript code sample that is handy in finding text in PDF and getting coordinates through ByteScout PDF Extractor SDK.
' Create Bytescout.PDFExtractor.TextExtractor object
Set extractor = CreateObject("Bytescout.PDFExtractor.TextExtractor")
extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"
' Load sample PDF document
extractor.LoadDocumentFromFile("..\..\sample1.pdf")
' Set the matching mode:
' 0 = WordMatchingMode.None - treats the search string as substring;
' 1 = WordMatchingMode.SmartMatch - will find the word in various forms (like Adobe Reader);
' 2 = WordMatchingMode.ExactMatch - treats the search string as separate word.
extractor.WordMatchingMode = 2
' Get page count
pageCount = extractor.GetPageCount()
For i = 0 To PageCount - 1
If extractor.Find(i, "ipsum", false) Then ' parameters are: page index, string to find, case sensitivity.
Do
foundMessage = "Found word 'ipsum' on page #" & CStr(i) & " at { " & _
"x = " & CStr(extractor.FoundText.Left) & "; " & _
"y = " & CStr(extractor.FoundText.Top) & "; " & _
"width = " & CStr(extractor.FoundText.Width) & "; " & _
"height = " & CStr(extractor.FoundText.Height) & " }"
elementInfo = ""
' Iterate through elements of the found text object
For j = 0 to extractor.FoundText.ElementCount - 1
Set element = extractor.FoundText.GetElement(j)
elementInfo = elementInfo & "Element #" & CStr(j) & " at { x = " & CStr(element.Left) & "; y = " & CStr(element.Top) & "; width = " & CStr(element.Width) & "; height = " & CStr(element.Height) & vbCRLF
elementInfo = elementInfo & "Text: " & CStr(element.Text) & vbCRLF
elementInfo = elementInfo & "Font is bold: " & CStr(element.FontIsBold) & vbCRLF
elementInfo = elementInfo & "Font is italic: " & CStr(element.FontIsItalic) & vbCRLF
elementInfo = elementInfo & "Font name: " & CStr(element.FontName) & vbCRLF
elementInfo = elementInfo & "Font size: " & CStr(element.FontSize) & vbCRLF
elementInfo = elementInfo & "Font color (as OLE_COLOR): " & CStr(element.FontColorAsOleColor) & vbCRLF & vbCRLF
Next
WScript.Echo foundMessage & vbCRLF & vbCRLF & elementInfo
Loop While extractor.FindNext
End If
Next
WScript.Echo "Done"
Set extractor = Nothing
All the code snippets achieve the same functionality, let’s review the C# code snippet here.
We’re using Bytescout.PDFExtractor library here. If you want to code along, then you need to install Bytescout SDK on your machine. Bytescout SDK are available at this link.
First of all, we’re creating an instance of “TextExtractor” class and passing the registration key and name to it. We’re passing “demo” key and name here which has its limitations but for this demo it’s okay. If you are using it in production, then this needs to be replaced with the actual registration key and name.
// Create Bytescout.PDFExtractor.TextExtractor instance
TextExtractor extractor = new TextExtractor();
extractor.RegistrationName = "demo";
extractor.RegistrationKey = "demo";
We are loading input PDF file to text extractor instance by using “LoadDocumentFromFile” method. We can also have stream as input source, and we can utilize it by using “LoadDocumentFromStream” method.
// Load sample PDF document
extractor.LoadDocumentFromFile("sample1.pdf");
Then we’re getting all page numbers and loop through all pages to perform the search. We get the number of pages in PDF by using “GetPageCount” method.
int pageCount = extractor.GetPageCount();
Lastly we’re using method “Find” to search word “ipsum” in input file.
We’re looping through all enumerations, till all words are found. Co-Ordinates of found text are printed on the console.
That’s all guys, I hope you find this article useful to understand how to find text using Bytescout SDK.
Happy Coding!