These sample source codes can be used to find text in PDF files and get coordinates using Bytescout PDF Extractor SDK.
We’ve provided the source code snippet below. Select your programming language:
Select your programming language:
Let’s see the code and we’ll analyze it later in this article.
Take a look at this ASP.NET source code snippet that can be helpful to find text in PDF files and get coordinates using ByteScout PDF Extractor SDK.
using System; using Bytescout.PDFExtractor; namespace ExtractText { public partial class _Default : System.Web.UI.Page { /* IF YOU SEE TEMPORARY FOLDER ACCESS ERRORS: Temporary folder access is required for web application when you use ByteScout SDK in it. If you are getting errors related to the access to temporary folder like "Access to the path 'C:\Windows\TEMP\... is denied" then you need to add permission for this temporary folder to make ByteScout SDK working on that machine and IIS configuration because ByteScout SDK requires access to temp folder to cache some of its data for more efficient work. SOLUTION: If your IIS Application Pool has "Load User Profile" option enabled the IIS provides access to user's temp folder. Check user's temporary folder If you are running Web Application under an impersonated account or IIS_IUSRS group, IIS may redirect all requests into separate temp folder like "c:\temp\". In this case - check the User or User Group your web application is running under - then add permissions for this User or User Group to read and write into that temp folder (c:\temp or c:\windows\temp\ folder) - restart your web application and try again */ protected void Page_Load(object sender, EventArgs e) { String inputFile = Server.MapPath(@".\bin\sample1.pdf"); // Create Bytescout.PDFExtractor.TextExtractor instance TextExtractor extractor = new TextExtractor(); extractor.RegistrationName = "demo"; extractor.RegistrationKey = "demo"; // Load sample PDF document extractor.LoadDocumentFromFile(inputFile); Response.Clear(); Response.ContentType = "text/html"; Response.Write(" <pre>"); // Write extracted text to output stream extractor.SaveTextToStream(Response.OutputStream); Response.Write("</pre> "); Response.End(); } } }
Check out this C# code snippet useful in finding text in PDF and getting coordinates with the help of ByteScout PDF Extractor SDK.
using System; using System.Drawing; using Bytescout.PDFExtractor; namespace FindText { class Program { static void Main(string[] args) { // Create Bytescout.PDFExtractor.TextExtractor instance TextExtractor extractor = new TextExtractor(); extractor.RegistrationName = "demo"; extractor.RegistrationKey = "demo"; // Load sample PDF document extractor.LoadDocumentFromFile(@".\sample1.pdf"); // Set the matching mode. // WordMatchingMode.None - treats the search string as substring // WordMatchingMode.ExactMatch - treats the search string as separate word // WordMatchingMode.SmartMatch - will find the word in various forms (like Adobe Reader). extractor.WordMatchingMode = WordMatchingMode.ExactMatch; int pageCount = extractor.GetPageCount(); for (int i = 0; i < pageCount; i++) { // Search each page for "ipsum" string if (extractor.Find(i, "ipsum", false)) { do { Console.WriteLine(""); Console.WriteLine("Found on page " + i + " at location " + extractor.FoundText.Bounds.ToString()); Console.WriteLine(""); // Iterate through each element in the found text foreach (SearchResultElement element in extractor.FoundText.Elements) { Console.WriteLine ("Element #" + element.Index + " at left=" + element.Left + "; top=" + element.Top + "; width=" + element.Width + "; height=" + element.Height); Console.WriteLine ("Text: " + element.Text); Console.WriteLine ("Font is bold: " + element.FontIsBold); Console.WriteLine ("Font is italic:" + element.FontIsItalic); Console.WriteLine ( "Font name: " + element.FontName); Console.WriteLine ( "Font size:" + element.FontSize); Console.WriteLine ( "Font color:" + element.FontColor); } } while (extractor.FindNext()); } } // Cleanup extractor.Dispose(); Console.WriteLine(); Console.WriteLine("Press any key to continue..."); Console.ReadLine(); } } }
The following VB.NET source code can be used for searching text in PDF documents and getting coordinates via ByteScout PDF Extractor SDK.
Imports System.Drawing Imports Bytescout.PDFExtractor Class Program Friend Shared Sub Main(args As String()) ' Create Bytescout.PDFExtractor.TextExtractor instance Dim extractor As New TextExtractor() extractor.RegistrationName = "demo" extractor.RegistrationKey = "demo" ' Load sample PDF document extractor.LoadDocumentFromFile(".\sample1.pdf") ' Set the matching mode. ' WordMatchingMode.None - treats the search string as substring; ' WordMatchingMode.ExactMatch - treats the search string as separate word; ' WordMatchingMode.SmartMatch - will find the word in various forms (like Adobe Reader). extractor.WordMatchingMode = WordMatchingMode.ExactMatch Dim pageCount As Integer = extractor.GetPageCount() For i As Integer = 0 To pageCount - 1 ' Search each page for "ipsum" string If extractor.Find(i, "ipsum", False) Then Do Console.WriteLine("") Console.WriteLine(("Found on page " & i & " at location ") + extractor.FoundText.Bounds.ToString()) Console.WriteLine("") ' Iterate through each element in the found text For Each element As SearchResultElement In extractor.FoundText.Elements Console.WriteLine((((("Element #" + element.Index.ToString() & " at left=") + element.Left.ToString() & "; top=") + element.Top.ToString() & "; width=") + element.Width.ToString() & "; height=") + element.Height.ToString()) Console.WriteLine("Text: " + element.Text) Console.WriteLine("Font is bold: " + element.FontIsBold.ToString()) Console.WriteLine("Font is italic:" + element.FontIsItalic.ToString()) Console.WriteLine("Font name: " + element.FontName) Console.WriteLine("Font size:" + element.FontSize.ToString()) Console.WriteLine("Font color:" + element.FontColor.ToString()) Next Loop While extractor.FindNext() End If Next ' Cleanup extractor.Dispose() Console.WriteLine() Console.WriteLine("Press any key to continue...") Console.ReadLine() End Sub End Class
Here’s VBScript code sample that is handy in finding text in PDF and getting coordinates through ByteScout PDF Extractor SDK.
' Create Bytescout.PDFExtractor.TextExtractor object Set extractor = CreateObject("Bytescout.PDFExtractor.TextExtractor") extractor.RegistrationName = "demo" extractor.RegistrationKey = "demo" ' Load sample PDF document extractor.LoadDocumentFromFile("..\..\sample1.pdf") ' Set the matching mode: ' 0 = WordMatchingMode.None - treats the search string as substring; ' 1 = WordMatchingMode.SmartMatch - will find the word in various forms (like Adobe Reader); ' 2 = WordMatchingMode.ExactMatch - treats the search string as separate word. extractor.WordMatchingMode = 2 ' Get page count pageCount = extractor.GetPageCount() For i = 0 To PageCount - 1 If extractor.Find(i, "ipsum", false) Then ' parameters are: page index, string to find, case sensitivity. Do foundMessage = "Found word 'ipsum' on page #" & CStr(i) & " at { " & _ "x = " & CStr(extractor.FoundText.Left) & "; " & _ "y = " & CStr(extractor.FoundText.Top) & "; " & _ "width = " & CStr(extractor.FoundText.Width) & "; " & _ "height = " & CStr(extractor.FoundText.Height) & " }" elementInfo = "" ' Iterate through elements of the found text object For j = 0 to extractor.FoundText.ElementCount - 1 Set element = extractor.FoundText.GetElement(j) elementInfo = elementInfo & "Element #" & CStr(j) & " at { x = " & CStr(element.Left) & "; y = " & CStr(element.Top) & "; width = " & CStr(element.Width) & "; height = " & CStr(element.Height) & vbCRLF elementInfo = elementInfo & "Text: " & CStr(element.Text) & vbCRLF elementInfo = elementInfo & "Font is bold: " & CStr(element.FontIsBold) & vbCRLF elementInfo = elementInfo & "Font is italic: " & CStr(element.FontIsItalic) & vbCRLF elementInfo = elementInfo & "Font name: " & CStr(element.FontName) & vbCRLF elementInfo = elementInfo & "Font size: " & CStr(element.FontSize) & vbCRLF elementInfo = elementInfo & "Font color (as OLE_COLOR): " & CStr(element.FontColorAsOleColor) & vbCRLF & vbCRLF Next WScript.Echo foundMessage & vbCRLF & vbCRLF & elementInfo Loop While extractor.FindNext End If Next WScript.Echo "Done" Set extractor = Nothing
All the code snippets achieve the same functionality, let’s review the C# code snippet here.
We’re using Bytescout.PDFExtractor library here. If you want to code along, then you need to install Bytescout SDK on your machine. Bytescout SDK are available at this link.
First of all, we’re creating an instance of “TextExtractor” class and passing the registration key and name to it. We’re passing “demo” key and name here which has its limitations but for this demo it’s okay. If you are using it in production, then this needs to be replaced with the actual registration key and name.
// Create Bytescout.PDFExtractor.TextExtractor instance
TextExtractor extractor = new TextExtractor();
extractor.RegistrationName = "demo";
extractor.RegistrationKey = "demo";
We are loading input PDF file to text extractor instance by using “LoadDocumentFromFile” method. We can also have stream as input source, and we can utilize it by using “LoadDocumentFromStream” method.
// Load sample PDF document
extractor.LoadDocumentFromFile("sample1.pdf");
Then we’re getting all page numbers and loop through all pages to perform the search. We get the number of pages in PDF by using “GetPageCount” method.
int pageCount = extractor.GetPageCount();
Lastly we’re using method “Find” to search word “ipsum” in input file.
We’re looping through all enumerations, till all words are found. Co-Ordinates of found text are printed on the console.
That’s all guys, I hope you find this article useful to understand how to find text using Bytescout SDK.
Happy Coding!