PDF Extractor SDK Explained: Search or Find Text in PDF - ByteScout

PDF Extractor SDK Explained: Search or Find Text in PDF

  • Home
  • /
  • Articles
  • /
  • PDF Extractor SDK Explained: Search or Find Text in PDF

In this program, we’re going to see how we can find a text from the PDF. First of all, we’re going to create the object of the Text Extractor. Then we’re going to Load Document, turn on the Word Matching Mode. We’re going to Iterate through all the pages. We will see if we can find a text, then we will Iterate to find all of the text. We are going to display the result, which is containing the search result element.


How to Search PDF Files

Now add a sample program here, which I already have on the desktop. Here is a sample PDF and we are going to search for the word IPSUM. We have three words, two on the first page and the one on the second page. Copy and paste it into the Solution Explorer window. I’m also going to include it in the output directory, copy always and we are all set.

Search PDF Files

Now create TextExtractor extractor = new extractor(“demo”, “demo). I am also going to pass the registration name and key to the constructor itself. The next step is the Load Document, which is like extractor.LoadDocumentFromFile(“sample_program2.pdf). The third step is to enable WordMatchingMode, where three options are available (Exact Match, None, Smart Match). ExactMatch will exactly match the word, if we set it to none then it will also search for the partial keyword like there is one main keyword and it is contained in another keyword. We are having the Load Document from the file but if we only search for the document then it will search it like substring. The Smart Image will search for the various combinations such as we can search in the adobe acrobat itself.

Search PDF File

In this demo, we are going to have the exact match which can be written as extractor.WordMatchingMode = WordMatchingMode.Exact. We will Iterate from all the pages and all the pages get counted from the PDF. Now we are going to enable search like if (extractor.find(index, “ipsum”false)). Next step would be displaying all the result which is foreach (SearchResultEelements element in extractor.FoundText.Element) and we are going to get all the properties like let’s see what we can find out, Console.WriteLine($”Page:{index+1}, Top:{element.Top}, Left:{element.Left}, Height{element.height}, Width:{element.width}”. Let’s also write about whether it is bold, Console.WriteLine($”Bold: {element.FontIsBold}, Italics: {element.FontIsItalic}, FontName: {element.FontName}, FontSize: {element.FontSize}, Color{element.FontColor}”).

PDF Search

Now Make sure we are holding the Console.Redline(); so that we can read it and start execution. Here, we are getting two results, one on page 1 and one on page 2. But if you remember in the program itself we have found three results. Two on page 1 and one on page 2. Let’s see what went wrong. Once we found it, we needed to continuously Iterate it, while (extractor.FindNext());.

Find Text in PDF

So let us try again. Now it shows Two on page 1 and one on page 2, it’s slightly colored. Font size is smaller. It’s also italics and bold. I guess it’s working correctly. This way we can find the text from the ByteScout PDF Extractor SDK. We also have fine-grained control like the found results. We can get most of the specific details about the found text here.

Text Search with PDF Extractor SDK