Home
/
Blog
/
How to Extract Document Info and Metadata using PDF Extractor SDK in C#

How to Extract Document Info and Metadata using PDF Extractor SDK in C#

PDF Extractor SDK makes possible not only extraction of actual document data such as text and images but also retrieval of basic and detailed information about the PDF document including its metadata.

InfoExtractor class implements a rich interface IInfoExtractor. Here are listed all properties and methods of this interface on the online documentation page: https://docs.bytescout.com/pdf-extractor-sdk-t-bytescout-pdfextractor-iinfoextractor

Initialization and document loading is using are not different from other SDK extractors:

InfoExtractor extractor = new InfoExtractor();
extractor.RegistrationName = "demo";
extractor.RegistrationKey = "demo";

Basic document properties, for example, the title, author, subject, keywords and bookmarks can be read by using corresponding extractor properties:

Console.WriteLine("Author:       " + extractor.Author);
Console.WriteLine("Subject:      " + extractor.Subject);
Console.WriteLine("Title:        " + extractor.Title);
Console.WriteLine("Keywords:     " + extractor.Keywords);
Console.WriteLine("Bookmarks:    " + extractor.Bookmarks);

To see whether the document is encrypted or not Encrypted property must be used and EncryptionAlgorithm defines an algorithm with which the PDF was encrypted:

Console.WriteLine("Encrypted:" + extractor.Encrypted);
Console.Write($"EncryptionAlgorithm: {extractor.EncryptionAlgorithm}”)

To check whether the document contains user-defined properties that are not standard property names there is InfoExtractor.CustomProperies dictionary of user-added properties in the PDF.

SDK also allows reading metadata streams in XMP format, which is metadata in XML-based format embedded in PDF. InfoExtractor.GetMetadata() method returns a metadata stream available in the document in XMP format.

How to Extract Document Info and Metadata using PDF Extractor SDK in C#

Related Posts