Extractor SDK provides several ways of extracting data from PDF documents and one of them is XMLExtractor class.
XMLExtractor reads document data and transforms it into XML format.
The resultant XML has a table-like structure and contains the following elements: ‘’document” root element with attributes “pageCount” and “pageCountWithOCRPerformed” which give you the total number of pages in the document and the number of pages on which OCR analysis was performed, respectively.
The child elements of the “document” root are “page” elements the number of which will correspond to the number of pages in the original document. In turn, each “page” is a collection of “row” elements and “row” is a collection of “column” elements:
<document pageCount="1" pageCountWithOCRPerformed="0"> <page index="0" width="612" height="792" OCRWasPerformed="False"> <row> <column>...</column> <column>...</column>
Actual data is contained in columns and the type of data elements will depend on whether it is a text node or a form field, having element names “text” and “control”, correspondingly.
Attributes in the case of a text node will be “fontName”, “fontSize”, and “x” “y” which show the original location of text on a page, and element content is the text itself. For example:
<text fontName="Helvetica" fontSize="12.0" x="143.87" y="431.33" width="23.34" height="12.00">Text</text>
In the case of a form field, the “control” element in addition to “fontName”, “fontSize”, “x”, “y” attributes will have a “type” attribute that defines the form field type. Possible values of the “type” are listbox, combobox, radiobutton, checkbox, and editbox:
<control type="combobox" id="Dropdown1" fontName="Helvetica" fontSize="12.0" x="141.87" y="107.53" width="72.00" height="40.00">
The content of a “control” element is dependent on the form field type. For instance, combobox and listbox will have a set of elements with “value” name, and selected value will have “selected” attribute on it:
<control type="listbox" id="ListBox" fontName="Helvetica" fontSize="12.0" x="141.87" y="-56.91" width="100.00" height="144.00"><values><value>Item1</value><value selected="true">Item2</value><value selected="true">Item3</value></values></control>
Note that a listbox can have multiple selected values and combox just only one item selected.
You can decrease output XML size by setting XMLExtractor.IndentedXML property to “false”. In this case, the output file size will be smaller but less readable since the indentation will be lost.
By default it is set to “true” and the resultant file is better to read, but at the same time it is bigger:
extractor.IndentedXML = true;
XMLExtractor has built-in OCR capabilities which allow you to extract text contained in the embedded raster images. By default OCR is turned off and controlled by XMLExtractor.OCRMode property:
extractor.OCRMode = OCRMode.Auto;
To learn more about available OCR modes in detail please follow this documentation link:
https://docs.bytescout.com/pdf-extractor-sdk-c-ocr-modes
For working with different OCR datasets you may find these links useful:
https://docs.bytescout.com/pdf-extractor-sdk-c-ocr-with-best-dataset
https://docs.bytescout.com/pdf-extractor-sdk-c-ocr-with-fast-dataset
Saving PDF data to an XML file is just as simple as calling XMLExtractor.SaveXMLToFile method:
extractor.SaveXMLToFile("E:/test.xml");
XMLExtractor.SaveXMLToStream will save extracted data in XML format to the specified Stream instance.
XMLExtractor.SaveImages property by default is set to ImageHandling.None, that’s why resultant XML will not have any image embedded, and that is a reasonable default value. Setting it to ImageHandling.Embed will produce XML with included “image” elements with “data” attributes whose values are base64 encoded image content:
<image x="324.50" y="56.40" width="146.85" height="195.85" type="base64encoded" format="png" data="iVBORw0KGgoAAAANS……
Including the images in XML might lead to very big output files and there is ImageHandling.OuterFile setting. It will extract document images and save them to the folder specified by XMLExtractor.ImageFolder property:
extractor.SaveImages = ImageHandling.OuterFile; extractor.ImageFolder = "E:/Images";
“data” attribute value in this case will contain a path to the saved image:
<image x="324.50" y="56.40" width="146.85" height="195.85" type="file" format="png" data="E:/Images\img-1-0.png" />
There is an option to choose the output image format using XMLExtractor.ImageFormat property. The possible values are PNG, JPEG, GIF, and BMP:
extractor.ImageFormat = OutputImageFormat.JPEG;