PDF extractor is the tool for extracting data from PDF and scanned documents. PDF extraction is focused on providing a structured representation of the original text, layout, images, vectors, etc.
Here is the visual showcase demonstrating the difference between just copying text from a PDF report with a table versus using PDF Extractor for this purpose:
Original sample PDF report was taken from this page:
This PDF has 18 pages with text and tables. Let’s take a page number 12 with text and a table:
Now let’s select the chart and the table with text on this page:
Now let’s copy and paste this text from PDF viewer into a notepad to see what it looks like:
Figure 2: Negative Effects of Stress on Work Performance 0 10 20 30 40 50 60 70 80 1 3 5 7 9 11 13 15 1 42.8% Productivity 9 49.5% Cooperation 2 73.9% Job Satisfaction/Morale 10 45.2% Initiative 3 39.1% Decision Making Abilities 11 26.6% Reliability 4 52.7% Accuracy 12 39.4% Alertness 5 51.6% Creativity 13 35.5% Perseverance 6 28.0% Attention to Appearance 14 25.8% Tardiness 7 46.3% Organizational Skills 15 28.3% Absenteeism 8 65.2% Courtesy
As you can see there are some issues with this copied text:
Now let’s try to extract this text using ByteScout’s PDF extractor engine:
Figure 2: Negative Effects of Stress on Work Performance 15 13 11 9 7 5 3 1 0 10 20 30 40 50 60 70 80 1 42.8% Productivity 9 49.5% Cooperation 2 73.9% Job Satisfaction/Morale 10 45.2% Initiative 3 39.1% Decision Making Abilities 11 26.6% Reliability 4 52.7% Accuracy 12 39.4% Alertness 5 51.6% Creativity 13 35.5% Perseverance 6 28.0% Attention to Appearance 14 25.8% Tardiness 7 46.3% Organizational Skills 15 28.3% Absenteeism 8 65.2% Courtesy
As you see, now issues are fixed:
But this can be improved further with PDF Extractor engine! Because PDF extractor can also extract text from PDF into JSON, XLS, XLSX, XML or CSV representation. Let’s select the same area and use ByteScout’s free PDF Multitool app (that runs on PDF extractor engine) to demonstrate PDF to CSV extraction:
As you see on the screenshot above, CSV has separate cells and you can easily get separate values for every cell from the original table. This also makes it easy to implement loading of data from PDF into CSV for processing with your own script or app and you don’t need to parse text further.
PDF extractor engine can also generate output as JSON via the cloud version which is copy-pasted below:
curl --location --request POST 'https://api.pdf.co/v1/pdf/convert/to/json2' \ --header 'Content-Type: application/json' \ --header 'x-api-key: YOUR_API_KEY' \ --data-raw '{ "url": "http://www.wright.edu/~david.wilson/eng3000/samplereport.pdf", "inline": true }' 200 { "document": { "pageCount": "1", "pageCountWithOCRPerformed": "0", "page": { "index": "11", "OCRWasPerformed": "False", "row": [ { "column": [ { "text": "" }, { "text": "" }, { "text": { "fontName": "Times New Roman", "fontSize": "9.0", "fontStyle": "Bold", "x": "153.12", "y": "330.84", "width": "36.22", "height": "9.48", "text": "Figure 2:" } }, { "text": { "fontName": "Times New Roman", "fontSize": "9.0", "fontStyle": "Bold", "x": "193.90", "y": "330.84", "width": "191.71", "height": "9.48", "text": "Negative Effects of Stress on Work Performance" } }, .... { "column": [ { "text": "" }, { "text": "" }, { "text": { "fontName": "Arial", "fontSize": "6.0", "x": "168.30", "y": "516.70", "width": "40.84", "height": "5.78", "text": "0 10" } }, { "text": "" }, { "text": { "fontName": "Arial", "fontSize": "6.0", "x": "235.33", "y": "516.70", "width": "6.87", "height": "5.78", "text": "20" } }, { "text": { "fontName": "Arial", "fontSize": "6.0", "x": "268.39", "y": "516.70", "width": "6.93", "height": "5.78", "text": "30" } }, { "text": { "fontName": "Arial", "fontSize": "6.0", "x": "301.51", "y": "516.70", "width": "6.87", "height": "5.78", "text": "40" } }, { "text": { "fontName": "Arial", "fontSize": "6.0", "x": "334.56", "y": "516.70", "width": "6.87", "height": "5.78", "text": "50" } }, { "text": { "fontName": "Arial", "fontSize": "6.0", "x": "367.68", "y": "516.70", "width": "6.87", "height": "5.78", "text": "60" } }, { "text": { "fontName": "Arial", "fontSize": "6.0", "x": "400.74", "y": "516.70", "width": "6.87", "height": "5.78", "text": "70" } }, { "text": { "fontName": "Arial", "fontSize": "6.0", "x": "433.80", "y": "516.70", "width": "6.93", "height": "5.78", "text": "80" } }, { "text": "" } ] }, { "column": [ { "text": "" }, { "text": "" }, { "text": { "fontName": "Times New Roman", "fontSize": "8.0", "x": "164.22", "y": "549.24", "width": "4.01", "height": "7.98", "text": "1" } }, { "text": { "fontName": "Times New Roman", "fontSize": "8.0", "x": "177.71", "y": "549.24", "width": "20.68", "height": "7.98", "text": "42.8%" } }, ... }
You can also get XML, XLS, XLSX, JSON, HTML representations of the text and the structure. For example, PDF To XML generated by the cloud version of Web API:
curl --location --request POST 'https://api.pdf.co/v1/pdf/convert/to/xml' \ --header 'x-api-key: YOUR_API_KEY' \ --header 'Content-Type: application/json' \ --data-raw '{ "url": "http://www.wright.edu/~david.wilson/eng3000/samplereport.pdf" }' <?xml version="1.0" encoding="UTF-8"?> <document pageCount="18" pageCountWithOCRPerformed="0"> <page index="0" width="612" height="792" OCRWasPerformed="False"> ... <row> <column> <text fontName="Arial" fontSize="6.0" x="168.30" y="516.70" width="40.84" height="5.78">0 10</text> </column> <column> <text fontName="Arial" fontSize="6.0" x="235.33" y="516.70" width="6.87" height="5.78">20</text> </column> <column> <text fontName="Arial" fontSize="6.0" x="268.39" y="516.70" width="6.93" height="5.78">30</text> </column> <column> <text fontName="Arial" fontSize="6.0" x="301.51" y="516.70" width="6.87" height="5.78">40</text> </column> <column> <text fontName="Arial" fontSize="6.0" x="334.56" y="516.70" width="6.87" height="5.78">50</text> </column> <column> <text fontName="Arial" fontSize="6.0" x="367.68" y="516.70" width="6.87" height="5.78">60</text> </column> <column> <text fontName="Arial" fontSize="6.0" x="400.74" y="516.70" width="6.87" height="5.78">70</text> </column> <column> <text> </text> </column> <column> <text fontName="Arial" fontSize="6.0" x="433.80" y="516.70" width="6.93" height="5.78">80</text> </column> <column> <text> </text> </column> <column> <text> </text> </column> </row> <row> <column> <text fontName="Times New Roman" fontSize="8.0" x="164.22" y="549.24" width="4.01" height="7.98">1</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="177.71" y="549.24" width="20.68" height="7.98">42.8%</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="209.19" y="549.24" width="39.62" height="7.98">Productivity</text> </column> <column> <text> </text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="338.45" y="549.24" width="4.01" height="7.98">9</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="353.26" y="549.24" width="20.68" height="7.98">49.5%</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="384.74" y="549.24" width="39.62" height="7.98">Cooperation</text> </column> <column> <text> </text> </column> <column> <text> </text> </column> <column> <text> </text> </column> <column> <text> </text> </column> </row> <row> <column> <text fontName="Times New Roman" fontSize="8.0" x="164.22" y="558.42" width="4.00" height="7.98">2</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="177.70" y="558.42" width="20.68" height="7.98">73.9%</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="209.18" y="558.42" width="76.29" height="7.98">Job Satisfaction/Morale</text> </column> <column> <text> </text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="334.43" y="558.42" width="8.01" height="7.98">10</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="353.24" y="558.42" width="20.68" height="7.98">45.2%</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="384.71" y="558.42" width="28.94" height="7.98">Initiative</text> </column> <column> <text> </text> </column> <column> <text> </text> </column> <column> <text> </text> </column> <column> <text> </text> </column> </row> <row> <column> <text fontName="Times New Roman" fontSize="8.0" x="164.22" y="567.60" width="4.00" height="7.98">3</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="177.70" y="567.60" width="20.68" height="7.98">39.1%</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="209.19" y="567.60" width="84.92" height="7.98">Decision Making Abilities</text> </column> <column> <text> </text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="334.43" y="567.60" width="8.00" height="7.98">11</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="353.23" y="567.60" width="20.66" height="7.98">26.6%</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="384.69" y="567.60" width="33.80" height="7.98">Reliability</text> </column> <column> <text> </text> </column> <column> <text> </text> </column> <column> <text> </text> </column> <column> <text> </text> </column> </row> <row> <column> <text fontName="Times New Roman" fontSize="8.0" x="164.22" y="576.84" width="4.01" height="7.98">4</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="177.71" y="576.84" width="20.68" height="7.98">52.7%</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="209.19" y="576.84" width="30.63" height="7.98">Accuracy</text> </column> <column> <text> </text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="334.38" y="576.84" width="8.02" height="7.98">12</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="353.20" y="576.84" width="20.68" height="7.98">39.4%</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="384.68" y="576.84" width="30.29" height="7.98">Alertness</text> </column> <column> <text> </text> </column> <column> <text> </text> </column> <column> <text> </text> </column> <column> <text> </text> </column> </row> <row> <column> <text fontName="Times New Roman" fontSize="8.0" x="164.22" y="586.02" width="4.01" height="7.98">5</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="177.71" y="586.02" width="20.66" height="7.98">51.6%</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="209.16" y="586.02" width="31.99" height="7.98">Creativity</text> </column> <column> <text> </text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="334.40" y="586.02" width="8.01" height="7.98">13</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="353.21" y="586.02" width="20.66" height="7.98">35.5%</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="384.67" y="586.02" width="42.21" height="7.98">Perseverance</text> </column> <column> <text> </text> </column> <column> <text> </text> </column> <column> <text> </text> </column> <column> <text> </text> </column> </row> <row> <column> <text fontName="Times New Roman" fontSize="8.0" x="164.22" y="595.20" width="4.01" height="7.98">6</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="177.71" y="595.20" width="20.66" height="7.98">28.0%</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="209.17" y="595.20" width="78.74" height="7.98">Attention to Appearance</text> </column> <column> <text> </text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="334.47" y="595.20" width="8.01" height="7.98">14</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="353.28" y="595.20" width="20.66" height="7.98">25.8%</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="384.75" y="595.20" width="31.11" height="7.98">Tardiness</text> </column> <column> <text> </text> </column> <column> <text> </text> </column> <column> <text> </text> </column> <column> <text> </text> </column> </row> <row> <column> <text fontName="Times New Roman" fontSize="8.0" x="164.22" y="604.44" width="4.00" height="7.98">7</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="177.70" y="604.44" width="20.67" height="7.98">46.3%</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="209.17" y="604.44" width="67.82" height="7.98">Organizational Skills</text> </column> <column> <text> </text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="334.41" y="604.44" width="8.01" height="7.98">15</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="353.22" y="604.44" width="20.68" height="7.98">28.3%</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="384.70" y="604.44" width="41.33" height="7.98">Absenteeism</text> </column> <column> <text> </text> </column> <column> <text> </text> </column> <column> <text> </text> </column> <column> <text> </text> </column> </row> <row> <column> <text fontName="Times New Roman" fontSize="8.0" x="164.22" y="613.62" width="4.01" height="7.98">8</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="177.71" y="613.62" width="20.69" height="7.98">65.2%</text> </column> <column> <text fontName="Times New Roman" fontSize="8.0" x="209.20" y="613.62" width="28.90" height="7.98">Courtesy</text> </column> <column> <text> </text> </column> <column> <text> </text> ...
Or CSV output:
curl --location --request POST 'https://api.pdf.co/v1/pdf/convert/to/csv' \ --header 'x-api-key: YOUR_API_KEY' \ --header 'Content-Type: application/json' \ --data-raw '{ "url": "http://www.wright.edu/~david.wilson/eng3000/samplereport.pdf" }' ... "negative influence on an individual's organizational efficiency and productivity.","","","","","","","","","","", "","The findings that negatively affect work performance are shown in Figure 2.","","","","","","","","","", "Figure 2:","Negative Effects of Stress on Work Performance","","","","","","","","","", "15","","","","","","","","","","", "13","","","","","","","","","","", "11","","","","","","","","","","", "9","","","","","","","","","","", "7","","","","","","","","","","", "5","","","","","","","","","","", "3","","","","","","","","","","", "1","","","","","","","","","","", "0 10","20","30","40","50","60","70","","80","","", "1","42.8%","Productivity","","9","49.5%","Cooperation","","","","", "2","73.9%","Job Satisfaction/Morale","","10","45.2%","Initiative","","","","", "3","39.1%","Decision Making Abilities","","11","26.6%","Reliability","","","","", "4","52.7%","Accuracy","","12","39.4%","Alertness","","","","", "5","51.6%","Creativity","","13","35.5%","Perseverance","","","","", "6","28.0%","Attention to Appearance","","14","25.8%","Tardiness","","","","", "7","46.3%","Organizational Skills","","15","28.3%","Absenteeism","","","","", "8","65.2%","Courtesy","","","","","","","","", "S-40","Copyright © Houghton Mifflin Company. All rights reserved.","","","","","","","","","", "","","","","","","","","Sample Reports","","","","","", "","","","","","","","","","","","","8","",
PDF extractor engine from ByteScout automates data extraction from scans and PDF documents. It extracts a structured representation of the text and preserves the original layout when needed. With CSV, XML, JSON, XLS, XLSX, HTML output it is very easy to import extracted data into your database, app, or script for further processing.