Why PDF Extractor is Different & its Advantages - ByteScout

How PDF Extractor is Different from Just Copy-Pasting Text from PDF and What are Advantages Provided by PDF Extractor SDK

  • Home
  • /
  • Articles
  • /
  • How PDF Extractor is Different from Just Copy-Pasting Text from PDF and What are Advantages Provided by PDF Extractor SDK

PDF extractor is the tool for extracting data from PDF and scanned documents. PDF extraction is focused on providing a structured representation of the original text, layout, images, vectors, etc.

Difference Between Copy-Pasting & Using PDF Extractor SDK and also with PDF Extractor Web API via PDF.co

Here is the visual showcase demonstrating the difference between just copying text from a PDF report with a table versus using PDF Extractor for this purpose:

Original sample PDF report was taken from this page:

sample PDF report found by googling for PDF sample report

This PDF has 18 pages with text and tables. Let’s take a page number 12 with text and a table:

page 12 from a sample PDF report document

Now let’s select the chart and the table with text on this page:

text selection on page 12 in sample PDF report

Now let’s copy and paste this text from PDF viewer into a notepad to see what it looks like:

Figure 2: Negative Effects of Stress on Work Performance
0 10 20 30 40 50 60 70 80
1
3
5
7
9
11
13
15
1 42.8% Productivity 9 49.5% Cooperation
2 73.9% Job Satisfaction/Morale 10 45.2% Initiative
3 39.1% Decision Making Abilities 11 26.6% Reliability
4 52.7% Accuracy 12 39.4% Alertness
5 51.6% Creativity 13 35.5% Perseverance
6 28.0% Attention to Appearance 14 25.8% Tardiness
7 46.3% Organizational Skills 15 28.3% Absenteeism
8 65.2% Courtesy 

As you can see there are some issues with this copied text:

  • the scale labels are inverted, so 0 is on the top and all vertical scale labels are in reverse order vs original document)
  • cells from the table are not separate clearly, they are just coming one after another in a row.

Now let’s try to extract this text using ByteScout’s PDF extractor engine:

 Figure 2: Negative Effects of Stress on Work Performance
       15

       13

       11

        9

        7

        5

        3

        1
     0 10   20   30   40   50   60   70   80

         1  42.8%   Productivity                      9   49.5%   Cooperation
         2  73.9%   Job Satisfaction/Morale            10   45.2%   Initiative
         3  39.1%   Decision Making Abilities          11   26.6%   Reliability
         4  52.7%   Accuracy                        12   39.4%   Alertness
         5  51.6%   Creativity                       13   35.5%   Perseverance
         6  28.0%   Attention to Appearance            14   25.8%   Tardiness
         7  46.3%   Organizational Skills              15   28.3%   Absenteeism
         8  65.2%   Courtesy

As you see, now issues are fixed:

  • labels for vertical and horizontal axis are made in the very same original order
  • values from cells are separated by spaces

But this can be improved further with PDF Extractor engine! Because PDF extractor can also extract text from PDF into JSON, XLS, XLSX, XML or CSV representation. Let’s select the same area and use ByteScout’s free PDF Multitool app (that runs on PDF extractor engine) to demonstrate PDF to CSV extraction:

sample pdf report page 12 in pdf multitool doing pdf to csv

As you see on the screenshot above, CSV has separate cells and you can easily get separate values for every cell from the original table. This also makes it easy to implement loading of data from PDF into CSV for processing with your own script or app and you don’t need to parse text further.

PDF extractor engine can also generate output as JSON via the cloud version which is copy-pasted below:


curl --location --request POST 'https://api.pdf.co/v1/pdf/convert/to/json2' \
--header 'Content-Type: application/json' \
--header 'x-api-key: YOUR_API_KEY' \
--data-raw '{
    "url": "http://www.wright.edu/~david.wilson/eng3000/samplereport.pdf",
    "inline": true
}'

200
{
  "document": {
    "pageCount": "1",
    "pageCountWithOCRPerformed": "0",
    "page": {
      "index": "11",
      "OCRWasPerformed": "False",
      "row": [
        {
          "column": [
            {
              "text": ""
            },
            {
              "text": ""
            },
            {
              "text": {
                "fontName": "Times New Roman",
                "fontSize": "9.0",
                "fontStyle": "Bold",
                "x": "153.12",
                "y": "330.84",
                "width": "36.22",
                "height": "9.48",
                "text": "Figure 2:"
              }
            },
            {
              "text": {
                "fontName": "Times New Roman",
                "fontSize": "9.0",
                "fontStyle": "Bold",
                "x": "193.90",
                "y": "330.84",
                "width": "191.71",
                "height": "9.48",
                "text": "Negative Effects of Stress on Work Performance"
              }
            },
....
        {
          "column": [
            {
              "text": ""
            },
            {
              "text": ""
            },
            {
              "text": {
                "fontName": "Arial",
                "fontSize": "6.0",
                "x": "168.30",
                "y": "516.70",
                "width": "40.84",
                "height": "5.78",
                "text": "0 10"
              }
            },
            {
              "text": ""
            },
            {
              "text": {
                "fontName": "Arial",
                "fontSize": "6.0",
                "x": "235.33",
                "y": "516.70",
                "width": "6.87",
                "height": "5.78",
                "text": "20"
              }
            },
            {
              "text": {
                "fontName": "Arial",
                "fontSize": "6.0",
                "x": "268.39",
                "y": "516.70",
                "width": "6.93",
                "height": "5.78",
                "text": "30"
              }
            },
            {
              "text": {
                "fontName": "Arial",
                "fontSize": "6.0",
                "x": "301.51",
                "y": "516.70",
                "width": "6.87",
                "height": "5.78",
                "text": "40"
              }
            },
            {
              "text": {
                "fontName": "Arial",
                "fontSize": "6.0",
                "x": "334.56",
                "y": "516.70",
                "width": "6.87",
                "height": "5.78",
                "text": "50"
              }
            },
            {
              "text": {
                "fontName": "Arial",
                "fontSize": "6.0",
                "x": "367.68",
                "y": "516.70",
                "width": "6.87",
                "height": "5.78",
                "text": "60"
              }
            },
            {
              "text": {
                "fontName": "Arial",
                "fontSize": "6.0",
                "x": "400.74",
                "y": "516.70",
                "width": "6.87",
                "height": "5.78",
                "text": "70"
              }
            },
            {
              "text": {
                "fontName": "Arial",
                "fontSize": "6.0",
                "x": "433.80",
                "y": "516.70",
                "width": "6.93",
                "height": "5.78",
                "text": "80"
              }
            },
            {
              "text": ""
            }
          ]
        },
        {
          "column": [
            {
              "text": ""
            },
            {
              "text": ""
            },
            {
              "text": {
                "fontName": "Times New Roman",
                "fontSize": "8.0",
                "x": "164.22",
                "y": "549.24",
                "width": "4.01",
                "height": "7.98",
                "text": "1"
              }
            },
            {
              "text": {
                "fontName": "Times New Roman",
                "fontSize": "8.0",
                "x": "177.71",
                "y": "549.24",
                "width": "20.68",
                "height": "7.98",
                "text": "42.8%"
              }
            },
...
}

You can also get XML, XLS, XLSX, JSON, HTML representations of the text and the structure. For example, PDF To XML generated by the cloud version of Web API:

curl --location --request POST 'https://api.pdf.co/v1/pdf/convert/to/xml' \
--header 'x-api-key: YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data-raw '{
    "url": "http://www.wright.edu/~david.wilson/eng3000/samplereport.pdf"
}'


<?xml version="1.0" encoding="UTF-8"?>
<document pageCount="18" pageCountWithOCRPerformed="0">
  <page index="0" width="612" height="792" OCRWasPerformed="False">

...
<row>
<column>
<text fontName="Arial" fontSize="6.0" x="168.30" y="516.70" width="40.84" height="5.78">0 10</text>
</column>
<column>
<text fontName="Arial" fontSize="6.0" x="235.33" y="516.70" width="6.87" height="5.78">20</text>
</column>
<column>
<text fontName="Arial" fontSize="6.0" x="268.39" y="516.70" width="6.93" height="5.78">30</text>
</column>
<column>
<text fontName="Arial" fontSize="6.0" x="301.51" y="516.70" width="6.87" height="5.78">40</text>
</column>
<column>
<text fontName="Arial" fontSize="6.0" x="334.56" y="516.70" width="6.87" height="5.78">50</text>
</column>
<column>
<text fontName="Arial" fontSize="6.0" x="367.68" y="516.70" width="6.87" height="5.78">60</text>
</column>
<column>
<text fontName="Arial" fontSize="6.0" x="400.74" y="516.70" width="6.87" height="5.78">70</text>
</column>
<column>
<text> </text>
</column>
<column>
<text fontName="Arial" fontSize="6.0" x="433.80" y="516.70" width="6.93" height="5.78">80</text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
</column>
</row>
<row>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="164.22" y="549.24" width="4.01" height="7.98">1</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="177.71" y="549.24" width="20.68" height="7.98">42.8%</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="209.19" y="549.24" width="39.62" height="7.98">Productivity</text>
</column>
<column>
<text> </text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="338.45" y="549.24" width="4.01" height="7.98">9</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="353.26" y="549.24" width="20.68" height="7.98">49.5%</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="384.74" y="549.24" width="39.62" height="7.98">Cooperation</text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
</column>
</row>
<row>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="164.22" y="558.42" width="4.00" height="7.98">2</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="177.70" y="558.42" width="20.68" height="7.98">73.9%</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="209.18" y="558.42" width="76.29" height="7.98">Job Satisfaction/Morale</text>
</column>
<column>
<text> </text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="334.43" y="558.42" width="8.01" height="7.98">10</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="353.24" y="558.42" width="20.68" height="7.98">45.2%</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="384.71" y="558.42" width="28.94" height="7.98">Initiative</text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
</column>
</row>
<row>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="164.22" y="567.60" width="4.00" height="7.98">3</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="177.70" y="567.60" width="20.68" height="7.98">39.1%</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="209.19" y="567.60" width="84.92" height="7.98">Decision Making Abilities</text>
</column>
<column>
<text> </text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="334.43" y="567.60" width="8.00" height="7.98">11</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="353.23" y="567.60" width="20.66" height="7.98">26.6%</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="384.69" y="567.60" width="33.80" height="7.98">Reliability</text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
</column>
</row>
<row>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="164.22" y="576.84" width="4.01" height="7.98">4</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="177.71" y="576.84" width="20.68" height="7.98">52.7%</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="209.19" y="576.84" width="30.63" height="7.98">Accuracy</text>
</column>
<column>
<text> </text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="334.38" y="576.84" width="8.02" height="7.98">12</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="353.20" y="576.84" width="20.68" height="7.98">39.4%</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="384.68" y="576.84" width="30.29" height="7.98">Alertness</text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
</column>
</row>
<row>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="164.22" y="586.02" width="4.01" height="7.98">5</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="177.71" y="586.02" width="20.66" height="7.98">51.6%</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="209.16" y="586.02" width="31.99" height="7.98">Creativity</text>
</column>
<column>
<text> </text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="334.40" y="586.02" width="8.01" height="7.98">13</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="353.21" y="586.02" width="20.66" height="7.98">35.5%</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="384.67" y="586.02" width="42.21" height="7.98">Perseverance</text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
</column>
</row>
<row>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="164.22" y="595.20" width="4.01" height="7.98">6</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="177.71" y="595.20" width="20.66" height="7.98">28.0%</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="209.17" y="595.20" width="78.74" height="7.98">Attention to Appearance</text>
</column>
<column>
<text> </text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="334.47" y="595.20" width="8.01" height="7.98">14</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="353.28" y="595.20" width="20.66" height="7.98">25.8%</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="384.75" y="595.20" width="31.11" height="7.98">Tardiness</text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
</column>
</row>
<row>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="164.22" y="604.44" width="4.00" height="7.98">7</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="177.70" y="604.44" width="20.67" height="7.98">46.3%</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="209.17" y="604.44" width="67.82" height="7.98">Organizational Skills</text>
</column>
<column>
<text> </text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="334.41" y="604.44" width="8.01" height="7.98">15</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="353.22" y="604.44" width="20.68" height="7.98">28.3%</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="384.70" y="604.44" width="41.33" height="7.98">Absenteeism</text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
</column>
</row>
<row>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="164.22" y="613.62" width="4.01" height="7.98">8</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="177.71" y="613.62" width="20.69" height="7.98">65.2%</text>
</column>
<column>
<text fontName="Times New Roman" fontSize="8.0" x="209.20" y="613.62" width="28.90" height="7.98">Courtesy</text>
</column>
<column>
<text> </text>
</column>
<column>
<text> </text>
...

Or CSV output:

curl --location --request POST 'https://api.pdf.co/v1/pdf/convert/to/csv' \
--header 'x-api-key: YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data-raw '{
    "url": "http://www.wright.edu/~david.wilson/eng3000/samplereport.pdf"
}'

...
"negative influence on an individual's organizational efficiency and productivity.","","","","","","","","","","",
"","The findings that negatively affect work performance are shown in Figure 2.","","","","","","","","","",
"Figure 2:","Negative Effects of Stress on Work Performance","","","","","","","","","",
"15","","","","","","","","","","",
"13","","","","","","","","","","",
"11","","","","","","","","","","",
"9","","","","","","","","","","",
"7","","","","","","","","","","",
"5","","","","","","","","","","",
"3","","","","","","","","","","",
"1","","","","","","","","","","",
"0 10","20","30","40","50","60","70","","80","","",
"1","42.8%","Productivity","","9","49.5%","Cooperation","","","","",
"2","73.9%","Job Satisfaction/Morale","","10","45.2%","Initiative","","","","",
"3","39.1%","Decision Making Abilities","","11","26.6%","Reliability","","","","",
"4","52.7%","Accuracy","","12","39.4%","Alertness","","","","",
"5","51.6%","Creativity","","13","35.5%","Perseverance","","","","",
"6","28.0%","Attention to Appearance","","14","25.8%","Tardiness","","","","",
"7","46.3%","Organizational Skills","","15","28.3%","Absenteeism","","","","",
"8","65.2%","Courtesy","","","","","","","","",
"S-40","Copyright © Houghton Mifflin Company. All rights reserved.","","","","","","","","","",
"","","","","","","","","Sample Reports","","","","","",
"","","","","","","","","","","","","8","",

Conclusion

PDF extractor engine from ByteScout automates data extraction from scans and PDF documents. It extracts a structured representation of the text and preserves the original layout when needed. With CSV, XML, JSON, XLS, XLSX, HTML output it is very easy to import extracted data into your database, app, or script for further processing.

Tutorials:

prev
next