Amazon AWS Textract vs PDF.co Web API - ByteScout
  • Home
  • /
  • Blog
  • /
  • Amazon AWS Textract vs PDF.co Web API

Amazon AWS Textract vs PDF.co Web API

It is said that “Data is new Oil”. And it is true if you consider analysis reports of today’s economists. Future data will be more valuable than oil. But it makes us wonder, where is this data stored and how do we extract data to benefit from it?

Most companies today in different industries and sectors produce huge amounts of daily data in various forms such as physical documents, scanned documents, photos, digitally filled forms, spreadsheets, electronic PDF files. A common problem with all of them is that they are stored in unstructured formats that are not easily (in most cases) can be loaded into databases or software applications. To liberate and take advantage of the data trapped inside files, companies are looking for intelligent tools that can load and restore original data from paper-based and electronic documents.

For example, a regular OCR (optical character recognition) can extract text from images as raw and unstructured data. But in most cases, raw and unstructured data is not enough if you are looking to use and analyze this data. It will require manual review, manual editing, verification, and cleanup before such data becomes valuable.

In these articles, we’ll be going to have an overview of two tools “PDF.co Web API” and “Amazon Textract”. Though both products are unique on their own, they offer unique and different features and have their own ecosystem.

We are going to compare features that are similar in both products:

  1. Overview
  2. Features
  3. Product Modules
  4. Sample Request/Response
  5. Useful Links


Overview

Amazon Textract

AWS Textract is a new cloud-based service introduced by Amazon AWS and it can extract text from scanned documents. Input Document needs to be provided in either BLOB or as a file uploaded into Amazon AWS S3 storage service. Basically it provides two services, one to detect text in the document and another to extract text. Data detection can detect words, lines, paragraphs, and one can extract data in modes such as raw text, table, and form. It works only in the cloud and you can’t use it on your own server and you have to upload all your data into Amazon.

PDF.co Web API

PDF.co provides Web API. PDF.co Web API is focused on scalable and intelligent data extraction. API includes features like extracting text and data from both native electronic documents (PDF, RTF, DOC, DOCX, XLS) and from scanned documents (PDF, JPG, PNG, TIFF). One can extract text with preserved layout and text order can extract structured data from tables, can extract form fields values. PDF.co Web API also includes Document Parser feature that provides template-based data extraction were one can design a template and then use this template for precise and accurate data extraction from documents. Other features it provides are converting various formats such as image/spreadsheets/XML/etc to PDF or converting PDF to these formats.

Despite Amazon Textract, the on-premise version of PDF.co Web API is also available for enterprise companies as ByteScout API Server product. You can install it on your own in-house server and use it with local files without an Internet connection required! The on-premise version can work offline in isolated environments, you can also deploy it to your own private cloud. Demo videos about the on-premise version are here

Features

Amazon Textract

The followings are the main features provided by Amazon Textract:

  • Optical Character Recognition (OCR)
    • Amazon Textract uses OCR technology to detect and extract text from a scanned document.
  • Form Extraction
    • Upon providing a “Form” mode to analyze data service, amazon Textract tries to detect all forms and provides results from forms. It uses machine learning to efficiently detecting form fields.
  • Table Extraction
    • AWS Textract provides a table detection and extraction feature which returns data from tables in a structured form.
  • Confidence Thresholds
    • With all extracted information such as text/form/table, Amazon Textract provides a confidence score of the result. This confidence score is useful to the developer to decide whether to use or not to use result information.
  • Bounding Boxes
    • With Aws Textract all output extracted data comes with coordinates of the result. The result can be of any format such as line, word, table, or even table cells; but all will contain co-ordinates of it so that the developer can detect where in document extracted text exists.
  • On-Premise Version
    • Not available
  • Technical Support
    • Requires a separate purchase

PDF.co Web API

The followings are some of the features provided by PDF.co Web API.

  • Optical Character Recognition (OCR)
    • PDF.co Web API automatically detects if documents need to be run through OCR during data extraction. If OCR is not required then it uses a more accurate native PDF engine to extract text and other data.
  • Forms and fields extraction as CSV, XML, JSON, TXT from PDF files and scanned images
    • PDF.co Web API extracts structured data from PDF in CSV, XML, JSON, TXT formats and returns results as values or as downloadable links to generated data. The latter is useful if you need to use the extracted data for post-processing with other tools and APIs.
  • Table Extraction from PDF and scanned documents
    • PDF.co Web API is analyzing source documents and re-creates original tables structure using AI and machine learning. It can also locate tables automatically and extract them as CSV, XML, or JSON values.
  • Template-based Data Extraction from PDF and images that provides high accuracy
    • This is one of many hidden gems of PDF.co Web API. You can create Document Parser templates without coding and pass template to the API and it will parse input document based on the template and will return your results as JSON. Templates are very flexible and can be easily updated without programming or coding required.
  • Supports PDF, DOC, DOCX, XLS, RTF, JPG, PNG, TIFF documents and scans
    • PDF.co supports different formats as input. It can load CSV, XLS, JSON, Image formats like JPG, PNG and TIFF, Text, HTML, XML, etc.
  • The analysis returns confidence levels, fonts and text object coordinates information
    • PDF.co Web API can extract data from PDF and scanned images and return extracted text as JSON or XML with extended information about confidence levels, font names, fonts sizes, fonts colors, and bounding rectangles coordinates for every text object.
  • Bounding Boxes
    • PDF.co Web API returns bounding boxes coordinates for text objects from both native PDF files and scanned documents.
  • PDF Tools: merging, splitting, compressing, redacting sensitive data, adding images and text
    • PDF.co provides an additional set of PDF  tools such as splitting PDF, merging PDF files, compressing PDF documents, adding images to PDFs, searching and removing text, searching and replacing text, redacting sensitive data.
  • Barcoding tools to automatically detect and decode barcodes inside images and PDF files
    • PDF.co provides additional barcode reading feature. It can automatically detect and decode all popular barcodes such as Code 39, Code 129, EAN, UPC, QR Code, Datamatrix, PDF417, and many others. Noisy documents, damaged and rotated scans, and pictures are supported.
  • On-Premise Self-hosted Version
  • Zapier and Integromat integrations 
    • The Zapier plugin is available at no additional fee! See this page to install PDF.co Zapier plugin
    • Integromat integration is available through direct API calls  (see PDF.co API docs)
  • Technical Support
    • included for free!

Product Modules

Amazon Textract

Amazon AWS is trying to provide basic API s for everything, not just data extraction. Amazon AWS is more known as the hosting provider of cloud-based servers and storage in the cloud. Textract service is built as one of the modules inside the Amazon AWS products ecosystem. Following is the most common use-case which is often pitched by amazon.

As you can see, first of all, the input image file should be loaded into an S3 bucket. Once it’s done it’s further processed by lambda and Textract is invoked. After the text is detected and extracted, it’s output can be stored into Dynamo DB or other places of choice. Finally, it’s loaded into ElasticSearch service from where it can be consumed by third-party apps.

You also need to pay for all the intermediary services like storage, inbound, and outbound traffic and it may add a hefty amount to the base price if you are processing large volumes of data.

Textract services have two versions sync and async. For smaller documents sync version is a good choice where results are instantaneous. But for larger documents that take huge processing time, an async option is the best fit.  However, both sync and async version are limited to documents with 3,000 pages on Amazon AWS Textract.

PDF.Co Web API

PDF.co Web API is provided by ByteScout which is an established software maker focused on PDF data extraction and PDF processing technologies for more than a decade! That is why companies from Fortune-100 list select BytesScout for their data extraction projects. PDF.co Web API uses the same technology that is used by Fortune-100 companies for quick and smart data extraction. Best of all, PDF.co Web API can be used without any installation required and is powered by the same scalable and secure infrastructure as Textract.

PDF.co makes no additional charges for storage or traffic and the base price covers both traffic and storage.

Dependency on other services in either of direct/indirect way increases end result cost as all services are counted here. In the case of PDF.co, one can simply get API keys, invoke API and consume output from any programming language from Javascript, Java, C#, Visual Basic to command-line based cURL.

PDF.co also provides API for file uploads, secure downloads, re-use of the same file, monitoring background jobs, etc.

PDF.co is autonomous without additional services or API required to be purchased.

PDF.co also includes a visual interface where you can try how it works before integrating with API:

Like Amazon Textract, PDF.co services can also be invoked in both sync and async mode (for large files)

Sample Request/Response

Amazon Textract

Let’s briefly analyze the “DetectDocumentText” API for Textract. This service endpoint basically detects whether the input document contains text or not. And if yes, what are text objects co-ordinates.

Request Syntax:

{
  "Document": {
    "Bytes": "blob",
    "S3Object": {
      "Bucket": "string",
      "Name": "string",
      "Version": "string" 
    } 
  } 
}

Request Parameters:
The request accepts data in JSON format. The input document as base64-encoded bytes or an Amazon S3 object. The document must be an image in JPEG or PNG format.

Response Syntax:

{
  "Blocks": [ 
    { 
      "BlockType": "string",
      "ColumnIndex": number,
      "ColumnSpan": number,
      "Confidence": number,
      "EntityTypes": [ "string" ],
      "Geometry": { 
        "BoundingBox": { 
          "Height": number,
          "Left": number,
          "Top": number,
          "Width": number
        },
        "Polygon": [ 
          { 
            "X": number,
            "Y": number
          }
        ]
      },
      "Id": "string",
      "Page": number,
      "Relationships": [ 
        { 
          "Ids": [ "string" ],
          "Type": "string"
        }
      ],
      "RowIndex": number,
      "RowSpan": number,
      "SelectionStatus": "string",
      "Text": "string"
    }
  ],
  "DocumentMetadata": { 
    "Pages": number
  }
}


For more developer guides, please refer to the documentation link provided in the “Useful Links” section.

PDF.co Web API

Let’s briefly analyze the following API request which converts input documents such as PDF, PNG, JPG documents and extracts text, JSON or CSV from it.

Sample 1: PDF to Text with layout preserved

API Endpoint: https://api.pdf.co/v1/pdf/convert/to/text

Sample POST:

{
  "name" : "result.txt",
  "pages" : "",
  "password" : "",
  "url" : "https://bytescout-com.s3.amazonaws.com/files/demo-files/cloud-api/pdf-to-text/sample.pdf"
}

(! Don’t forget to set x-API-key URL param or HTTP header param)

Here’s the input document which we’ve passed along.

Response:

{
  "url": "https://pdf-temp-files.s3.amazonaws.com/d6fbd25870a94c8cb0f268e5619d6b21/result.txt",
  "pageCount": 1,
  "error": false,
  "status": 200,
  "name": "result.txt"
}

The output response inside the ‘URL‘ parameter is as follows as the text output reproducing the original layout

For more developer guides, please refer to the documentation link provided in the “Useful Links” section or check API docs at https://apidocs.pdf.co

Sample 2: PDF to structured JSON

PDF to JSON extraction with PDF.co Web API

Endpoint: https://api.pdf.co/v1/pdf/convert/to/json

POST request:

{
"name" : "result.txt",
"inline" : "true",
"url" : "https://bytescout-com.s3.amazonaws.com/files/demo-files/cloud-api/pdf-to-text/sample.pdf"
}

Response from PDF.co Web API with structured JSON:

{
    "body": {
        "document": {
            "page": {
                "@index": "0",
                "row": [
                    {
                        "column": [
                            {
                                "text": {
                                    "@fontName": "Arial",
                                    "@fontSize": "24.0",
                                    "@fontStyle": "Bold",
                                    "@color": "#538DD3",
                                    "@x": "36.00",
                                    "@y": "34.44",
                                    "@width": "242.81",
                                    "@height": "24.00",
                                    "#text": "Your Company Name"
                                }
                            },
                            {
                                "text": ""
                            },
                            {
                                "text": ""
                            },
                            {
                                "text": ""
                            }
                        ]
                    },
.............................
.......JSON parts skipped....
.............................

                    {
                        "column": [
                            {
                                "text": {
                                    "@fontName": "Arial",
                                    "@fontSize": "11.0",
                                    "@fontStyle": "Bold",
                                    "@x": "36.00",
                                    "@y": "316.25",
                                    "@width": "22.58",
                                    "@height": "11.04",
                                    "#text": "Item"
                                }
                            },
                            {
                                "text": {
                                    "@fontName": "Arial",
                                    "@fontSize": "11.0",
                                    "@fontStyle": "Bold",
                                    "@x": "247.61",
                                    "@y": "316.25",
                                    "@width": "44.64",
                                    "@height": "11.04",
                                    "#text": "Quantity"
                                }
                            },
                            {
                                "text": {
                                    "@fontName": "Arial",
                                    "@fontSize": "11.0",
                                    "@fontStyle": "Bold",
                                    "@x": "398.95",
                                    "@y": "316.25",
                                    "@width": "26.91",
                                    "@height": "11.04",
                                    "#text": "Price"
                                }
                            },
                            {
                                "text": {
                                    "@fontName": "Arial",
                                    "@fontSize": "11.0",
                                    "@fontStyle": "Bold",
                                    "@x": "533.14",
                                    "@y": "316.25",
                                    "@width": "26.30",
                                    "@height": "11.04",
                                    "#text": "Total"
                                }
                            }
                        ]
                    },
                    {
                        "column": [
                            {
                                "text": {
                                    "@fontName": "Arial",
                                    "@fontSize": "11.0",
                                    "@x": "36.00",
                                    "@y": "341.33",
                                    "@width": "30.62",
                                    "@height": "11.04",
                                    "#text": "Item 1"
                                }
                            },
                            {
                                "text": {
                                    "@fontName": "Arial",
                                    "@fontSize": "11.0",
                                    "@x": "286.13",
                                    "@y": "341.33",
                                    "@width": "6.12",
                                    "@height": "11.04",
                                    "#text": "1"
                                }
                            },
                            {
                                "text": {
                                    "@fontName": "Arial",
                                    "@fontSize": "11.0",
                                    "@x": "398.35",
                                    "@y": "341.33",
                                    "@width": "27.51",
                                    "@height": "11.04",
                                    "#text": "40.00"
                                }
                            },
                            {
                                "text": {
                                    "@fontName": "Arial",
                                    "@fontSize": "11.0",
                                    "@x": "531.94",
                                    "@y": "341.33",
                                    "@width": "27.50",
                                    "@height": "11.04",
                                    "#text": "40.00"
                                }
                            }
                        ]
                    },
.............................
.......JSON parts skipped....
.............................

                    {
                        "column": [
                            {
                                "text": ""
                            },
                            {
                                "text": ""
                            },
                            {
                                "text": {
                                    "@fontName": "Arial",
                                    "@fontSize": "11.0",
                                    "@fontStyle": "Bold",
                                    "@x": "389.11",
                                    "@y": "425.83",
                                    "@width": "36.75",
                                    "@height": "11.04",
                                    "#text": "TOTAL"
                                }
                            },
                            {
                                "text": {
                                    "@fontName": "Arial",
                                    "@fontSize": "11.0",
                                    "@fontStyle": "Bold",
                                    "@x": "525.82",
                                    "@y": "425.83",
                                    "@width": "33.62",
                                    "@height": "11.04",
                                    "#text": "200.00"
                                }
                            }
                        ]
                    }
                ]
            }
        }
    },
    "pageCount": 1,
    "error": false,
    "status": 200,
    "name": "result.txt",
    "remainingCredits": 99263
}

Sample 2: PDF to structured XML

PDF to JSON extraction with PDF.co Web API

Endpoint: https://api.pdf.co/v1/pdf/convert/to/xml

POST request:

{
"name": "Result.xml",
"url": "https://bytescout-com.s3.amazonaws.com/files/demo-files/cloud-api/pdf-to-text/sample.pdf"
}

Result.xml returned from PDF.co Web API that contains details for every text snippet along with coordinates, font and size information:

<?xml version="1.0" encoding="UTF-8"?><document><page index="0"><row><column><text fontName="Arial" fontSize="24.0" fontStyle="Bold" color="#538DD3" x="36.00" y="34.44" width="242.81" height="24.00">Your Company Name</text></column><column><text></text></column><column><text></text></column><column><text></text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="76.94" width="66.62" height="11.04">Your Address</text></column><column><text></text></column><column><text></text></column><column><text></text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="91.46" width="69.14" height="11.04">City, State Zip</text></column><column><text></text></column><column><text></text></column><column><text></text></column></row><row><column><text></text></column><column><text></text></column><column><text></text></column><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="461.02" y="115.94" width="98.42" height="11.04">Invoice No. 123456</text></column></row><row><column><text></text></column><column><text></text></column><column><text></text></column><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="436.54" y="130.46" width="122.90" height="11.04">Invoice Date 01/01/2016</text></column></row><row><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="36.00" y="154.94" width="63.62" height="11.04">Client Name</text></column><column><text></text></column><column><text></text></column><column><text></text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="169.70" width="40.34" height="11.04">Address</text></column><column><text></text></column><column><text></text></column><column><text></text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="184.22" width="69.14" height="11.04">City, State Zip</text></column><column><text></text></column><column><text></text></column><column><text></text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="233.30" width="28.70" height="11.04">Notes</text></column><column><text></text></column><column><text></text></column><column><text></text></column></row><row><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="36.00" y="316.25" width="22.58" height="11.04">Item</text></column><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="247.61" y="316.25" width="44.64" height="11.04">Quantity</text></column><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="398.95" y="316.25" width="26.91" height="11.04">Price</text></column><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="533.14" y="316.25" width="26.30" height="11.04">Total</text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="341.33" width="30.62" height="11.04">Item 1</text></column><column><text fontName="Arial" fontSize="11.0" x="286.13" y="341.33" width="6.12" height="11.04">1</text></column><column><text fontName="Arial" fontSize="11.0" x="398.35" y="341.33" width="27.51" height="11.04">40.00</text></column><column><text fontName="Arial" fontSize="11.0" x="531.94" y="341.33" width="27.50" height="11.04">40.00</text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="362.45" width="30.62" height="11.04">Item 2</text></column><column><text fontName="Arial" fontSize="11.0" x="286.13" y="362.45" width="6.12" height="11.04">2</text></column><column><text fontName="Arial" fontSize="11.0" x="398.35" y="362.45" width="27.51" height="11.04">30.00</text></column><column><text fontName="Arial" fontSize="11.0" x="531.94" y="362.45" width="27.50" height="11.04">60.00</text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="383.57" width="30.62" height="11.04">Item 3</text></column><column><text fontName="Arial" fontSize="11.0" x="286.13" y="383.57" width="6.12" height="11.04">3</text></column><column><text fontName="Arial" fontSize="11.0" x="398.35" y="383.57" width="27.51" height="11.04">20.00</text></column><column><text fontName="Arial" fontSize="11.0" x="531.94" y="383.57" width="27.50" height="11.04">60.00</text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="404.93" width="30.62" height="11.04">Item 4</text></column><column><text fontName="Arial" fontSize="11.0" x="286.13" y="404.93" width="6.12" height="11.04">4</text></column><column><text fontName="Arial" fontSize="11.0" x="398.35" y="404.93" width="27.51" height="11.04">10.00</text></column><column><text fontName="Arial" fontSize="11.0" x="531.94" y="404.93" width="27.50" height="11.04">40.00</text></column></row><row><column><text></text></column><column><text></text></column><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="389.11" y="425.83" width="36.75" height="11.04">TOTAL</text></column><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="525.82" y="425.83" width="33.62" height="11.04">200.00</text></column></row></page></document>

For more API endpoints please explore  PDF.co API docs at https://apidocs.pdf.co

Useful Links

PDF.Co Web API

Product Page: https://pdf.co/
Pricing Policy: https://app.pdf.co/subscriptions
Developer Guide: https://apidocs.pdf.co/
Code Samples: https://github.com/bytescout/pdf-co-api-samples

Happy Coding! 🙂

 

About the Author

ByteScout Team

ByteScout Team of Writers

ByteScout has a team of professional writers specialized in different technical topics. We select the best writers to cover interesting and trending topics for our readers. We love developers and we hope our articles help you learn about programming and programmers.

 

 

prev
next