Amazon AWS Textract vs PDF.co Web API - ByteScout
  • Home
  • /
  • Blog
  • /
  • Amazon AWS Textract vs PDF.co Web API

Amazon AWS Textract vs PDF.co Web API

It is said that “Data is new Oil”. And if you consider analysis reports of today’s economists. Future data will be more valuable than oil. Which makes us wonder, where is this data and what are the benefits of it?

Well coming to the second question, benefits are endless but one of the most obvious benefits is that it gives the power to analyze and predict. Hence makes easy to make decisions. Coming to the first question where is this data

Most companies today in different sectors produce huge amounts of daily data in various forms such as physical documents, scanned documents, photos, digitally filled forms, etc. A common problem with these data is that they are in scanned forms or in unstructured formats, which demands manual effort and restricts the automation process.

PDF.co Web API vs Amazon Textract

As a solution, OCR is not enough, as it simply returns raw text which is of little use in most cases. One will require beyond OCR, which provides advanced text detection and extraction feature.

In these articles, we’ll be going to have an overview of such two products “PDF.co Web API” and “Amazon Textract”. Though both products are unique on their own, offer many distinguish and different features and have their own ecosystem; we’ll be going to compare features that are similar in both products.

Introduction

Amazon Textract

AWS Textract is a new cloud based service provided by Amazon to automatically detect and extract text data from scanned documents. Input Document needs to be provided in either BLOB or as an S3 file format. Basically it provides two services, one to detect text in the document and others to analyze/extract text. It provides data detection in different modes such as Word, Lines, Paragraph, and one can extract data in modes such as Raw, Table and Form. It works only in the cloud.

PDF.co Web API

PDF.co provides various restful services by ByteScout. PDF.co Web API is focused on the scalable and smart data extraction. It provides features like extracting data from both native and  scanned documents in various ways such as extracting raw data, extracting table data. It also provides features to template-based extraction were based on predefined template one can extract data. Other features it provides are converting various formats such as image/spreadsheets/XML/etc to PDF or converting PDF to these formats.

Despite Amazon Textract, the on-premise version of PDF.co Web API is also available as ByteScout Cloud API Server product. You can install it on your own in-house server, it can work with local files without Internet connection required! The on-premise version can works offline in the isolated environments or  with your own private cloud. Read more and check demo videos about on-premise version here

Features

Amazon Textract

The followings are the main features provided by Amazon Textract:

  • Optical Character Recognition (OCR)
    • Amazon Textract uses OCR technology to detect and extract text from a scanned document.
  • Form Extraction
    • Upon providing a “Form” mode to analyze data service, amazon Textract tries to detect all forms and provides results accordingly. It uses machine learning to efficiently detecting form fields.
  • Table Extraction
    • AWS Textract provides table detection and extraction feature which is useful where there’s a lot of structured data.
  • Confidence Thresholds
    • With all extracted information such as text/form/table, Amazon Textract provides a confidence score of the result. This confidence score is useful to the developer to decide whether to use or not to use result information.
  • Bounding Boxes
    • With Aws Textract all output extracted data comes with coordinates of the result. The result can be of any format such as line, word, table or even table cells; but all will contain co-ordinates of it so that the developer can detect where in document extracted text exists.
  • On-Premise Version
    • Not available
  • Technical Support
    • Requires a separate purchase

PDF.co Web API

The followings are some of the features provided by PDF.co Web API.

  • Optical Character Recognition (OCR)
    • PDF.co Web API automatically detects if documents needs to be run through OCR during data extraction
  • Forms and fields extraction as CSV, XML, JSON, TXT from PDF files and scanned images
    • PDF.co Web API extracts structured data from PDF in CSV, XML, JSON, TXT formats and can return results both as downloadable links
  • Table Extraction from PDF and scanned documents
    • PDF.co Web API re-creates original tables structure from documents. It can also search and locate tables on pages or inside given regions and return them as CSV, XML  or JSON.
  • Can convert Word, XLS, HTML to PDF
    • PDF.co makes the conversion of different formats to and from PDF very easy. It supports conversion to formats such as CSV, XLS, JSON, Image formats (including TIFF), Text, HTML, XML, etc. While the conversion process or data extraction process, if the input file is in the scanned version it supports inbuilt OCR support.
  • Confidence levels, fonts, coordinates information
    • PDF.co  Web API can extract data from PDF and scanned images and return results as JSON, XML that also contain information about the confidence levels,  fonts size, font color and bounding rectangles coordinates for every text object.
  • Bounding Boxes
    • PDF.co Web API returns bounding boxes coordinates for text objects inside XML and JSON returned after extraction from both native PDF files and scanned documents.
  • PDF Tools: merging, splitting, compressing, redacting sensitive data, adding images and text
    • PDF.co provides a set of PDF-related tools such as splitting PDF, merging PDF files, compressing PDF documents, adding digital signatures to PDFs, searching and removing text, searching and replacing text, redacting sensitive data
  • Barcoding tools: generating barcodes and reading barcodes from images and PDF files
    • PDF.co provides additional barcoding features such as barcode generation and barcode reading. You can generate  and read with almost all barcode types from Code 39, Code 129, EAN, UPC to QR Code, Datamatrix, PDF417 and many others. API can add barcodes to existing PDF documents, can stamp barcodes into images. It also reads barcodes from noisy documents, damaged scans and pictures.
  • Template-based Data Extraction from PDF and images
    • This is one of many hidden gems of PDF.co web API. You can create Document Parser templates without coding and pass template to the API and it will parse input document based on the template and will return you results as JSON. Templates are very flexible and can be easily updated without programming or coding required.
  • On-Premise Version
  • Zapier plugin
    • Zapier plugin is available at no additional fee! See this page to install PDF.co Zapier plugin
  • Technical Support
    • included for free!

Product Ecosystem

Amazon Textract

Textract service is built to fit into the amazon products ecosystem. Following is the most common use-case which is often pitched by amazon. It seems like Amazon AWS is trying to provide basic API s for everything and currently is not focused on the data extraction. Amazon AWS is more known as the hosting provider of servers and storage in the cloud.

Please note that this is one of many use-cases, this service can be used and integrated differently based on the requirement. As per the above image here’s the flow of service.

First of all input image file is loaded into an S3 bucket. Once it’s done it’s further processed by lambda and Textract is invoked. After the text is detected and extracted, it’s output can be stored into Dynamo DB or other places of choice. Finally, it’s loaded into ElasticSearch service from where it can be consumed by third-party apps.

Textract services have two versions sync and async. For smaller documents sync version is a good choice where results are instantaneous. But for larger documents that take huge processing time, an async option is the best fit.  However, both sync and async version are limited to documents with 3,000 pages.

PDF.Co Web API

PDF.co web API is provided by ByteScout which is an established software maker focused on PDF data extraction and PDF processing technologies for more than a decade! That is why companies from Fortune-100 list select BytesScout for their data extraction projects. PDF.co Web API uses the same technology that is used by Fortune-100 companies for quick and smart data extraction. Best of all, PDF.co Web API can be used without any installation required and is powered by the same scalable and secure infrastructure that is used in Amazon Textract.

Dependency on other services in either of direct/indirect way increases end result cost as all services are counted here. In the case of PDF.co, One can simply get api keys, invoke service and consume output as per their requirements from any programming language from Javascript, Java, C#, Visual Basic  to command line based cURL.

All the major services provided by PDF.co are autonomous without additional services to be  purchased like storage or other APIs.

PDF.co also includes visual interface where you can try how it works before integrating with API:

Like Amazon Textract, PDF.co services can also be invoked in BOTH sync and async modes.

Sample Request/Response

Amazon Textract

Let’s briefly analyze the “DetectDocumentText” API for Textract. This service endpoint basically detects whether the input document contains text or not. And if yes, what are text objects co-ordinates.

Request Syntax:

{
  "Document": {
    "Bytes": "blob",
    "S3Object": {
      "Bucket": "string",
      "Name": "string",
      "Version": "string" 
    } 
  } 
}

Request Parameters:
The request accepts data in JSON format. The input document as base64-encoded bytes or an Amazon S3 object. The document must be an image in JPEG or PNG format.

Response Syntax:

{
  "Blocks": [ 
    { 
      "BlockType": "string",
      "ColumnIndex": number,
      "ColumnSpan": number,
      "Confidence": number,
      "EntityTypes": [ "string" ],
      "Geometry": { 
        "BoundingBox": { 
          "Height": number,
          "Left": number,
          "Top": number,
          "Width": number
        },
        "Polygon": [ 
          { 
            "X": number,
            "Y": number
          }
        ]
      },
      "Id": "string",
      "Page": number,
      "Relationships": [ 
        { 
          "Ids": [ "string" ],
          "Type": "string"
        }
      ],
      "RowIndex": number,
      "RowSpan": number,
      "SelectionStatus": "string",
      "Text": "string"
    }
  ],
  "DocumentMetadata": { 
    "Pages": number
  }
}


For more developer guides, please refer to the documentation link provided in the “Useful Links” section.

PDF.co Web API

Let’s briefly analyze the following API request which converts input documents such as PDF, PNG, JPG documents and extracts text, JSON or CSV from it.

Sample 1: PDF to Text with layout preserved

API Endpoint: https://api.pdf.co/v1/pdf/convert/to/text

Sample POST:

{
  "name" : "result.txt",
  "pages" : "",
  "password" : "",
  "url" : "https://bytescout-com.s3.amazonaws.com/files/demo-files/cloud-api/pdf-to-text/sample.pdf"
}

(! Don’t forget to set x-API-key URL param or HTTP header param)

Here’s the input document which we’ve passed along.

Response:

{
  "url": "https://pdf-temp-files.s3.amazonaws.com/d6fbd25870a94c8cb0f268e5619d6b21/result.txt",
  "pageCount": 1,
  "error": false,
  "status": 200,
  "name": "result.txt"
}

The output response inside the ‘URL‘ parameter is as follows as the text output reproducing the original layout

For more developer guides, please refer to the documentation link provided in the “Useful Links” section or check API docs at https://apidocs.pdf.co

Sample 2: PDF to structured JSON

PDF to JSON extraction with PDF.co Web API

Endpoint: https://api.pdf.co/v1/pdf/convert/to/json

POST request:

{
"name" : "result.txt",
"inline" : "true",
"url" : "https://bytescout-com.s3.amazonaws.com/files/demo-files/cloud-api/pdf-to-text/sample.pdf"
}

Response from PDF.co Web API with structured JSON:

{
    "body": {
        "document": {
            "page": {
                "@index": "0",
                "row": [
                    {
                        "column": [
                            {
                                "text": {
                                    "@fontName": "Arial",
                                    "@fontSize": "24.0",
                                    "@fontStyle": "Bold",
                                    "@color": "#538DD3",
                                    "@x": "36.00",
                                    "@y": "34.44",
                                    "@width": "242.81",
                                    "@height": "24.00",
                                    "#text": "Your Company Name"
                                }
                            },
                            {
                                "text": ""
                            },
                            {
                                "text": ""
                            },
                            {
                                "text": ""
                            }
                        ]
                    },
.............................
.......JSON parts skipped....
.............................

                    {
                        "column": [
                            {
                                "text": {
                                    "@fontName": "Arial",
                                    "@fontSize": "11.0",
                                    "@fontStyle": "Bold",
                                    "@x": "36.00",
                                    "@y": "316.25",
                                    "@width": "22.58",
                                    "@height": "11.04",
                                    "#text": "Item"
                                }
                            },
                            {
                                "text": {
                                    "@fontName": "Arial",
                                    "@fontSize": "11.0",
                                    "@fontStyle": "Bold",
                                    "@x": "247.61",
                                    "@y": "316.25",
                                    "@width": "44.64",
                                    "@height": "11.04",
                                    "#text": "Quantity"
                                }
                            },
                            {
                                "text": {
                                    "@fontName": "Arial",
                                    "@fontSize": "11.0",
                                    "@fontStyle": "Bold",
                                    "@x": "398.95",
                                    "@y": "316.25",
                                    "@width": "26.91",
                                    "@height": "11.04",
                                    "#text": "Price"
                                }
                            },
                            {
                                "text": {
                                    "@fontName": "Arial",
                                    "@fontSize": "11.0",
                                    "@fontStyle": "Bold",
                                    "@x": "533.14",
                                    "@y": "316.25",
                                    "@width": "26.30",
                                    "@height": "11.04",
                                    "#text": "Total"
                                }
                            }
                        ]
                    },
                    {
                        "column": [
                            {
                                "text": {
                                    "@fontName": "Arial",
                                    "@fontSize": "11.0",
                                    "@x": "36.00",
                                    "@y": "341.33",
                                    "@width": "30.62",
                                    "@height": "11.04",
                                    "#text": "Item 1"
                                }
                            },
                            {
                                "text": {
                                    "@fontName": "Arial",
                                    "@fontSize": "11.0",
                                    "@x": "286.13",
                                    "@y": "341.33",
                                    "@width": "6.12",
                                    "@height": "11.04",
                                    "#text": "1"
                                }
                            },
                            {
                                "text": {
                                    "@fontName": "Arial",
                                    "@fontSize": "11.0",
                                    "@x": "398.35",
                                    "@y": "341.33",
                                    "@width": "27.51",
                                    "@height": "11.04",
                                    "#text": "40.00"
                                }
                            },
                            {
                                "text": {
                                    "@fontName": "Arial",
                                    "@fontSize": "11.0",
                                    "@x": "531.94",
                                    "@y": "341.33",
                                    "@width": "27.50",
                                    "@height": "11.04",
                                    "#text": "40.00"
                                }
                            }
                        ]
                    },
.............................
.......JSON parts skipped....
.............................

                    {
                        "column": [
                            {
                                "text": ""
                            },
                            {
                                "text": ""
                            },
                            {
                                "text": {
                                    "@fontName": "Arial",
                                    "@fontSize": "11.0",
                                    "@fontStyle": "Bold",
                                    "@x": "389.11",
                                    "@y": "425.83",
                                    "@width": "36.75",
                                    "@height": "11.04",
                                    "#text": "TOTAL"
                                }
                            },
                            {
                                "text": {
                                    "@fontName": "Arial",
                                    "@fontSize": "11.0",
                                    "@fontStyle": "Bold",
                                    "@x": "525.82",
                                    "@y": "425.83",
                                    "@width": "33.62",
                                    "@height": "11.04",
                                    "#text": "200.00"
                                }
                            }
                        ]
                    }
                ]
            }
        }
    },
    "pageCount": 1,
    "error": false,
    "status": 200,
    "name": "result.txt",
    "remainingCredits": 99263
}

Sample 2: PDF to structured XML

PDF to JSON extraction with PDF.co Web API

Endpoint: https://api.pdf.co/v1/pdf/convert/to/xml

POST request:

{
"name": "Result.xml",
"url": "https://bytescout-com.s3.amazonaws.com/files/demo-files/cloud-api/pdf-to-text/sample.pdf"
}

The content of Result.xml file returned from PDF.co Web API with structured JSON:

<?xml version="1.0" encoding="UTF-8"?><document><page index="0"><row><column><text fontName="Arial" fontSize="24.0" fontStyle="Bold" color="#538DD3" x="36.00" y="34.44" width="242.81" height="24.00">Your Company Name</text></column><column><text></text></column><column><text></text></column><column><text></text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="76.94" width="66.62" height="11.04">Your Address</text></column><column><text></text></column><column><text></text></column><column><text></text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="91.46" width="69.14" height="11.04">City, State Zip</text></column><column><text></text></column><column><text></text></column><column><text></text></column></row><row><column><text></text></column><column><text></text></column><column><text></text></column><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="461.02" y="115.94" width="98.42" height="11.04">Invoice No. 123456</text></column></row><row><column><text></text></column><column><text></text></column><column><text></text></column><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="436.54" y="130.46" width="122.90" height="11.04">Invoice Date 01/01/2016</text></column></row><row><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="36.00" y="154.94" width="63.62" height="11.04">Client Name</text></column><column><text></text></column><column><text></text></column><column><text></text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="169.70" width="40.34" height="11.04">Address</text></column><column><text></text></column><column><text></text></column><column><text></text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="184.22" width="69.14" height="11.04">City, State Zip</text></column><column><text></text></column><column><text></text></column><column><text></text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="233.30" width="28.70" height="11.04">Notes</text></column><column><text></text></column><column><text></text></column><column><text></text></column></row><row><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="36.00" y="316.25" width="22.58" height="11.04">Item</text></column><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="247.61" y="316.25" width="44.64" height="11.04">Quantity</text></column><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="398.95" y="316.25" width="26.91" height="11.04">Price</text></column><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="533.14" y="316.25" width="26.30" height="11.04">Total</text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="341.33" width="30.62" height="11.04">Item 1</text></column><column><text fontName="Arial" fontSize="11.0" x="286.13" y="341.33" width="6.12" height="11.04">1</text></column><column><text fontName="Arial" fontSize="11.0" x="398.35" y="341.33" width="27.51" height="11.04">40.00</text></column><column><text fontName="Arial" fontSize="11.0" x="531.94" y="341.33" width="27.50" height="11.04">40.00</text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="362.45" width="30.62" height="11.04">Item 2</text></column><column><text fontName="Arial" fontSize="11.0" x="286.13" y="362.45" width="6.12" height="11.04">2</text></column><column><text fontName="Arial" fontSize="11.0" x="398.35" y="362.45" width="27.51" height="11.04">30.00</text></column><column><text fontName="Arial" fontSize="11.0" x="531.94" y="362.45" width="27.50" height="11.04">60.00</text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="383.57" width="30.62" height="11.04">Item 3</text></column><column><text fontName="Arial" fontSize="11.0" x="286.13" y="383.57" width="6.12" height="11.04">3</text></column><column><text fontName="Arial" fontSize="11.0" x="398.35" y="383.57" width="27.51" height="11.04">20.00</text></column><column><text fontName="Arial" fontSize="11.0" x="531.94" y="383.57" width="27.50" height="11.04">60.00</text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="404.93" width="30.62" height="11.04">Item 4</text></column><column><text fontName="Arial" fontSize="11.0" x="286.13" y="404.93" width="6.12" height="11.04">4</text></column><column><text fontName="Arial" fontSize="11.0" x="398.35" y="404.93" width="27.51" height="11.04">10.00</text></column><column><text fontName="Arial" fontSize="11.0" x="531.94" y="404.93" width="27.50" height="11.04">40.00</text></column></row><row><column><text></text></column><column><text></text></column><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="389.11" y="425.83" width="36.75" height="11.04">TOTAL</text></column><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="525.82" y="425.83" width="33.62" height="11.04">200.00</text></column></row></page></document>

For more API endpoints please explore  PDF.co API docs at https://apidocs.pdf.co

Useful Links

PDF.Co Web API

Product Page: https://pdf.co/
Pricing Policy: https://app.pdf.co/subscriptions
Developer Guide: https://apidocs.pdf.co/
Code Samples: https://github.com/bytescout/pdf-co-api-samples

Happy Coding! 🙂

prev
next