It is said that “Data is new Oil”. And it is true if you consider analysis reports of today’s economists. Future data will be more valuable than oil. But it makes us wonder, where is this data stored and how do we extract data to benefit from it?
Most companies today in different industries and sectors produce huge amounts of daily data in various forms such as physical documents, scanned documents, photos, digitally filled forms, spreadsheets, electronic PDF files. A common problem with all of them is that they are stored in unstructured formats that are not easily (in most cases) can be loaded into databases or software applications. To liberate and take advantage of the data trapped inside files, companies are looking for intelligent tools that can load and restore original data from paper-based and electronic documents.
For example, a regular OCR (optical character recognition) can extract text from images as raw and unstructured data. But in most cases, raw and unstructured data is not enough if you are looking to use and analyze this data. It will require manual review, manual editing, verification, and cleanup before such data becomes valuable.
In these articles, we’ll be going to have an overview of two tools “PDF.co Web API” and “Amazon Textract”. Though both products are unique on their own, they offer unique and different features and have their own ecosystem.
We are going to compare features that are similar in both products:
AWS Textract is a new cloud-based service introduced by Amazon AWS and it can extract text from scanned documents. Input Document needs to be provided in either BLOB or as a file uploaded into Amazon AWS S3 storage service. Basically, it provides two services, one to detect text in the document and another to extract text. Data detection can detect words, lines, paragraphs, and one can extract data in modes such as raw text, table, and form. It works only in the cloud and you can’t use it on your own server and you have to upload all your data into Amazon.
PDF.co provides Web API. PDF.co Web API is focused on scalable and intelligent data extraction. API includes features like extracting text and data from both native electronic documents (PDF, RTF, DOC, DOCX, XLS) and from scanned documents (PDF, JPG, PNG, TIFF). One can extract text with preserved layout and text order can extract structured data from tables, can extract form fields values. PDF.co Web API also includes Document Parser feature that provides template-based data extraction where one can design a template and then use this template for precise and accurate data extraction from documents. Other features it provides are converting various formats such as image/spreadsheets/XML/etc to PDF or converting PDF to these formats.
Despite Amazon Textract, the on-premise version of PDF.co Web API is also available for enterprise companies as ByteScout API Server product. You can install it on your own in-house server and use it with local files without an Internet connection required! The on-premise version can work offline in isolated environments, you can also deploy it to your own private cloud. Demo videos about the on-premise version are here.
The followings are the main features provided by Amazon Textract:
The followings are some of the features provided by PDF.co Web API.
Amazon AWS is trying to provide basic API s for everything, not just data extraction. Amazon AWS is more known as the hosting provider of cloud-based servers and storage in the cloud. Textract service is built as one of the modules inside the Amazon AWS products ecosystem. Following is the most common use case which is often pitched by amazon.
As you can see, first of all, the input image file should be loaded into an S3 bucket. Once it’s done it’s further processed by lambda and Textract is invoked. After the text is detected and extracted, its output can be stored in Dynamo DB or other places of choice. Finally, it’s loaded into ElasticSearch service from where it can be consumed by third-party apps.
You also need to pay for all the intermediary services like storage, inbound, and outbound traffic and it may add a hefty amount to the base price if you are processing large volumes of data.
Textract services have two versions sync and async. For smaller documents sync version is a good choice where results are instantaneous. But for larger documents that take huge processing time, an async option is the best fit. However, both sync and async version are limited to documents with 3,000 pages on Amazon AWS Textract.
PDF.co Web API is provided by ByteScout which is an established software maker focused on PDF data extraction and PDF processing technologies for more than a decade! That is why companies from the Fortune-100 list select BytesScout for their data extraction projects. PDF.co Web API uses the same technology that is used by Fortune-100 companies for quick and smart data extraction. Best of all, PDF.co Web API can be used without any installation required and is powered by the same scalable and secure infrastructure as Textract.
PDF.co makes no additional charges for storage or traffic and the base price covers both traffic and storage.
Dependency on other services in either direct/indirect way increases end result cost as all services are counted here. In the case of PDF.co, one can simply get API keys, invoke API and consume output from any programming language from Javascript, Java, C#, Visual Basic to command-line based cURL.
PDF.co also provides API for file uploads, secure downloads, re-use of the same file, monitoring background jobs, etc.
PDF.co is autonomous without additional services or API required to be purchased.
PDF.co also includes a visual interface where you can try how it works before integrating with API:
Like Amazon Textract, PDF.co services can also be invoked in both sync and async mode (for large files)
Let’s briefly analyze the “DetectDocumentText” API for Textract. This service endpoint basically detects whether the input document contains text or not. And if yes, what are text object coordinates.
Request Syntax:
{ "Document": { "Bytes": "blob", "S3Object": { "Bucket": "string", "Name": "string", "Version": "string" } } }
Request Parameters:
The request accepts data in JSON format. The input document is base64-encoded bytes or an Amazon S3 object. The document must be an image in JPEG or PNG format.
Response Syntax:
{ "Blocks": [ { "BlockType": "string", "ColumnIndex": number, "ColumnSpan": number, "Confidence": number, "EntityTypes": [ "string" ], "Geometry": { "BoundingBox": { "Height": number, "Left": number, "Top": number, "Width": number }, "Polygon": [ { "X": number, "Y": number } ] }, "Id": "string", "Page": number, "Relationships": [ { "Ids": [ "string" ], "Type": "string" } ], "RowIndex": number, "RowSpan": number, "SelectionStatus": "string", "Text": "string" } ], "DocumentMetadata": { "Pages": number } }
For more developer guides, please refer to the documentation link provided in the “Useful Links” section.
Let’s briefly analyze the following API request which converts input documents such as PDF, PNG, JPG documents and extracts text, JSON, or CSV from it.
API Endpoint: https://api.pdf.co/v1/pdf/convert/to/text
Sample POST:
{ "name" : "result.txt", "pages" : "", "password" : "", "url" : "https://bytescout-com.s3.amazonaws.com/files/demo-files/cloud-api/pdf-to-text/sample.pdf" }
(! Don’t forget to set x-API-key URL param or HTTP header param)
Here’s the input document which we’ve passed along.
Response:
{
"url": "https://pdf-temp-files.s3.amazonaws.com/d6fbd25870a94c8cb0f268e5619d6b21/result.txt",
"pageCount": 1,
"error": false,
"status": 200,
"name": "result.txt"
}
The output response inside the ‘URL‘ parameter is as follows as the text output reproducing the original layout
For more developer guides, please refer to the documentation link provided in the “Useful Links” section or check API docs at https://apidocs.pdf.co
PDF to JSON extraction with PDF.co Web API
Endpoint: https://api.pdf.co/v1/pdf/convert/to/json
POST request:
{ "name" : "result.txt", "inline" : "true", "url" : "https://bytescout-com.s3.amazonaws.com/files/demo-files/cloud-api/pdf-to-text/sample.pdf" }
Response from PDF.co Web API with structured JSON:
{ "body": { "document": { "page": { "@index": "0", "row": [ { "column": [ { "text": { "@fontName": "Arial", "@fontSize": "24.0", "@fontStyle": "Bold", "@color": "#538DD3", "@x": "36.00", "@y": "34.44", "@width": "242.81", "@height": "24.00", "#text": "Your Company Name" } }, { "text": "" }, { "text": "" }, { "text": "" } ] }, ............................. .......JSON parts skipped.... ............................. { "column": [ { "text": { "@fontName": "Arial", "@fontSize": "11.0", "@fontStyle": "Bold", "@x": "36.00", "@y": "316.25", "@width": "22.58", "@height": "11.04", "#text": "Item" } }, { "text": { "@fontName": "Arial", "@fontSize": "11.0", "@fontStyle": "Bold", "@x": "247.61", "@y": "316.25", "@width": "44.64", "@height": "11.04", "#text": "Quantity" } }, { "text": { "@fontName": "Arial", "@fontSize": "11.0", "@fontStyle": "Bold", "@x": "398.95", "@y": "316.25", "@width": "26.91", "@height": "11.04", "#text": "Price" } }, { "text": { "@fontName": "Arial", "@fontSize": "11.0", "@fontStyle": "Bold", "@x": "533.14", "@y": "316.25", "@width": "26.30", "@height": "11.04", "#text": "Total" } } ] }, { "column": [ { "text": { "@fontName": "Arial", "@fontSize": "11.0", "@x": "36.00", "@y": "341.33", "@width": "30.62", "@height": "11.04", "#text": "Item 1" } }, { "text": { "@fontName": "Arial", "@fontSize": "11.0", "@x": "286.13", "@y": "341.33", "@width": "6.12", "@height": "11.04", "#text": "1" } }, { "text": { "@fontName": "Arial", "@fontSize": "11.0", "@x": "398.35", "@y": "341.33", "@width": "27.51", "@height": "11.04", "#text": "40.00" } }, { "text": { "@fontName": "Arial", "@fontSize": "11.0", "@x": "531.94", "@y": "341.33", "@width": "27.50", "@height": "11.04", "#text": "40.00" } } ] }, ............................. .......JSON parts skipped.... ............................. { "column": [ { "text": "" }, { "text": "" }, { "text": { "@fontName": "Arial", "@fontSize": "11.0", "@fontStyle": "Bold", "@x": "389.11", "@y": "425.83", "@width": "36.75", "@height": "11.04", "#text": "TOTAL" } }, { "text": { "@fontName": "Arial", "@fontSize": "11.0", "@fontStyle": "Bold", "@x": "525.82", "@y": "425.83", "@width": "33.62", "@height": "11.04", "#text": "200.00" } } ] } ] } } }, "pageCount": 1, "error": false, "status": 200, "name": "result.txt", "remainingCredits": 99263 }
PDF to JSON extraction with PDF.co Web API
Endpoint: https://api.pdf.co/v1/pdf/convert/to/xml
POST request:
{ "name": "Result.xml", "url": "https://bytescout-com.s3.amazonaws.com/files/demo-files/cloud-api/pdf-to-text/sample.pdf" }
Result.xml returned from PDF.co Web API that contains details for every text snippet along with coordinates, font, and size information:
<?xml version="1.0" encoding="UTF-8"?><document><page index="0"><row><column><text fontName="Arial" fontSize="24.0" fontStyle="Bold" color="#538DD3" x="36.00" y="34.44" width="242.81" height="24.00">Your Company Name</text></column><column><text></text></column><column><text></text></column><column><text></text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="76.94" width="66.62" height="11.04">Your Address</text></column><column><text></text></column><column><text></text></column><column><text></text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="91.46" width="69.14" height="11.04">City, State Zip</text></column><column><text></text></column><column><text></text></column><column><text></text></column></row><row><column><text></text></column><column><text></text></column><column><text></text></column><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="461.02" y="115.94" width="98.42" height="11.04">Invoice No. 123456</text></column></row><row><column><text></text></column><column><text></text></column><column><text></text></column><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="436.54" y="130.46" width="122.90" height="11.04">Invoice Date 01/01/2016</text></column></row><row><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="36.00" y="154.94" width="63.62" height="11.04">Client Name</text></column><column><text></text></column><column><text></text></column><column><text></text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="169.70" width="40.34" height="11.04">Address</text></column><column><text></text></column><column><text></text></column><column><text></text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="184.22" width="69.14" height="11.04">City, State Zip</text></column><column><text></text></column><column><text></text></column><column><text></text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="233.30" width="28.70" height="11.04">Notes</text></column><column><text></text></column><column><text></text></column><column><text></text></column></row><row><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="36.00" y="316.25" width="22.58" height="11.04">Item</text></column><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="247.61" y="316.25" width="44.64" height="11.04">Quantity</text></column><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="398.95" y="316.25" width="26.91" height="11.04">Price</text></column><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="533.14" y="316.25" width="26.30" height="11.04">Total</text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="341.33" width="30.62" height="11.04">Item 1</text></column><column><text fontName="Arial" fontSize="11.0" x="286.13" y="341.33" width="6.12" height="11.04">1</text></column><column><text fontName="Arial" fontSize="11.0" x="398.35" y="341.33" width="27.51" height="11.04">40.00</text></column><column><text fontName="Arial" fontSize="11.0" x="531.94" y="341.33" width="27.50" height="11.04">40.00</text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="362.45" width="30.62" height="11.04">Item 2</text></column><column><text fontName="Arial" fontSize="11.0" x="286.13" y="362.45" width="6.12" height="11.04">2</text></column><column><text fontName="Arial" fontSize="11.0" x="398.35" y="362.45" width="27.51" height="11.04">30.00</text></column><column><text fontName="Arial" fontSize="11.0" x="531.94" y="362.45" width="27.50" height="11.04">60.00</text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="383.57" width="30.62" height="11.04">Item 3</text></column><column><text fontName="Arial" fontSize="11.0" x="286.13" y="383.57" width="6.12" height="11.04">3</text></column><column><text fontName="Arial" fontSize="11.0" x="398.35" y="383.57" width="27.51" height="11.04">20.00</text></column><column><text fontName="Arial" fontSize="11.0" x="531.94" y="383.57" width="27.50" height="11.04">60.00</text></column></row><row><column><text fontName="Arial" fontSize="11.0" x="36.00" y="404.93" width="30.62" height="11.04">Item 4</text></column><column><text fontName="Arial" fontSize="11.0" x="286.13" y="404.93" width="6.12" height="11.04">4</text></column><column><text fontName="Arial" fontSize="11.0" x="398.35" y="404.93" width="27.51" height="11.04">10.00</text></column><column><text fontName="Arial" fontSize="11.0" x="531.94" y="404.93" width="27.50" height="11.04">40.00</text></column></row><row><column><text></text></column><column><text></text></column><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="389.11" y="425.83" width="36.75" height="11.04">TOTAL</text></column><column><text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="525.82" y="425.83" width="33.62" height="11.04">200.00</text></column></row></page></document>
For more API endpoints please explore PDF.co API docs at https://apidocs.pdf.co
Product Page: | https://pdf.co/ |
Pricing Policy: | https://app.pdf.co/subscriptions |
Developer Guide: | https://apidocs.pdf.co/ |
Code Samples: | https://github.com/bytescout/pdf-co-api-samples |
Happy Coding! 🙂