This sample below will demonstrate how to extract data from PDF to Text, XML or CSV in JavaScript & jQuery using Cloud API (low level).
You may also find useful to check this article: How to extract and convert spreadsheets between various file formats in JavaScript and jQuery using Cloud API.
This sample covers following functionalities:
First of all let’s review code and then we’ll analyze it.
</pre> <!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"> <title>PDFExtractor JQuery sample</title> <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.0/jquery.min.js"></script> <script src="pdf_extractor.js" type="text/javascript" encoding="UTF-8"></script> </head> <body> <form id="form" enctype="multipart/form-data"> <p> <label>Copy-paste your API Key for api.pdf.co here</label> <input type="text" id="apiKey" placeholder="API Key" /> </p> <p> <label>InputFile</label> <input type="file" name="file" id="inputFile" /> </p> <p> <label>Extract</label> <select id="extractType"> <option value="0" selected="selected">to Text</option> <option value="1">to XML</option> <option value="2">to CSV</option> <option value="3">get Info</option> </select> </p> <p> <label>Page Index (zero-based)</label> <input type="number" id="pageIndex" value="0"> </p> <button type="button" id="submit">Extract</button> </form> <div id="errorBlock"> <h2>Error:</h2> <h4>Code: <span id="statusCode"></span></h4> <ul id="errors"></ul> </div> <div id="resultBlock"> <h2>Result:</h2> <pre id="result"></pre> </div> </body> </html> <pre>
$(document).ready(function () { $("#resultBlock").hide(); $("#errorBlock").hide(); }); $(document).on("click", "#submit", function () { $("#resultBlock").hide(); $("#errorBlock").hide(); var apiKey = $("#apiKey").val().trim(); //Get your API key by registering at https://bytescout.com/ var urlUploadFile = "https://api.pdf.co/api/v1/file/upload?apiKey=" + apiKey; var formData = new FormData($("#form")[0]); var pageIndex = $("#pageIndex").val(); $.ajax({ url: urlUploadFile, type: "POST", data: formData, cache: false, contentType: false, processData: false, success: function (fileId) { switch ($("#extractType").val()) { case "0": ExtractText(apiKey, fileId, pageIndex); break; case "1": ExtractXML(apiKey, fileId, pageIndex); break; case "2": ExtractCSV(apiKey, fileId, pageIndex); break; case "3": ExtractInfo(apiKey, fileId); break; } }, error: function (response) { $("#errorBlock").show(); $("#statusCode").html(response.status); $("#errors").html(""); $.each(response.responseJSON.Errors, function () { $("#errors").append($("&lt;li&gt;&lt;/li&gt;").html(this)); }); } }); }); function ExtractXML(apiKey, fileId, pageIndex) { var url = "https://api.pdf.co/api/v1/pdfextractor/xmlextractor/extract?apiKey=" + apiKey; var options = { "properties": { "startPageIndex": pageIndex, "endPageIndex": pageIndex, "extractInvisibleText": false }, "inputType": "fileId", "input": fileId }; $.ajax({ url: url, type: "POST", data: JSON.stringify(options), contentType: "application/json", success: function (response) { $("#resultBlock").show(); $("#result").text(xmlToString(response)); }, error: function (response) { $("#errorBlock").show(); $("#statusCode").html(response.status); $("#errors").html(""); $.each(response.responseJSON.Errors, function () { $("#errors").append($("&lt;li&gt;&lt;/li&gt;").html(this)); }); } }); } function ExtractText(apiKey, fileId, pageIndex) { var url = "https://api.pdf.co/api/v1/pdfextractor/textextractor/extract?apiKey=" + apiKey; var options = { "properties": { "startPageIndex": pageIndex, "endPageIndex": pageIndex, "rtlTextAutoDetectionEnabled": true, "detectLinesInsteadOfParagraphs": false }, "inputType": "fileId", "input": fileId }; $.ajax({ url: url, type: "POST", data: JSON.stringify(options), contentType: "application/json", success: function (response) { $("#resultBlock").show(); $("#result").text(response); }, error: function (response) { $("#errorBlock").show(); $("#statusCode").html(response.status); $("#errors").html(""); $.each(response.responseJSON.Errors, function () { $("#errors").append($("&lt;li&gt;&lt;/li&gt;").html(this)); }); } }); } function ExtractCSV(apiKey, fileId, pageIndex) { var url = "https://api.pdf.co/api/v1/pdfextractor/csvextractor/extract?apiKey=" + apiKey; var options = { "properties": { "startPageIndex": pageIndex, "endPageIndex": pageIndex, "columnDetectionMode": "contentGroups", "extractInvisibleText": false }, "InputType": "FileId", "Input": fileId }; $.ajax({ url: url, type: "POST", data: JSON.stringify(options), contentType: "application/json", success: function (response) { $("#resultBlock").show(); $("#result").html(response); }, error: function (response) { $("#errorBlock").show(); $("#statusCode").html(response.status); $("#errors").html(""); $.each(response.responseJSON.Errors, function () { $("#errors").append($("&lt;li&gt;&lt;/li&gt;").html(this)); }); } }); } function ExtractInfo(apiKey, fileId) { var url = "https://api.pdf.co/api/v1/pdfextractor/infoextractor/extract?apiKey=" + apiKey; var options = { "inputType": "fileId", "input": fileId }; $.ajax({ url: url, type: "POST", data: JSON.stringify(options), contentType: "application/json", success: function (response) { $("#resultBlock").show(); $("#result").html(response); }, error: function (response) { $("#errorBlock").show(); $("#statusCode").html(response.status); $("#errors").html(""); $.each(response.responseJSON.Errors, function () { $("#errors").append($("&lt;li&gt;&lt;/li&gt;").html(this)); }); } }); } function xmlToString(xmlData) { var xmlString; //IE if (window.ActiveXObject) { xmlString = xmlData.xml; } // code for Mozilla, Firefox, Opera, etc. else { xmlString = (new XMLSerializer()).serializeToString(xmlData); } return xmlString; }
This html file contains structure for following elements:
Apart from these, html file contains script references for jQuery library and PDF extraction script reference. It also contains error and result display containers.
This javaScript file is built using jQuery library. It basically handles click event of “Extract” button. Let’s analyze click event handler logic.
First of all we’re uploading user input file and getting uploaded fileId as response as shown in below code.
... var formData = new FormData($("#form")[0]); var pageIndex = $("#pageIndex").val(); $.ajax({ url: urlUploadFile, type: "POST", data: formData, success: function (fileId) { ...
Now that file is successfully uploaded and we have fileId, we’re processing further based on data extraction request as following.
Function Name | Purpose |
ExtractText(…) | Extract PDF data in text format |
ExtractXML(…) | Extract PDF data in xml format |
ExtractCSV(…) | Extract PDF data in csv format |
ExtractInfo(…) | Get PDF file info |
All these functions operate in same way so let’s explore function “ExtractCSV“, and we’ll get overall idea.
First of all we’re storing url for csv generation api call and passing apiKeys as query string data.
var url = "https://api.pdf.co/api/v1/pdfextractor/csvextractor/extract?apiKey=" + apiKey;
Then we’re preparing options parameter needed for csv preparation by setting properties and keys. Property “startPageIndex” and “endPageIndex” denotes zero index based start and end page number. We’ve set property “columnDetectionMode” to “contentGroups” which will try to analyze extract content by content groups. Please refer to documentations for more information of available options. If we also want to extract invisible text then we need to set property “extractInvisibleText” to “true”. Apart from these basic properties, we need to pass file input in parameters “InputType” and “Input”.
After input parameters are ready, We proceed with POST request execution. And upon receiving result we’re displaying it in specified result div. Simple!
I hope you get better understanding of extract data in various ways with bytescout api.
Happy Coding!
IMPORTANT:
Cloud API is deprecated and was replaced with more powerful and secure www.PDF.co Web API
CLICK HERE
TO LEARN MORE
ABOUT NEW
www.PDF.co
w/ Web API
On-Premise API Server
Cloud API Server