How to extract data from PDF to Text, XML or CSV in JavaScript & jQuery using Cloud API (low level) - ByteScout

How to extract data from PDF to Text, XML or CSV in JavaScript & jQuery using Cloud API (low level)

  • Home
  • /
  • Articles
  • /
  • How to extract data from PDF to Text, XML or CSV in JavaScript & jQuery using Cloud API (low level)

This sample below will demonstrate how to extract data from PDF to Text, XML or CSV in JavaScript & jQuery using Cloud API (low level).

You may also find useful to check this article: How to extract and convert spreadsheets between various file formats in JavaScript and jQuery using Cloud API.

This sample covers following functionalities:

  • Converting PDF to TEXT
  • Converting PDF to XML
  • Converting PDF to CSV
  • Getting PDF file info like author, title, description, etc.

First of all let’s review code and then we’ll analyze it.

Sample.html

</pre>
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>PDFExtractor JQuery sample</title>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.0/jquery.min.js"></script>
<script src="pdf_extractor.js" type="text/javascript" encoding="UTF-8"></script>
</head>
<body>

<form id="form" enctype="multipart/form-data">
<p>
<label>Copy-paste your API Key for Bytescout.IO here</label>
<input type="text" id="apiKey" placeholder="API Key" />
</p>
<p>
<label>InputFile</label>
<input type="file" name="file" id="inputFile" />
</p>
<p>
<label>Extract</label>
<select id="extractType">
<option value="0" selected="selected">to Text</option>
<option value="1">to XML</option>
<option value="2">to CSV</option>
<option value="3">get Info</option>
</select>
</p>
<p>
<label>Page Index (zero-based)</label>
<input type="number" id="pageIndex" value="0">
</p>
<button type="button" id="submit">Extract</button>
</form>

<div id="errorBlock">
<h2>Error:</h2>
<h4>Code: <span id="statusCode"></span></h4>
<ul id="errors"></ul>
</div>

<div id="resultBlock">
<h2>Result:</h2>
<pre id="result"></pre>

</div>

</body>
</html>
<pre>

pdf_extractor.js

$(document).ready(function () {
    $("#resultBlock").hide();
    $("#errorBlock").hide();
});

$(document).on("click", "#submit", function () {
    $("#resultBlock").hide();
    $("#errorBlock").hide();

    var apiKey = $("#apiKey").val().trim(); //Get your API key by registering at https://bytescout.com/

    var urlUploadFile = "https://bytescout.io/api/v1/file/upload?apiKey=" + apiKey;

    var formData = new FormData($("#form")[0]);
    var pageIndex = $("#pageIndex").val();

    $.ajax({
        url: urlUploadFile,
        type: "POST",
        data: formData,
        cache: false,
        contentType: false,
        processData: false,
        success: function (fileId) {
            switch ($("#extractType").val()) {
                case "0":
                    ExtractText(apiKey, fileId, pageIndex);
                    break;
                case "1":
                    ExtractXML(apiKey, fileId, pageIndex);
                    break;
                case "2":
                    ExtractCSV(apiKey, fileId, pageIndex);
                    break;
                case "3":
                    ExtractInfo(apiKey, fileId);
                    break;
            }
        },
        error: function (response) {
            $("#errorBlock").show();
            $("#statusCode").html(response.status);
            $("#errors").html("");
            $.each(response.responseJSON.Errors, function () {
                $("#errors").append($("&amp;lt;li&amp;gt;&amp;lt;/li&amp;gt;").html(this));
            });
        }
    });
});


function ExtractXML(apiKey, fileId, pageIndex) {
    var url = "https://bytescout.io/api/v1/pdfextractor/xmlextractor/extract?apiKey=" + apiKey;

    var options = {
        "properties": {
            "startPageIndex": pageIndex,
            "endPageIndex": pageIndex,
            "extractInvisibleText": false
        },
        "inputType": "fileId",
        "input": fileId
    };


    $.ajax({
        url: url,
        type: "POST",
        data: JSON.stringify(options),
        contentType: "application/json",
        success: function (response) {
            $("#resultBlock").show();
            $("#result").text(xmlToString(response));
        },
        error: function (response) {
            $("#errorBlock").show();
            $("#statusCode").html(response.status);
            $("#errors").html("");
            $.each(response.responseJSON.Errors, function () {
                $("#errors").append($("&amp;lt;li&amp;gt;&amp;lt;/li&amp;gt;").html(this));
            });
        }
    });
}

function ExtractText(apiKey, fileId, pageIndex) {
    var url = "https://bytescout.io/api/v1/pdfextractor/textextractor/extract?apiKey=" + apiKey;

    var options = {
        "properties": {
            "startPageIndex": pageIndex,
            "endPageIndex": pageIndex,
            "rtlTextAutoDetectionEnabled": true,
            "detectLinesInsteadOfParagraphs": false
        },
        "inputType": "fileId",
        "input": fileId
    };

    $.ajax({
        url: url,
        type: "POST",
        data: JSON.stringify(options),
        contentType: "application/json",
        success: function (response) {
            $("#resultBlock").show();
            $("#result").text(response);
        },
        error: function (response) {
            $("#errorBlock").show();
            $("#statusCode").html(response.status);
            $("#errors").html("");
            $.each(response.responseJSON.Errors, function () {
                $("#errors").append($("&amp;lt;li&amp;gt;&amp;lt;/li&amp;gt;").html(this));
            });
        }
    });
}

function ExtractCSV(apiKey, fileId, pageIndex) {
    var url = "https://bytescout.io/api/v1/pdfextractor/csvextractor/extract?apiKey=" + apiKey;

    var options = {
        "properties": {
            "startPageIndex": pageIndex,
            "endPageIndex": pageIndex,
            "columnDetectionMode": "contentGroups",
            "extractInvisibleText": false
        },
        "InputType": "FileId",
        "Input": fileId
    };

    $.ajax({
        url: url,
        type: "POST",
        data: JSON.stringify(options),
        contentType: "application/json",
        success: function (response) {
            $("#resultBlock").show();
            $("#result").html(response);
        },
        error: function (response) {
            $("#errorBlock").show();
            $("#statusCode").html(response.status);
            $("#errors").html("");
            $.each(response.responseJSON.Errors, function () {
                $("#errors").append($("&amp;lt;li&amp;gt;&amp;lt;/li&amp;gt;").html(this));
            });
        }
    });
}

function ExtractInfo(apiKey, fileId) {
    var url = "https://bytescout.io/api/v1/pdfextractor/infoextractor/extract?apiKey=" + apiKey;

    var options = {
        "inputType": "fileId",
        "input": fileId
    };

    $.ajax({
        url: url,
        type: "POST",
        data: JSON.stringify(options),
        contentType: "application/json",
        success: function (response) {
            $("#resultBlock").show();
            $("#result").html(response);
        },
        error: function (response) {
            $("#errorBlock").show();
            $("#statusCode").html(response.status);
            $("#errors").html("");
            $.each(response.responseJSON.Errors, function () {
                $("#errors").append($("&amp;lt;li&amp;gt;&amp;lt;/li&amp;gt;").html(this));
            });
        }
    });
}


function xmlToString(xmlData) {

    var xmlString;
    //IE
    if (window.ActiveXObject) {
        xmlString = xmlData.xml;
    }
    // code for Mozilla, Firefox, Opera, etc.
    else {
        xmlString = (new XMLSerializer()).serializeToString(xmlData);
    }
    return xmlString;
}

Sample.html

This html file contains structure for following elements:

  • To enter API key
  • To upload Input file
  • Extraction/Operation type. For example to CSV/XML/Text etc.
  • Page Index of PDF File.

Apart from these, html file contains script references for jQuery library and PDF extraction script reference. It also contains error and result display containers.

pdf_extractor.js

This javaScript file is built using jQuery library. It basically handles click event of “Extract” button. Let’s analyze click event handler logic.

First of all we’re uploading user input file and getting uploaded fileId as response as shown in below code.

...
var formData = new FormData($("#form")[0]);
var pageIndex = $("#pageIndex").val();

$.ajax({
url: urlUploadFile,
type: "POST",
data: formData,
success: function (fileId) {
...

Now that file is successfully uploaded and we have fileId, we’re processing further based on data extraction request as following.

Function Name Purpose
ExtractText(…) Extract PDF data in text format
ExtractXML(…) Extract PDF data in xml format
ExtractCSV(…) Extract PDF data in csv format
ExtractInfo(…) Get PDF file info

All these functions operate in same way so let’s explore function “ExtractCSV“, and we’ll get overall idea.

First of all we’re storing url for csv generation api call and passing apiKeys as query string data.

var url = "https://bytescout.io/api/v1/pdfextractor/csvextractor/extract?apiKey=" + apiKey;

Then we’re preparing options parameter needed for csv preparation by setting properties and keys. Property “startPageIndex” and “endPageIndex” denotes zero index based start and end page number. We’ve set property “columnDetectionMode” to “contentGroups” which will try to analyze extract content by content groups. Please refer to documentations for more information of available options. If we also want to extract invisible text then we need to set property “extractInvisibleText” to “true”. Apart from these basic properties, we need to pass file input in parameters “InputType” and “Input”.

After input parameters are ready, We proceed with POST request execution. And upon receiving result we’re displaying it in specified result div. Simple!

I hope you get better understanding of extract data in various ways with bytescout api.

Happy Coding!

Tutorials:

prev
next