Home
/
Blog
/
Guide on How to Convert Excel and PDF into Flat Files

Guide on How to Convert Excel and PDF into Flat Files

You might have come across the name flat-file and wondered what it is and why it is important. Well, a flat file is a file that contains records that do not have a structured interrelationship. It is simply a text file that does not contain any structured characters of processing from application files such as Excel and PDF.

A flat file is made up of a single data table. A flat file allows a user to specify data attributes e.g. columns and data types per table. These are then stored separately from applications such as PDF or word processor. Flat files are very common in data warehousing where they are used to import data.

Flat files have oftentimes been confused with relational databases. These are actually very different. While a flat file allows the specification of data attributes (e.g. columns and datatypes) for a table at a time and storing these independent of the application being used, a relational database, on the other hand, goes a step further and allows the specification of the information in multiple tables as well as the relationships between these tables.

Relational databases provide for more logical steps than flat files by allowing more rules of data that the tables must follow.

One of the commonly used flat files is the CSV file. This stands for a comma-separated values file. In a CSV file, data in tables is composed of American Standard Code for Information Interchange (known as ASCII) text. Here, the values contained in every cell of a table are separated by a comma while each row in the table is represented in a new line. The reason why CSV files are most preferred is that they represent relational data in a text file.

In this guide, we discuss how to convert both Excel and PDF files into CSV files using the PDF Extractor SDK and Spreadsheet SDK by ByteScout, Inc.

Step 1: Initialize

As the first step, we need to initialize our program by creating an extractor instance. This is as seen in the code below.

// Create Bytescout.PDFExtractor.CSVExtractor instance

CSVExtractor extractor = new CSVExtractor();

extractor.RegistrationName = "demo";

extractor.RegistrationKey = "demo";

It is in the initialization stage that we give the extractor instance a name and we input the registration key. The registration key will be provided upon purchase of the ByteScout SDK. As you can see in the code above, we are going to use “demo” for this guide.

You will need the actual key in the real programming environment.

Step 2: Load PDF document

The second step involves loading the PDF document that is to be converted into your program. This can be done in two ways; you can either use the document’s name or specify its location by using the file path to the document. Either of these will work perfectly.

See the code below;

// Load sample PDF document

extractor.LoadDocumentFromFile("sample3.pdf");

Step 3: Extract data

This is the step where we perform the actual extraction of data from the PDF document into a flat file. As seen in the code below, you can give the exact specifications of how you wish your document to be converted.

Begin VB.Form Form1

Caption = "PDF to CSV"

ClientHeight = 1095

ClientLeft = 120

ClientTop = 465

ClientWidth = 3675

LinkTopic = "Form1"

ScaleHeight = 1095

ScaleWidth = 3675

StartUpPosition = 3 'Windows Default

Begin VB.CommandButton cmd_pdf_to_csv

Caption = "Convert PDF to CSV"

Height = 855

Left = 120

TabIndex = 0

Top = 120

Width = 3495

End

Step 4: Save Output to CSV

After extraction, you need to save the converted data into CSV by using the code below.

//Peform Save to CSV file

extractor.SaveCSVToFile "output.csv"

This process is the same when using the other mentioned languages. The codes below show how to go about the above process using different languages.

How to convert Excel to CSV flat file (Using C#)

static void Main(string[] args)
{
// Load document
Spreadsheet document = new Spreadsheet();
document.LoadFromFile("SimpleReport.xls");
&nbsp;
// Specify worksheet to be exported as CSV
document.Workbook.Worksheets[0].SaveAsCSV("SimpleReport.csv");
&nbsp;
// Clean up
document.Close();
&nbsp;
// open output document in default viewer
Process.Start("SimpleReport.csv");
}

How to convert PDF to CSV flat file

Data in PDF files such as tables need to be extracted first before they are converted to CSV files. Here, we show how to extract tables from PDF and then convert them to CSV using C# and VBScript (Visual Basic 6) languages.

How to find a table in PDF and extract it as CSV in C#

using System;
using Bytescout.PDFExtractor;
namespace ExtractTextByPages
{
class Program
{
static void Main(string[] args)
{
// Create Bytescout.PDFExtractor.TextExtractor instance
CSVExtractor extractor = new CSVExtractor();
extractor.RegistrationName = "demo";
extractor.RegistrationKey = "demo";
TableDetector tdetector = new TableDetector();
tdetector.RegistrationKey = "demo";
tdetector.RegistrationName = "demo";
// we should define what kind of tables we should detect
// so we set min required number of columns to 3
tdetector.DetectionMinNumberOfColumns = 3;
// and we set min required number of columns to 3
tdetector.DetectionMinNumberOfRows = 3;
// Load sample PDF document
extractor.LoadDocumentFromFile("sample3.pdf");
tdetector.LoadDocumentFromFile("sample3.pdf");
// Get page count
int pageCount = tdetector.GetPageCount();
for (int i = 0; i &lt; pageCount; i++)
{
int j = 1;
// find first table and continue if found
if (tdetector.FindTable(i))
do
{
// set extraction area for CSV extractor to rectangle given by table detector
extractor.SetExtractionArea(tdetector.GetFoundTableRectangle_Left(),
tdetector.GetFoundTableRectangle_Top(),
tdetector.GetFoundTableRectangle_Width(),
tdetector.GetFoundTableRectangle_Height()
);
// and finally save the table into CSV file
extractor.SavePageCSVToFile(i, "page-" + i + "-table-" + j + ".csv");
j++;
} while (tdetector.FindNextTable()); // search next table
}
// Open first output file in default associated application
System.Diagnostics.Process.Start("page-0-table-1.csv");
}
}
}

How to find a table in PDF and extract it as CSV in VBScript (Visual Basic 6)

Create Bytescout.PDFExtractor.TextExtractor object
Set tdetector= CreateObject("Bytescout.PDFExtractor.TableDetector")
tdetector.RegistrationName = "demo"
tdetector.RegistrationKey = "demo"
' Create Bytescout.PDFExtractor.CSVExtractor object
Set extractor = CreateObject("Bytescout.PDFExtractor.CSVExtractor")
extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"
' we should define what kind of tables we should detect
' so we set min required number of columns to 3
tdetector.DetectionMinNumberOfColumns = 3
' and we set min required number of columns to 3
tdetector.DetectionMinNumberOfRows = 3
' Load sample PDF document into table detector
tdetector.LoadDocumentFromFile("..\..\sample3.pdf")
' Load sample PDF document into CSV extractor
extractor.LoadDocumentFromFile "..\..\sample3.pdf"
' Get page count
pageCount = tdetector.GetPageCount()
For i=0 to PageCount-1
If tdetector.FindTable(i) Then ' parameters are: page index, string to find, case sensitivity
Do
MsgBox "Found a table on page #" & CStr(i) & " at left=" & CStr(tdetector.GetFoundTable
 Rectangle_Left) & "; top=" & CStr(tdetector.GetFoundTableRectangle_Top) & "; width=" & CStr
(tdetector.GetFoundTableRectangle_Width) & "; height=" & 
CStr(tdetector.GetFoundTableRectangle_Height)
' set extraction area to extract table data as CSV
extractor.SetExtractionArea tdetector.GetFoundTableRectangle_Left, tdetector.GetFoundTable
Rectangle_Top, 
tdetector.GetFoundTableRectangle_Width, tdetector.GetFoundTableRectangle_Height
' define filename to save CSV
CSVFileName = "page-" & CStr(i) & "-table-at-" & CStr(tdetector.GetFoundTableRectangle_Top) 
& ".csv"
' save CSV from this page (bounded by extraction area) into file
extractor.SavePageCSVToFile i, CSVFileName
MsgBox "Table saved into CSV as " & CSVFileName
' reset extraction area on the page
extractor.ResetExtractionArea
Loop While tdetector.FindNextTable
End If
Next
MsgBox "Done"
Set tdetector= Nothing