You might have come across the name flat-file and wondered what it is and why it is important. Well, a flat file is a file that contains records that do not have a structured interrelationship. It is simply a text file that does not contain any structured characters of processing from application files such as Excel and PDF.
A flat file is made up of a single data table. A flat file allows a user to specify data attributes e.g. columns and data types per table. These are then stored separately from applications such as PDF or word processor. Flat files are very common in data warehousing where they are used to import data.
Flat files have oftentimes been confused with relational databases. These are actually very different. While a flat file allows the specification of data attributes (e.g. columns and datatypes) for a table at a time and storing these independent of the application being used, a relational database, on the other hand, goes a step further and allows the specification of the information in multiple tables as well as the relationships between these tables.
Relational databases provide for more logical steps than flat files by allowing more rules of data that the tables must follow.
One of the commonly used flat files is the CSV file. This stands for a comma-separated values file. In a CSV file, data in tables is composed of American Standard Code for Information Interchange (known as ASCII) text. Here, the values contained in every cell of a table are separated by a comma while each row in the table is represented in a new line. The reason why CSV files are most preferred is that they represent relational data in a text file.
In this guide, we discuss how to convert both Excel and PDF files into CSV files using the PDF Extractor SDK and Spreadsheet SDK by ByteScout, Inc.
As the first step, we need to initialize our program by creating an extractor instance. This is as seen in the code below.
// Create Bytescout.PDFExtractor.CSVExtractor instance
CSVExtractor extractor = new CSVExtractor();
extractor.RegistrationName = "demo";
extractor.RegistrationKey = "demo";
It is in the initialization stage that we give the extractor instance a name and we input the registration key. The registration key will be provided upon purchase of the ByteScout SDK. As you can see in the code above, we are going to use “demo” for this guide.
You will need the actual key in the real programming environment.
The second step involves loading the PDF document that is to be converted into your program. This can be done in two ways; you can either use the document’s name or specify its location by using the file path to the document. Either of these will work perfectly.
See the code below;
// Load sample PDF document
extractor.LoadDocumentFromFile("sample3.pdf");
This is the step where we perform the actual extraction of data from the PDF document into a flat file. As seen in the code below, you can give the exact specifications of how you wish your document to be converted.
Begin VB.Form Form1
Caption = "PDF to CSV"
ClientHeight = 1095
ClientLeft = 120
ClientTop = 465
ClientWidth = 3675
LinkTopic = "Form1"
ScaleHeight = 1095
ScaleWidth = 3675
StartUpPosition = 3 'Windows Default
Begin VB.CommandButton cmd_pdf_to_csv
Caption = "Convert PDF to CSV"
Height = 855
Left = 120
TabIndex = 0
Top = 120
Width = 3495
End
End
After extraction, you need to save the converted data into CSV by using the code below.
//Peform Save to CSV file
extractor.SaveCSVToFile "output.csv"
This process is the same when using the other mentioned languages. The codes below show how to go about the above process using different languages.
static void Main(string[] args) { // Load document Spreadsheet document = new Spreadsheet(); document.LoadFromFile("SimpleReport.xls"); // Specify worksheet to be exported as CSV document.Workbook.Worksheets[0].SaveAsCSV("SimpleReport.csv"); // Clean up document.Close(); // open output document in default viewer Process.Start("SimpleReport.csv"); }
Data in PDF files such as tables need to be extracted first before they are converted to CSV files. Here, we show how to extract tables from PDF and then convert them to CSV using C# and VBScript (Visual Basic 6) languages.
using System; using Bytescout.PDFExtractor; namespace ExtractTextByPages { class Program { static void Main(string[] args) { // Create Bytescout.PDFExtractor.TextExtractor instance CSVExtractor extractor = new CSVExtractor(); extractor.RegistrationName = "demo"; extractor.RegistrationKey = "demo"; TableDetector tdetector = new TableDetector(); tdetector.RegistrationKey = "demo"; tdetector.RegistrationName = "demo"; // we should define what kind of tables we should detect // so we set min required number of columns to 3 tdetector.DetectionMinNumberOfColumns = 3; // and we set min required number of columns to 3 tdetector.DetectionMinNumberOfRows = 3; // Load sample PDF document extractor.LoadDocumentFromFile("sample3.pdf"); tdetector.LoadDocumentFromFile("sample3.pdf"); // Get page count int pageCount = tdetector.GetPageCount(); for (int i = 0; i < pageCount; i++) { int j = 1; // find first table and continue if found if (tdetector.FindTable(i)) do { // set extraction area for CSV extractor to rectangle given by table detector extractor.SetExtractionArea(tdetector.GetFoundTableRectangle_Left(), tdetector.GetFoundTableRectangle_Top(), tdetector.GetFoundTableRectangle_Width(), tdetector.GetFoundTableRectangle_Height() ); // and finally save the table into CSV file extractor.SavePageCSVToFile(i, "page-" + i + "-table-" + j + ".csv"); j++; } while (tdetector.FindNextTable()); // search next table } // Open first output file in default associated application System.Diagnostics.Process.Start("page-0-table-1.csv"); } } }
Create Bytescout.PDFExtractor.TextExtractor object Set tdetector= CreateObject("Bytescout.PDFExtractor.TableDetector") tdetector.RegistrationName = "demo" tdetector.RegistrationKey = "demo" ' Create Bytescout.PDFExtractor.CSVExtractor object Set extractor = CreateObject("Bytescout.PDFExtractor.CSVExtractor") extractor.RegistrationName = "demo" extractor.RegistrationKey = "demo" ' we should define what kind of tables we should detect ' so we set min required number of columns to 3 tdetector.DetectionMinNumberOfColumns = 3 ' and we set min required number of columns to 3 tdetector.DetectionMinNumberOfRows = 3 ' Load sample PDF document into table detector tdetector.LoadDocumentFromFile("..\..\sample3.pdf") ' Load sample PDF document into CSV extractor extractor.LoadDocumentFromFile "..\..\sample3.pdf" ' Get page count pageCount = tdetector.GetPageCount() For i=0 to PageCount-1 If tdetector.FindTable(i) Then ' parameters are: page index, string to find, case sensitivity Do MsgBox "Found a table on page #" & CStr(i) & " at left=" & CStr(tdetector.GetFoundTable Rectangle_Left) & "; top=" & CStr(tdetector.GetFoundTableRectangle_Top) & "; width=" & CStr (tdetector.GetFoundTableRectangle_Width) & "; height=" & CStr(tdetector.GetFoundTableRectangle_Height) ' set extraction area to extract table data as CSV extractor.SetExtractionArea tdetector.GetFoundTableRectangle_Left, tdetector.GetFoundTable Rectangle_Top, tdetector.GetFoundTableRectangle_Width, tdetector.GetFoundTableRectangle_Height ' define filename to save CSV CSVFileName = "page-" & CStr(i) & "-table-at-" & CStr(tdetector.GetFoundTableRectangle_Top) & ".csv" ' save CSV from this page (bounded by extraction area) into file extractor.SavePageCSVToFile i, CSVFileName MsgBox "Table saved into CSV as " & CSVFileName ' reset extraction area on the page extractor.ResetExtractionArea Loop While tdetector.FindNextTable End If Next MsgBox "Done" Set tdetector= Nothing