Templates can be written in YAML or JSON formats. A template defines one or more keywords to find the right template and expressions for fields to be extracted.
A single template file can contain multiple templates. Templates in YAML file should be separated with ---
line. Templates in JSON must be arranged as an array []
.
Sample YAML template showing the main features:
---
templateVersion: 2
templatePriority: 1
sourceId: ACME Inc. Invoice
culture: en-US
detectionRules:
keywords:
- ACME Inc\.
- Invoice No
- ABN 01 234 567 890
fields:
companyName:
expression: ACME Inc.
static: true
invoiceNumber:
expression: 'Invoice No.: {{ABC123+}}'
pageIndex: 0
invoiceDate:
expression: 'Invoice Date: {{123+}}'
type: date
dateFormat: MM/dd/yyyy
billTo:
rect:
- 32
- 150
- 348
- 70
expression: '(?s)Bill to:(?<value>.*)'
pageIndex: 0
total:
expression: TOTAL\s+(\d+\.\d+)
type: decimal
tables:
- name: table1
start:
expression: Item\s+Quantity\s+Price\s+Total
end:
expression: TOTAL
row:
expression: ^\s*(?<description>\w+.*)(?<quantity>\d+)\s+(?<unitPrice>\d+\.\d{2})\s+(?<itemTotal>\d+\.\d{2})\s*$
columns:
- name: description
type: string
- name: quantity
type: integer
- name: unitPrice
type: decimal
- name: itemTotal
type: decimal
multipage: true
Templates are sorted and tried by templatePriority, then alphabetically. 0
– the highest priority, 999999
– the lowest.
Some name that identifies the design of the document. Passed to the result unchanged.
Template culture that affects the detection of dates and decimal numbers.
For example, if en-US
culture is set, the parser will expect dates in month-day-year
sequence, and decimal numbers with .
as the decimal symbol and ,
as the digit grouping symbol.
For fr-FR
culture, the parser will expect dates in day-month-year
sequence, and decimal numbers with ,
as the decimal symbol and as the digit grouping symbol.
You can find the list of culture names at https://msdn.microsoft.com/en-us/library/cc233982.aspx.
Example:
culture: fr-FR
Few words or regular expressions (Regex) that uniquely identify the document design.
Note, you must escape symbols +*.[]()\$
with \
as they are Regex special characters.
Example:
detectionRules:
keywords:
- ACME Inc\.
- \[CONFIDENTIAL\]
If your PDF file contains multiple documents to parse, documentStart
regular expression should indicate the beginning of new document in PDF file.
Example:
documentStart: TAX INVOICE
Standalone fields to extract. For example, invoice number, invoice date, etc.
Field parameters:
Remarks:
If you used several macros in one expression, only the last one will be passed to the result.
If regex doesn’t contain capturing groups, the entire match will go to result. With groups, the last group or the named group will go to the result.
Do not use both macros and regular expressions in the same expression.
Special case: the expression can also contain the name of a special function.
Currently available special functions:
$$funcFindCompany – searches the document for the company name from a predefined list of known companies.
Examples of expression parameter:
#
expression: 'Invoice No.: {{ABC123}}'
# The entire match will go to the result
expression: \w{6}-\d{5}
# The last capturing group will go to the result
expression: 'Account number:\s+(\d+)'
# Only the match of <value> group will go to the result
expression: 'Total\s+(?:USD|€|\$|£|¥)?\s*(?<value>(\d+,?)+\.\d\d)'
# Special function
expression: $$funcFindCompany
fields:
companyName:
expression: ACME Inc.
static: true
fields:
billTo:
rect:
- 10
- 10
- 200
- 100
expression: '(?s)Bill to:(?<value>.*)'
0
(first page).2018-01-04T00:00:00
, but you can specify your own output format, e.g. yyyy-MM-dd
.field1
is not succesfully parsed, the field1a
will be used to replace field1
in the result:fields:
field1:
rect:
- 10
- 10
- 100
- 25
field1a:
rect:
- 10
- 50
- 100
- 25
coalesceWith: field1
Note 1: If you come across different number representations in the same document, you can override the template culture by appending new culture to the data type name. This single field will be parsed according to the specified culture.
Example:
type: decimal[fr-FR]
Note 2: The dateFormat and outputDateFormat can contain a format string defining the exact date format. Find the format string description here: https://docs.microsoft.com/en-us/dotnet/api/system.datetime.tryparseexact.
Example:
type: date
dateFormat: MM-dd-yyyy
The dateFormat can also contain auto-format strings:
auto-MDY
– the parser will try to detect the date format automatically, assuming the date is in month-day-year
sequence.
auto-DMY
– the parser will try to detect the date format automatically, assuming the date is in day-month-year
sequence.
auto-YMD
– the parser will try to detect the date format automatically, assuming the date is in year-month-day
sequence.
auto
– the parser will try to detect the format automatically, taking the date parts sequence from the template culture.
Example:
type: date
dateFormat: auto-DMY
Tabular data you need to extract. It is defined by regular expressions to find the table start, the end, and rows.
Tables section can contain multiple table definitions arranged as an array.
Table parameters:
expression
– regular expression to find the start of the table, or:y
– the coordinate that defines the top of the table,pageIndex
– index of the page containing the table start and y
coordinate.expression
– regular expression to find the end of the table, or:y
– the coordinate that defines the bottom of the table.expression
– regular expression to find the start of the sub-item.expression
– regular expression to find the end of the sub-item.expression
– main regular expression to find a row. Names of capturing groups should correspond to column names in columns array.subExpression1
, subExpression2
, subExpression3
, subExpression4
, subExpression5
– additional expressions to parse some remaining parts of row data which the main expression cannot parse in one pass. Sub-expressions are executed after the main expression for the text chunks between matches of the main expression. Can be used to parse hanging rows (wrapped multiline cells).name
– [optional] column name;x
– [optional] X coordinate of the left column edge in PDF Points;type
– [optional] “string”, “integer”, “date”, or “decimal” (see type descriptions in fields section).dateFormat
– [optional] see dateFormat description in fields section.outputDateFormat
– [optional] see outputDateFormat description in fields section.coalesceWith
– [optional] column name to merge the parsed value with. columns:
- name: exam
x: 0
type: string
- name: examDate
x: 100
type: date
dateFormat: auto-MDY
none
– default, no rule.byBorders
– combine lines within a table cell framed by border lines.hangingRows
– join table row that contains only a single cell up to the previous row if there is no separating line between them. Useful for tables without border between columns. rowMergingRule: byBorders
multipage: true
Example:
Description | Interval | Quantity | Amount ($) |
---|---|---|---|
Basic Plan | Jan 1 – Jan 31 | 1 | 25.00 |
Basic Plan | Feb 1 – Feb 28 | 1 | 25.00 |
Total in USD: | 50.00 |
The table above, can be parsed with regular expressions or with explicitly defined column coordinates.
Regex approach:
tables:
- name: table1
start:
expression: Amount \(\$\)
end:
expression: Total in USD
row:
expression: ^\s*(?<description>\w+.*)(?<interval>[a-zA-Z]{3} \d+ - [a-zA-Z]{3} \d+)\s+(?<quantity>\d+)\s+(?<amount>\d+\.\d\d)
columns:
- name: description
type: string
- name: interval
type: string
- name: quantity
type: integer
- name: amount
type: decimal
If the regex approach is impossible for some complicated table, you can specify column coordinates explicitly. To visually determine the coordinates of a column, you can use the included Template Editor application: it shows the cursor coordinates in the toolbar.
Explicit column coordinates approach:
tables:
- name: table1
start:
expression: Description\s+Interval
end:
expression: Total in USD
columns:
- name: description
x: 0
type: string
- name: interval
x: 100
type: string
- name: quantity
x: 150
type: integer
- name: amount
x: 200
type: decimal
Template options.
eng
– English (default)deu
– Germanfra
– Frenchspa
– Spanishnld
– DutchExample: ocrLanguage: nld
auto
– The optical character recognition (OCR) will be used only if there are no text on PDF document page but only raster images.forced
– Force OCR to extract text from both images and fonts. Useful for PDF documents with mixed content (when portion of document text is drawn as image).repairFonts
– Some PDF documents use embedded fonts with customized charset making the text extraction impossible. This mode will render entire document and extract the text using OCR.Example:
ocrMode: forced
If ocrMode
option is not specified, the mode will be defined by DocumentParser.OCRMode
property. See documentation of Document Parser SDK.
The Expression parameter can contain macros or regular expression (https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference).
Do not use both macros and regular expressions in the same expression.
Macros:
{{ABC}}
– Detects continuous sequences of letters
and _
character.{{ABC+}}
– Detects continuous sequences of letters
and _-+=/
characters.{{ABC123}}
– Detects continuous sequences of letters
, digits
, and _
character.{{ABC123+}}
– Detects continuous sequences of letters
, digits
, and _-+=/
characters.{{123}}
– Detects continuous sequences of digits
.{{123+}}
– Detects continuous sequences of digits
and _-+=/
characters.{{DATE}}
– Detects short date patterns like the following: 12/31/2019
, 31.12.19
, 2019-12-31
.{{DATE+}}
– Detects long date patterns like the following: Sep 23, 2019
, 22 décembre 2010
.{{DECIMAL}}
– Detects decimal numbers like the following: 12.34
, -123,456.78
, 123.456
. The decimal separator and group separator are automatically taken from the template culture.{{MONEY}}
– Detects decimal numbers with currency symbol like the following: USD 12.34
, $123,456.78
, 123.45 €
. The decimal separator and group separator are automatically taken from the template culture.{{ANY}}
– Sequence of any characters, including spaces and new lines.Remarks:
If the macro successfully detected an appropriate character sequence, it will be passed to the parsing result for this field. If you used several macros in one expression, only the last one will be passed to the result.