ByteScout Document Parser SDK - Template Creation Guide - ByteScout

ByteScout Document Parser SDK – Template Creation Guide

  • Home
  • /
  • Articles
  • /
  • ByteScout Document Parser SDK – Template Creation Guide

Template Creation Guide

 

Templates can be written in YAML or JSON formats. A template defines one or more keywords to find the right template and expressions for fields to be extracted.
A single template file can contain multiple templates. Templates in YAML file should be separated with --- line. Templates in JSON must be arranged as an array [].

Sample YAML template showing the main features:

---
templateVersion: 2
templatePriority: 1
sourceId: ACME Inc. Invoice 
culture: en-US

detectionRules:
  keywords:
    - ACME Inc\.
    - Invoice No
    - ABN 01 234 567 890

fields:
  companyName:
    expression: ACME Inc.
    static: true
  invoiceNumber:
    expression: 'Invoice No.: {{ABC123+}}'
    pageIndex: 0
  invoiceDate:
    expression: 'Invoice Date: {{123+}}'
    type: date
    dateFormat: MM/dd/yyyy
  billTo:
    rect:
      - 32
      - 150
      - 348
      - 70
    expression: '(?s)Bill to:(?<value>.*)'
    pageIndex: 0
  total:
    expression: TOTAL\s+(\d+\.\d+)
    type: decimal

tables:
  - name: table1
    start:
      expression: Item\s+Quantity\s+Price\s+Total
    end:
      expression: TOTAL
    row:
      expression: ^\s*(?<description>\w+.*)(?<quantity>\d+)\s+(?<unitPrice>\d+\.\d{2})\s+(?<itemTotal>\d+\.\d{2})\s*$
    columns:
      - name: description
        type: string
      - name: quantity
        type: integer
      - name: unitPrice
        type: decimal
      - name: itemTotal
        type: decimal
    multipage: true

templatePriority

Templates are sorted and tried by templatePriority, then alphabetically. 0 – the highest priority, 999999 – the lowest.

sourceId

Some name that identifies the design of the document. Passed to the result unchanged.

culture

Template culture that affects the detection of dates and decimal numbers.
For example, if en-US culture is set, the parser will expect dates in month-day-year sequence, and decimal numbers with . as the decimal symbol and , as the digit grouping symbol.
For fr-FR culture, the parser will expect dates in day-month-year sequence, and decimal numbers with , as the decimal symbol and  as the digit grouping symbol.
You can find the list of culture names at https://msdn.microsoft.com/en-us/library/cc233982.aspx.
Example:

culture: fr-FR

detectionRules

Few words or regular expressions (Regex) that uniquely identify the document design.
Note, you must escape symbols +*.[]()\$ with \ as they are Regex special characters.
Example:

detectionRules:
  keywords:
    - ACME Inc\.
    - \[CONFIDENTIAL\]

documentStart

If your PDF file contains multiple documents to parse, documentStart regular expression should indicate the beginning of new document in PDF file.
Example:

documentStart: TAX INVOICE

fields

Standalone fields to extract. For example, invoice number, invoice date, etc.

Field parameters:

  • expression – Macros (see Appendix 1) or a regular expression (Regex) that define the data to be searched and retrieved from the document.

Remarks:
If you used several macros in one expression, only the last one will be passed to the result.
If regex doesn’t contain capturing groups, the entire match will go to result. With groups, the last group or the named group will go to the result.
Do not use both macros and regular expressions in the same expression.

Special case: the expression can also contain the name of a special function.
Currently available special functions:
$$funcFindCompany – searches the document for the company name from a predefined list of known companies.

Examples of expression parameter:

    # 
    expression: 'Invoice No.: {{ABC123}}'

    # The entire match will go to the result
    expression: \w{6}-\d{5}
    
    # The last capturing group will go to the result
    expression: 'Account number:\s+(\d+)'
    
    # Only the match of <value> group will go to the result
    expression: 'Total\s+(?:USD|€|\$|£|¥)?\s*(?<value>(\d+,?)+\.\d\d)'
    
    # Special function
    expression: $$funcFindCompany
  • static – [optional] if this parameter is set to true, the parser will pass the contents of the expression parameter to the result unchanged.
    Example:

fields:
  companyName:
    expression: ACME Inc.
    static: true
  • rect – [optional] limits the text extraction to the specified area of the document. The rectangle is specified as top, left, width, height in PDF units Points (1 Point = 1/72″).
    If used without the expression parameter, it will simply return the text extracted from the rectangle.
    If used with the expression parameter, the regex will only search in text extracted from the rectangle.
    Example:

fields:
  billTo:
    rect:
      - 10
      - 10
      - 200
      - 100
    expression: '(?s)Bill to:(?<value>.*)'
  • pageIndex – [optional] Zero-based page index to search the field in. Default is 0 (first page).
  • type – [optional] The expected datatype of the parsed value.
    Possible values:

    • string – used by default if the type is not specified; the matched Regex value will be passed to the result unchanged.
    • integer – the parser will try to convert the retrieved text to an integer number according to the template culture.
    • decimal – the parser will try to convert the retrieved text to a decimal number according to the template culture. See Note 1 below.
    • date – the retrieved text will be parsed as a date according to specified dateFormat or the template culture. See Note 2 below.
    • table – a special type used in conjunction with rect parameter. The data from the rect area will be extracted preserving the table structure.
  • dateFormat – [optional] The format string to parse the date. See Note 2 below.
  • outputDateFormat – [optional] Output date format. By default, succesfully parsed date will be passed to the result in ISO 8601 format, e.g. 2018-01-04T00:00:00, but you can specify your own output format, e.g. yyyy-MM-dd.
  • rowMergingRule – [optional] defines a rule to merge multiline data in table cells. See rowMergingRule description in tables section.
  • coalesceWith – name of another field to coalesce with. If the specified field is not parsed, the current field will replace it. This is useful if you need to create two parsing criteria for some varying data and get them as a single field in the result. If the first field fails, the second will be used.
    Example. If field1 is not succesfully parsed, the field1a will be used to replace field1 in the result:

fields:
  field1: 
    rect:
      - 10
      - 10
      - 100
      - 25
  field1a: 
    rect:
      - 10
      - 50
      - 100
      - 25
    coalesceWith: field1 

Note 1: If you come across different number representations in the same document, you can override the template culture by appending new culture to the data type name. This single field will be parsed according to the specified culture.
Example:

    type: decimal[fr-FR]

Note 2: The dateFormat and outputDateFormat can contain a format string defining the exact date format. Find the format string description here: https://docs.microsoft.com/en-us/dotnet/api/system.datetime.tryparseexact.
Example:

    type: date
    dateFormat: MM-dd-yyyy

The dateFormat can also contain auto-format strings:
auto-MDY – the parser will try to detect the date format automatically, assuming the date is in month-day-year sequence.
auto-DMY – the parser will try to detect the date format automatically, assuming the date is in day-month-year sequence.
auto-YMD – the parser will try to detect the date format automatically, assuming the date is in year-month-day sequence.
auto – the parser will try to detect the format automatically, taking the date parts sequence from the template culture.
Example:

    type: date
    dateFormat: auto-DMY

tables

Tabular data you need to extract. It is defined by regular expressions to find the table start, the end, and rows.
Tables section can contain multiple table definitions arranged as an array.
Table parameters:

  • name – table name to distinguish tables in the result.
  • start – parameters that define the start of the table:
    expression – regular expression to find the start of the table, or:
    y – the coordinate that defines the top of the table,
    pageIndex – index of the page containing the table start and y coordinate.
  • end – parameters that define the end of the table.
    expression – regular expression to find the end of the table, or:
    y – the coordinate that defines the bottom of the table.
  • subItemStart – parameters that define the start of the table sub-item. Sub-items are useful for tables with complex multiline row data.
    expression – regular expression to find the start of the sub-item.
  • subItemEnd – parameters that define the end of the table sub-item.
    expression – regular expression to find the end of the sub-item.
  • row – [optional] parameters that define table table rows:
    expression – main regular expression to find a row. Names of capturing groups should correspond to column names in columns array.
    subExpression1, subExpression2, subExpression3, subExpression4, subExpression5 – additional expressions to parse some remaining parts of row data which the main expression cannot parse in one pass. Sub-expressions are executed after the main expression for the text chunks between matches of the main expression. Can be used to parse hanging rows (wrapped multiline cells).
  • columns – [optional] array that defines column name, coordinate, data type and format.
    Column parameters:
    name – [optional] column name;
    x – [optional] X coordinate of the left column edge in PDF Points;
    type – [optional] “string”, “integer”, “date”, or “decimal” (see type descriptions in fields section).
    dateFormat – [optional] see dateFormat description in fields section.
    outputDateFormat – [optional] see outputDateFormat description in fields section.
    coalesceWith – [optional] column name to merge the parsed value with.
    Example:

  columns:
    - name: exam
      x: 0
      type: string
    - name: examDate
      x: 100
      type: date
      dateFormat: auto-MDY
  • rowMergingRule – [optional] For table type fields: defines a rule to merge multiline data in table cells.
    Valid values:
    none – default, no rule.
    byBorders – combine lines within a table cell framed by border lines.
    hangingRows – join table row that contains only a single cell up to the previous row if there is no separating line between them. Useful for tables without border between columns.
    Example:

    rowMergingRule: byBorders
  • multipage – [optional] defines whether the table may continue on further pages.
    Example:

    multipage: true

Example:

Description Interval Quantity Amount ($)
Basic Plan Jan 1 – Jan 31 1 25.00
Basic Plan Feb 1 – Feb 28 1 25.00
Total in USD: 50.00

The table above, can be parsed with regular expressions or with explicitly defined column coordinates.
Regex approach:

tables:
  - name: table1
    start:
      expression: Amount \(\$\)
    end:
      expression: Total in USD
    row:
      expression: ^\s*(?<description>\w+.*)(?<interval>[a-zA-Z]{3} \d+ - [a-zA-Z]{3} \d+)\s+(?<quantity>\d+)\s+(?<amount>\d+\.\d\d)
    columns:
      - name: description
        type: string
      - name: interval
        type: string
      - name: quantity
        type: integer
      - name: amount
        type: decimal

If the regex approach is impossible for some complicated table, you can specify column coordinates explicitly. To visually determine the coordinates of a column, you can use the included Template Editor application: it shows the cursor coordinates in the toolbar.
Explicit column coordinates approach:

tables:
  - name: table1
    start:
      expression: Description\s+Interval
    end:
      expression: Total in USD
    columns:
      - name: description
        x: 0
        type: string
      - name: interval
        x: 100
        type: string
      - name: quantity
        x: 150
        type: integer
      - name: amount
        x: 200
        type: decimal

options

Template options.

  • ocrLanguage – The language for Optical Character Recognition (OCR). Document Parser SDK is shipped with 5 language files, but you can download more languages at https://github.com/bytescout/ocrdata.
    Valid values:
    eng – English (default)
    deu – German
    fra – French
    spa – Spanish
    nld – DutchExample:

  ocrLanguage: nld
  • ocrMode
    auto – The optical character recognition (OCR) will be used only if there are no text on PDF document page but only raster images.
    forced – Force OCR to extract text from both images and fonts. Useful for PDF documents with mixed content (when portion of document text is drawn as image).
    repairFonts – Some PDF documents use embedded fonts with customized charset making the text extraction impossible. This mode will render entire document and extract the text using OCR.

Example:

  ocrMode: forced

If ocrMode option is not specified, the mode will be defined by DocumentParser.OCRMode property. See documentation of Document Parser SDK.

APPENDIX 1

The Expression parameter can contain macros or regular expression (https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference).
Do not use both macros and regular expressions in the same expression.

Macros:

  • {{ABC}} – Detects continuous sequences of letters and _ character.
  • {{ABC+}} – Detects continuous sequences of letters and _-+=/ characters.
  • {{ABC123}} – Detects continuous sequences of letters, digits, and _ character.
  • {{ABC123+}} – Detects continuous sequences of letters, digits, and _-+=/ characters.
  • {{123}} – Detects continuous sequences of digits.
  • {{123+}} – Detects continuous sequences of digits and _-+=/ characters.
  • {{DATE}} – Detects short date patterns like the following: 12/31/2019, 31.12.19, 2019-12-31.
  • {{DATE+}} – Detects long date patterns like the following: Sep 23, 2019, 22 décembre 2010.
  • {{DECIMAL}} – Detects decimal numbers like the following: 12.34, -123,456.78, 123.456. The decimal separator and group separator are automatically taken from the template culture.
  • {{MONEY}} – Detects decimal numbers with currency symbol like the following: USD 12.34, $123,456.78, 123.45 €. The decimal separator and group separator are automatically taken from the template culture.
  • {{ANY}} – Sequence of any characters, including spaces and new lines.

Remarks:
If the macro successfully detected an appropriate character sequence, it will be passed to the parsing result for this field. If you used several macros in one expression, only the last one will be passed to the result.

 

prev
next