Image Data Extraction with Python - ByteScout
  • Home
  • /
  • Blog
  • /
  • Image Data Extraction with Python

Image Data Extraction with Python

Humans understand the image and its content by merely looking at it. Machines do not work the same way. It needs something more tangible, organized to understand, and give output. Optical Character Recognition (OCR) is the process, which helps the computer to understand the images. It enables the computer to recognize car plates using a traffic camera. Ocr kicks in to convert handwritten documents into a digital copy. The primary objective is to makes it a lot easier and faster for some people to do their jobs.

  1. Optical Character Recognition
  2. OCR in Python
  3. Basic Libraries
  4. Reading Image Data in Python
  5. Reading Grayscale Image in Python
  6. PyTesseract Library and Tool
  7. Steps of Tesseract OCR Process
  8. Setup PyTesseract
  9. OCR on Hand Writing Using PyTesseract
  10. Detection of Language with Tesseract
  11. Limitations of Tesseract

Image Data Extraction with Python

Optical Character Recognition

The OCR technology allows extracting text from an image or a scanned document. It can reorganize and extract any handwritten or printed character from the image. Optical Character Recognition is the process that detects the text content on images and translates the images into encoded text that the computer can easily understand. It scans the image, text, and graphic elements and converts them into a bitmap, a black and white dots matrix.

The image gets pre-processed afterward, where the brightness and contrast are adjusted to improve the process’s precision. OCR is not 100% precise, as it needs user/programmer involvement to correct a few elements missed in the scanning process. Natural Language Processing (NLP) is used to achieve error correction.

Several tools support using OCR but most of the commercial OCR engines are paid or allow using limited functionality on the free trial version. The users can also utilize OCR technology to extract text and tables from the PDFs and extract text from many non-editable formats.

OCR tools are also beneficial for various fields. Following are some of such fields where OCR can help speed up the process:

  • Pattern Recognition
  • Cognitive Computing
  • Automated Data Entry
  • Text Mining
  • Document Indexing

OCR in Python

Python is an efficient and easy-to-use programming language well known for image and text processing abilities. This language provides a huge number of libraries that can automatically complete most of the user tasks.

In python, Optical Character Recognition is achievable by using two different methods.

  1. Python built-in Libraries (scikit, matplotlib)
  2. PyTesseract Library and Tool

Basic Libraries

It is essential to comprehend how one reads and stores images on our machines before proceeding further—every Image forms by merging small but square boxes, known as pixels.

Image Extraction

The computer saves images in the form of a matrix of different numbers. The dimension of the matrix depends on the number of pixels according to the picture. Let’s suppose the previous image’s dimensions are 280 x 200 or (h x w). The dimensions are the number of pixels an image consists of height and width. (Height x width)

Pixel values represent the brightness and intensity of the picture, ranges between 0 – 255. 0 illustrates the black color, and 255 denotes white, respectively.

Reading Image Data in Python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from skimage.io import imread, imshow

image = imread('MUFC.jpg')
Image.shape, image
imshow(image)

Image.shape () function extracts the matrix values of the image.

Imread () function reads the image from the path.

Reading Grayscale Image in Python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from skimage.io import imread, imshow

image = imread('MUFC.jpg' as_gray = true)
Grey_image = image
imshow(grey_image)

Imread (destination, parameter) = as_gray is a sub-function that allows it to convert the picture in Black and White mode if the value is actual.

PyTesseract Library and Tool

Tesseract library in python is an optical character recognition (OCR) tool. It helps recognize and read the text embedded in images. Tesseract works as a stand-alone script, as it supports all image types sustained by the Pillow and Leptonica libraries, including all formats as jpeg, png, gif, BMP, tiff, and others. PyTesseract works by converting the test and graphic elements of the scanned image into a bitmap. A bitmap contains white and black dots. The image goes through the pre-processing phase adjusting the brightness and contrast before data extraction and conversion. Suppose used as a script, PyTesseract prints the documented text instead of writing it to a file.

PyTesseract can be an OCR in Python or a library and a wrapper for Goggle’s Tesseract OCR Engine. It can wrap the Python code around Tesseract OCR and provide the capability to work with various software structures. The users can utilize various Python OCR libraries to complete different tasks. Following are some of the libraries that the developers can use with Tesseract:

  • PYOCR
  • Testract
  • OpenCV
  • Leptonica
  • Pillow

Steps of Tesseract OCR Process

The Tesseract OCR process typically consists of some basic steps. Following is the breakdown of the process in steps for better user understanding:

  • API Request: The very first step to start the process is to use an API request because Tesseract OCR is only accessible by using an API integration. The users have to establish a connection between their solution and the Tesseract. Then they can send the API request from their solution to the Tesseract OCR engine.
  • Input Image: The users have to send an input image which they are using for text extraction in the API request.
  • Image Pre-Processing: This step ensures the quality of the input image. The quality of an input image must be as high as possible to achieve accurate results. For this purpose, the users can use OpenCV with Tesseract to increase the input image quality.
  • Data Extraction: The next step for the Tesseract OCR engine is to process the image and extract the data.
  • Text Conversion: The users can convert the extracted data from the input image into any format they require for the project. Tesseract supports PDF, text, TSV, HTML, and XML formats.
  • API Response Back: Once the process has ready the output, the users will get the API response with finalized output.

This Tesseract OCR process provides basic steps to provide an input image and get the output text. However, the users can use their customized solutions using different approaches and tools necessary for their use cases.

Setup PyTesseract

Python libraries are always the easiest to set up. It is usually the one-step if the user is aware of PIP. To use PyTesseract, the user needs two things:

  • Install the Python Library.
  • Install PyTesseract.
Pip install PyTesseract

Create the directory and initiate the project.

$ mkdir ocr_server && cd ocr_server && pipenv install --two

OCR Script

The user creates a primary function, which takes input from the user as an image and returns it in the text form.

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract
def ocr_core(filename):
    """
    This function will handle the core OCR processing of images.
    """
    text = pytesseract.image_to_string(Image.open(filename))
    return text
print(ocr_core('images/ocr_example_1.png'))

The function is quite simple, as in the initial five lines, the user is taking an image as an input from the Pillow library and PyTesseract library.

Attached picture in the code:

Image Data Extraction

The user then creates an ocr_core function. It inputs a file name and returns the text contained in the image.

The result of this code is:

Extract Image Data

OCR script works 100% on the digital text because it was elementary since this is digital text, picture-perfect and precise, unlike handwriting.

OCR on Hand Writing Using PyTesseract

A handwritten note is input this time on the OCR PyTesseract. Let’s see the results.

Note:

Image Info Extraction

Output:

The output of this note is as below:

Ad oviling writl

As it is evident that OCR may not entirely extract text from handwriting as it did with other images shown in the examples mentioned above.

The Tesseract engine may extract information about the orientation of the text in an image and variation. The orientation is a figure of the engine’s precision about the orientation identified to act as a guide. The script section represents the confidence marker also follows the writing system used in the text.

Detection of Language with Tesseract

Tesseract allows the user to detect the language. It is a built-in function in the form of a flag.

pytesseract.image_to_string(Image.open(filename), lang='ita')

 Please refer to the below-mentioned example without the language flag:

Extract Images

Output:

Ques allarm e local solo in so seg one cend 911

Without the flag, it is reading the text in the English language and trying to decrypt it. Hence, it has shown the word’s output, which the compiler could extract in the English language.

Now, refer to the below-mentioned example without the language flag:

Extract Images

Output:

Questo allarme è locale solo in caso di segnalazione di incendio 911

Without the language flag (), the OCR script missed some Italian words. As after leading the flag, it was able to detect all the Italian content. The translation is not yet possible, but this is still notable. Tesseract’s official documentation includes the supported languages.

Limitations of Tesseract

Tesseract is useful for many fields and tasks, as mentioned above. However, there are some limitations to the technology as well. Following are some of the limitations of the open-source solution:

  1. Tesseract is not proficient in identifying handwritten text.
  2. This technology is less accurate than some advanced embedded solutions with Artificial Intelligence.
  3. It does not support all of the file formats itself.
  4. The developers might require more resources and time to develop their solution using Tesseract.
  5. A particular threshold of Dots per inch (DPI) is limited for an image to be considered workable.
  6. There is a need for integration with AI solutions to automate certain document processes inducing verification and validations and better yield.
  7. Tesseract does not provide a Graphical User Interface (GUI). It means that the users can not connect it with an existing GUI or build a new one for it.
  8. If a document contains languages, Tesseract does not support will results in low output.
  9. It adopts a more precise image as input. A low-quality scan may provide low Output in OCR.
  10. It is not good at evaluating the regular reading order of documents. For example, a user may fail to identify that a document consists of two different columns and might try to join the word across both of those columns.
  11. Tesseract does not represent the font family’s information.

Through Tesseract and the Python-Tesseract library, users have been able to photograph images and receive output in the form of text. It is Optical Character Recognition, and it can be of boundless use in many situations.

In the examples mentioned above, the user has built a scanner. It inputs an image, returns the text in the pictorial, and integrates it into an interface. It enables a user to render the functionality in a more familiar medium and in a way that can serve various individuals simultaneously.

   

About the Author

ByteScout Team ByteScout Team of Writers ByteScout has a team of professional writers proficient in different technical topics. We select the best writers to cover interesting and trending topics for our readers. We love developers and we hope our articles help you learn about programming and programmers.  
prev
next