Convolutional Layers - CNN Architecture - Deep Learning Approach To Text Recognition

3 Deep Learning Approach To Text Recognition

3.4 CNN Architecture

3.4.1 Convolutional Layers

The Convolution layer is considered the main block of the CNN model and it is located at the start of the model in the sequence after several iterations of the pooling layers. The mathematical representation of dimension and filter or kernel is explained below.

dim(𝑖𝑚𝑎𝑔𝑒) = (𝒏_𝑯, 𝒏_𝒘, 𝒏𝒄) (𝟑. 𝟑)

Where:

nH: the size of the height nW: the size of the width nC: the number of channels

In the case of an RGB image the Nc=3, we have red, green, and blue. The filter K should be in a squared shape, and the dimension represented by (f) allows all pixel elements to be in the center of the kernel. When applying the filter to the convolution the kernel must have equals number of channels that the image has. It can be possible to apply a different filter to each channel of the image. The dimension of the filter is represented as follows.

dim(𝑓𝑖𝑙𝑡𝑒𝑟) = (𝑓, 𝑓, 𝑛_𝑐) (3.4) Mathematically for a given image and filter we have.

(3.5) Based on the same notation as before we have used for the size of the height, size of the width, and the number of channels, the dimension of convolution could be more specific by applying a kernel to the image, we have.

(3.6) In the convolution layer, the mathematical operations are performed to calculate the feature map of the image with the help of a kernel or filter. The input image is supposed to be in the form of 5 by 5 (5*5) matrix pixels and it can be divided into sub-matrix of 3 by 3 (3*3) patches for the implementation of 3 by 3 (3*3) kernel or filter for feature extraction. In this way each of the 3 by 3 (3*3) patches of the 5 by 5 (5*5) whole image are multiplied by 3 by 3 (3*3) kernel matrix to get the output feature maps, the pixels values for both matrixes should be from zero or one (0, 1). The visual and mathematical representation of the matrix image with dimension is shown step by step.

Table 3.1: Input values of the image and kernel values

The left side table is showing the input values (pixels) of the image for the convolution, and the right-side table is showing the convolution Filter, also known as mask or kernel, which has been applied to every patch of the image.

Table 3.2: First patch of the image with the kernel and feature map

Here in table 3.2, the left side box is showing the multiplication process of the first patch of the input image with the kernel to get the first value of the feature map which is shown in the second table.

Table 3.3: Second patch of the image with the kernel, and feature map

Table 3.3 shows the second iteration of the multiplication process to get the second value of feature maps. in this way, it continuously goes to the last iteration and finds the whole values of feature maps in last.

Table 3.4: Final patch of the image with the kernel, and feature map

This is the last or final step of this multiplication process as shown in the table to get the last value of feature maps.

Table 3.5: Input values of the image and final output value of feature map

Table 3.5 shows the result of the convolution, as convolution aims to get the feature map of the input image. It can be the feature value of an object, a text, or any other kind of things such as a person, a dog, a cat, a car, etc. In this experiment, we have performed the convolution operation on every patch of the input image by sliding the kernel over it. At every point of the input image, we have performed matrix multiplication on elements wise to get the sum as a feature map. The yellow boxes show the input values of the image, the green box shows the mask or filter value, and the blue box shows the final feature map values that are achieved through the convolution operation.

30 3.4.2 Pooling layers

The pooling layer is the second most crucial layer of the CNN model which comes after the convolutional layers. It is also known as the down sampling layer because it has been reducing the size of the feature map which it received from the Conv layer as an input to overcome some serious problem such as overfitting, computational power, and accuracy level. This layer almost demolishes 75% of the data without affecting the whole information. In another word, we can say that it just removes the unnecessary information from the data to purify the result.

The main information which is being reduced in this layer including the size of feature maps and some neural connection for fasting the processing. There is no need for padding (zero paddings) to perform the stride on the feature map (Akhtar, & Ragavendran, 2020).

There are three types of pooling as Max, Average, and Sum pooling. But the most important one is max pooling which takes the maximum number from the feature map window in the selected region of the stride, average pooling calculates the average value of the selected window, and sum pooling summarizes the total values of the selected window. There are two common terms as stride step and window size are used to calculate these values. The stride step is the step that represents the movement of the selected region and normally it is used to be one (1), while the window is the selected region of the feature map for pooling (Dertat, 2017).

Here is the mathematical calculation of an experimental example of pooling layers (max, avg, and sum) using 2 by 2 window and the stride size is also 2, as it's clear that both window size and stride size are the same as 2 so they are not overlapping.

Table 3.6: First iteration of pooling over a window of the feature map

In table 3.6, it's clear from the green box in the feature map as it represents the selected region of the window for pooling in the next three yellow boxes, ’s the green one is representing the result of their respective pooling methods.

Table 3.7: last iteration of pooling over a window of the feature map

It is showing the results of the last selected region of the window of the feature map, similarly, for the whole window, the stride moved for the next two-step to calculate the pooling value for every selected region in the window.

Table 3.8: Results of pooling’s over a window of the feature map for every move

Table 3.8, representing the whole results of the experiments for each step of stride in the window for every pooling type.

3.4.3 Fully connected layer

A fully connected layer is considered the last layer of the CNN architecture and it has a series of layers that are placed before the output layers. Internally this layer has various layers including an input layer, numerous hidden layers, and the output layers as shown in image 3.3.

These layers consist of weights and biases factors that are used to connect different layers as every node in each layer is connected to the next layers. The input vector is flattened from the pooling layer in the form of a feature map which is fed to the FC layer to make classification of each observed object. In the series of FC layers, the activation faction (RELU) and weight factor (W) are added to the value of the feature map for each node. The FC layer aims to classify the types of objects based on their feature result (Gurucharan, 2020).

Figure 3.5: A fully connected layer in a deep network (Dertat, 2017)

3.5 Activation function

The activation function is one of the most important and useful factors of the CNN model. An activation function in a neural network aims to learn the complex pattern in data and decide when to activate the neuron. It takes an input value from the previous layers to pass them into other layers by performing some mathematical operation to produce the output value. The main responsibility of an activation function is to bring nonlinearity into the output values of the model. There are three main kinds of activation functions including RELU, Tanh, and sigmoid function, everyone has their specific usage and importance in the field of Artificial neural networks. SoftMax is a sub-kind of sigmoid function which is mainly used for classification purposes to classify the object into different classes. While Rectified Linear Unit is widely used in NN due to its fast processing and less expensive properties, also it has simpler mathematical operation in general (Gurucharan, 2020).

3.6 Optical character recognition

OCR is an acronym of optical character recognition which is used for text recognition in multiple formats such as handwritten recognition, digital text recognition from various background. Humans can easily understand the content of an image or documents by looking into it, while machines or computers cannot understand the content of an image or documents in such away. Due to this reason, OCR being in existence. The aims and objectives of OCR tools are to recognize the digital text or handwritten text from an image or documents to automate the computerized system and encode these texts into computer-readable form. Such kind of software is used to recognize and translate the text of various spoken languages into machine-readable form. This OCR process consists of many subprocesses to process the image for getting possible and accurate results in the form of text. Firstly, the image is scanned from the camera and save in one of the image formats including JPEG, PNG, or in pdf format, etc.

Secondly, the image or documents is passed into pre-processed stages where the contrast and brightness of the image are controlled and managed. Thirdly, the localization process starts where the image is divided into different zones and focused on the targeted area where the required text has existed, and it must speed up to start the extraction process. Fourthly, the targeted area which contains the text is broken down into lines, character, and words where the software is applied to compare, recognize, and identified the text through various detection and recognition algorithms to produce final output (Filip, & Anuj, 2021).

Figure 3.6: Architecture of Optical character recognition (Filip, & Anuj, 2021)

Figure 3.6 indicating the OCR process where the input data consist of scanned documents, PDF documents, or simple Images given to the OCR software which has been processing these documents and extracting the text documents to store them into the database.

3.6.1 Uses of optical character recognition

Nowadays OCR has been used widely in different areas for various purposes to automate and digitalized the system for saving human effort and time. Previously, for digitalization purposes and holding record history, the data have been typed manually into the computer. While using the OCR system the data captured and stored in digital form by just scanning the image documents that can extract the data from the scanned documents and convert it into an editable text document, no need for extra manual works. It has a lot of usage in various departments, here are listed some of them where OCR is used for text recognition.

• Airports used OCR technology for passport identification.

• OCR is used for document processing such as degree certificates, driving licenses, identity documents, etc.

• The banking sector used OCR to detect customer information and details from deposit slips, invoices, and other documents.

• Smart Parking management system used OCR to recognize vehicle number plate for classified parking space for different categories of vehicles, for example, ambulances and VIPs, (Joshi, et al., 2015).

• OCR is used in the shopping mall to recognize item prices through barcodes.

3.6.2 Types of OCR Software

Since last decades there are several OCR software that has been used for text reading, identification, and recognition from different ground, especially from an image. Most of them implemented for the recognition of printed text or handwritten text from the scanned image or documents to classify the required text. Some of them are open source and free OCR software such as Tesseract, Calamari, and Kraken, and a few of them are paid services such as google vision API and ABBYY FineReader, etc. All of them have some difference in their results but no one has 100% accuracy because of image resolution. Here is the list of useful OCR tools for printed and hand-written text mining.

• Tesseract OCR.

• OCR opus.

• Calamari.

• Kraken

• Microsoft Azure Computer Vision.

• Google Cloud Vision.

• ABBYY FineReader.

• Amazon textract.

• Swift OCR.

• Attention OCR.

Apart from the above-mentioned list of OCR software four of them are very common and popular including Tesseract OCR, Google Cloud Vision, ABBYY FineReader, and Amazon textract. Here are the comparison results and acceptance ratio of these four OCR software in tabular form. This information gets from the experiment which is based on the implementation of the various images containing text data of printed text as well as hand-written text.

Table 3.9: Acceptance criteria of OCR tools (Fabian, 2020)

From the whole experiment, the main takeaway in words is follows as, if you deal with machine-written and good scanned image then the Tesseract OCR will do a great job. When you deal with hand-written character recognition then google cloud vision is the best choice. If the resolution of the document is bad and you deal with tabular data then ABBYY FineReader is the best option for all of them (Fabian, 2020).

36 3.6.2.1 Tesseract OCR

Tesseract OCR is a very famous and useful text recognition software tool because it's open-source and free for any usage. It has developed by HP at an early age between 1985-1995. Over time a lot of improvements and changes come into it that has increased its popularity. Currently, it's capable to recognize text in various languages including French, English, Arabic, Dutch, and German, etc, and it's managed and maintained by Google. This OCR only works through the command line for image processing (Patel, et al., 2012).

In this experiment, this tool has been implemented to a simple test image that contains text given to the OCR through the command line which has converted the text into outpu_file which is in editable text form.

Figure 3.7: Experimental results of a simple image using Tesseract OCR

Figure 3.7 shows the results of experiments in which the image (test.png) is given to the OCR through the command line which produced the results into an editable text file (output_file) that contain the same text which was already in the image(test.png).

3.7 Related Work

Tesseract OCR and google cloud vision API both models have been used in numerous areas for text recognition from documents. They have performed well in text reading and text extraction for different purposes such as receipt of printed bills and invoices, auto reading of multiple-choice questions, and feature extraction of identity documents, etc. These models are capable to extract and recognize normal text, alphabet, and font of various languages.

The authors' work highlighted the importance of tesseract OCR models. They have used the models for extracting the printed text from the image of bills and invoices. The target text extracted from the image such as total amount, purchase date, discount amount, and other similar data from the image of the invoice and bills. Initially, Open CV has been used to remove

the noise from the image of bills and invoices. Afterward, the image of the bills and invoices has been transferred to the tesseract OCR engine for further processing. The OCR engine then extracts every single word from the scanned image to make an editable text file. Lastly, the required text is selected from the list of extracted text files such as total bill, and purchase date, etc. The purposed method claimed that the system performs very well for multiple input images of bills and invoices (Sidhwa et al., 2018).

Nowadays the needs and usage of electronic documents are highly increased in the different workspace such as offices, schools and colleges, hospitals, supermarkets, and industries for multiple purposes. They might be in the shape of paper, letters or documents, task description, and invoice, etc. Which are available in electronic format but somehow it needs to be presented the text into an editable text form in regional language. For this purpose, the language translator is used to translate the content into regional language and the Tesseract OCR engine is used to recognize and extract the text from the documents. By using tesseract OCR to recognize the text from the above-mentioned documents in different languages is easy as compared to other OCR models and its accuracy level is high (Acharya et al., 2019).

The tesseract OCR is popular and efficient for data retrieval from a big data storage space.

Usually, it's hard to get the required text and information that contain in an image. It is possible to extract those messages and information from the image with this model to search the images from the database based on the keyword of a text message. After recognizing and detecting the text message from the scanned image through the tesseract OCR, it creates a text file that contains the message. Different string searching algorithms such as Rabin–Karp, Knuth–

Morris–Pratt, Two-way string-matching algorithms, and Boyer–Moore can be used to search the required message from the text file. In this study, the authors have been used Boyer–Moore algorithm for searching purposes (Wankhede, & Mohod, 2017).

Data extraction from identity documents such as Passport, Residence Permit card, Military Identification Card, Student Identification Card, and Driver's License, etc, are moving towards digitalization. Manually information extraction from the image of such documents is difficult and uninteresting, this may also cause the occurrence of a huge error rate and increases the working time of evaluation. This problem can be overcome using technology. Based on the current requirement the automation of information extraction can provide a better solution in terms of accuracy and efficiency as compared to manual processing. To get good results, the image of such documents is forwarded to the pre-processing stage for noise removal, and then

the information extraction process is started. Usually, this process takes place in two steps: text detection and text recognition from the documents. Transfer learning is the purposed method that has been used in this study for text recognition from the image of an Italian identity document. The results of this study claimed that they have achieved an 88.44% accuracy level and an error rate of 2.71% (Visalli et al., 2021).

Nowadays visually affected peoples are increasing day by day due to biological illness and accidental issues. Such kind of person needs helps to perform their daily life activities by hearing the scripts. This problem can be solved with recent technology. A smart reader is a device that can do this job efficiently, and its process consists of three steps. Firstly, capturing a photograph from the camera. Secondly, extract the text from the captured image by OCR.

Thirdly, convert the text into speech and make a voice note to assist the impaired person. The

In document Features Extraction of Tax Card by Using OCR Based DeepLearning Techniques (sivua 32-0)