• Ei tuloksia

Types of Artificial Neural Networks

3 Deep Learning Approach To Text Recognition

3.2 Artificial Neural Network

3.2.2 Types of Artificial Neural Networks

Many types of neural networks can be used to solve various real-life problems, some of them are explained below.

3.2.2.1 Feed-forward neural networks

Feed-forward is a kind of neural network in which signals travel only in the forward direction and the input has been directly linked to the output. In feedforward neural networks there are no loops for backward direction, which means that the data can only flow from input to output.

Networks in feedforward neural networks have fixed input and output data. Feedforward neural networks are widely used in various applications such as pattern generation and pattern recognition, document segmentation, prediction, function approximation, and classification.

The mathematical representation of Feed-forward neural network models is explained below.

Where the artificial neuron takes a vector of input values such as x1, x2...xn, and every input vector is multiplied by a series of weight factors such as w1, w2...wn, The weighted input values are combined, and a bias value (b) is added to the weighted input to produce a net output value.

๐‘ง = โˆ‘ ๐‘ฅ๐‘–๐‘ค๐‘–

๐‘›

๐‘–=1

+ ๐‘ (3.1)

The final input is then pass through the activation function (g) to produce the final output a=g(z), which can be transmitted to the next neurons.

๐‘Ž = ๐‘”(๐‘ง) = ๐‘” (โˆ‘ ๐‘ฅ๐‘–๐‘ค1

๐‘›

๐‘–=1

+ ๐‘) (3.2)

Where the activation function can be chosen according to the requirement, but the weight factors and bias values (b) are selected based on the learning rules during the training phase of the neural network models.

23

Feedforward neural network is a classical neural network method which was used earlier for the different problem that arises from various disciplines. The network is consisting of multilayers and each layer is connected as fully connected to other layers to process the input data directly into output result. It has multiple layers, due to this reason this network is also known as the multi-layer perceptron (Razavi & Tolson, 2011).

3.2.2.2 Recurrent Neural Networks

RRN is a kind of artificial neural network that is highly used in text detection and text-to-speech conversion to locate the pattern and series of data in natural language processing. In this network, the output of a certain layer is saved and send back to the input layers to compare it with the input data, this helps to predict the outcomes of the other layers. In the layers of this network, every node will remember some facts that it had in the previous one. In short, each node performs as a memory cell to save some information of the layers during the data transformation from one layer to the other layers. If the prediction is not corrected, then the system self learns from the stored information to make the right prediction regards to the backpropagation (Schmidt,2019).

(a) An example of fully connected RRN (b). An example of simple RNN Figure 3.2: Recurrent Neural Network Architecture (Medsker et al., 2001)

Figure 3.2 (a) shows the architecture of fully connected networks that do not have separate input nodes, each node gets input from other nodes. In the case of (b), it is explaining the simple architecture of RNN and it is used to learn a single character of string through feed word structure.

24 3.3 Convolutional neural networks

The convolutional neural network is the most powerful and famous deep learning neural network that has been used in various applications of computer vision such as image processing, pattern recognition, and object classification. It can also be used for the application of speech recognition. CNN detects the feature map of the object from the input image through different operations repeatedly and processes the output result from the image to classify the target object from the examined image. Nowadays large size pixels of images have been processed with the help of CNN to get all the features by adding some weight factors, that cause the improvements of features and pass them into different layers of the neural network for further improvement. This cycle revolves around various times between these hidden layers such as convolution layers and pooling layers. In these layers, the image pixels are divided into various patches to apply a different kind of padding mask for the processing and purification of the output image. Then the system can be able to classify the objects from the image into its class, for example, the input image contains a dog and a cat as an object. During the training of CNN, the algorithm can easily classify both objects into their classes and label their name accordingly.

The name CNN is taken from a mathematical operation such as a matrix called convolution. It has multiple layers and can be categorized into two different sub-groups such as convolutional layers and fully connected layers. This group of two layers have some parameters but the other group of two layers including pooling layers and non-linearity layers doesnโ€™t have any parameters at all. CNN is considered one of the most important neural networks for solving the complex problem of image processing because complex tasks are impossible to solve with the traditional ANN methods. CNN has specialty over the other neural network because it has reduced the number of parameters in ANN, this leads CNN to success in solving a complex problem. CNN does not need to have the features map of the targeted zone of the image. For example, in the problem of face detection, there is no need to focus on the area of the image where the faces are located. It has only based on some features that are specified at the first layer of the network, as in the first layer the edge can be detected, in the second layer the shape is identified, in the next layer the face is detected (Albawi et al., 2017).

CNN has been used in various applications of image recognition to identify and recognize a targeted object from the image such as: It can be used to detect number and character as a string

25

from the image, e.g., capture the vehicle registration number from number plate, etc. It can be used for the medical image to detect the diagnosis and affected part of the body from the image.

It can be used to detect the faces of animals or humans based on some facial features. It can also be used to detect mechanical parts of an automobile in industry. It can be used in agriculture aspects to detect plants species such as flower, steam, and leaves from the image (Gogul, & Kumar, 2017)., and even it can be used to detect any objects based on their features from the image to identify and classify the class of the objects.

CNN has been widely used for text detection from various surface ground to detect the text character by character. Natural scene image is one of those areas where CNN applied for automation of text detection. Text recognition techniques aim to detect the depicted words from the image which can be categorized into two ways such as character recognition or whole word recognition. Normally text can be recognized from the image documents by using the OCR technique which is well suited to identify the words and characters from the image documents.

In the case of scene image, it can be failed due to some characteristics including font style, size, image scene, blurring effects, and other feasible appearance of the scene image. This needs to use advanced CNN techniques for solving such kind of problem (Jaderberg et al., 2017). Most of the OCR tools are used to detect the text for a printed text from the image which produced high accuracy and potential result (Saidane, & Garcia, 2007).

Figure 3.3: CNN model used for text recognition (Jaderberg et al., 2017)

Itโ€™s shown in figure 3.3, where the input image is containing only a simple text word which is divided into image patches and passed through into various CNN layers to recognize the accurate and efficient words at the output phase.

26 3.4 CNN Architecture

All CNN based model follows this architecture for classification and pattern recognition for various objects from an image. The fundamental sketch of the CNN model is shown, we will explain each part of the model individually in detail.

Figure 3.4: Architecture of Convolutional neural network (Dertat, 2017)

From figure 3.4, it shows that the model can take any sort of image as an input vector and transferred it through the series of various layers including convolutions, poolingโ€™s, and finally based on several fully connected layers to recognize and identify the object and its features from the input image. In this whole process, some other factors have an important role to process the input data into output data such as RELU function, bias function, activation function, and SoftMax.

3.4.1 Convolutional Layers

The Convolution layer is considered the main block of the CNN model and it is located at the start of the model in the sequence after several iterations of the pooling layers. The mathematical representation of dimension and filter or kernel is explained below.

dim(๐‘–๐‘š๐‘Ž๐‘”๐‘’) = (๐’๐‘ฏ, ๐’๐’˜, ๐’๐’„) (๐Ÿ‘. ๐Ÿ‘)

Where:

nH: the size of the height nW: the size of the width nC: the number of channels

27

In the case of an RGB image the Nc=3, we have red, green, and blue. The filter K should be in a squared shape, and the dimension represented by (f) allows all pixel elements to be in the center of the kernel. When applying the filter to the convolution the kernel must have equals number of channels that the image has. It can be possible to apply a different filter to each channel of the image. The dimension of the filter is represented as follows.

dim(๐‘“๐‘–๐‘™๐‘ก๐‘’๐‘Ÿ) = (๐‘“, ๐‘“, ๐‘›๐‘) (3.4) Mathematically for a given image and filter we have.

(3.5) Based on the same notation as before we have used for the size of the height, size of the width, and the number of channels, the dimension of convolution could be more specific by applying a kernel to the image, we have.

(3.6) In the convolution layer, the mathematical operations are performed to calculate the feature map of the image with the help of a kernel or filter. The input image is supposed to be in the form of 5 by 5 (5*5) matrix pixels and it can be divided into sub-matrix of 3 by 3 (3*3) patches for the implementation of 3 by 3 (3*3) kernel or filter for feature extraction. In this way each of the 3 by 3 (3*3) patches of the 5 by 5 (5*5) whole image are multiplied by 3 by 3 (3*3) kernel matrix to get the output feature maps, the pixels values for both matrixes should be from zero or one (0, 1). The visual and mathematical representation of the matrix image with dimension is shown step by step.

Table 3.1: Input values of the image and kernel values

28

The left side table is showing the input values (pixels) of the image for the convolution, and the right-side table is showing the convolution Filter, also known as mask or kernel, which has been applied to every patch of the image.

Table 3.2: First patch of the image with the kernel and feature map

Here in table 3.2, the left side box is showing the multiplication process of the first patch of the input image with the kernel to get the first value of the feature map which is shown in the second table.

Table 3.3: Second patch of the image with the kernel, and feature map

Table 3.3 shows the second iteration of the multiplication process to get the second value of feature maps. in this way, it continuously goes to the last iteration and finds the whole values of feature maps in last.

29

Table 3.4: Final patch of the image with the kernel, and feature map

This is the last or final step of this multiplication process as shown in the table to get the last value of feature maps.

Table 3.5: Input values of the image and final output value of feature map

Table 3.5 shows the result of the convolution, as convolution aims to get the feature map of the input image. It can be the feature value of an object, a text, or any other kind of things such as a person, a dog, a cat, a car, etc. In this experiment, we have performed the convolution operation on every patch of the input image by sliding the kernel over it. At every point of the input image, we have performed matrix multiplication on elements wise to get the sum as a feature map. The yellow boxes show the input values of the image, the green box shows the mask or filter value, and the blue box shows the final feature map values that are achieved through the convolution operation.

30 3.4.2 Pooling layers

The pooling layer is the second most crucial layer of the CNN model which comes after the convolutional layers. It is also known as the down sampling layer because it has been reducing the size of the feature map which it received from the Conv layer as an input to overcome some serious problem such as overfitting, computational power, and accuracy level. This layer almost demolishes 75% of the data without affecting the whole information. In another word, we can say that it just removes the unnecessary information from the data to purify the result.

The main information which is being reduced in this layer including the size of feature maps and some neural connection for fasting the processing. There is no need for padding (zero paddings) to perform the stride on the feature map (Akhtar, & Ragavendran, 2020).

There are three types of pooling as Max, Average, and Sum pooling. But the most important one is max pooling which takes the maximum number from the feature map window in the selected region of the stride, average pooling calculates the average value of the selected window, and sum pooling summarizes the total values of the selected window. There are two common terms as stride step and window size are used to calculate these values. The stride step is the step that represents the movement of the selected region and normally it is used to be one (1), while the window is the selected region of the feature map for pooling (Dertat, 2017).

Here is the mathematical calculation of an experimental example of pooling layers (max, avg, and sum) using 2 by 2 window and the stride size is also 2, as it's clear that both window size and stride size are the same as 2 so they are not overlapping.

Table 3.6: First iteration of pooling over a window of the feature map

31

In table 3.6, it's clear from the green box in the feature map as it represents the selected region of the window for pooling in the next three yellow boxes, โ€™s the green one is representing the result of their respective pooling methods.

Table 3.7: last iteration of pooling over a window of the feature map

It is showing the results of the last selected region of the window of the feature map, similarly, for the whole window, the stride moved for the next two-step to calculate the pooling value for every selected region in the window.

Table 3.8: Results of poolingโ€™s over a window of the feature map for every move

Table 3.8, representing the whole results of the experiments for each step of stride in the window for every pooling type.

3.4.3 Fully connected layer

A fully connected layer is considered the last layer of the CNN architecture and it has a series of layers that are placed before the output layers. Internally this layer has various layers including an input layer, numerous hidden layers, and the output layers as shown in image 3.3.

32

These layers consist of weights and biases factors that are used to connect different layers as every node in each layer is connected to the next layers. The input vector is flattened from the pooling layer in the form of a feature map which is fed to the FC layer to make classification of each observed object. In the series of FC layers, the activation faction (RELU) and weight factor (W) are added to the value of the feature map for each node. The FC layer aims to classify the types of objects based on their feature result (Gurucharan, 2020).

Figure 3.5: A fully connected layer in a deep network (Dertat, 2017)

3.5 Activation function

The activation function is one of the most important and useful factors of the CNN model. An activation function in a neural network aims to learn the complex pattern in data and decide when to activate the neuron. It takes an input value from the previous layers to pass them into other layers by performing some mathematical operation to produce the output value. The main responsibility of an activation function is to bring nonlinearity into the output values of the model. There are three main kinds of activation functions including RELU, Tanh, and sigmoid function, everyone has their specific usage and importance in the field of Artificial neural networks. SoftMax is a sub-kind of sigmoid function which is mainly used for classification purposes to classify the object into different classes. While Rectified Linear Unit is widely used in NN due to its fast processing and less expensive properties, also it has simpler mathematical operation in general (Gurucharan, 2020).

33

3.6 Optical character recognition

OCR is an acronym of optical character recognition which is used for text recognition in multiple formats such as handwritten recognition, digital text recognition from various background. Humans can easily understand the content of an image or documents by looking into it, while machines or computers cannot understand the content of an image or documents in such away. Due to this reason, OCR being in existence. The aims and objectives of OCR tools are to recognize the digital text or handwritten text from an image or documents to automate the computerized system and encode these texts into computer-readable form. Such kind of software is used to recognize and translate the text of various spoken languages into machine-readable form. This OCR process consists of many subprocesses to process the image for getting possible and accurate results in the form of text. Firstly, the image is scanned from the camera and save in one of the image formats including JPEG, PNG, or in pdf format, etc.

Secondly, the image or documents is passed into pre-processed stages where the contrast and brightness of the image are controlled and managed. Thirdly, the localization process starts where the image is divided into different zones and focused on the targeted area where the required text has existed, and it must speed up to start the extraction process. Fourthly, the targeted area which contains the text is broken down into lines, character, and words where the software is applied to compare, recognize, and identified the text through various detection and recognition algorithms to produce final output (Filip, & Anuj, 2021).

Figure 3.6: Architecture of Optical character recognition (Filip, & Anuj, 2021)

34

Figure 3.6 indicating the OCR process where the input data consist of scanned documents, PDF documents, or simple Images given to the OCR software which has been processing these documents and extracting the text documents to store them into the database.

Figure 3.6 indicating the OCR process where the input data consist of scanned documents, PDF documents, or simple Images given to the OCR software which has been processing these documents and extracting the text documents to store them into the database.