Features Extraction of Tax Card by Using OCR Based DeepLearning Techniques

(1)

Features Extraction of Tax Card by Using OCR Based Deep Learning Techniques

Qamar Uddin

Master's thesis

University of Eastern Finland School of Computing

Computer Science

June 2021

(2)

i

UNIVERSITY OF EASTERN FINLAND, Faculty of Science and Forestry, Joensuu School of Computing

Computer Science

Qamar Uddin: Feature Extraction of Tax Card by Using OCR Based Deep Learning Techniques

Master’s Thesis, 56 p., Appendix 1 (5 p.)

Supervisors of the Master’s Thesis: Professor. Xiao-Zhi Gao and Syed Muneer (CTO at FREE OY)

June 2021

Abstract: Recognition of an object and a character of a text from an image has been a popular and famous area of research in the field of computer vision. The major applications of CV can be used in multiple areas to solve different kinds of problems such as a printed text recognition from a document, handwritten text recognition from a document, number plate recognition, classifying the scripts or text of documents, text recognition from a scene, etc. There are a bunch of optical character recognition methods that have been developed to address this kind of problem. one of the famous OCR methods that have been used in this study is a Py-Tesseract OCR. This study aims to automate the system that can detect some required features automatically from the tax card, the features consist of the following: SSN, customer name, tax percent, additional tax percent, tax card year, and income limit. The purposed method used for this thesis is a CNN-based OCR(Tesseract) model that has extracted these features from the image of the tax card. The results of this model are clear and accurate and provide an 80%

accuracy level.

Keywords: Computer Vision, Image Processing, Artificial Intelligence, Deep Learning Convolutional Neural Network, Optical Character Recognition

(3)

ii

Acknowledgment

This thesis was done at the School of Computing, University of Eastern Finland during spring 2021. I am thankful to the University of Eastern Finland, its staff, teachers for offering me a great opportunity to pursue my master's study. I am also thankful to my coordinator Dr. Oili Kohonen for helping, motivating, and guidance throughout the study.

I would like to express my humble thanks to Allah Almighty, who blessed upon me with the strength and knowledge to accomplish this achievement. I am thankful to the beloved Holy Prophet MUHAMMAD (PBUH), who is forever a source of knowledge and a role model for everyone.

I would like to express my sincere thanks to my supervisor Prof. Dr. Xiao-Zhi Gao for his continuous support, guidance, and advice throughout the study. It’s been a good experience and honor to work with him under his kind supervision.

I want to thanks my co-supervisor Syed Muneer, for helping me in the practical aspects of the project and allowing me to work in FREE Laskutus Oy.

I would like to dedicate this thesis to my beloved Father. I am truly thankful and happy for every moment and words of encouragement and motivation from my mom and dad. They have always been there, more than 5000 km away, for me. Not to forget, Special thanks to my friends, colleagues, brothers, and my sisters for their love, support, prayers, and encouragement throughout my studies. This accomplishment would not have been possible without them.

(4)

iii

List of Abbreviations

CV Computer Vision AI Artificial Intelligence ML Machine Learning DL Deep Learning NN Neural Network

CNN Convolutional Neural Network RNN Recurrent Neural Networks NLP Natural Language Processing OCR Optical Character Recognition HCI Human-Computer Interaction GIS geographical information system

ASCII American Standard Code for Information Interchange AIS Artificial Immune system

ACO Ant Colony Optimiser BCO Bee Colony Optimization GA Genetic algorithms FLM Fuzzy Logic Model DBM Deep Boltzmann Machine FC Fully Connected

RELU Rectified Linear Unit RE Regular expression

(5)

iv

1. Introduction

For the last few decades, computer vision has been used in various fields of life to solve different kinds of problems. For example, object recognition, face recognition, emotion recognition, visual search, gesture recognition, and text recognition, etc. One of the most important areas where computer vision is implemented successfully is optical character recognition which is usually known as OCR. Optical character recognition is a scientific approach that is used to convert the text from the scanned or printed image into a normal text file. Recognizing the text from the image through computer vision is very popular in various real-life applications, including medical imaging, a photo of a document, a scanned document, identity documents, subtitles on an image or video, vehicle registration number from number plate, vehicle chases number, etc. (Robby et al., 2019).

There are multiples OCR models including Py-tesseract, google vision API, ABBYY FineReader, and E-aksharayan have been using for text recognition from various ground for different purposes. The OCR models are used to recognize text (printed or handwritten) in multiples languages. The most suitable and famous use case for OCR models is to transfer the printed text from the documents into machine-readable text files that can be easily edited with the help of tools such as Microsoft Word and google docs etc.

Nowadays, OCR is considered one of the most prominent and widely used types of data entry methods. Before the invention of OCR technology, the only method that has been used for digitizing the printed documents into text was the manually retyping process which was done with the help of a type-writer machine. This method was not only time-consuming but also generate a high level of inaccuracy and typing error due to typographical mistakes. Due to the high performance and good quality work of OCRs, many paper-based documents with various languages and formats can easily be processed through this method to convert the text into an editable text file (Thien, & Minh, 2019).

This research aims to implement the OCR tools on the scanned image of Vero (Finnish Tax Administration) tax-card to get various useful information of the customer. For example, social security number, customer name, withholding tax percent, additional tax percent, the year of the tax, tax-card number, etc. The purpose of this study experiment is to build a useful system that can automatically detect these features from the image of the text card to automate the verification and validation process of customer data. This system was built and implemented

(8)

2

in FREE Oy Helsinki, as they have thousands of customer's tax cards that need to verify and validate every tax card data manually. The benefits and advantages of this system are to make their data verification and validation process faster because this system is capable to do this job automatically. Before the implementation of this system, they have been checking every individual customer data manually and it was very time-consuming.

1.1 Computer Vision

The term CV stands for computer vision and it is defined as a deep learning technique that enables the computer to see and understand the content and format of digital media such as image and video recording, etc. The techniques can detect the required and useful features from the raw source of media to process them and convert the result to numerical values that can be easily understandable by the computer for further processing. This problem can also be solved by a person if who has seen it once, even a child can understand the content of an image or the content of a video clip if once he/she watched and see it in-depth. But the world needs a solution to this problem with the help of the computer and the possibility of implementing this technology in the real-world problem (Brownlee, 2019).

Computer vision is also considered as a subfield of artificial intelligence and machine learning which may have a strong relationship to implement different methods for the extraction of useful information from digital media such as image and video. The useful information might be an object, edge, text, a handwritten digit, or three denominal models, etc. The relationship of AI and ML with computer vision is interconnected (Brownlee, 2019).

In the last few decades, applications of computer vision have been widely used in different fields of life such as agriculture, medical diagnostics or health care, mechanical industries, text, and character reader (OCR), and so on. The computer vision technique is implemented for the improvement of health benefits and increases the efficiency level to produce livestock products in agriculture. The vision algorithm can enhance the quality and highlight the production procedure of livestock products in a steady flow to fasten the production process by using the new technology in the agriculture industry. (Stuyft et al., 1991).

Various Computer vision applications have been used to recognize things from an image or a photograph in different ways. For example, Object classification categorized the class of objects from the photograph, Object identification identify the type of given object from the

(9)

3

photograph, Object verification confirm the existence of the object from an image, Object detection find the location of the objects from the image, and finally Object recognition recognizes the type of objects and their location from the photograph, etc. Although initially, people thoughts that the problem could be solved easily, even they think that one can solve this problem by linking the camera to the computer but with time this prediction became wrong, and the problem of computer vision remains unsolved. A few decades later, researchers realized that they are not getting the result of human vision through computer vision. The main challenging task of computer vision is to extract useful information from the image is still under research, as the researchers want to build a general-purpose seeing machine that should have the capacity to capture the human vision efficiency (Brownlee, 2019).

Computer vision has been extended into broad areas including image processing, pattern recognition, and computer graphics, etc. Computer vision is one of the powerful disciplines of Artificial Intelligent which takes the image as an input and produced the output by the understanding of the content of the image to identify the objects. The functionality of image processing is different from computer vision. The purpose of image processing is about the implementation of computational transformation on an image such as image sharpening, image scaling, and image segmentation, etc. While the idea of computer vision is to create a model and extract data and useful information from the image. Image processing is somehow related to human-computer interaction. A reliable model of computer vision is based on the following characteristics: image acquisition-processing, feature extraction, and image segmentation (Wiley & Lucas., 2018).

Figure 1.1: Object Detection Comparison (Voulodimos et al.,2018)

(10)

4

In figure 1.1, a deep learning technique known as CNN is used to perform computer vision tasks such as object detection, face recognition, pose estimation, etc. The ambition of neural networks is to create a system based on neurons to perceives and mimic the knowledge and thinking level of the human brain and act like that. In this figure, the model is detecting, identifying, and differentiating various objects from the scene image where the captured image is bonded into a box.

Object detection shown in figure 1.1 is one of the most prominent applications of computer vision to identify and differentiate the certain class of objects such as birds, human, car, and the airplane from the surface of the digital image (Figure 1). From (a) the image shows the ground truth of objects and CNN try to understand the various object by identifying their features. From (b) the model is capable to organize the objects and locate the position of every single instance of an object from the image based on the previous step where objects are identified through feature extraction. From (c) the model classified the object into different classes and assigned the label to every individual object separately (Voulodimos et al., 2018).

1.2 Image Processing

Image processing is a subfield of computer vision and a powerful technique of signal processing that takes input in the form of an image and processed it to produce the output either in the form of an image or some set of features/characteristics of an image. The raw image can be taken through a sensor or camera placed on satellite space or the pictures can be taken from the surface ground. It can be used for various applications such as medical imaging, remote sensing, agriculture, forensic and material science, graphics, textile industry, media, and printing industries, etc. The image is just a two-dimensional signal for which most of the time human being is involved to process it into a useful way. Before the entry of digital image processing into the world, the cost of processing was very high, and these kinds of tasks were done by using graphics with human involvement. After 2000, digital image processing techniques become very popular and useful tools for image processing because of fast processors and cheaper fair (Deepika, Anjali, Sandeep, 2014).

(11)

5

Figure 1.2: Process of Image Processing (tutorial point)

In figure 1.2 the picture has been taken by the normal digital camera and has been sent into the digital image processing system. The main objective of this task is to ignore all other details of the image and just focus on zooming of the water drop in a such way that the quality of the image remains unchanged. The image processing system processed the image with the help of a processing algorithm and produced the output same in the form of an image with better quality by zooming the selected part of the image.

The image processing system is mainly based on two types of processing techniques including analog image processing and digital image processing. Analog image processing technique processes the image through electrical means which are mostly operated in a two-dimensional way and can be used for hard copies like printed documents and photographs. A common example of analog image processing is a television image. An analog signal has many drawbacks like it takes a lot of effort to store the signals into memory and it produced too much noise on the image. While on the other hand digital image processing is a powerful and fast method that uses a computer algorithm to perform image processing techniques on the digital image. A digital signal is less time-consuming and can be stored easily. Digital Image processing can produce a high-quality image with a low cost of processing for the digital image per processing stage (Deepika et al., 2014).

During the last few decades, several image processing methods have been purposed for the enhancement and improvement of a digital image obtained through different ways. Digital image processing is become very popular due to the innovation of new technology such as a powerful computer, large-size storage devices, and graphics software, etc. By using these

(12)

6

technological factors, the system can get pictorial data of a photograph to human understanding and machine interpretation. The following techniques are used for the processing of digital images. (Chitradevi & Srimathi, 2014).

• Image pre-processing.

• Image enhancement.

• Image segmentation.

• Feature extraction.

• Image classification.

pre-processing is an important technique of image processing that applies the operation on an image at the early stage of processing to remove some undesired features for further processing.

The aims of image pre-processing are to use redundancy methods on an image to improve the quality of the neighbouring pixels of the central object corresponding to the object which has similar brightness values. To obtain the high-quality result of an image through this technique emphasis implementing the following pre-processing factors; Scaling, cropping, filtering, magnification, and reduction (Miljković, 2009).

The image processing techniques are used to modify the pictures by changing the brightness values of pixels to improve the obvious cortex of the image is known as Image enhancement.

Image enhancements consist of several methods that are utilized to move forward the visual appearance of an image that can be understandable for human and machine interpretation. In some cases, images captured through the satellites and the digital camera has lack contrast due to the illumination condition that leads it to a bad quality image. This type of image may have different types of noise. The main objective of image enhancement is to modify and convert the image into such a shape that is superior suited for human understanding. Image enhancement is useful for different purposes such as feature extraction and image analysis to achieve a high-quality image. Three of the enhancement techniques that are used to achieve the expected results are Contrast Stretching, Noise Filtering, and Histogram modification (Chitradevi & Srimathi, 2014).

(13)

7

Figure 1.3: Noise Removal (Chitradevi, & Srimathi, 2014)

Above figure 1.3 shows the techniques of noise filtering. It is a process to remove unnecessary aspects like a blurry spot from the image. It can also be used to filter and remove a different kind of noise from the image.

Image segmentation is a powerful technique in image processing that has been used for partitioning an image into multiple parts (maybe an object or set of pixels). The image segmentation aims to change the visual look of an image into a more meaningful form. It is also used to locate the position of an object and its other characteristics (edge, the curve, and the line) in the image. A trained object classifier model is used to classify the objects into their specific class and the model predicts the object type based on the label of the object. The idea to process the whole image simultaneously is difficult to get the accurate result from the image, that is the main reason for dividing the whole image into smaller chunks and pieces to obtained useful information from the image through the image segmentation process. As the image is a collection of pixels and one can group them based on their similar attributes through the image segmentation process.

Nowadays Image segmentation is one of the most famous techniques used in medical applications to identify the infected areas from the image. For example, cancer cells, skin disease impact level, and many more. Identification of cancer cells provides the seriousness of the disease. This process can reach the depth of the problem in a more granular way and can get meaningful results from the infection region of the image (Zhao, & Xie, 2013).

(14)

8

Figure 1.4: Cancer Cells (source: Wikipedia)

Feature extraction (shown in figure 1.4 ) is an image processing technique that extracts useful information from various parts of the image to recognize and classify the nature of the image.

It can be used in different image processing applications. For example, character recognition, object detection, etc. In this research, the author has been using the feature extraction technique to extract features from an image.

The applications on which this process has been applied such as documents verification, reading deposit slip, bank credit card details, reading postal address documents, etc. The method that is used in this research is optical character recognition based on feature selection and feature classification. These two factors of feature extraction have a vital role to obtained useful information from the image. Feature extraction can be done after the pre-processing stage of character recognition. Features can describe the behaviour and shape of the image, and they can be helpful for the pre-processing stage of image processing (Kumar & Bhatia, 2014).

There is still needed to work on various applications of character recognition with the help of features extraction. For example, the problem is to detect the various numeric features from the printed image of the tax card for auto-matching and verification. In this thesis, the study is conducting on the images of Vero (Finnish tax administration) tax-card to extract the required features of the customer for automating the process of tax card verification.

There are many applications in which text is detected and recognized automatically through OCR technology. These kinds of applications such as multimedia systems, geographical information systems, digital libraries, etc. The OCR model converts the printed text from the image into computer-readable form (ASCII), and the system can easily detect and recognize the required text from the OCR converted text. Nowadays most of the information is presented

(15)

9

either on printed documents or in the form of images or videos, which can be easily processed with this kind of technology (Wu et al.,1997).

1.3 Aims and Objectives

This thesis aims to study how deep-learning-based OCR models can be used to recognize and extract the printed text from the image of a tax card to identify useful information through image processing? The problem is how to overcome the manual processing of tax cards? It's hard to recover and get back the images from the database that have many images that contain text on it? OCR-based models become very efficient and quick to retrieve the data and processing of tax card. The data in terms of tax cards have been collected from the FREE oy and tested both models including (tesseract and GC vision OCR) for proper evaluation.

The research study of this thesis was carried out at the FREE Oy Helsinki, under the project of image processing of Vero tax-card. The customers using this service needs to upload their VERO tax card into the FREE database. The idea of this project is to detect the useful and required information of customers from the image of tax cards that are being uploaded into the system of the company. One of the major objectives of this study is to automate the module of tax cards into the FREE database. The system can automatically verify and validate each feature of customers from the image of tax cards when the customer uploads their tax card into the database of FREE. The system will verify the data extracted from the tax card and compare that data with the already existing data of customers to automate the processing of data.

The author’s contribution to this work includes the development, integration, testing, and implementation of the Vero tax card into the FREE database. The methods we have implemented successfully are deep-learning-based OCR models including Tesseract, and Google cloud vision to recognize the text from the tax card image. Firstly, the reason to use these models for text recognition in this study is to fasten the processing speed of tax card evaluation. Secondly, due to the high accuracy level and well performance for text recognition.

Previously these models have been used for text recognition in the various domain for different aims such as receipts of the supermarket, bills of hotels and restaurants, coffee shops, and for objective paper in the educational sector, etc. The novelty of this study is based on the performance comparison and processing speed of OCR models for the successful implementation of the Vero (Finnish Administration Tax Authority) tax card. The purposed system has been implemented successfully in FREE Oy for automation of tax cards.

(16)

10 1.4 Outline of the thesis

The thesis is described and organized as follows:

• Chapter 1: A brief introduction of the thesis and an explanation of computer vision and image processing and their few techniques that have been used for object detection as well as text recognition from the image. Also, this section contains the aims and objectives of the thesis.

• Chapter 2: Summaries the overview of traditional Artificial Intelligence and machine learning approaches such as supervised and unsupervised learning that have been related to computer vision.

• Chapter 3: This chapter briefly introduced the core concepts of deep learning techniques that have been used for text detection, also discusses the comparison of optical character recognition methods.

• Chapter 4: Discussed the implementation of the program and explain the environmental setup of the FREE platform and present an overview analysis of the results.

• Chapter 5: This section purposed the main conclusion and a brief discussion.

(17)

11

2 Artificial Intelligence and Machine Learning

This chapter briefly introduced the background-related work to computer vision based on Artificial Intelligence and Machine learning techniques.

2.1 Artificial Intelligence

Artificial intelligence term refers to the science of engineering that making the machine intelligent to perform and act like humans, especially computer programs and software. AI- based systems are developed by the study of human learning and performing activities such as how human brains think, how humans learn, how to decide and work while solving a real-life problem. AI is a branch of science and technology that is based on various scientific disciplines such as Computer science, Engineering, Mathematics, Biology, Psychology, Neuron science, Sociology, Philosophy, and linguistics, etc. Based on these disciplines the aims of AI are the development of computer programs linked with human intelligence such as reasoning, learning, thinking, and problem-solving (Kok et al., 2009).

Figure 2.1: Artificial Intelligence involvements (tutorials-points)

(18)

12

Figure 2.1 shows various applications of AI that have been used in different fields to perform multiple human-based activities. The problems including gaming, natural language processing, transportation, mental health, expert system, vision system, speech recognition, and handwritten text recognition, etc have been solved through AI. These applications are considered the most important areas that directly affect and based on daily life activities of human being, and the use of AI in these applications fasten and bring new changes in the field of work.

Transportation is one of the major applications of AI that has the potential to make traffic systems more efficient by changing the transportation sector. It starts from helping cars, trains, ships, airplanes, and other types of traffic to functions autonomously to make them smoother and more reliable. AI in transportation aims to overcome and control the various environmental factors includes safety concerns, decreases CO2 emissions, air pollution (nitrous oxides and particulate) that have been affecting the environment, and creating challenges for the survival of human beings. Transport departments must find the best way to use this kind of technology to improve the transportation system to provide a wonderful and efficient travel facility to their public and customer to make them happy. It can also improve the economy and productivity of the country. Following are the applications of AI that can make an improvement in the field of transportation for the welfare of society; Artificial Neural Networks, Artificial Immune system, Ant Colony Optimiser, and Bee Colony Optimization, Genetic algorithms, and Fuzzy Logic Model (Abduljabbar et al., 2019).

2.2 Machine Learning

The word ‘Learning’ refers to the process of obtaining knowledge or skills through instructions, experience, training, and observation by any living organism. For example, a plant learning how to respond to light and temperature for any species, a bird learning how to fly and feed, or humans learning how to ride a horse, a bay learning how to observe and respond to a sound, etc.

Machine learning is a branch of AI that can be defined as the ability of the system that can acquire knowledge from training and experience. The machine can learn from the data that has a similar pattern; the more data is provided to the machine the more accurate will be the result.

The machine provides the result of new experiments based on past data because the machine

(19)

13

can make a prediction on given factors for accurate estimation. For few decades, ML applications have been using in multiple areas of research including the following: data mining, computer vision, medical diagnostics, credit card fraud detection, natural language processing, DNA sequencing, speech and handwriting recognition, biometrics, mechanical and electrical robotics, etc. ML aims to train the machine in a way to be able to detect the fault efficiently and highlight the issues for improvement (Baştanlar & Özuysal, 2014).

Machine learning has a lot of methods that allow the machine to learn directly the useful and important pattern from the data without human intervention. Such kind of ability can help the machine to learn more efficiently and quickly through the process of transfer learning, feature selection, and multitask learning. Machine learning has been applied in various applications and achieves the state-of-the-art performance. Medical diagnostics is one of the most important applications of AI for which the ML algorithm has been implementing to get the most affected spot of the body which has been suffered from diseases. This kind of method can easily detect and recognize the patterns of features from the medical image. The feature extraction from the medical image is very important to predicting the improvements or deterioration condition of diagnoses of interest (Erickson et al., 2017).

The motivations toward machine learning are increasing day to day by its implementation into various real-life applications to solve their problems quickly without human interpretation.

Certain tasks are immensely hard to program by hand including face and speech recognition, machine translation, spam filtering, controlling robot motion, and data mining. Such type of tasks is solved with the help of the ML technique (Lison, 2015).

Table 2.1: Examples of ML approach (Lison, 2015)

(20)

14

Table 2.1, Summarizes some ML approaches and explaining various kind of data that has been used as an input and output data for these techniques accordingly.

2.3 Machine Learning Approach

The learning methods are classified into supervised learning, unsupervised learning, and semi- supervised learning, another scientific name for semi-supervised learning is reinforcement learning. These learning approaches are further categorized into different classes and sub- classes such as classification, clustering, regression, and dimension reduction.

2.3.1 Supervised Learning

Supervised learning is one of the important ML techniques to predict and classify things based on the ladled data during the training of the data model. It consists of input as well as outputs in the pair form (I, O) during the training of a data model where the correct output is already known by the systems. When you're learning a task under supervision, someone is watching to see either you're getting it right or not. Consequently, in supervised learning, a complete set of labeled data is required for the training of an algorithm. Each input data set example is tagged with the answer that the algorithm should come up with the sense to recognize it. The model can distinguish the class of labeled datasets of various flowers from images. For example, when a new labeled image is given to the model, the model compares this image to the training image to classify and predicts the object or flowers from the correct label (Lison, 2015).

Supervised learning is performed in the context of regression and classification. In both regression and classification, the aim is to find the association between input and output data to get an accurate output result. The main complexity of the regression model is a bias-variance trade-off during the training of the model. If you have a small number of data points, then you may face a low complexity problem. This kind of problem can cause overfitting. Commonly known algorithms that have been used in supervised learning are: logistic regression, linear regression, naive Bayes, support vector machines, artificial neural networks, and random forests (Devin, 2018).

(21)

15

Figure 2.2: Supervised learning algorithm process (ISHA,2018)

Figure 2.2 shows the training process of a supervised machine learning algorithm that gets input data with labeled observation for training, based on prior knowledge from training the machine can make a good prediction to produce a correct result.

2.3.2 Unsupervised Learning

Typical, unsupervised learning refers to the learning methodology that has a huge amount of unlabelled data. In other words, it can be defined as a deep learning technique that has been implemented on the dataset without explicit instruction on it how to do and what to do it. The training data set of this learning model is only based on the input data set without any desired output dataset or correct result. A neural network is a solution to find the pattern and structure of the input dataset to make a separate cluster for extracting useful information and features.

Sometimes we don’t have any output values, we only have access to the set of input values. In that case, the system can learn and identify the pattern of input data easily (Lison, 2015).

Figure 2.3: Unsupervised learning algorithm in features extraction (ISHA,2018) Figure 2.3 is explaining the learning behaviour of unsupervised learning algorithms, as the models can automatically extract the features of the object and find the suitable patterns in data for clustering purposes.

(22)

16

Unsupervised learning is one of the deep learning methodologies in which the algorithm does not need any supervision. Instead, leave it to the model to find out what it wants to do. It is mainly concerned with data that has not been classified. The aims of using unsupervised learning include the followings: It can detect all kinds of unknown patterns in data. It helps to find features from the ground which can be useful for categorization. It takes place in real-time and analysed the input data under the supervision of the learner. It is easier to get unlabelled data instead of labeled data which requires manual supervision. For example, if you want to train a machine that can predict the time to travel from your workplace to home. It needs just some input data for prediction such as, including time of the day, weather condition, is its holiday, is it is raining, etc. Based on these input values the machine can predict the output value that could be the time required to travel from workplace to home (ISHA,2018). The unsupervised learning algorithm is based on three main fields including clustering, Anomaly detection, and association.

2.3.3 Reinforcement learning

Deep learning techniques existing in between supervised and unsupervised learning are known as reinforcement learning. We may not always have direct access to the "right" output, but we can get a sense of its consistency by using the input values. RL is a mapping process from conditions to actions to maximize a scalar reward or reinforcement signal. As it's not told to the learner which actions take to perform but it can be judged by the learner automatically to perform the higher level of action which can yield the best rewards. This form of machine learning is used by AI agents and tries to find out the best way to accomplish a particular objective or improve task performance. As the agent moves closer to the goal, it receives a reward. The overall aim is to anticipate the best next step to maximize the final reward from the actions (Sutton, 1992).

(23)

17

Figure 2.4: Machine learning Algorithms. (https://towardsdatascience.com/) In the above figure 2.4, there is a list of ML algorithms that are categorized into four different classes such as supervised learning (regression), supervised learning (classification), unsupervised learning (dimension reduction), and unsupervised learning (clustering). This figure also explaining the Chee sheet of ML algorithms that have been categorized and classified into their parent groups.

(24)

18

3 Deep Learning Approach To Text Recognition

This section briefly explains deep learning approaches that have been used for text recognition, such as ANN, CNN, and OCR-based models.

3.1 Deep Learning

Deep learning is an advanced type of ML method that has multiple layers to extract a higher level of features from the input data. DL algorithms are inspired by the data processing technique of the human brain for decision-making. The word ‘deep’ in deep learning represents the number of layers through which the data are passed as input and output, each level of layers learns to change over its input data into a more abstract and meaningful shape. For example, in image processing applications, the raw data may be an image or photograph of a human face.

The first layer may identify the collection of pixels, the second layer may identify and encode the edges of the image, the third layer tries to arrange the edges, the fourth layer may identify the nose, lips, mouth, and eyes, the final or fifth layer may identify and recognize that there is a face in the image based on these collected features from the image. The most popular and influential deep learning algorithms listed below, Artificial Neural Network, Recurrent Neural Networks, Feedforward, Backpropagation, and Deep Boltzmann Machine, etc.

With the rise and progression of deep learning techniques, computer vision has been extremely changed and reshaped. Text detection and recognition are considered one of the growing research areas in computer vision that have been influenced by this wave of change. The text itself is one of the important factors of natural language and creations of humankind, which can be very useful in any field of work and became a medium of communication. Through the DL technique, text detection provides the true understanding and explanation of the image across time and space. Text in natural scene pictures or videos commonly has basic semantic information, common scene text disclosure and recognition aim to locate and discover the text in the scene image or in the video to recognize it automatically. (Long et al, 2021).

Transfer learning is a scientific tool in DL that can be used to solve the basic problem of less quantity training data, as it can transfer the data information from the source place to the target place. Transfer learning has brought a huge improvement to this kind of problem where taring data is insufficient to train the model. DL algorithms learn high-level features from massive

(25)

19

data which can distinguish DL techniques from traditional Machine learning algorithms. DL algorithms automatically detect data features from the source domain by unsupervised or reinforcement learning algorithm and it can save user resources such as time, effort, and technology. While in distinction, ordinary machine learning algorithms are required to extract features manually that can create high-level stress on users and cause them to time-consuming (Tan et al., 2018).

Over the last few decades, deep learning and AI technologies have been used in various applications such as computer vision, autonomous driving, airplane, robotics, and natural language processing, etc. Autonomous vehicles or self-driving revolution bring huge changes in industry cause of deep learning technique, now this technology is transferred from industry/laboratory to public road. The implementation and deployment of this advanced technology in our landscape proved to be very beneficial for the public because it escorts some positive changes in our environment and society such as decreased road accidents and overcrowding, sensor devices, etc (Grigorescu et al., 2020).

3.2 Artificial Neural Network

The human brain is the most precious and active organ/part of the body, it assists us to learn, understand, behave, act, and decide. Because the human brain has control over many organs of the body such as thoughts, speech, and movement, etc. The secrete behind this logic is due to neurons. The neuron is considered the messenger of the body because it transferred the information through impulses within different parts of the human brain and to other organs of the whole nervous system. The best way to understand the working process of a neural network is to study and understand how the natural neural network interior of the human brain works.

As the neuron inside to the human brain are the basic component of the brain and has the responsibility for learning and holding the knowledge and information as we know them. Same as a human brain, an artificial neural network model was designed after the discovery of artificial neurons. It works to take input in raw forms and use different layers to process those inputs to get the desired output from the model with the help of an artificial neuron. The artificial neural network is not efficient like the human brain, but both perform similarly. The main difference between the brain and neural network is simple, as the brain learns from experience, but the neural network learns by comparing different samples.

(26)

20

In learning image recognitions, during the training cycle, the neural network would learn to recognize and identify the images containing birds by observing sample images that have been labeled with “birds” or “no birds”. Based on those results the system could locate and identify the birds from new images. Such kind of neural network initiate from zero with no data about birds’ characteristics such as tail, wing, eye, beak, forehead. The systems make their claim to understand imperative characteristics based on the learning fabric being prepared. In contrast to the neural network, the human brain does not initiate from zero because it is based on the evaluation process (Foote, 2021).

The structure of the neural network model can be constructed in a way that has been consisting of various components such as neurons, layers, weightage, activation function. The neuron also knows as unit nodes are categorized into input nodes, hidden nodes, and output nodes.

These nodes can pick up the information from one node and passes it to other nodes or finally to the output nodes as a result. This information passes through the different layers of the neural networks. The input neuron gets the information from the outside world in the form of a pattern or signal and transferred them to the hidden neurons that are existing between input and output neurons. The output neurons received the information from hidden neurons in the form of weights that have been processed through the activation function and provides the output as useful information to the outside world as a result. The weight factor is a very effective component in neural network training, it has been classified into positive weight, negative, and zero weight. Zero weight means a neuron expend no impact over the connections of other neurons. Positive weight factors can increase the efficiency of networks that cause the accuracy of the output and the best solution to the problems (Mijwel et al., 2019).

Neural networks can be used in various applications for different domains to solve complex problems, some of them are explained here. NN is used in handwriting recognition to transfer handwritten digit into computer-readable form. NN has been used in the stock exchange market to predict upcoming downfalls and factors that have been affecting the stock market. NN has been used in image recognition to differentiate and identify various objects from the same image. The neural network can understand the pitch of sound for different humans in voice recognition. NN can be used to create a simulation and prediction-based approach for some critical problems such as weather forecasting, business downfalls, and medical diagnosis. The neural network can also be used in language translation, time series prediction, biometric recognition, and fraud detection, etc. (Mijwel et al., 2019).

(27)

21 3.2.1 Layers in Artificial Neural Network

There are three main types of layers in artificial neural networks which can be further extended into their subtypes, they are explained below.

Figure 3.1: Structure of Artificial Neural Network (Sordo, 2002)

Figure 2.5 described the flow of information through different layers including input layers, hidden layers, and output layers for the artificial neural network.

3.2.1.1 Input Layers

Input layers are also known as the input node that takes information from the source in the form of signals or neurons and provides that information to the hidden layers for further processing, sometimes it's also known as visible layers in the neural network.

3.2.1.2 Hidden layers

The mapping layers between the input layers and output layers accept the data from the input layers and provides the information to the output layers. In this layer, the weight factors of the input unit are multiplied by the activation function to combine the net result. The hidden layers may be one or more depending upon the situation and need of the data model.

(28)

22 3.2.1.3 Output Layers

The layer of nodes that produce the output variable, as a result, is known as the output layer in neural networks. Output layers received the processed information from hidden layers.

3.2.2 Types of Artificial Neural Networks

Many types of neural networks can be used to solve various real-life problems, some of them are explained below.

3.2.2.1 Feed-forward neural networks

Feed-forward is a kind of neural network in which signals travel only in the forward direction and the input has been directly linked to the output. In feedforward neural networks there are no loops for backward direction, which means that the data can only flow from input to output.

Networks in feedforward neural networks have fixed input and output data. Feedforward neural networks are widely used in various applications such as pattern generation and pattern recognition, document segmentation, prediction, function approximation, and classification.

The mathematical representation of Feed-forward neural network models is explained below.

Where the artificial neuron takes a vector of input values such as x1, x2...xn, and every input vector is multiplied by a series of weight factors such as w1, w2...wn, The weighted input values are combined, and a bias value (b) is added to the weighted input to produce a net output value.

𝑧 = ∑ 𝑥_𝑖𝑤_𝑖

𝑛

𝑖=1

+ 𝑏 (3.1)

The final input is then pass through the activation function (g) to produce the final output a=g(z), which can be transmitted to the next neurons.

𝑎 = 𝑔(𝑧) = 𝑔 (∑ 𝑥_𝑖𝑤₁

𝑛

𝑖=1

+ 𝑏) (3.2)

Where the activation function can be chosen according to the requirement, but the weight factors and bias values (b) are selected based on the learning rules during the training phase of the neural network models.

(29)

23

Feedforward neural network is a classical neural network method which was used earlier for the different problem that arises from various disciplines. The network is consisting of multilayers and each layer is connected as fully connected to other layers to process the input data directly into output result. It has multiple layers, due to this reason this network is also known as the multi-layer perceptron (Razavi & Tolson, 2011).

3.2.2.2 Recurrent Neural Networks

RRN is a kind of artificial neural network that is highly used in text detection and text-to- speech conversion to locate the pattern and series of data in natural language processing. In this network, the output of a certain layer is saved and send back to the input layers to compare it with the input data, this helps to predict the outcomes of the other layers. In the layers of this network, every node will remember some facts that it had in the previous one. In short, each node performs as a memory cell to save some information of the layers during the data transformation from one layer to the other layers. If the prediction is not corrected, then the system self learns from the stored information to make the right prediction regards to the backpropagation (Schmidt,2019).

(a) An example of fully connected RRN (b). An example of simple RNN Figure 3.2: Recurrent Neural Network Architecture (Medsker et al., 2001)

Figure 3.2 (a) shows the architecture of fully connected networks that do not have separate input nodes, each node gets input from other nodes. In the case of (b), it is explaining the simple architecture of RNN and it is used to learn a single character of string through feed word structure.

(30)

24 3.3 Convolutional neural networks

The convolutional neural network is the most powerful and famous deep learning neural network that has been used in various applications of computer vision such as image processing, pattern recognition, and object classification. It can also be used for the application of speech recognition. CNN detects the feature map of the object from the input image through different operations repeatedly and processes the output result from the image to classify the target object from the examined image. Nowadays large size pixels of images have been processed with the help of CNN to get all the features by adding some weight factors, that cause the improvements of features and pass them into different layers of the neural network for further improvement. This cycle revolves around various times between these hidden layers such as convolution layers and pooling layers. In these layers, the image pixels are divided into various patches to apply a different kind of padding mask for the processing and purification of the output image. Then the system can be able to classify the objects from the image into its class, for example, the input image contains a dog and a cat as an object. During the training of CNN, the algorithm can easily classify both objects into their classes and label their name accordingly.

The name CNN is taken from a mathematical operation such as a matrix called convolution. It has multiple layers and can be categorized into two different sub-groups such as convolutional layers and fully connected layers. This group of two layers have some parameters but the other group of two layers including pooling layers and non-linearity layers doesn’t have any parameters at all. CNN is considered one of the most important neural networks for solving the complex problem of image processing because complex tasks are impossible to solve with the traditional ANN methods. CNN has specialty over the other neural network because it has reduced the number of parameters in ANN, this leads CNN to success in solving a complex problem. CNN does not need to have the features map of the targeted zone of the image. For example, in the problem of face detection, there is no need to focus on the area of the image where the faces are located. It has only based on some features that are specified at the first layer of the network, as in the first layer the edge can be detected, in the second layer the shape is identified, in the next layer the face is detected (Albawi et al., 2017).

CNN has been used in various applications of image recognition to identify and recognize a targeted object from the image such as: It can be used to detect number and character as a string

(31)

25

from the image, e.g., capture the vehicle registration number from number plate, etc. It can be used for the medical image to detect the diagnosis and affected part of the body from the image.

It can be used to detect the faces of animals or humans based on some facial features. It can also be used to detect mechanical parts of an automobile in industry. It can be used in agriculture aspects to detect plants species such as flower, steam, and leaves from the image (Gogul, & Kumar, 2017)., and even it can be used to detect any objects based on their features from the image to identify and classify the class of the objects.

CNN has been widely used for text detection from various surface ground to detect the text character by character. Natural scene image is one of those areas where CNN applied for automation of text detection. Text recognition techniques aim to detect the depicted words from the image which can be categorized into two ways such as character recognition or whole word recognition. Normally text can be recognized from the image documents by using the OCR technique which is well suited to identify the words and characters from the image documents.

In the case of scene image, it can be failed due to some characteristics including font style, size, image scene, blurring effects, and other feasible appearance of the scene image. This needs to use advanced CNN techniques for solving such kind of problem (Jaderberg et al., 2017). Most of the OCR tools are used to detect the text for a printed text from the image which produced high accuracy and potential result (Saidane, & Garcia, 2007).

Figure 3.3: CNN model used for text recognition (Jaderberg et al., 2017)

It’s shown in figure 3.3, where the input image is containing only a simple text word which is divided into image patches and passed through into various CNN layers to recognize the accurate and efficient words at the output phase.

(32)

26 3.4 CNN Architecture

All CNN based model follows this architecture for classification and pattern recognition for various objects from an image. The fundamental sketch of the CNN model is shown, we will explain each part of the model individually in detail.

Figure 3.4: Architecture of Convolutional neural network (Dertat, 2017)

From figure 3.4, it shows that the model can take any sort of image as an input vector and transferred it through the series of various layers including convolutions, pooling’s, and finally based on several fully connected layers to recognize and identify the object and its features from the input image. In this whole process, some other factors have an important role to process the input data into output data such as RELU function, bias function, activation function, and SoftMax.

3.4.1 Convolutional Layers

The Convolution layer is considered the main block of the CNN model and it is located at the start of the model in the sequence after several iterations of the pooling layers. The mathematical representation of dimension and filter or kernel is explained below.

dim(𝑖𝑚𝑎𝑔𝑒) = (𝒏_𝑯, 𝒏_𝒘, 𝒏𝒄) (𝟑. 𝟑)

Where:

nH: the size of the height nW: the size of the width nC: the number of channels

(33)

27

In the case of an RGB image the Nc=3, we have red, green, and blue. The filter K should be in a squared shape, and the dimension represented by (f) allows all pixel elements to be in the center of the kernel. When applying the filter to the convolution the kernel must have equals number of channels that the image has. It can be possible to apply a different filter to each channel of the image. The dimension of the filter is represented as follows.

dim(𝑓𝑖𝑙𝑡𝑒𝑟) = (𝑓, 𝑓, 𝑛_𝑐) (3.4) Mathematically for a given image and filter we have.

(3.5) Based on the same notation as before we have used for the size of the height, size of the width, and the number of channels, the dimension of convolution could be more specific by applying a kernel to the image, we have.

(3.6) In the convolution layer, the mathematical operations are performed to calculate the feature map of the image with the help of a kernel or filter. The input image is supposed to be in the form of 5 by 5 (5*5) matrix pixels and it can be divided into sub-matrix of 3 by 3 (3*3) patches for the implementation of 3 by 3 (3*3) kernel or filter for feature extraction. In this way each of the 3 by 3 (3*3) patches of the 5 by 5 (5*5) whole image are multiplied by 3 by 3 (3*3) kernel matrix to get the output feature maps, the pixels values for both matrixes should be from zero or one (0, 1). The visual and mathematical representation of the matrix image with dimension is shown step by step.

Table 3.1: Input values of the image and kernel values

(34)

28

The left side table is showing the input values (pixels) of the image for the convolution, and the right-side table is showing the convolution Filter, also known as mask or kernel, which has been applied to every patch of the image.

Table 3.2: First patch of the image with the kernel and feature map

Here in table 3.2, the left side box is showing the multiplication process of the first patch of the input image with the kernel to get the first value of the feature map which is shown in the second table.

Table 3.3: Second patch of the image with the kernel, and feature map

Table 3.3 shows the second iteration of the multiplication process to get the second value of feature maps. in this way, it continuously goes to the last iteration and finds the whole values of feature maps in last.

(35)

29

Table 3.4: Final patch of the image with the kernel, and feature map

This is the last or final step of this multiplication process as shown in the table to get the last value of feature maps.

Table 3.5: Input values of the image and final output value of feature map

Table 3.5 shows the result of the convolution, as convolution aims to get the feature map of the input image. It can be the feature value of an object, a text, or any other kind of things such as a person, a dog, a cat, a car, etc. In this experiment, we have performed the convolution operation on every patch of the input image by sliding the kernel over it. At every point of the input image, we have performed matrix multiplication on elements wise to get the sum as a feature map. The yellow boxes show the input values of the image, the green box shows the mask or filter value, and the blue box shows the final feature map values that are achieved through the convolution operation.

(36)

30 3.4.2 Pooling layers

The pooling layer is the second most crucial layer of the CNN model which comes after the convolutional layers. It is also known as the down sampling layer because it has been reducing the size of the feature map which it received from the Conv layer as an input to overcome some serious problem such as overfitting, computational power, and accuracy level. This layer almost demolishes 75% of the data without affecting the whole information. In another word, we can say that it just removes the unnecessary information from the data to purify the result.

The main information which is being reduced in this layer including the size of feature maps and some neural connection for fasting the processing. There is no need for padding (zero paddings) to perform the stride on the feature map (Akhtar, & Ragavendran, 2020).

There are three types of pooling as Max, Average, and Sum pooling. But the most important one is max pooling which takes the maximum number from the feature map window in the selected region of the stride, average pooling calculates the average value of the selected window, and sum pooling summarizes the total values of the selected window. There are two common terms as stride step and window size are used to calculate these values. The stride step is the step that represents the movement of the selected region and normally it is used to be one (1), while the window is the selected region of the feature map for pooling (Dertat, 2017).

Here is the mathematical calculation of an experimental example of pooling layers (max, avg, and sum) using 2 by 2 window and the stride size is also 2, as it's clear that both window size and stride size are the same as 2 so they are not overlapping.

Table 3.6: First iteration of pooling over a window of the feature map

(37)

31

In table 3.6, it's clear from the green box in the feature map as it represents the selected region of the window for pooling in the next three yellow boxes, ’s the green one is representing the result of their respective pooling methods.

Table 3.7: last iteration of pooling over a window of the feature map

It is showing the results of the last selected region of the window of the feature map, similarly, for the whole window, the stride moved for the next two-step to calculate the pooling value for every selected region in the window.

Table 3.8: Results of pooling’s over a window of the feature map for every move

Table 3.8, representing the whole results of the experiments for each step of stride in the window for every pooling type.

3.4.3 Fully connected layer

A fully connected layer is considered the last layer of the CNN architecture and it has a series of layers that are placed before the output layers. Internally this layer has various layers including an input layer, numerous hidden layers, and the output layers as shown in image 3.3.

(38)

32

These layers consist of weights and biases factors that are used to connect different layers as every node in each layer is connected to the next layers. The input vector is flattened from the pooling layer in the form of a feature map which is fed to the FC layer to make classification of each observed object. In the series of FC layers, the activation faction (RELU) and weight factor (W) are added to the value of the feature map for each node. The FC layer aims to classify the types of objects based on their feature result (Gurucharan, 2020).

Figure 3.5: A fully connected layer in a deep network (Dertat, 2017)

3.5 Activation function

The activation function is one of the most important and useful factors of the CNN model. An activation function in a neural network aims to learn the complex pattern in data and decide when to activate the neuron. It takes an input value from the previous layers to pass them into other layers by performing some mathematical operation to produce the output value. The main responsibility of an activation function is to bring nonlinearity into the output values of the model. There are three main kinds of activation functions including RELU, Tanh, and sigmoid function, everyone has their specific usage and importance in the field of Artificial neural networks. SoftMax is a sub-kind of sigmoid function which is mainly used for classification purposes to classify the object into different classes. While Rectified Linear Unit is widely used in NN due to its fast processing and less expensive properties, also it has simpler mathematical operation in general (Gurucharan, 2020).

(39)

33

3.6 Optical character recognition

OCR is an acronym of optical character recognition which is used for text recognition in multiple formats such as handwritten recognition, digital text recognition from various background. Humans can easily understand the content of an image or documents by looking into it, while machines or computers cannot understand the content of an image or documents in such away. Due to this reason, OCR being in existence. The aims and objectives of OCR tools are to recognize the digital text or handwritten text from an image or documents to automate the computerized system and encode these texts into computer-readable form. Such kind of software is used to recognize and translate the text of various spoken languages into machine-readable form. This OCR process consists of many subprocesses to process the image for getting possible and accurate results in the form of text. Firstly, the image is scanned from the camera and save in one of the image formats including JPEG, PNG, or in pdf format, etc.

Secondly, the image or documents is passed into pre-processed stages where the contrast and brightness of the image are controlled and managed. Thirdly, the localization process starts where the image is divided into different zones and focused on the targeted area where the required text has existed, and it must speed up to start the extraction process. Fourthly, the targeted area which contains the text is broken down into lines, character, and words where the software is applied to compare, recognize, and identified the text through various detection and recognition algorithms to produce final output (Filip, & Anuj, 2021).

Figure 3.6: Architecture of Optical character recognition (Filip, & Anuj, 2021)

Features Extraction of Tax Card by Using OCR Based DeepLearning Techniques