• Ei tuloksia

Biomedical Named Entity Recognition: Extracting Food/Diet related Named Entities for Biomedical Relation Detection

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Biomedical Named Entity Recognition: Extracting Food/Diet related Named Entities for Biomedical Relation Detection"

Copied!
72
0
0

Kokoteksti

(1)

BIOMEDICAL NAMED ENTITY RECOGNITION

Extracting Food/Diet related named entities for biomedical relation detection

Master of Science Thesis Faculty of Information and Communication Technology Examiners: Prof. Frank Emmert-Streib Prof. Joni Kämäräinen Prof. Olli Yli-Harja November 2020

(2)

ABSTRACT

M. Nadeesha Thilini Perera: Biomedical Named Entity Recognition Master of Science Thesis

Tampere University

Master of Science in Information Technology November 2020

Named Entity Recognition (NER) became significant in the biomedical field when biology started becoming more of an information science due to vast amounts of data in digital archives.

Many online biomedical journal search engines currently employ named entity recognition to some extent, such that the search space can be reduced. BioNER is also the first step in many biomedical text mining tasks such as summarizing, document classification, mining associations between biological entities and extracting evidence-based biological networks. While there are many BioNER tools with state-of-the-art performance for entities such as genes, proteins, dis- eases, chemicals, and species, there is little research using the machine learning approaches for food and dietary compounds entity extraction. Even existing systems like FoodIE and NCBO tag- ger, the focus is on extracting generic food names, but not dietary compounds such as minerals (iodine, zinc, calcium, and iron), phytochemicals (alkaloids, organosulfides, and flavonoids), and vitamins (folic acid, riboflavin, and biotin). This thesis aims to evaluate dictionary-based modeling and conditional random fields to develop a system capable of mining food and dietary compounds, including the above entity types, with state-of-the-art performance.

Ontology-based and rule-based approaches to food named entity recognition have been ex- plored previously; therefore, we focused on using purely dictionary-based and machine learning- based approaches for the NER system. We use a simple yet powerful method for named entity recognition, i.e., Conditional Random Fields (CRF), as the machine learning approach. Accord- ingly, we chose several food databases to compile the vocabulary for the dictionary-based method and selected the FoodBase corpus of 1000 recipe articles for training the machine learning-based system. Both systems use rich text features and are evaluated using the FoodBase corpus such that a comparison can be drawn with the existing systems. Both of our systems were evaluated for the size of training data and the quality of the training data for the scoring metrics using F1-score, recall, precision, balanced accuracy, and standard error.

Our results show a state-of-the-art F-score of 96.83% for identifying the NER using the CRF based system with a standard error of 0.000345, which is 0.8% higher than the FoodIE system on the same FoodBase corpus. The system furthermore showed a balanced accuracy of over 97% in identifying named entity term boundaries using an IOB format. The dictionary model showed significant results as well, with an F-score of 95.7%. While the dictionary model was stable with respect to precision, the CRF model performed well with rare-terms. In the future, we hope to achieve even better scores in identifying NER by combining our two models in an ensemble approach. Furthermore, this work will be extended to develop a disease-diet relation network extraction method.

Keywords: Natural Language Processing, Biomedical Text Mining, Named Entity Recognition, Machine Learning

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

PREFACE

When I started the Master’s degree, I was adamant that my thesis would be on a biomedi- cal related field due to personal reasons. Fortunately, I was introduced to Professor Frank Emmert-Streib, who directed me to a topic I was immediately interested in; biomedical text mining. I spent a few months, with his guidance, grasping the main concepts of the field and research gaps, which then turned in to review paper published in Frontiers journal:

Named Entity Recognition and Relation Detection for Biomedical Information Extraction (Perera et al. 2020).

The research gap that we were interested in was the lack of a comprehensive system that can extract disease-diet networks. There is much attention to dietary compounds and how they positively/negatively affect diseases currently, but not a system that can mine these associations with evidence from biomedical publications. However, we realized there were other topic-related gaps in the research. For example, a named entity recognition tool was lacking to identify dietary entities like nutrition and chemical constituents. Therefore we decided to start there and first develop a named entity recognition tool for food and dietary compounds, resulting in my Master’s Thesis. Here we discuss topics in biomedical named entity recognition, followed by methods, resources, and results.

In completing the thesis, my primary gratitude is to my supervisor, Associate Professor Frank Emmert-Streib, for believing in me and giving me the opportunity to work on the thesis under his guidance. He gave much encouragement and valuable advice, which helped me immensely to improve myself as a researcher as well as in my academics.

Coming after a long sabbatical from my career, his support has been invaluable in getting me reacquainted. Further, I am thankful to all of my colleagues at the TA wing and in my Data Engineering and Machine Learning major, who supported me not only academically but also by taking care of my little daughter, who needs extra attention, while I was sitting for exams and studying for courses.

Finally and most importantly, I thank my husband and my mother, who has always been the pillars of my success, and my darling daughter, the love of my life, whose struggles at birth and life, is what drives me to go into research in the biomedical field. I will dedicate my work in their names, always.

Tampere, 26th November 2020

M. Nadeesha Thilini Perera

(4)

CONTENTS

1 Introduction . . . 1

2 Background . . . 5

2.1 Challenges in BioNER . . . 5

2.2 Pre-processing . . . 6

2.3 Feature Processing . . . 7

2.3.1 Rich Text Features . . . 7

2.3.2 One-hot vector word representation . . . 8

2.3.3 Cluster-based word representation . . . 8

2.3.4 Distributional word representation . . . 9

2.3.5 GloVe . . . 9

2.3.6 Word2Vec . . . 10

2.3.7 fastText . . . 10

2.3.8 BERT/BioBERT . . . 11

2.3.9 ELMo . . . 11

2.4 BioNER Modeling . . . 11

2.4.1 Rule-based Models . . . 12

2.4.2 Dictionary-based Models . . . 13

2.4.3 Machine Learning Models . . . 13

2.4.4 Hybrid Models . . . 18

2.5 Evaluation . . . 18

2.6 Post Processing . . . 22

2.7 Applications and End-to-End Systems . . . 23

3 Methods . . . 26

3.1 Tools and Datasets . . . 26

3.1.1 Tools . . . 26

3.1.2 Datasets . . . 29

3.2 Dictionary-based Method . . . 31

3.2.1 Preprocessing . . . 31

3.2.2 Feature Extraction . . . 32

3.2.3 Model Developing . . . 33

3.2.4 Model Evaluation . . . 34

3.3 Conditional Random Fields based Method . . . 34

3.3.1 Preprocessing . . . 36

3.3.2 Feature Extraction . . . 37

(5)

3.3.3 Model Training . . . 38

3.3.4 Model Evaluation . . . 38

4 Results . . . 40

4.1 Dictionary-based Model . . . 40

4.2 Conditional Random Fields based Model . . . 44

4.3 Comparison of Models with regards to the NCBO Annotator . . . 50

5 Discussion . . . 53

6 Conclusion . . . 55

References . . . 56

Appendix A Appendix . . . 66

(6)

1 INTRODUCTION

The phrase"Named Entity"first came in to use in 1996 in 6th Message Understanding Conference (MUC) ("Nadeau and Sekine 2007), which is when extracting specific proper- noun entities from unstructured text sources became essential for commercial and defen- sive purposes. In general, Named Entity Recognition (NER) first involves automatically scanning through unstructured text to locate "entities" and normalizing them. Subse- quently, these entities may require classifying into person names, organizations (such as companies, government organizations, committees), locations (like cities, countries, rivers), and time expressions (Mansouri et al. 2008). NER systems have undoubtedly be- come an invaluable tool in many fields, including research where one has to scan through millions of textual data sources looking for specific information. NER has also become a prerequisite for many other text mining tasks like summarizing, clustering, and classifica- tion.

In this thesis, however, we will focus specifically on how NER is becoming a vital tool in the Biomedical field. While a great deal of progress has been made in the general NER method in the last decade, much territory remains unexplored in biomedical named en- tity recognition (BioNER). NER started becoming significant in the biomedical field when biology started becoming more of an information-science concerned with analyzing large amounts of data in a growing number of digital biological databases and journal arti- cle publications.(Leser and Hakenberg 2005) At the time of writing this thesis, PubMed (Bethesda (MD) 2005) reports to contain over 30 million citations and Medline (Bethesda (MD) 2019) reports over 25 million references in their databases. Undoubtedly, this large amount makes it difficult for researchers to keep up with biomedical literature even in more specific subject areas. Many online biomedical database search engines currently em- ploy named entity recognition to reduce the search space. Furthermore, BioNER is also the first step in mining interactions and associations between biological entities, such as genes and diseases, diseases and drugs, drugs and side effects, imperative in the medical research field. Using computational methods to narrow down the most specific associations that need to be further pursued and experimented on, time and financial efficiency are promoted in biological experiments.

Figure 1.1 shows an overview of trends in BioNER research in the form of scientific pub- lication counts. We extracted the details of the publications that correspond to several

(7)

0 10 20 30 40 50

2005 2010 2015

year

number of publications

Named Entity Type General BioNER Gene(DNA/RNA) Protein Disease Chemical Drug Anatomy Species

Figure 1.1. Publication trends in biomedical Named Entity Recognition. The numbers of the published articles were obtained from Web of Science (WoS). The legend shows different queries used for the search of WoS.

combinations of terms related to"Biomedical Named Entity Recognition" from the Web of Science (WoS) between2001to2019and categorize them by general BioNER keywords, i.e., gene/protein, drugs/chemicals, diseases, and anatomy/species. The resulting counts of articles in each category were plotted chronologically. It can be observed that there is a steadily increasing amount of publications in general BioNER and positive growth in nearly every sub-category since the early 2000s. As such, it can be predicted that this trend will presumably continue in the next few years.

There are currently sophisticated BioNER tools that can tag not only concrete entity names but also abstract or conceptual entity names. Some examples of such entities are Genes/Proteins, Drugs, Adverse Effects, Metabolites, Diseases, Tissues, SNP(Single- nucleotide Polymorphism), Organs, Toxins, Pathways, Gene/Protein Sequences, MeSH (Medical Subject Headings) terms (Liu et al. 2015). However, while some of these entity categories like genes, proteins, chemicals, and diseases have received extra attention, others like Food and dietary entities, have been less prioritized. Consequently, several systems can perform NER on genes, chemicals, or diseases with state-of-the-art (SOTA) performance. In contrast, there is only one Food named entity recognizing tool FoodIE (Popovski. et al. 2019a), and few other ontologies based web services (that will be dis- cussed in more detail in Section 2.7). Since most of these are not capable of capturing food constituents such as minerals (iodine, zinc, calcium, and iron), phytochemicals (al- kaloids, organosulfides, carotenoids, and flavonoids), and vitamins (folic acid, riboflavin,

(8)

Pre-processing

Feature Processing

NER Modelling

Post Processing Unstructured Text

Structured Text

E1A gene expression induces susceptibility to killing by NK cells following immortalization

but not adenovirus of human cells. Adenovirus

infection and E1A transfection were used

to model changes in susceptibility to NK

cell killing caused by transient vs stable

E1A expression in human cells.

E1A gene⟨DN ARegion⟩

expression induces susceptibility to killing byNK cells⟨CellT ype⟩

following immortalization but not adenovirus⟨V irus⟩ofhuman

cells⟨CellT ype⟩. Adenovirus⟨V irus⟩infection and

E1A⟨P roteinM olecule⟩

transfection were used to model changes in susceptibility toNK cells⟨CellT ype⟩killing caused

by transient vs stable E1A⟨P roteinM olecule⟩

expression inhuman cells⟨CellT ype⟩.

Figure 1.2.The main steps in designing a BioNER system (with an example from manu- ally annotated GENIA Corpus article MEDLINE:95343554 - (Routes and Cook 1995)).

and biotin), a critical research gap remains in the biomedical NER field. There is much attention in the present to understand how dietary compounds can interact with drugs and diseases such as cancers, diabetes, cardiovascular disease, and how different Food can promote health benefits for general well-being. To explore these research areas, with the help of current literature, it is vital to have a Food and dietary named entity system capable of extracting food constituents with state-of-the-art accuracy.

In a generic BioNER system, the main steps consist of preprocessing, feature processing, model formulating/training, and post-processing, as illustrated in figure 1.2. In the pre- processing stage, data is cleaned, tokenized, and in some cases, normalized to reduce ambiguity at the feature processing step. Feature processing includes different methods used to extract features that represent the classes in question the most and then convert them into an appropriate representation as necessary to apply for modeling. Notably, while dictionary and rule-based methods can take features in their textual format, most machine learning methods require the tokens to be represented as real-valued numbers.

Selected features are then used to train or develop models capable of capturing entities, post which it may go through a post-processing step to increase accuracy further.

Our primary research goals in this thesis include;

1. Conducting an exhaustive review on biomedical named entity recognition for rela- tion detection and identifying research gaps,

2. Implementing and comparing the dictionary-based and the machine learning-based (using conditional random field) named entity recognition models for Food and di-

(9)

etary compounds,

3. Evaluating the stability of dictionary-based model for quality of test data and quan- tity of the dictionary term count,

4. Evaluating the CRF model stability for quality and quantity of training and testing data,

5. Building an ensemble method using the dictionary-based model and CRF model for higher performance in precision, recall, and F-score,

6. Comparing our implemented models with existing systems in the context of biomed- ical disease-related dietary entity extraction recall scores,

Hence in the background section, we will follow the above NER system structure to de- scribe the literature on each sub-step separately and extensively using sections from our published review paper Perera et al. 2020, written as a foundational step to this thesis.

The methods section will predominantly describe the two end-to-end systems that we de- veloped and compared against each other and with earlier systems. The methods section will be followed by the results achieved and a discussion on the results. Finally, we will conclude with future work that will extend the thesis project.

(10)

2 BACKGROUND

There are several sub-steps involved in NER and BioNER. Consequently, the available literature on the field predominantly focuses on improving different parts of the process, rather than only focusing on end-to-end systems. Therefore, We will categorize this liter- ature review into different functional sections of the NER process, namely, Challenges in 2.1, Pre-processing in 2.2, Feature Processing in 2.3, Modelling in 2.4, Evaluation in 2.5, and Post-processing in 2.6. Finally, there will be two more sections; first, about tools and data sources available for biomedical named entity recognition tasks. The second will be about applications and research gaps to justify selecting Food and nutrition named entity recognition as our thesis topic.

2.1 Challenges in BioNER

There several reasons why developing a comprehensive system to capture named en- tities is challenging. Such a system requires defining the types of NEs and category guidelines for those types, resolving semantic issues such as word sense disambigua- tion to deal with multi-class entities, and capturing correct boundaries of a NE (Marrero et al. 2013). There are also additional domain-specific difficulties in developing a BioNER system; all of these factors complicate feature extraction, and subsequently, the system evaluation (Nayel et al. 2019). In this section, We will address some of such problems and proposed resolutions related to feature extraction, whereas the evaluation-related issues and resolutions will be discussed in section 2.5

Text prepossessing and feature extraction for BioNER requires the isolation of entities.

However, as for any natural language, many articles contain ambiguities stemming from the equivocal use of synonyms, homonyms, multi-word/nested NEs, and other ambigu- ities more specific to the biomedical domain such as naming conventions (Nayel et al.

2019). For instance, the same entity names can be written differently in different arti- cles, e.g., "Lymphocytic Leukemia" and "Lymphoblastic Leukaemia" (synonyms/British and American spelling differences). Some names may share the same head noun in an article such as in "91 and 84 kDa proteins" corresponding to the nested terms"91 kDa protein" and "84 kDa protein", in which case the entity recognizer needs to take the con- text into account. There are various ways for resolving these ambiguities, using different

(11)

techniques, e.g., name normalization and noun head resolving (D’Souza and Ng 2012;

H. Li et al. 2017).

Additionally, there are two distinct semantic-related issues resulted from metonymy, poly- semy, and abbreviations usage. While most terms in the biomedical field have a specific meaning, there are still terms, e.g., gene and protein names that can be used interchange- ably; an example of which isGLP1Rused to refer to both the gene or the protein. Such complications may need ontologies and UML concepts to help resolve the entity cate- gory (Jovanovi´c and Bagheri 2017). There are also terms used to describe a disease in layman’s terms or drugs with ambiguous brand names. For example, diseases likeAlice in Wonderland syndrome,Laughing Death, Foreign Accent Syndrome and drug names such asSonata,Yasmin,Lithiumare easy culprits in confusing a bioNER system if there is no semantic analysis involved. For this reason, recent research work, e.g., (C. Zhang et al. 2019; Pesaranghader et al. 2019; Duque et al. 2018; Y. Wang, K. Zheng et al. 2018) discussed techniques for word sense disambiguation in biomedical text mining.

Another critical issue is the excessive usage of abbreviations with ambiguous meanings, such as "CLD", which could either refer to "Cholesterol-lowering Drug," "Chronic Liver Disease," "Congenital Lung Disease," or "Chronic Lung Disease". Given the differences in the meaning and BioNE class, it is crucial to identify the correct one. Despite being a sub-task of word sense disambiguation, authors like (Gaudan et al. 2005; Schwartz and Hearst 2002) have focused explicitly on abbreviation resolving due to its importance.

Whereas most of the above issues are a result of the lack of standard nomenclature in some biomedical domains, even the most standardized biological entity names can contain long chains of words, numbers, and control characters (for example "2,4,4,6- Tetramethylcyclohexa-2,5-dien-1-one", "epidemic transient diaphragmatic spasm"). Such long named-entities make the BioNER task complex, causing issues in defining bound- aries for sequences of words referring to a biological entity. However, correct boundary definitions are essential in evaluation and training systems, especially when penalizing is required for missing to capture the complete entity (long NE capture) (Campos et al.

2012). One of the most commonly used solutions for multi-word capturing challenge is to use a multi-segment representation (SR) model to tag words in a text as combination of Inside, Outside, Beginning, Ending, Single, Rear or Front, using standards like IOB, IOBES, IOE, IOE or FROBES (Nayel et al. 2019; Keretna et al. 2015).

2.2 Pre-processing

For general NLP tasks, preprocessing includes data cleaning, tokenizing, stopping, stem- ming or extracting lemma, sentence boundary detection, spelling, and case normalization (Miner et al. 2012); however, based on the application, the usage of these steps can vary.

PreprocessingHowever, preprocessing in BioNER commonly comprises data cleaning,

(12)

tokenization, name normalization, abbreviation, and head noun resolving measures to lessen complications in the features processing step. Some studies follow the TTL model (Tokenization, Tagging, and Lemmatization) suggested by (Ion 2007) as a standard pre- processing framework for biomedical text mining applications (Mitrofan and Ion 2017) in order to minimize complications. In this approach, the main steps include sentence splitting and segmenting words into meaningful chunks (tokens), part-of-speech (POS) tagging, grouping tokens based on similar meanings, i.e., lemmatization using linguistic rules.

2.3 Feature Processing

In systems that use rules and dictionaries, orthographic and morphological feature ex- traction focusing on word formations are the prime choice. Hence, they heavily depend on techniques based on word formation and language syntax. Examples of such include regular expressions to identify the presence of words beginning with capital letters and entity-type specific characters, suffixes, and prefixes, counting the number of characters, and part-of-speech (POS) analysis to extract nouns/noun-phrases (Campos et al. 2012).

For most machine learning approaches, feature processing is mostly concerned with real- valued word representations (WR) since most machine learning methods require a real- valued input (Levy and Goldberg 2014). While the simplest of these use bag-of-words or POS tags with term frequencies or a binary representation (one-hot encoding), the more advanced formulations also perform a dimensional reduction, e.g., using clustering-based or distributional representations (Turian et al. 2010).

However, the current state-of-the-art (SOTA) method for feature representation in biomed- ical text mining is word embedding due to their sensitivity to even hidden semantic/syntactic details (Pennington et al. 2014). A real-valued vector representing a word is learned in an unsupervised or semi-supervised way from a text corpus for word embedding. While the groundwork for word embedding was laid by (Collobert and Weston 2008; Collobert, Weston et al. 2011), over the last few years, much progress has been made in neural network-based text-embedding taking into account the context, semantics, and syntax for NLP applications (S. Wang et al. 2020). Below, we discuss some of the most signif- icant word representation approaches and those embedding methods applicable to the biomedical field.

2.3.1 Rich Text Features

The most commonly used rich text features in BioNER are Linguistic, Orthographic, Mor- phological, Contextual, and Lexicon, (Campos et al. 2012), all-of-which are extensively used when it comes to rule-based and dictionary-based NER modeling. To further elab-

(13)

orate the feature types mentioned above, linguistic features, generally focus on a given text’s grammatical syntax such as parts-of-speech. In contrast, orthographic features emphasize the word-formation techniques such as the presence of uppercase letters, specific symbols, or the number of occurrences of a particular digit. Comparatively,mor- phological featuresprioritize the common characteristics that can quickly identify a named entity, for instance, a suffix or prefix. Contextual features are equally crucial in word em- bedding approaches as well, where preceding and succeeding token characteristics of a word are used to represent that word. Finally,Lexicon featuresprovides additional domain specificity to named entities by maintaining extensive dictionaries with tokens, synonyms, and trigger words to enhance the performance of the system (Campos et al. 2012).

2.3.2 One-hot vector word representation

The one-hot-encoded vector is the most basic word embedding method. For a vocabulary of size N, each word is assigned a binary vector of lengthN, whereas all components are zero except one corresponding to the index of the word (Braud and Denis 2015).

Usually, this index is obtained from a ranking of all words, whereas the rank corresponds to the index. The biggest issue of this representation is the size of the word vector; since for a larger corpus, word vectors are very high-dimensional and predominantly sparse.

Besides, each word’s frequency and contextual information are lost in this representation but can be vital in specific applications.

2.3.3 Cluster-based word representation

In clustering-based word representation, the basic idea is that each cluster of words should contain words with contextually similar information. An algorithm that is most frequently used for this approach is Brown clustering (Brown et al. 1992). Specifically, Brown clustering is a hierarchical agglomerative clustering capable of representing con- textual relationships of words by a binary tree. Notably, the binary tree structure is learned from word probabilities, and the clusters of words are obtained by maximizing their mutual information. The leaves of the binary tree represent the words, and paths from the root to leaves can encode each word as a binary vector. Furthermore, similar paths and sim- ilar parents/grandparents among words indicate a close semantic/syntactic relationship.

While similar to a one-hot vector word representation, this approach reduces the repre- sentation vector’s dimension and sparsity while including contextual information (B. Tang et al. 2014).

(14)

2.3.4 Distributional word representation

The distributional word representation method uses co-occurrence matrices to extract hidden semantic information. The first step involves obtaining a co-occurrence matrix,F with dimensionsV ×C, whereas V is the vocabulary size andC the context, and each Fij gives the frequency of a wordi ∈V co-occurring with contextj ∈ C. Hence, in this approach, it is necessary for the preprocessing to perform stop-word filtering since high frequencies of unrelated words can negatively affect the results. In the second step a sta- tistical approximation or unsupervised learning functiong()is applied to the matrixF to reduce its dimensionality such thatf =g(F), where the resultingf is a matrix of dimen- sionsV ×dwithd≪ C. The rows of this matrix represent the words in the vocabulary, and the columns give the counts of each word vector (Turian et al. 2010). Some of the most common functions (g()) used for dimensionality reduction include clustering (Turian et al. 2010), self-organizing semantic maps (Turian et al. 2010), Latent Dirichlet Allocation (LDA) (Turian et al. 2010), Latent Semantic Analysis (LSA)(Sahlgren 2006), Random In- dexing(Sahlgren 2006), Hyperspace Analogue to Language (HAL) (Sahlgren 2006). The main disadvantage of these models is that they become computationally expensive for large data sets.

2.3.5 GloVe

GloVe (Global Vectors) is a word embedding method inspired by the distributional and cluster-based word representation. Its name emphasizes that global corpus-wide statis- tics are captured by the method, as opposed to word2vec in section 2.3.6, where local statistics of words are assessed (Pennington et al. 2014). This embedding method uses an unsupervised learning algorithm to derive vector representations for words. After that, the contextual distance among words defined by logarithmic probability is used to create linear sub-structural patterns in the vector space. The method bases itself on how word- word co-occurrence probabilities evaluated on a given corpus can interpret the words’

semantic dependence. For example, if We consider two wordsiandj, a simplified ver- sion of an equation for GloVe is given by

wTi ·w˜j =log(Pij) = Xij

Xi. (2.1)

Here wi ∈ Rdis the word vector for wordi,w˜j ∈ Rd is the contextual word vector, and Pij = P(j|i) = XXiij is the probability of co-occurrence between the wordsiand j. An in-depth description of GloVe can be found in (Pennington et al. 2014).

(15)

2.3.6 Word2Vec

Word2Vec is a state-of-the-art word representation model using a two-layer shallow neu- ral network. It takes a textual corpus as the input, creates a vocabulary out of it, and produces a multidimensional vector representation for each word as output. The word vectors position themselves in the vector space, such that words with a common con- textual meaning are closer to each other. There are two algorithms in the Word2Vec architecture, i.e., Continuous Bag-of-Words (CBoW) and Continuous Skip-Gram, both using a window size to capture the local context of each word. While the former predicts the current word by windowing its close contextual words in the space (with no consider- ation to the order of those words), the latter uses the current word to predict the words that surround it. The network ultimately outputs either a vector representing a word (in CBoW) or a vector representing a set of words (in skip-gram). Figure 2.1 illustrates the basic mechanisms of the two architectures of word2vec; CBOW and Skip-Gram. Details about these algorithms can be found in (Mikolov, K. Chen et al. 2013; Mikolov, Sutskever et al. 2013) (parameter learning of the Word2Vec is explained in (Rong 2014)).

Figure 2.1. An overview of the Continuous Bag-of-Words Algorithm and the Skip-Gram Algorithm. A CBOW predicts the current word based on surrounding words, whereas Skip-Gram predicts surrounding words based on the current word. Herew(t)represents a word sequence.

2.3.7 fastText

fastText, introduced by researchers at Facebook, is an extension of Word2Vec. Instead of directly learning the vector representation of a word, it first learns the word represented by the N-gram characters. For example, if someone is embedding the wordcollagenusing a 3-gram character representation, the representation would be<co, col, oll, lla, lag, age,

(16)

gen, en>, where<and>indicate the boundaries of the word. FastText effectively rep- resents suffixes/prefixes, the meanings of short words, and the embedding of rare words, even when those are not present in a training corpus since the training uses characters rather than words (Joulin et al. 2016). This embedding method has been adapted to the biomedical domain by Pylieva et al. 2018 and Y. Wang, J. Wang et al. 2018.

2.3.8 BERT/BioBERT

Bidirectional Encoder Representations for Transformers (BERT) (Devlin et al. 2018), is a relatively new approach of text embedding that has been successfully applied directly to several biomedical text mining tasks (Peng et al. 2019). BERT uses the Transformer learning model to learn contextual token embedding of a given sentence bidirectionally (from both left and right and averaged over a sentence). The learning is done using en- coders and decoders of the Transformer model combined with Masked Language Mod- eling to train the network to predict the original text. While the general NLP model was pre-trained with unlabeled data from the standard English corpora, tin he domain-specific version BioBERT (J. Lee et al. 2020), authors take the pre-trained BERT model and uses weights as initial weights. Then they train the BERT model again with PubMed abstracts and PubMed Central full-text articles. The model can further be fine-tuned using bench- mark corpora such as those in 2.1 to make it more task or NE type oriented.

2.3.9 ELMo

ELMo (The Embeddings from Language Models) is a context-dependent deep word em- bedding model that learns word vectors using a bidirectional LSTM and a Language Model. This method can model both syntactic-semantic variations in words and linguis- tic characteristics such as polysemy; i.e., different meanings of a word based on context (Peters et al. 2018). Like BERT, ELMo can be pre-trained on large text corpora and then be fine-tuned to address specific NLP problems, such as entity recognition or event ex- traction.

2.4 BioNER Modeling

Modeling methods in BioNER can be divided into four categories: Rule-based, Dictionary- based, Machine Learning based, and Hybrid models (Eltyeb and Salim 2014). However, in recent years, the focus has shifted to either pure machine learning approaches or hybrid techniques combining rules and dictionaries with machine learning methods.

While supervised learning methods heavily dominate machine learning approaches in the literature, some semi-supervised and even unsupervised learning approaches are also used. Examples of such work will be discussed briefly later in the sections below.

(17)

The earliest machine learning approaches for BioNER focused on Support Vector Ma- chines (SVM), Hidden Markov Models (HMM), and Decision Trees. However, most NER research currently utilizes deep learning with sequential data and Conditional Random Fields (CRF).

2.4.1 Rule-based Models

Rule-based approaches, unlike decision trees or statistical methods, use handcrafted rules to capture named-entities and classify them based on their orthographic and mor- phological features. For instance, it is conventional in the English language to start proper names, i.e., named-entities, with a capital letter. Hence entities with features like upper- case letters, symbols, digits, suffixes, prefixes can be captured, for example, using regex expressions. Additionally, part-of-speech taggers can be used to fragment sentences and capture noun phrases. It is common practice, in this case, to include the complete phrases as entities if at least one part of them are identified as entities.

An example of the earliest rule-based BioNER system is PASTA (Protein Active Site Template Acquisition (Gaizauskas et al. 2003)), in which entity tagging was performed heuristically, by defining 12 classes of technical terms, including scope guidelines. Each document is first analyzed for sections with technical text, split into tokens, analyzed for semantic and syntactic features, before extracting morphological and lexical features. The system then uses handcrafted rules to tag and classify terms into 12 categories of tech- nical terms. The terms are tagged with respective classes using the SGML (Standard Generalized Markup Language) format. One of the most recent systems significant to this thesis is a food named entity recognition system FoodIE(Popovski. et al. 2019a), which uses rules sets to extract generic food entities from recipes. Additionally, hybrid systems like C.-H. Wei, Kao et al. 2012, and Eftimov et al. 2017 present how combining heuristic rules with dictionaries may result in higher SOTA f-scores. The two techniques complement each other by rules compensating for exact dictionary matches and dictio- naries refining results extracted through rules.

The main drawbacks of rule-based systems are the time-consuming processes involved with handcrafting rules to cover all possible patterns of interest and the ineffectiveness of such rules towards unseen terms. However, in an instance where an entity class is well defined, it is possible to formulate meticulous rule-based systems that can achieve both high precision and recall. For example, most species entity tagging systems rely on binomial nomenclature (two-term naming system of species), which provides clearly defined entity boundaries, qualifying as an ideal candidate for a rule-based NER system.

(18)

2.4.2 Dictionary-based Models

Dictionary-based methods use large databases or ontologies of named-entities and pos- sibly trigger terms of different categories as a reference to locate and tag entities in a given text. While scanning texts for exactly matching terms included in the dictionaries is a straightforward and precise way of named entity recognition, recall of these systems tends to be lower due to the increasingly expanding nature of biomedical jargon, their syn- onyms, spellings, and word order differences. Some systems have been using inexact or fuzzy matching by automatically generating extended dictionaries to account for spelling variations and partial matches.

One prominent example of a dictionary-based BioNER model is in the association mining toolPolysearch(Cheng et al. 2008), where the system keeps several comprehensive dic- tionary thesauri to make tagging and normalization of entities rather trivial. Another exam- ple is Whatizit (Rebholz-Schuhmann 2013), a class-specific text annotator tool available online, with separate modules for different NE types. This BioNER is built using con- trolled vocabularies (CV) extracted from standard online databases. For instance, Wha- tizitChemical uses a CV from ChEBI and OSCAR3,WhatizitDiseaseuses disease terms CV extracted from MedlinePlus,whatizitDrugsuses a CV extracted from DrugBank,Wha- tizitGOuses gene ontology terms and finallywhatizitOrganismuses a CV extracted from the NCBI taxonomy. Similarly, LINNAEUS, (Gerner et al. 2010) is a dictionary-based NER package designed explicitly to recognize and normalize species name entities in text. The system has a significant recall of 94% at the mention-level and 98% at the document level, despite being dictionary-based.

More latest SOTA tools have shown preference in using dictionary-based hybrid NER as well, attributing to its high accuracy of performance with previously known data. Moreover, since it involves exact/inexact matching, the main requirement for high accuracy is only a thoroughly composed dictionary of all possible related jargon.

2.4.3 Machine Learning Models

Currently, the most frequently used methods for named entity recognition are machine learning approaches. While some studies focus on purely machine learning-based mod- els, others utilize hybrid systems that combine machine learning with rule-based or dictionary- based approaches. Overall these present state-of-the-art methods.

This section discusses three principal machine learning methodologies utilizing super- vised, semi-supervised, and unsupervised learning. These also include Deep Neural Net- works (DNN) and Conditional Random Fields (CRF) because newer studies are focused on using LSTM/Bi-LSTM coupled with Conditional Random Fields (CRF). Furthermore, in Section 2.4.4, We will discuss a few hybrid approaches.

(19)

Supervised methods

The first supervised machine learning methods used were Support Vector Machines (Kazama et al. 2002), Hidden Markov models (Shen et al. 2003), Decision trees, and Naive Bayesian methods (Nobata, Collier et al. 1999). However, the milestone publication by (Lafferty et al. 2001) about Conditional Random Fields (CRF) taking the probability of contextual dependency of words into account shifted the focus away from independence assumptions made in Bayesian inference and directed to graphical probability models.

Conditional Random Fields

CRFs are an exceptional case of conditionally-trained finite-state machines, where the final result is a statistical-graphical model. These models perform well with sequential data, making it ideal for language modeling tasks such as NER (Settles 2004), in both general and biomedical domains. While conditional random fields are similar to Hidden Markov Models (HMM), they have several differences, the most significant being that CRFs are undirected and conditional probability models. In contrast, HMMs are directed- independent-probability models that do not account for dependencies between the in- put data. Figure 2.2 illustrates the main differences between CRF, HMM, and MEMM (Maximum-entropy Markov Models) in the linear chain context. MEMM, similar to CRF, models conditional probability; however, it is prone to label-bias problems caused by the model’s directed nature. However, CRF models conditional probabilities normalized by local state transition and considers the whole model such that the label-bias problem will not occur. As such, modeling for text sequences with succeeding and preceding contex- tual information, CRF performs much better than MEMM and HMM in NLP tasks.

Figure 2.2. Hidden Markov Model, Maximum-entropy Markov Model and Conditional Random fields represented as a linear chain to illustrate learning with context.

(20)

Deep learning

In the last five years, there is a general shift in the literature towards deep neural network models in the machine learning domain, and consequently, in biomedical NLP (Perera et al. 2020; LeCun et al. 2015; Emmert-Streib, Yang et al. 2020). For instance, there have already been BioNER models trained with feed-forward neural networks (FFNN) (Furrer et al. 2019), recurrent neural networks (RNN), and convolution neural networks (CNN) (Zhu et al. 2017) with state-of-the-art performances. Among these, RNNs have been in more limelight due to their ability to model sequential data well. Consequently, several variations of RNNs such as, e.g., Elman-type, Jordan-type, unidirectional, and bidirectional LSTM models have been explored in NER domain (L. Li, Jin and D. Huang 2015).

The Neural Network (NN) language models are essential since they excel at dimension reduction of word representations and help improve NLP application performances im- mensely (Jing et al. 2019). Consequently, (Bengio et al. 2003) introduced the earliest NN language model as a feed-forward neural network architecture focusing on "fighting the curse of dimensionality." This FFNN, which first learns a distributed continuous space of word vectors, is also the inspiration behind CBOW and Skip-gram models of feature space modeling. The generated distributed word vectors are then fed into a neural net- work that estimates each word’s conditional probability in context to the others. However, this model has several drawbacks, the first being that it is limited to pre-specifiable con- textual information. Secondly, it is impossible to use timing and sequential information in FFNNs, which would facilitate language to be represented in its natural state, as a sequence of words instead of probable word space (Jing et al. 2019).

Similarly, convolutional neural networks (CNN) are used in literature to extract contextual information from embedded word and character spaces. In Y. Kim et al. 2016, such a CNN has been applied to a general English language model, with each word represented as character embeddings. The CNN then filters the embeddings and creates a feature vector to represent the word. Extending this approach to Biomedical text processing, Zhu et al. 2017, generates embeddings for characters, words, and POS tagging, which are then combined to represent words and fed to a CNN level with several filters. The CNN, eventually, outputs a vector representing the local feature of each term, which can then be tagged by a CRF layer.

Researchers have later started exploring recurrent neural networks for language modeling to facilitate language to be represented as a collection of sequential tokens. Elman- type and Jordan-type networks are simple recurrent neural networks, where contextual information is fed into the system as weights either in the hidden layers in the former type or the output layer in the latter-type. The main issue with these simple RNNs is that they face the vanishing gradient problem, making it difficult for the network to retain temporal

(21)

information long-term, as benefited by a recurrent language model.

Long Short-Term Memory (LSTM) neural networks were introduced to compensate for both of the weaknesses mentioned in previous DNN models (CNN and simple RNN).

Hence, they are most common and often used in language modeling tasks. LSTMs can learn long-term dependencies through a singular unit called a memory cell, which not only can retain information long time but has gates to control which input, output, and data in the memory to preserve and which to forget. Extension to the above model is bi-directional LSTM: where learning can be done with both past and future information (hence both directions), allowing more freedom to build a contextual language model. In contrast, unidirectional LSTM models learn based on only past data. (L. Li, Jin, Jiang et al. 2016).

For achieving the best results, Bi-LSTM and CRFs models are generally combined with a word-level and character-level embedding in a structure, as illustrated in Fig. 2.3 (Yoon et al. 2019; Habibi et al. 2017; X. Wang et al. 2018; Ling et al. 2019; Giorgi and Bader 2019; Weber et al. 2019). Here a pre-trained lookup table produces word embeddings, and secondary Bi-LSTM is trained to render character-level embedding, both of which are then combined to acquire x1, x2, ...., xn as word representation (Habibi et al. 2017).

These vectors then become the input to a bi-directional LSTM, and the output of both forward and backward paths, hb, hf, are then combined through an activation function and inserted into a CRF layer. This layer is ordinarily configured to predict the class of each word using an IBO-format (Inside-Beginning-Outside).

Semi-Supervised methods

Semi-supervised learning is usually used when a small amount of labeled data and a larger amount of unlabeled data are available, which is often the case when it comes to Biomedical collections. If labeled data is expressed asX(x1, x2, ...., xn)−> L(l1, l2, ..., ln) where X is the set of data and L is the set of labels, the task is to develop a model that accurately mapsY(y1, y2, ..., ym)− > L(l1, l2, ..., lm)where m > nand Y is the set of unlabeled data that needs mapping to labels.

Whereas literature using a semi-supervised approach is lesser in BioNER, Munkhdalai et al. 2015 describes how domain knowledge has been incorporated into chemical and biomedical NER using semi-supervised learning by extending the existing BioNER sys- tem BANNER. The pipeline runs the labeled and unlabeled data in two parallel pipelines;

wherein one pipeline, labeled data is processed through NLP techniques to extract rich features such as word and character n-grams, lemma, and orthographic information as in BANNER. In the second pipeline, the unlabeled data corpus is cleaned, tokenized, and run through brown hierarchical clustering and word2vec algorithms to extract word representation vectors, and are clustered using k-means. All of the extracted features

(22)

Figure 2.3.Structure of the Bi-LSTM-CRF architecture for Named Entity Recognition.

from labeled and unlabeled data are then used to train a BioNER model using conditional random fields. This system’s authors emphasize that the system does not use lexical features or dictionaries and performs well in the BioCreative II gene-mention task.

Unsupervised methods

While unsupervised machine learning has potent in organizing new high throughput data without previous processing and improving the existing system’s ability to process previ- ously unseen information, it is not often the first choice for developing BioNER systems.

However, S. Zhang and Elhadad 2013 introduced a system, which uses an unsupervised approach to BioNER with the concepts of seed knowledge and signature similarities between entities.

In the above approach, first, semantic types and semantic groups are collected from UMLS (Unified Medical Language System) for each entity type, e.g., protein, DNA, RNA, Cell type, and cell line as seed concepts to represent the domain knowledge. Second, the candidate corpora are processed using a noun phrase chunker and an inverse document frequency filter, which formulates word sense disambiguation vectors for a given named entity using a clustering approach. The next step generates the signature vectors for each entity class to indicate that the same classes tend to have contextually similar words. The final step compares the candidate named entity signatures and entity class signatures by

(23)

calculating similarities. The method, however, performs only with the highest F-score of 67.2for protein entities. However, Sabbir et al. 2017 using a similar approach to imple- ment a word sense disambiguation with an existing knowledge base of concepts extracted through UMLS, managed to achieve over 90% accuracy in their BioNER model. These unsupervised methods tend to work well when dealing with ambiguous Biomedical enti- ties as well.

2.4.4 Hybrid Models

There are currently several state-of-the-art applications of BioNER that combine the best aspects of all the above three methods. Most of these methods combine machine learn- ing with either dictionaries or sets of rules (heuristic/derived), but other approaches also exist, combining dictionaries and rule sets together. Since machine learning approaches have shown to result in better recall values, whereas both dictionary-based and rule- based approaches tend to have better precision values, the hybrid methods tend to have improved F-scores.

For instance, OrganismTagger (Naderi et al. 2011) uses binomial nomenclature rules of naming species to tag organism names in text and combines this with an SVM to assure that it captures organism names that do not follow the binomial rules. In contrast, SR4GN (C.-H. Wei, Kao et al. 2012), which is also a species tagger, utilizes rules to capture species names and a dictionary lookup to reevaluate the accuracy of the tagged entities.

Furthermore, state of the art tools such as Gimli (Campos et al. 2013), Chemspot (Rock- täschel et al. 2012) and DNorm(Leaman, Islamaj Do ˘gan et al. 2013) use Conditional Ran- dom fields with a thesaurus of own field-specific taxonomy to improve recall. In contrast, OGER++ (Furrer et al. 2019), which performs multi-class BioNER, utilizes a feed-forward neural network structure followed by a dictionary lookup to improve precision.

On the other hand, some systems have combined statistical machine-learning approaches with rule-based models to achieve higher results, as described in this more recent work (Soomro et al. 2017). This study uses the probability analysis of orthographic, POS, n- gram, affixes, and contextual features with Bayesian, Naive-Bayesian, and partial decision tree models to formulate classification rules.

2.5 Evaluation

It is essential to use standardized evaluation scores to assess and compare NER systems using gold-standard corpora. A frequently used error measures for evaluating NER is the F-Score, which is a combination of Precision and Recall (Mansouri et al. 2008; Emmert- Streib, Moutari et al. 2019).

(24)

Precision, recall and F-Score are defined as follows (Campos et al. 2012):

Precision= Relevant Names Recognized

Total Names Recognized = True Positives(TP)

True Positives(TP)+False Positives(FP)(2.2)

Recall= Relevant Names Recognized

Relevant Names in Corpus = True Positives(TP)

True Positives(TP)+False Negatives(FN)(2.3)

F-score= 2× Precision×Recall

Precision+Recall. (2.4)

However, when precision, recall, and F1-scores are calculated for a model that involves more than two categories with unbalanced data, for example, a system that outputs boundaries for named entities using the IOBES format, the metric scores are likely to become unbalanced too. The reason is that naturally, these documents tend to contain more O(outside) segments rather than I(inside) or B(Beginning). To alleviate the problem of these unbalanced classes, one can useBalanced Accuracy for each class (I, O, B, E, S), which is defined as;

Balanced Accuracy= Sensitivity+Specificity

2 . (2.5)

where sensitivity is the recall(or the true positive rate) of the system and specificity is de- fined as;Specif icity = T NT N+F N.or the true negative(TN) rate of the system (Q. Wei and Dunbrack Jr 2013). This metrics gives a more accurate overall score for the unbalanced multi-class problem evaluation as opposed to f-scores.

A problem with scoring a NER system in the above ways is that it requires to defineTrue Positives, False Positives, and False Negatives, and hence the degree of correctness of the tagged entities, in order to calculate precision and recall. The degree of correctness, in turn, depends on the pre-defined boundaries of the captured phrases. To illustrate this, consider the following example phrase"Acute Lymphocytic leukemia". If the system tags "lymphocytic leukemia", but misses "Acute", We need to decide if it is still a "true positive," or not. The decision depends on the accuracy requirement of the BioNER; for a system that collects information on patients with Leukaemia in general, it may be possible to accept the above tag as a "true positive."." In contrast, if one is looking for treatment regiments for rapid progressing Leukaemia types, it may be necessary to capture the whole term, includingacute. Hence, the above would be considered "false positive."

One possible solution is to relax the matching criteria to a certain degree since anexact

(25)

matchcriterion tends to reduce the performance of a BioNER system. The effects of such approaches have been evaluated, e.g., using left or right matching, partial or approximate matching, name fragment matching, co-term matching, and multiple-tagging matching.

Some approaches also apply semantic relaxation, such as "categorical relaxation," which merges several entity types to reduce the ambiguity, e.g., by joining DNA, RNA, and protein categories or by combining cell lines and type entities into one class. In Fig. 2.4, We show an example of the different ways to evaluate"Acute Lymphocytic leukemia". For a thorough discussion of this topic, the reader is referred to Tsai, Wu et al. 2006.

“Acute lymphocytic Leukaemia”

(Exact Matching)

“Acute lymphocytic”(right) or

“lymphocytic Leukaemia”(left) (Left/Right Matching)

“myeloid”, or

“lymphocytic”

(Approximate Matching)

“Leukaemia”

(Partial Matching)

Strictest assessment Loosest assessment

Figure 2.4. An example for different matching criteria to evaluate Named Entity Recogni- tion. From left to right the criteria become more relaxed (Tsai, Wu et al. 2006).

Until recently, there was also an evaluation-related problem stemming from the scarcity of comprehensively labeled data to test the systems (which also affected the training of the machine learning methods). This scarcity was a significant problem for BioNER until the mid-2000s, since human experts annotated most of the gold standard corpora, and thus were of small size and prone to annotator dependency (Leser and Hakenberg 2005). However, with growing biological databases and as the technologies behind NER evolved, the availability of labeled data for training and testing have increased drastically in recent years. Presently, there is not only a considerable amount of labeled data sets available, but there are also problem-specific text corpora, and entity-specific databases, and thesauri accessible to researchers. However, there is still a scarcity in labeled data for food compound and nutrition, where the only available corpus is FoodBase (Popovski et al. 2019b), which is an annotated collection of 1000 recipes. When it comes to the medical domain, most articles often use phytochemical and nutritional terms, which a system trained simply with generic food names may not be capable of capturing. This incapability specifically may prove an obstacle when developing a system, for example, that finds relations between dietary compounds and diseases.

The most frequently used general-purpose biomedical corpora for training and testing are GENETAG (Tanabe et al. 2005), and JNLPBA (M.-S. Huang et al. 2019), various BioCreative corpora, GENIA (J.-D. Kim et al. 2003) (which also includes several levels of linguistic/semantic features) and CRAFT (Bada et al. 2012). In Table 2.1, We show an overview of10text corpora often used for benchmarking a BioNER system.

Additionally, some of the Named-Entity specific databases that have comprehensive col-

(26)

Corpus Year Text Type Training Data Type Data Size FoodBase

(Popovski et al. 2019b) 2019 Recipes from

AllReceipes Food entities 1000 recipes ChEBI

(Shardlow et al. 2018) 2018 Abstracts/

Full text

Chemical Entities of Biological Interest

Abs-199/FT-100 (15000 mentions) CHEMDNER

(Krallinger et al. 2015) 2015 Pubmed

Abstracts Chemicals and Drugs

10 000 (84,355 Entity

mentions) NCBI Disease

(Dogan et al. 2014) 2014 Pubmed

Abstracts Diseases

793 (6892 Disease

mentions)

CRAFT

(Bada et al. 2012) 2012 Full Text

Cell Type, Chemical Entities of Biological Interest, NCBI Taxonomy,

protein, Sequence, Gene, DNA, RNA

97 (140,000 Annotations)

AnEM

(Ohta et al. 2012) 2012 Abstracts/

Full text

Pathology, Anatomical Structures/Substances

500 (3000 mentions) NaCTeM Metabolite

and Enzyme (Nobata, Dobson et al. 2011)

2011 Medline

Abstracts Metabolites and Enzymes 296 LINNAEUS

(Gerner et al. 2010) 2010 Full text

Documents Species 100

GENETAG

(Tanabe et al. 2005) 2005 Sentences Gene, Protein 20000 Sentences JNLPBA

(M.-S. Huang et al. 2019) 2004 Abstracts DNA, RNA, Protein, Cell Type, Cell Line

2000 (+404 testset)

Table 2.1.Benchmark Corpora used for analyzing BioNER systems.

lections of jargon includeGene Ontology(Consortium 2004),Chemical Entities of Biolog- ical Interest (Shardlow et al. 2018),DrugBank (D. S. Wishart et al. 2017),Human Protein Reference Database(Keshava Prasad et al. 2008),Online Mendelian Inheritance in Man (Amberger et al. 2018),FooDB(D. Wishart 2014),Toxins and Toxin-Targets Database(D.

Wishart et al. 2014),International Classifications of Disease (ICD-11) by WHO(Organiza- tion et al. 2018),Metabolic Pathways and Enzymes Database(Caspi et al. 2017),Human Metaboleme Database(Jewell et al. 2007) and USDA food and nutrients database (Hay- towitz and Pehrsson 2018). The majority of these has been used by Liu et al. 2015 to compile their thesauri and databases.

(27)

2.6 Post Processing

While not all systems require or use post-processing, it can improve the quality and ac- curacy of the output by resolving abbreviation ambiguities, disambiguation of classes and terms, and parenthesis mismatching instances (Bhasuran et al. 2016). For example, if a certain BioNE is only tagged in one place of the text, yet the same or a co-referring term exist elsewhere in the text, untagged, then the post-processing would make sure these missed NEs are tagged with their respective class. Also, in the case of a partial entity being tagged in a multi-word BioNE, this step would enable the complete NE to be anno- tated. In the case where some of the abbreviations are wrongly classified or failed to be tagged, some systems use tools such as the BioC abbreviation resolver (Intxaurrondo et al. 2017) at this step to improve the annotation of abbreviated NEs. Furthermore, failure to tag NE also stems from unbalanced parenthesis in isolated entities, which also can be addressed during preprocessing. Interestingly, (Q. Wei, T. Chen et al. 2016) describes us- ing a complete rule-based BioNER model for post-processing in disease mention tagging to improve the F-score.

One important sub-task that is essential in post-processing is resolving coreferences to extract stronger associations between entities. Coreferences are terms that refer to a named entity without using its proper name; instead, by using some forms of anaphora, cataphora, split-reference, or compound noun-phrase (Sukthanker et al. 2020). For exam- ple in the sentence"BRCA1and BRCA2are proteins expressed in breast tissue where theyare responsible for either restoring or, if irreparable, destroying damaged DNA.", the anaphorathey refers to the proteinsBRCA1andBRCA2, and resolving this helps asso- ciate the proteins with their purpose. When it comes to biomedical coreference resolution, one must understand that generalized methods may not be very effective, given that there are fewer usages of common personal pronouns. Some approaches that have been used in biomedical text mining literature are heuristic rule sets, statistical approaches, ma- chine learning-based methods. Most of the earlier systems commonly used mention-pair based binary classification and rule-sets to filter coreferences such that only domain sig- nificant ones are tagged (J. Zheng et al. 2011). While the rule set methods have provided state-of-the-art precision, they often do not have good recall. Hence, a Sieve architec- ture (Bell et al. 2016) has been introduced to the approach, which arranges rules starting from high-precision-low-recall to low-precision-high-recall. More recently, deep learning methods have been used for coreference resolution in the general NER domain success- fully without using syntactic parsers, for example, in (K. Lee et al. 2017). This system has been applied to biomedical coreference resolution in (Trieu et al. 2018) with domain- specific feature enhancements. Here it is worth mentioning that CRAFT corpus cited in 2.1 includes an improved version that can be used coreference resolution in biomedical texts (Cohen et al. 2017).

(28)

2.7 Applications and End-to-End Systems

One of the most significant applications of Biomedical Named Entity Recognition is nar- rowing down search space when exploring millions of online biomedical journal archives, looking for literature on specific topics. It is essential for researchers to view articles that do not merely respond to a search term but are also contextually related to the query terms. For example, if "sequenced genes in chromosome 9" are searched for, the search engine should be able to locate all genes and then pick the ones that belong to the cat- egory chromosome 9. The process of locating entities and classifying is where BioNER becomes vital.

The second significant application is in information extraction tasks, such as relation- ship mining, which requires recognizing and classifying biomedical named entities are crucial first steps. Association mining task also often extended beyond extracting rela- tionships from databases and full-text articles to building up networks of interactions such as prostate cancer and gene interactions, diet and disease associations, or drug to drug interaction with side effects, where BioNER is again an essential first step. These net- works promote summarizing copious amounts of research in full text into a format that is easier to visualize and analyze for the human researcher.

The next application of BioNER tools is in EHR (Electronic Health record) mining for patient diagnosis and treatment. Mining previous patient data could help narrow down the diagnosis and possibly effective treatment regimens when a complicated set of symptoms and test results from a patient are presented. The same method can also be used in medical chat-bot systems designed to respond to patient queries and provide necessary advice or support.

In Table 2.2, We provide an overview of BioNER tools that are available for gene, pro- tein, chemicals, disease, Food, species, cell type, and cell line named entity recognition tasks. While there are several other tools, our selection criterion was to cover the earli- est successful implementations, benchmark tools, and the most recent tools using novel approaches here.

As observed in Table 2.2, while most biomedical named entities like genes, disease, pro- teins, and chemicals have multiple systems that can identify and tag them, systems that can tag dietary compounds are very scarce. We could only find the FoodIE tool(Popovski.

et al. 2019a), a rule-based food NER tagger available in R and C++ as an open-source code. While the system shows a state-of-the-art F1-score of 96% for food entities corpora FoodBase (Popovski et al. 2019b) with 1000 recipes, it has not been evaluated in the con- text of biomedical corpus to evaluate performance. However, an affiliated system named drNER (Eftimov et al. 2017) was also designed using rule-sets to extract evidence-based dietary recommendations from biomedical texts. According to the authors, since there

Viittaukset

LIITTYVÄT TIEDOSTOT

We investigate the effect of Linear Discriminant Analysis and clustering for training data selection, resulting in a reduced size model, with an acceptable loss in the

The increasing application of high-throughput technologies is fuelling the ongoing explosion in biomedical datasets. This has led to serious capacity issues related to the ability

We demonstrated that small differences in size at birth, gestational age and growth in childhood are related to clear differences in the function in these systems six to

A key component in reaching better harmonization across the national spatial data in Finland will be an automated data quality evaluation process named as QualityGuard,

As the purpose of the evaluation was to recognise the conditions and requirements for producing cutting edge and high quality research results and doctoral training, the

nustekijänä laskentatoimessaan ja hinnoittelussaan vaihtoehtoisen kustannuksen hintaa (esim. päästöoikeuden myyntihinta markkinoilla), jolloin myös ilmaiseksi saatujen

Hä- tähinaukseen kykenevien alusten ja niiden sijoituspaikkojen selvittämi- seksi tulee keskustella myös Itäme- ren ympärysvaltioiden merenkulku- viranomaisten kanssa.. ■

Capacity retention results for the AAPSi and THCPSi samples reveal similar behavior in relation to the size of the PSi particles (Fig. S2, Supplementary Information).. In