• Ei tuloksia

Grammars, Languages, and Hierarchies

2.3 Chomsky Grammars

2.3.1 Grammars, Languages, and Hierarchies

Definition 2.3. A Chomsky grammar is a quadruple G = (N, T, S, P), where N and T are disjoint alphabets, called the nonterminal and terminal alphabet, respectively, S ∈N −T is theaxiom, and P is a finite set of rules (called also productions, or even more rewriting rules) of the form α→β,α∈(N∪T)N(N ∪T) and β ∈(N∪T). In the sequel, for any rule p of the form α → β,α and β are called the left-hand side and the right-hand side ofp, respectively. It is also said that ruleprewritesα. For any u ∈ N ∪T occurring in β, it is said that p produces u. If β ∈ T, then the rule p is called terminal. Otherwise,p is a non-terminal rule. If β =λ, then p is callederasing rule. Otherwise,p is anon-erasing rule. Ifα, β ∈N, thenp is achaining rule.

For each v, w∈ (N ∪T), one says that v directly derives w inG, written as v ⇒G w, or even more as v ⇒p w, if there exist u1, u2, α, β ∈ (N ∪T), such that v = u1αu2, w=u1βu2, and p∈P is a rule of the form α→β. This is also called aderivation step inG. A sequence of the form S⇒Gw1Gw2G...⇒Gws−1Gws=w, such that wi ∈(N ∪T), for 1 ≤ i≤ s−1, and ws = w ∈T, is called a derivation in G of w, which is also called a terminal derivation in G, since it leads to a string composed only of terminals. Each wi ∈(N ∪T), 1≤i≤s−1, is called a sentential form, while s is the length of the derivation D. Depending on the position of α in v (in the definition of the ⇒G relation) there are several types of derivations in a Chomsky grammar. The most important are the leftmost and rightmost derivation orders.

Definition 2.4. LetG = (N, T, S, P) be a Chomsky grammar. A terminal derivation D:S =w0pi

1 w1pi

2 ...⇒pis ws=wis aleftmost (rightmost) derivation ofw, if for each 1≤j ≤s,wj−1 = uj−1αjvj−1p

ij uj−1βjvj−1 =wj and uj−1 ∈T (vj−1 ∈T, respectively), wherepij is the ruleαj →βj inP.

Chapter 2. Elements of Formal Language Theory

Denote by ⇒G the reflexive and transitive closure of ⇒G. The language generated by Gis defined as L(G) ={w∈T|S⇒w}.

The beauty of the Chomsky grammars dwells on their vast applicability in computer science and computational linguistics, in fields like parsing theory and natural language processing. Chomsky grammars form the best known framework to describe the syntax2 of programming languages. They represent the most suitable generative devices to study and manipulate the syntax and semantics of natural languages. This is possible due to the simplicity of the Chomsky grammars rules and due to their applicability in infinitely many contexts. Depending on the type of the rules, Chomsky grammars can be divided into several classes of grammars. Below we recall the most important of them, which will be further used during this thesis.

Definition 2.5. LetG= (N, T, S, P) be a Chomsky grammar.

- If no restrictions are imposed on the rules in P (Definition 2.3) then G is called a phrase-structure (PS), orunrestricted, ortype 0 grammar.

- If each rule in P is either of the form α →β ∈P, α =α00 and β =α00, where α0, β0 ∈ (N ∪T), A ∈ N, and x ∈ (N ∪T)+, or of the form S → λ, and S does not appear on the right-hand side of any rule inP, then Gis called acontext-sensitive (CS) ortype 1 grammar.

- If each rule in P is of the form α → β, |α| ≤ |β|, then G is called a monotonous or length-increasing grammar. Moreover, the grammar may contain the rule S → λ, assuming that S does not occur on the right-hand side of any rule in P.

- If each rule in P is of the form α → β, α ∈ N, β ∈ (N ∪T), then G is called a context-free (CF) or type 2 grammar.

- If each rule in P is of the formα→β,α ∈N,β =aX,X ∈N, such that there is no other rule in P that rewritesα and produces a∈T, then Gis ans-grammar [126].

- If each rule inP is of the formα →β,α∈N andβ ∈T N, and for each pair of rules X → aY, X0 → bY0, X, X0 ∈N, Y, Y0 ∈N, a, b∈T, we have a6=b, then G is called an ss-grammar [137].

- If each rule inP is of the formα→β,α∈N and β ∈T∪TN T, thenGis called a linear (LIN) grammar.

- If each rule in P is of the form α → β, α ∈ N and β ∈ T ∪TN, then G is a right-linear grammar. If for each rule in P,α ∈N andβ ∈T∪N T, thenGis called aleft-linear grammar. A right- or left-linear grammar is called aregular (REG) ortype 3 grammar.

2The syntax of a programming language defines i.the form of the instructions that describe the programming language (which can be seen as rules in a grammar), ii. the symbols from which these instructions can be composed (which can be considered as elements of the grammar alphabet),iii.the manner in which these symbols can be combined to provide correct instructions, and finally, iv. the manner in which instructions can be used to provide a correct program.

Context-sensitive grammars and monotonous grammars generate the same class of lan-guages. The classes of languages generated by REG, LIN, CF, CS, and PS grammars are denoted by REGL,LIN L,CF L,CSL, and P SL, respectively. The class P SL equals the class of recursively enumerable languages, also denoted by REL3. Between these classes of languages the next inclusions (Chomsky hierarchy) hold REGL⊂ LIN L ⊂ CF L⊂CSL⊂P SL=REL.

Sometimes, due to notation constraints, the class of languages generated by grammars of typeX, where X∈ {REG, LIN, CF, CS, P S}, is also denoted byL(X).

A terminal derivation in a context-free grammarGcan be graphically sketched through a derivation tree associated with the word that it generates.

Definition 2.6. Let G= (N, T, S, P) be a context-free grammar and w ∈L(G). The derivation tree ofwis a rooted treeT = (V,E) for which vertices are labeled by elements inN ∪T∪ {λ}, i.e.,V =N ∪T∪ {λ}, the axiomS is the root of the tree, and the set of edgesV is built as follows

- Vertices labeled by X1, X2, ..., Xk ∈ N ∪T are children of the root, enumerated from left to right in the same order, if and only if there is a rule in P of the form S → X1X2...Xk applied at the beginning of the derivation of w. If Xi ∈ T then the vertex labeled byXi is a leaf, otherwise is an internal node, 1≤i≤k. A vertex labeled by λ is a child of the root if and only if P contains a rule of the form S → λ (case in which w=λ).

- An internal node labeled byX ∈N hasY1, Y2, ..., Ym∈N∪T as children, enumerated from left to right in the same order, if there exists a rule inP of the formX →Y1Y2...Ym applied during the derivation of w, independent of the derivation order. IfYi ∈T then the vertex labeled by Yi is a leaf, otherwise is an internal node, 1 ≤i ≤m. A vertex labeled by λis a child of an internal node labeled by X if and only if there is a rule in P of the form X→λ, which is also a leaf in the tree.

The labels of the leaves, read from left to right, in a derivation tree as defined above, gives the word w. A context-free grammar G is ambiguous if there exists at least one word in L(G) that has more than one derivation tree, or equivalently, if there exists at least one word in L(G) that has more than one leftmost derivation (Definition 2.4).

Otherwise, the grammar isunambiguous. A language isambiguousif there exists at least one ambiguous grammar generating it. Otherwise, it is unambiguous. A language for

3A language isrecursively enumerable if there exists a Turing machine that accepts it, in the sense that the machine either says yes when it accepts the input (in a final state), or, either it halts in a non-final state or it fails to halt. Recursively enumerable languages are also known as semi-decidable languages. A language is decidable, or recursive, if there exists a Turing machine that halts, on any word in the language, withyes, if the word is accepted, andno otherwise. In other words, there is an algorithm that decides the membership problem.

Chapter 2. Elements of Formal Language Theory

which there is no unambiguous grammar generating it is called aninherently ambiguous language.

Note that the Chomsky hierarchy lists classes of languages according to the complexity of the rules that define the grammars. In general, in any hierarchy of languages built on the base of their grammar complexity, the less restrictions are imposed on the grammar rules, the larger the class of the language generated by that grammar is.

The syntax of almost all programming languages can be specified by using the Backus-Naur form4[5], which is a grammar mechanism proved to be equivalent with the context-free grammar formalism. Hence, the set of all strings, including programs (i.e., the source code of a valid program), that can be generated using this syntax is a context-free language. However, checking the correctness of a program, i.e. to build a parser or compiler for that program, is a more complex problem that dwells beyond the context-free formalism. A parser should be able to handle all possible errors and ambiguities, and fix them. This depends on the context in which an error (or an ambiguity) occurs.

Checking the validity of a program is a matter of context-sensitiveness.

Another reason for which context-free grammars are used to define the syntax of pro-gramming languages, is that parsers for context-free languages are easier and more efficient to be built than for context-sensitive languages. Parsing context-free languages takes less time than parsing context-sensitive languages.

Deterministic context-free languages can be recognized (or parsed) in O(n) time, linear context-free languages require O(n2) time, while (unrestricted) context-free grammars can be parsed in O(n3) time (on-line computation). Context-sensitive languages are harder to be parsed. It is presumed that (unrestricted) context-sensitive grammars cannot be parsed by using polynomial time, as the membership problem for context-sensitive languages isP SP ACE-complete. If reductio ad absurdum all context-sensitive languages would be parsed in polynomial time, then we would have N =N P (see also Subsection 4.2.1), which is not a contradiction, but it is widely believed that it is not possible.

On the other hand, the syntax of natural languages is more complex than that of pro-gramming languages, due to the ambiguity of theirsemantics. A word may have different meanings, depending on the context in which it is used5. This makes context-free gram-mars insufficient in describing natural languages, requiring the use of context-sensitive

4There are programming languages for which the Backus-Naur form, or any other context-free gram-mar, does not suffice to describe their syntax. ALGOL-like languages or C(++)-like languages are such examples.

5This should not happen in programming languages, where each symbol has a precise meaning which works under a precise context, precisely provided by its syntax.

grammars. For a convincing pleading for the non-context-freeness of natural languages the reader is referred to [195].

Since the context-free formalism is too weak to cover all syntactical and semantical phe-nomena in natural languages, and the context-sensitivity formalism is too strong, one of the main question (which is still under debates) in computational linguistics, is what kind of properties must satisfy a grammar to properly define a certain natural language.

The most widely accepted, and the most efficient (computationally and syntactically) grammar formalism in this respect may be considered themildly context-sensitive gram-mar formalism [120], which generates a class of languages that properly contains the CF L class and it is strictly included in CSL. Besides, any language in this class has some nested and limited crossing dependencies, a constant growth property, and it can be parsed in polynomial time.