On a Regular Superset Approximation for Context-Free Languages

A regular language R may be considered a superset approximation for a context-free language L, if L⊆R. A good approximation forL is the one for which the set R−L is as small as possible. There are considerable methods to find a regular approximation for a context-free language. The most significant consist in building, through several transformations applied to the original pushdown automaton (or context-free grammar), the most appropriate finite automaton (regular grammar) recognizing (generating) a regular superset approximation of the original context-free language. How accurate the approximation is, depends on the transformations applied to the considered devices.

However, the perfect regular superset (or subset) approximation for an arbitrary context-free language cannot be built. For surveys on approximation methods and their practical applications in computational linguistics (especially in parsing theory) the reader is referred to [155,158]. Methods to measure the accuracy of a regular approximation can be found in [43,76,194].

In the sequel we propose a new approximation technique that emerges from the Chomsky-Sch¨utzenberger theorem. In brief, the method consists in transforming the original context-free grammar into a context-free grammar in Dyck normal form. For this gram-mar we build the refined extended dependency graph G.e^S described in Section 3.3.

FromG.e^S we depict a state diagramA_e for a finite automaton and a regular grammar

Gr= (Nr, T, Pr, S) that generates a regular (superset) approximation for L(Gk) (which is nothing else than the image throughϕof the languageR_m built in Section3.3).

Let Gk = (Nk, T, Pk, S) be an arbitrary context-free grammar in Dyck normal form, and G.e^S = (V_e, E_e) the extended dependency graph of G_k. Recall that V_e = {[⁻_i |[_i∈ N⁽¹⁾∪Nr⁽²⁾∪N⁽³⁾} ∪ {]⁻_j |]_j ∈N_l⁽²⁾∪Nr⁽²⁾∪N⁽³⁾} ∪ {S} in which some of the vertices may be ~-marked, in order to prevent repetition of the same bracket when building the digraph associated with a plus-height regular expression. In brief, the state diagramA_e can be built by skipping in G.e^S all left brackets in Nr⁽²⁾ and all brackets in N⁽³⁾, and labeling the edges with the symbol produced by left or right bracket inN⁽²⁾∪N⁽¹⁾. This reasoning is applied no matter whether the vertex inVeis~-marked or not. Therefore, we avoid~-marker specifications when buildingA_e, unless this is strictly necessary. Denote bys_f the accepting state ofA_e. Thestart stateofA_eiss_S, whereS is the axiom ofG_k.

Chapter 3. Homomorphic Representations and Regular Approximations of Languages

i s_f, respectively. In both cases, this is labeled byλ. We set inP_r a rule of the form ]^q_i →λor ]^qt_i →λ, respectively.

The new grammarGr= (Nr, T, Pr, S), in which the set of rulesPris built as above, and N_r = {]⁻_i |]_i ∈ N⁽²⁾} ∪ {[⁻_i ,]⁻_i |[_i,]_i ∈ N⁽¹⁾} is a regular grammar generating a regular superset approximation for L(Gk). Recall that some of the brackets in Nr may also be

~-marked (by distinct symbols). It is easy to observe that L(G_r) =ϕ(R_m), whereϕ is the homomorphism in the proof of Theorem 3.6.

Note that since the regular language in the Chomsky-Sch¨utzenberger theorem is an ap-proximation of the trace-language,R_m depends on the considered context-free grammar in Dyck normal form. As for L=L(Gk) there exist several other grammars generating it, setting these grammars in Dyck normal form other trace-languages can be drawn, and consequently other regular languages, of typeRm, can be built. The best approximation forL is the regular language with fewer words that are not inL.

Denote byG_Lthe set of grammars in Dyck normal form generatingL, byR_m the set of all regular languages obtained from the refined extended dependency graphs associated with grammars in G_L, and by A_L={ϕ(R_m)|R_m ∈ R_m} the set of all superset regular approximations of L. It is easy to observe thatA_L, with the inclusion relation on sets, is a partially ordered subset of context-free languages. A_L has an infimumequal to the

context-free language it approximates, but it does not have the least element. Indeed, as proved in [22, 92, 93, 94], there is no algorithm to build for a certain context-free language L, the simplest context-free grammar that generates L. Hence, there is no possibility to identify the simplest context-free grammar in Dyck normal form that gen-eratesL. Therefore, there is no algorithm to build the minimal superset approximation forL. Where by thesimplest grammar we refer to a grammar with a minimal number of nonterminals, rules, or loops (grammatical levels encountered during derivations).

Consequently, A_L does not have the least element.

It would be interesting to further study how the (refined) extended dependency graphs (see Construction3.2and Section3.3), associated with grammars in Dyck normal form generating a certain context-free language L, vary depending on the structure of these grammars¹⁰, and what makes the structure of the regular languageR_m(hence the regular superset approximation) simpler. In other words, to find a hierarchy on A_L depending on the structure of the grammars in Dyck normal form that generate L. These may also provide an appropriate measure to compare languages in A_L. On the other hand, for an ambiguous grammar Gk, there exist several paths (hence regular expressions) in the refined extended dependency graph, which “approximate” the same word inL(G_k).

Apparently, finding an unambiguous grammar for L(Gk) may refine the language Rm. The main disadvantage is that, again in general, there is no algorithm to solve this problem. Moreover, even if it is possible to find an unambiguous grammar forL(G_k), it is doubtful that the corresponding regular language Rm is finer than the others.

In [92] it is also proved that the cost of the “simplicity” is the ambiguity. In other words, finding an unambiguous grammar for L = L(G_k) may lead to the increase in size (e.g., number of nonterminals, rules, levels, etc.) of the respective grammar. Which again, may enlargeR_mwith useless words. Therefore, a challenging matter that deserves further attention is whether the unambiguity is more powerful than the “simplicity” in determining a more refined regular superset approximation for a certain context-free language (with respect to the method proposed in this section).

Example 3.4. a. The regular grammar that generates the regular superset approxima-tion of the linear context-free language in Example3.1isG_r= ({S,]₁,]^t₂,]^t₃,]₄,]^t₅,]₆,[^t₇,]^t₇}, {a, b, c, d}, S, P_r), where¹¹ P_r= {S→a]₁,]₁→b]₄,[₄→ b]₆,]₆→ a]₁/a[^t₇,[^t₇→ a]^t₇,]^t₇ → d]^t₅,]^t₅ →c]^t₃,]^t₃→ b]^t₂,]^t₂ → c]^t₃,]^t₂ → d]^t₅,]^t₂ → λ}. The language generated by Gr is L(G_r) ={(abb)^maa(d(cb)ⁿ)^p|n, m, p≥1}= (abb)⁺aa(d(cb)⁺)⁺=h(R). The transition diagram associated with the finite automaton acceptingL(Gr) is sketched in Figure3.1.

10For instance, how does the extended dependency graph associated with anonself-embeddinggrammar in Dyck normal form look, and what is the corresponding regular superset approximation.

11Note that, since there is only one dependency graph that yields only one plus-height regular expres-sion there is no need of the labeling procedure described in Section3.3.

Chapter 3. Homomorphic Representations and Regular Approximations of Languages

Figure 3.5: The transition diagram Ae built from G.e^S in Example3.3. Each bracket [i

(S, ]i) inAecorresponds to the states[_i (sS,s]_i) (see Example3.4.b.). Sis the initial vertex, vertices colored in green lead to the final state.

b. The regular grammar that generates the regular superset approximation of the context-free language in Examples3.2and 3.3 isG_r= ({S,]^3t₂ , ...,]^7t₂ ,]^2t₃ ,]^3t₃ ,]^4t₃ ,]^6t₃ ,[^1t₄ , ...,[^7t₄ ,]^1t₄ , ...,]^7t₄ ,]^1t₅ ,]²₇, ...,]⁷₇,¯]⁴₇,¯]⁶₇},{a, b, c}, S, P_r), where P_r= {S → c[^1t₄ ,[^it₄→ c]^it₄,]^1t₄ → b]^1t₅ ,]^jt₄ → a]^jt₃,]^mt₄ → b]^mt₂ ,]^1t₅ → a]²₇,]^j₇ → a]^j₇,]ⁿ₇ → c[^nt₄ ,]^k₇ → a¯]^k₇,¯]^k₇ → c[^kt₄ ,]^2t₃ → a]³₇/a]⁶₇/a]⁷₇,]^jt₃ → a]^jt₃,]^3t₃ →a]³₇/a]⁴₇/a]⁵₇,]^kt₃ →b]^kt₂ ,]^lt₂→a]²₇/λ,]^ht₂ →b]^3t₂ ,]^3t₂ →b]^3t₂ /a]²₇/λ|i∈{1,2,3,4,5,6,7}, j∈{2,3,4,6}, h∈{4,5}, k∈{4,6}, l∈{6,7}, m∈{5,7}, n∈{2,3,5,7}}. The transition dia-gram associated with the finite automaton that acceptsL(G_r) is sketched in Figure3.5.

In document Advanced Studies on the Complexity of Formal Languages (sivua 72-76)