• Ei tuloksia

Further Refinements of the Regular Language in the Chomsky-Sch¨ utzenberger

One of the main disadvantage of considering ∗-height regular expressions in building the extended dependency graph associated with a context-free grammar in Dyck nor-mal form is that some ∗-loops composed of right brackets in Nr(2)∪N(3) may not be symmetrically arranged according to their corresponding left brackets in Nr(2)∪N(3), if we consider their corresponding core segment as a symmetrical center. This is due to the possibility of having “λ-loops”. This deficiency does not affect the intersection with a Dyck language, but it has the disadvantage of enlarging considerable the regular language in the Chomsky-Sch¨utzenberger theorem. This can be avoided by considering only loops described in terms of +-Kleene closure.

Another disfunction of the extended dependency graph built through Construction3.2is the concatenation of a regular expressionr.e(l,X)[t

i

with another regular expressionr.e(r,X[t 0) i

, r.e(r,X[t 0)

i

6= r.e(r,X)[t

i (due to the common tie [ti that marks a core segment). This can be avoided by a renaming procedure of the regular expressions we want to concatenate.

All these additional modifications in building an extended dependency graph are useful only if we want to refine the regular language that satisfies the Chomsky-Sch¨utzenberger theorem (with regards to the grammar in Dyck normal form). This will be further handled (in Section3.4) to build a tighter approximation for the context-free language it characterizes.

Each regular expression of a finite star-height can be described as a finite union of regular expressions in terms of +-Kleene closure (shortly plus-height). For instance the∗-height regular expression ]6[2(]7[3)]7[t4 inR]6

[t4 can be forked into ]6[2(]7[3)+]7[t4 and ]6[2]7[t4. The plus-height of a regular expression, can be defined analogous to the star-height of a regular expression in [98], as follows.

Definition 3.11. Let Σ be a finite alphabet. The plus-heighth(r) of a regular expression r is defined recursively as follows: i. h(λ) =h(∅) =h(a) = 0 for a∈Σ, ii. h(r1∪r2) = h(r1r2) = max{h(r1), h(r2)}, and h(r+) =h(r) + 1.

Chapter 3. Homomorphic Representations and Regular Approximations of Languages Note that for any star-height regular expression it is possible to build a digraph, with an initial vertex vi and a final vertex vf, such that all paths in this digraph, from vi to vf, to provide the respective regular expression (which can be done in a similar manner as in Construction3.2). However, if the regular expression is described in terms of plus-height then this statement may not be true (due to the repetition of some symbols). To force this statement be true, also for plus-height regular expressions, each repetition of a bracket is marked by a distinct symbol (e.g., ]6[2(]7[3)+]7[t4 becomes ]6[2(]7[3)+¯]7[t4), and then, for the new plus-height regular expression obtained in this way, we build a digraph with the above property. In order to recover the initial plus-height regular expression from the associated digraph, a homomorphism that maps all the marked brackets (by distinct symbols) into the initial one must be applied. Each time it is required, we refer to such a vertex as a~-marked vertex. Therefore, due to the technical transformations described above and the symmetrical considerations used in the construction of a trace-language, we may assume to work only with plus-height regular expressions.

LetGk = (Nk, T, Pk, S) be an arbitrary context-free grammar in Dyck normal form, and GX the dependency graph of Gk (see Construction 3.1). Denote by PX

[ti the set of all plus-height regular expressions over ˜Nk ∪ {X} that can be read in GX, starting from the initial vertex X and ending in the final vertex [ti. The cardinality of P[Xt

i is finite.

Now, we consider the same homomorphism as defined for the case of the set RX[t i

i we build a new plus-height regular expression r.e(r,X[t )

i = hrG(r.e(l,X)[t

i the set of all (plus-height) regular expressionsr.eX[t

i obtained as

Note that linear context-free languages do not need an extended dependency graph. The set of all regular expressions P.eS suffices to build a regular language in the Chomsky-Sch¨utzenberger theorem (see Theorem3.10) that cannot be further adjusted by using the graphical method proposed in this section. Furthermore,|R.eS| ≤ |P.eS|. Equality takes place only when each regular expression inR.eS is a plus-height regular expression (see Example3.1). For the case of context-free languages the plus-height regular expressions inP.emust be linked with each other in such a way it approximates, as much as possible, the trace-language associated with the respective context-free language.

In order to find an optimal connection of the regular expressions in P.e, we consider the following labeling procedure of elements in P.e. Denote by c0 the cardinality of P.eS and by cj the cardinality of P.e]j, where ]j ∈ N(3). Each regular expression r ∈ P.eS is labeled by a unique q, 1 ≤ q ≤ c0, and each regular expression r ∈ P.e]j,

is labeled by a unique q, such that Pi−1

r=0cr + 1 ≤ q ≤ Pi

r=0cr, 1 ≤ i ≤ s, and s = |{]j|]j ∈ N(3)}|. Denote by rq the labeled version of r. To preserve symmetric structures that characterize trace-words of context-free languages, then when we link regular expressions in P.e between them, each bracket in a regular expression rq is upper labeled by q. Exception makes the first bracket occurring in rq (which is a bracket in {]j|]j ∈ N(3)}). Now, a refined extended digraph can be built similar to that described in Construction3.2. To have a better picture of how the labeled regular expressions must be linked to each other, and where further relabeling procedures may be required (to obtain a better approximation of the trace-language), we first build for each plus-height regular expressionrq∈ P.e]j, ]j ∈N(3), a digraph and then connect all these digraphs to each other. Denote by Gq,]j the digraph associated with rq ∈ P.e]j, such that ]j is the initial vertex and the final vertex is the last bracket occurring in rq. Each digraphGq,]j read from the initial vertex ]j to the final vertex provides the regular expression rq. Hence, any digraph Gq,]j has vertices labeled by brackets of the forms {[qj|[j∈ N(1) ∪Nr(2) ∪N(3)} ∪ {]qj|]j ∈ Nl(2) ∪Nr(2) ∪N(3)}, c0 ≤ q ≤ Ps

r=0cr, with the exception of the initial vertex ]j, ]j ∈ N(3). Some of vertices in Gq,]j, besides the q-index, may also be~-marked, in order to prevent repetitions of the same vertex which may occur in a plus-height regular expression. As the construction of the dependency graph does not depend on~-markers, unless it is necessary, we avoid~-marked notations in further explanations when building this dependency digraph.

The adjacent vertex Y to ]j, in Gq,]j, is calledsibling. Any edge of the form ]l ]k, where ]l ∈ N(3), ]k ∈ Nr(2)∪N(3), is called dummy edge, while ]l (]k, if ]k ∈ N(3), and the adjacent vertex to ]k is a right bracket in Nr(2) ∪N(3)) is a dummy vertex. An edge that is not a dummy edge is called stable edge. Denote by G]j the set of all digraphs Gq,]j, i.e., their initial vertex is ]j. Any digraphGq,]j has only one vertex labeled by a bracket [qk, [k∈N(1), which stands for a core segment in a trace-word. Right brackets ]qj, ]j ∈ Nr(2) ∪N(3), must be symmetrically arranged according to their left pairwise [qj, [j∈ Nr(2)∪N(3), that occur at the left side of [qk. A dummy vertex labeled by ]qj, ]j ∈N(3), allows the connection with any digraph in G]j. A digraph in G]j with a final vertex labeled by a bracket [k, [k ∈N(1), or by a bracket ]l , ]l∈Nr(2), is calledterminal, because the vertex [k or ]l , respectively, does not allow more connections.

Next we describe the procedure that builds a refined extended digraph with the property that reading this digraph (in which each loop is a plus-loop) from the initial vertexS to all its final vertices, we obtain those (plus-height) regular expressions that form a regular language that provides a refined approximation of the corresponding trace-language.

Step 1. First we build a digraphG.eSthat describes all (plus-height) regular expressions in P.eS. This can be done by connecting all digraphs in GS to S. Since each bracket

Chapter 3. Homomorphic Representations and Regular Approximations of Languages labeling a vertex in Gq,S, 1≤q ≤c0, is uniquely labeled by q, and there exists a finite number of brackets, G.eS is correct (in the sense that it is finite and any vertex occurs only one time). The initial vertex ofG.eS isS. If a graph inGS has a final vertex labeled by a bracket [qi, [i∈N(1) or by a bracket ]qi, ]i ∈Nr(2), then this is also a final vertex in G.eS. IfGkis a grammar in linear-Dyck normal form thenG.eS, built in this way, suffices to build the regular language in the Chomsky-Sch¨utzenberger theorem. The set of all paths from S to each final vertex to which we apply the homomorphismhk, defined in the proof of Theorem3.10, yields a regular languageRmthat cannot be further adjusted, such that the Chomsky-Sch¨utzenberger theorem still holds. Therefore, Rm is minimal with respect to the grammar Gk and the Chomsky-Sch¨utzenberger theorem, i.e., the equalityϕ(DK∩Rm) =ϕ( L(Gk)) still holds, where ϕis the homomorphism defined in the proof of Theorem 3.6.

Step 2. For each vertex ]qj existing inG.eS, such that ]j ∈N(3), we connect all digraphs inG]j toG.eS. This can be done by adding to G.eS a new edge ]qjY, for each sibling Y of ]j (in G]j). If Z is the adjacent vertex of ]qj (in the former version of G.eS), i.e., ]qjZ is a dummy edge, then we remove inG.eS the edge ]qjZ, while in Gq0,]j ∈ G]j (connected to G.eS through ]qj) we remove the vertex ]j and consequently, the edge ]jY. For the moment, all the other edges inGq0,]j are preserved inG.eS, too. Besides, ifV is the final vertex of Gq0,]j, then a new edgeV Z is added to G.eS. IfV ∈ {[k|[k∈N(1)} ∪ {]l |]l ∈ Nr(2)}, i.e.,Gq0,]j is a terminal digraph then the edgeV Z is aglue edge, i.e., it is a stable edge that makes the connection of Gq0,]j into G.eS (or more precisely the connection of Gq0,]j to the digraph Gq,S in which it has been inserted). Otherwise, V Z is a dummy edge, which will be removed at a further connection with a digraph in GV. Since for the case of linear context-free languages generated by a grammar in linear-Dyck normal form, G.eS does not contain any dummy vertex, the construction of G.eS is completed atStep 1.

A vertex in G.eS labeled by a bracket ]qj, ]j ∈ N(3), that has no adjacent vertex, i.e., the out degree of the vertex labeled by ]qj is 0, is called apop vertex. When connecting a digraph Gq0,]j toG.eS, through a pop vertex, if Gq0,]j is a terminal digraph, then the final vertex of Gq0,]j becomes a final vertex of G.eS. If Gq0,]j is not a terminal digraph, then the final vertex ofGq0,]j becomes a pop vertex for G.eS.

If there exist more than one vertex labeled by an upper indexed6 bracket ]qj¯, ]j ∈N(3), then, ifGq0,]j has been already added toG.eS there is no need to add another “copy” of Gq0,]j. It is enough to connect ]qj¯to the digraph existing inG.eS, i.e., to add a new edge ]¯qjY, whereY is a sibling of ]j inGq0,]j. This observation holds for any element inG]j. The procedure described at Step 2 is repeated for each new dummy or pop vertex added to

6AsG.eS is finite, there cannot exist inG.eS two right brackets ]j, ]jN(3), upper indexed by the same value.

G.eS. For each transformation performed on G.eS, we maintain the same notationG.eS for the new obtained digraph. The construction ofG.eS ends up then when each vertex ]j, ]j ∈N(3), has been connected to a digraph in G]j, i.e., no dummy and pop vertices exist in G.eS. The only permissible contexts under which a bracket ]j , ]j ∈N(3), may occur, in the final version ofG.eS, are of the forms ]i ]j [k, [h]j [k, ]i ]j ]l , and [h]j]l , where ]i∈Nr(2), [h ∈N(1), [k∈N(1)∪Nr(2)∪N(3), ]l∈Nl(2).

There are several refinements that can be done on G.eS in order to the resulted regu-lar language to better approximate the trace-language associated with the considered context-free language. Two peculiar situations may occur when adding digraphs toG.eS: I1. First, suppose that during the construction of G.eS by subsequently connecting digraphs between them, starting from ]qj, ]j ∈N(3), we reach a terminal digraph with a final vertex ]qk0, ]k ∈Nr(2), such that ]qk0 is linked toZ, forming thus a stable (glue) edge ]qk0Z. Denote by℘= ]qj...]qk0Z the path (inG.eS) from ]qj to ]qk0Z, obtained at this stage. If the vertex preceding ]qk0 in℘is ]qj0, ]j ∈N(3), i.e.,℘= ]qj...]qj0]qk0Z, then connecting ]qj0 (]qj0]qk0 is a dummy edge), through its siblings, to digraphs in G]j another edge ]qk0]qk0 preceded by ]qj0]qk0, is added to G.eS, i.e., ℘ becomes ℘ = ]qj...]qj0(]qk0)2Z. Since ]qj0]qk0 is a dummy edge, the vertex ]qj0 must be again connected to digraphs in G]j, and so on, until ]qj0 is connected to a terminal digraph Gq,]¯ j ∈ G]j, ¯q 6=q0, that has a final vertex labeled by a bracket ]qm¯, ]m ∈Nr(2) (m and knot necessarily distinct), or by a bracket [qm¯, [m∈N(1) such that ]¯qm is not preceded by a bracket of the form ]j , ]j ∈ N(3). Then ℘ will be either of the form ]qj1]qm¯(]qk0)+Z or of the form ]qj1[qm¯(]qk0)+Z, respectively. On the other hand, sinceGq,]¯ j ∈ G]j the digraphGq,]¯ j can be added toG.eS, through ]qj, from the very first beginning, avoiding thus the plus-loop (]qk0)+, i.e., there should exist in G.eS a new path℘0 = ]qj2]qm¯Z or℘0= ]qj2[qm¯Z (where ℘2 is a path inGq,]¯ j). This allows two other new paths to be created in G.eS, i.e., ¯℘ = ]qj2]¯qm(]qk0)+Z (or ℘00 = ]qj2[qm¯(]qk0)+Z) and

¯

0 = ]qj1]qm¯Z(or ¯℘00= ]qj1[qm¯Z), which are of no use in approximating the trace-language (hence in building the regular language in the Chomsky-Sch¨utzenberger theorem). Paths

¯

℘and ¯℘0 (℘00, ¯℘00) do not affect the intersection with the Dyck language but they enlarge the regular language with useless words. In order to avoid the paths ¯℘and ¯℘0 (or℘00, ¯℘00) the terminal digraphGq,]¯ j receives a new label ˜q, besides of label ¯q (which is maintained to allow ℘ to be produced). To allow the shorter path ℘0 to be created, instead of Gq,]¯ j the terminal digraph Gq,]˜ j is connected to G.eS through the dummy vertex ]qj. Hence ℘0 becomes ℘0 = ]qj...]qm˜Z (or ℘0 = ]qj...[qm˜Z), while ℘ remains ]qj...]qm¯(]qk0)+Z (or ]qj...[qm¯(]qk0)+Z, respectively). This relabeling procedure is used for any case similar to that described above7 encountered during the computation ofG.eS. As there may exist a

7For instance, ]qk0 may also be a dummy vertex and ]qk0Z a dummy edge.

Chapter 3. Homomorphic Representations and Regular Approximations of Languages finite number8of plus-loops inG.eS, there will be a finite number of “relabeled” digraphs (not necessarily terminal). A loop (not necessarily a self-loop) may be reached through different paths that must be “renamed” (if we want to avoid that loop).

I2. Another situation that requires a relabeling procedure may occur when connecting a digraph to G.eS through a pop vertex. Suppose that ]qj, ]j ∈ N(3), is a pop vertex, and the digraphGq0,]j that must be added toG.eS has been already connected through a dummy vertex labeled by ]¯qj (i.e.,Gq0,]j has been already inserted inG.eS). According to the procedure described atStep 2 the vertex ]qj is linked to the sibling of ]j inGq0,]j already existing inG.eS. Since the connection ofGq0,]j toG.eS has been done through a dummy vertex, the final vertex inGq0,]j cannot be neither a final vertex inG.eS (ifGq0,]j is a terminal digraph) nor a pop vertex.

To forbid a pop vertex ]j to overlap with a dummy vertex ]j, each of the digraphs connected to G.eS through a pop vertex, is renamed by a new label. Denote by ¯G]j the labeled version ofG]j. Then connections through pop vertices will be done by using only digraphs in ¯G]j. However, any dummy vertex ]j , that is not a pop vertex, obtained by connecting digraphs in ¯G]j toG.eS should be connected to the original digraphs inG]j, unless a relabeling procedure described atI1 is required.

Denote by ¯Nk = {[i |[i∈ N(1) ∪Nr(2)∪N(3)} ∪ {]j |]j ∈ Nl(2)∪Nr(2)∪N(3)}, the set of vertices composing G.eS, in which some brackets may be ~-marked (by distinct ~-markers). To reach the regular language in the Chomsky-Sch¨utzenberger theorem we denote byRG the set of all regular expressions obtained by readingG.eS from the initial vertexSto any final vertex. First, suppose thatGkdoes not have an extended grammar.

We have K = k and Dk0 = L(Gk). Consider the homomorphism hk: ¯Nk ∪ {S} → {[i,]i|[i,]i∈Nr(2)∪N(3)}∪{[i]i|[i,]i∈Nl(2)∪N(1)}∪{λ}, defined byhk(S) =λ,hk([i ) = [i, hk(]i ) = ]i, for any ]i ∈ Nr(2), hk(]i ) = [i]i, for any ]i ∈ Nl(2), and hk([i ) = [i]i, for any [i∈ N(1). Then Rm = hk(RG) is a regular language with Dk ∩Rm = L(Gk).

Furthermore,Rm is a strength refinement ofR, such that the Chomsky-Sch¨utzenberger theorem still holds. This is because when building regular expressions inP.eeachr.e(l,X)[t

i

is linked only to its right pairwiser.e(r,X)[t i

(due to plus-height considerations and labeling procedures). In this way all plus-loops in r.e(l,X)[t

i are correctly mirrored (through hrG) into its correct pairwiser.e(r,X)[t

i

. The case ofλ-loops is taken by the relabeling procedure described at I1. This is also applicable each time we want to fork a path in G.eS in order to avoid useless loops on that path. The relabeling procedure I2 allows to leave G.eS without re-loading another useless path. That is why the regular language Rm

built this way is a tighter approximation of L(Gk). A finer language thanRm can be

8The plus-height of a regular expression obtained from any digraph in G]j is finite related to the length of the strings inL(Gk).

Figure 3.3: a. - e. Graphs associated with regular expressions inP.e(Example3.2). Initial vertices are colored in red, final vertices are colored in blue, while purple vertices mark a core

segment. ¯]47 is a marked vertex to allow the plus-loop ([43]47)+.

found by searching for a more efficient grammar in Dyck normal form, with respect to the number of rules and nonterminals.

If Gk has an extended grammar Gk+p = (Nk+p, T, Pk+p, S) (built as in the proof of Theorem 3.6) then RG is augmented with ∇e ={S[tk+1, ..., S[tk+p} and hk is extended to hK, hK: ¯Nk ∪ {S} ∪ {[tk+1, ...,[tk+p} → {[i,]i|([i,]i) ∈ Nr(2)∪N(3)} ∪ {[i]i|([i,]i) ∈ Nl(2)∪N(1)} ∪ {[tk+1]tk+1, ...,[tk+p]tk+p} ∪ {λ}, hK(x) = hk(x), x /∈ {[tk+1, ...,[tk+p}, and hK([tk+j) = [tk+j]tk+j, 1≤j≤p,K=k+p.

Example 3.3. Consider the context-free grammar in Example 3.2with the dependen-cy graphs drawn in Figure3.2. The setP.eof labeled plus-loop regular expressions, built from these dependency graphs, is composed ofS([11)+[15[1t4 ]1t5(]11)+ (which corresponds to G1,S in Figure 3.3.a), ]1[26([23]27)+[2t4(]2t3 )+]26 (with the associated digraph G2,]1 in Fig-ure 3.3.b), ]6[32[36([33]37)+[3t4(]3t3 )+]36]3t2 (with the associated digraph G3,]6 in Figure 3.3.c), ]6[42(]47[43)+]47[4t4 (]4t3)+]4t2 or the ~-marked version ]6[42(]47[43)+¯]47[4t4 (]4t3)+]4t2 (with the associ-ated digraphG4,]6 in Figure 3.3.d), and ]6[52]57[5t4 ]t52 (with the associated digraph G5,]6 in Figure 3.3.e).

The extended dependency graph built with respect to the refinement procedure is sketched in Figure3.4. The terminal digraphsG6,]6 andG7,]6 are introduced with respect to the relabeling procedure I1, in order to prevent the loop yielded by the ”iterated“

digraph G3,]6 to occur between G2,]1 and G6,]6 (or G7,]6). It also forbids the self-loop (]3t2 )+to be linked to G6,]6 (or toG7,]6), then when the digraphG3,]6 is not added to the corresponding path. Due to the self-loop (]11)+, in which ]11 is a pop vertex, we did not applied the relabeling procedure described atI2 (applying it leads to the same result).

Chapter 3. Homomorphic Representations and Regular Approximations of Languages

Figure 3.4: The refined dependency graph of the context-free grammar in Example 3.2 and Example3.3. S is the initial vertex, vertices colored in green are final vertices, vertices colored in blue are dummy vertices, vertices colored in purple mark a core segment. Orange edges emphasize symetrical structures built with respect to the structure of the trace-language.

Green edges are glue edges.

3.4 On a Regular Superset Approximation for