• Ei tuloksia

Patterns with unrestricted length wildcards (P3)

In document Pattern Discovery from Biosequences (sivua 58-63)

1+pl+p patterns that match this substring. StringScan containO(ln)substrings of length less thanl, hence the total number of patterns that can match substrings of length at mostlinSisO(xnpxlx+1).

The patterns are constructed incrementally by expanding shorter patterns one position at a time. The work used for creating each pattern is proportional to the number of occurrences of each pattern. In the worst case it isO(n). Hence, the worst case time complexity of Algorithm 3.10 isO(xn2pxlx+1). In reality only the patterns that occur frequently in S, i.e. the patterns that match at least K different substrings ofS, are constructed.

In Algorithm 3.6 for discovering the frequent substrings, all the position lists of the nodes at the same level in the trie were disjoint. Now, allowing group characters, different patterns can match exactly the same substrings ofS. It means that the catenation of the position lists at the same level of the trie could grow exponentially in the number of group positions in the patterns.

In the breadth-first search these overlapping positions would all have to be stored. However, if the search for patterns is performed in the depth-first order, only the position lists of the nodes (and the immediate siblings of these nodes) along a single path from the root would be needed. The position lists of siblings do not overlap, hence at most n positions are stored in them. This means that during the construction of the trie of depth l, only O(ln) positions are stored in the nodes along the path from the root to the leaf at depth l. Hence, the space requirement of Algorithm 3.10 isO(T +ln), whereT is the size of the trie of frequently occurring patterns andlnis the amount of memory used for storing the position lists along any single path from the root to leaf at depthl(the length of the longest pattern).

3.3 Patterns with unrestricted length wildcards (P3)

Given a stringS, patterns with unrestricted length wildcards matchingSare com-monly called subsequences1 ofS. Hence, the patterns from class P3 can also be referred to as subsequences. In this section we consider the discovery of patterns of form2+(+), i.e. the patterns whose first and last symbol are in.

In order to identify frequently occurring subsequences we need to be able to count the number of occurrences of each subsequence. As previously, we will

1Note that subsequence and substring are not the same concept although the string and sequence in general are used interchangeably. See also (Gusfield 1997).

52 3 DISCOVERY OF FREQUENTLY OCCURRING PATTERNS

identify the occurrence of a pattern by the location of the last position of the pattern in the sequenceS. This definition is not free of problems though.

Example 3.12 Given a stringS =ABAACAsome of the subsequences ofS are, for example, AB*C, A*A*C,and A*A. According to our definition, A*Ahas three occurrences inSandA*A*CandAB*A*Conly one. The two different sub-sequencesA*CandB*Care different patterns even though they “occur” at exactly the same positions. Note that patternBoccurs once inS, while its refinementB*A has three occurrences.

Example 3.13 LetS be a string S 2 of lengthn. Let us consider the string

S 0

=Sca

n, such thatcoccurs only once inS0. Every patternca, where is a substring ofSoccursntimes inS0even thoughcoccurs only once.

It is clear from these examples that a prefix of the subsequence pattern(see

cabove) may have fewer occurrences than the patternitself (ca). Hence, the definition of the frequency of occurrences is perhaps not so intuitive any more.

In practice our pattern discovery algorithm requires all prefixes of a pattern to be frequent.

For multi-sequence inputs we can use a different criteria for counting occur-rences. Namely, we can count the number of sequences that contain a match (one or more occurrences) by the pattern . Then the ambiguity of counting pattern match locations is avoided. First, however, we consider a single-sequence input.

3.3.1 Frequent subsequences of a stringS(P3:A)

Problem P3:A Given a stringS 2 and an integer K, construct all patterns

with unrestricted length wildcards (patterns of type P3) such thathas at least

Koccurrences inS.

We will extend our pattern trie construction algorithms so that symbols of the type c,c 2 , can also be used as labels for nodes. When the wildcard is used to extend the pattern, it is immediately followed by a character from. For example, if = fA;C;G;Tg, the pattern trie is built over the set of node labels

fA;C;G;T;A;C;G;Tg.

3.3 Patterns with unrestricted length wildcards (P3) 53 Algorithm 3.14 (P3:A) Frequent subsequence generation

Input: Input stringS, thresholdK

Output: All frequently occurring subsequences (patterns over[fcjc2g) Method:

1. R oot new node 2. R oot:label

3. R oot:pos (1;2;:::;jSj) 4. enqueue(Q;R oot)

5. whileN dequeue(Q) 6. foreachp2N:pos 7. addp+1toSet(S[p])

8. if N6=R oot then // Create position lists for ’*c’ extensions 9. p smallest value fromN:pos

10. w 0

11. whilep+w<jSj

12. addp+w+1toSet(0S[p+w]0)

13. w w+1

14. foreachc2([00)wherejSet(c)jK

15. P new node

16. P:char c 17. N:child(c) P 18. P:pos Set(c)

19. enqueue(Q;P;front) // Enqueue to the front of queue 20. foreachc2([00)deleteSet(c)

21. deleteN:pos 22. end

Every position ofS is considered as a possible first character of pattern in Algorithm 3.14. During the process of extending patterns, first all single character extensions are considered (see line 7). Next, all extensions with a wildcard fol-lowed by a character from the alphabet ofSare added. Note that the test on line 8 does not allow to start a pattern with a wildcard.

When we add nodes corresponding to wildcards followed by a fixed character, only the first occurrence of a pattern prefix needs to be considered (lines 9 – 13).

This is enough for creating the maximal set of occurrences.

There is an exponential number of subsequences ofS(in the length ofS). This is easy to see by noting that each of the positions of Scan be either represented in pattern by the character itself or a wildcard. Additionally, for each pattern potentially all its occurrences need to be enumerated. Hence, the worst case time

54 3 DISCOVERY OF FREQUENTLY OCCURRING PATTERNS

complexity of Algorithm 3.14 isO(n2n).

The problem statement P3:A required to enumerate only the frequently occur-ring subsequences ofS. Note that we require each prefix of a subsequence pattern to be a frequent subsequence pattern as well, as we think it is more intuitive in this case.

The space complexity. The maximum number of occurrences of any pattern

isO(n). The sum of the total number of occurrences of all the extensions to a pattern of typec isO(n), as the occurrences must be mutually exclusive.

Hence, the total number of occurrences of all the extensions X isO(n). Note that the search is performed in a depth-first order, and hence only the positions along a single search path need to be stored at any given moment. Therefore the maximum number of positions needed to store during the search is O(ln) wherelis the length of the longest path in the tree corresponding to the longest commonly occurring pattern. The pattern trie size itself accounts for the rest of the space requirement. If the traversed paths are deleted immediately this space can be reduced. Otherwise the trie size is proportional to the total number of frequent subsequences.

3.3.2 Subsequences common to multiple input strings (P3:B)

The problem statement P3:A at the first glance seems impractical for both the practical relevance as well as the time complexity analysis point of view. Note that the main complexity of the algorithm is due to the length of the input sequenceS. If the input data consists of a set of sequencesSi, we get the following problem statement.

Problem P3:B Given a set of strings Sn = fS1;S2;:::;Sng, Si 2 , and an integerK, construct all patterns of type P3 such thathas at least one occurrence in at leastKof the sequencesS1

;S

2

;:::;S

n.

Note that the longest common subsequence problem is directly related to the sequence alignment problem (Gusfield 1997). Hence, Problem P3:B can be viewed as a multiple sequence alignment problem where all local alignments com-mon to at leastKinput sequences are searched for.

3.3 Patterns with unrestricted length wildcards (P3) 55 Algorithm 3.15 (P3:B) Frequent subsequences common to a set of sequences Input: Input stringsSn=fS1

;:::;S

n

g, thresholdK

Output: Patterns over[fcjc2gthat occur in at leastKdifferent strings from

S

8. s 0// for keeping track of sequence numberSs

9. foreachp2N:pos 10. addp+1toSet(S[p])

11. if N 6=R ootandp2Siandi>s // ifpis within a new sequenceSi

12. s i // Remember which sequenceSiis being studied

13. w 0

14. whileS[p+w]6=0#0

15. addp+w+1toSet(S[p+w])

16. w w+1

17. foreach characterc2([fcjc2g)wherecountseq(Set(c))K

18. P new node

19. P:char c 20. N:child(c) P 21. P:pos Set(c)

22. enqueue(Q;P;front) // Enqueue to the front of queue 23. foreachc2([fcjc2g)deleteSet(c)

24. deleteN:pos 25. end

Algorithm 3.15 does not generate subsequences that would cross delimiters between original stringsSi. Otherwise it is only a small modification from Algo-rithm 3.14, and constructs all possible subsequences common to at leastK input sequencesSi.

Note that no extension to a subsequencecan be more frequent thanitself, as we only count the number of sequences matches, not the positions (how many occurrences in total). The subsequences that are not common to enough many original sequencesSiare eliminated.

56 3 DISCOVERY OF FREQUENTLY OCCURRING PATTERNS

The space required to store position lists at any time during the algorithm execution is potentially the same as for Algorithm 3.14. However, the number of all potential subsequences isO(n2max(jSij)).

The time complexity of the Algorithm 3.15 isO(n22max(jSij)).

In document Pattern Discovery from Biosequences (sivua 58-63)