Patterns with unrestricted length wildcards (P3)

1+pl+p patterns that match this substring. String^Scan contain^O(ln)substrings of length less than^l, hence the total number of patterns that can match substrings of length at most^lin^Sis^O(xnp^x^l^x+1⁾.

The patterns are constructed incrementally by expanding shorter patterns one position at a time. The work used for creating each pattern is proportional to the number of occurrences of each pattern. In the worst case it is^O(n). Hence, the worst case time complexity of Algorithm 3.10 is^O(xn²^p^x^l^x+1⁾. In reality only the patterns that occur frequently in ^S, i.e. the patterns that match at least ^K different substrings of^S, are constructed.

In Algorithm 3.6 for discovering the frequent substrings, all the position lists of the nodes at the same level in the trie were disjoint. Now, allowing group characters, different patterns can match exactly the same substrings of^S. It means that the catenation of the position lists at the same level of the trie could grow exponentially in the number of group positions in the patterns.

In the breadth-first search these overlapping positions would all have to be stored. However, if the search for patterns is performed in the depth-first order, only the position lists of the nodes (and the immediate siblings of these nodes) along a single path from the root would be needed. The position lists of siblings do not overlap, hence at most ⁿ positions are stored in them. This means that during the construction of the trie of depth ^l, only ^O(ln) positions are stored in the nodes along the path from the root to the leaf at depth ^l. Hence, the space requirement of Algorithm 3.10 is^O(T ⁺^ln), where^T is the size of the trie of frequently occurring patterns and^lnis the amount of memory used for storing the position lists along any single path from the root to leaf at depth^l(the length of the longest pattern).

3.3 Patterns with unrestricted length wildcards (P3)

Given a string^S, patterns with unrestricted length wildcards matching^Sare com-monly called subsequences¹ of^S. Hence, the patterns from class P3 can also be referred to as subsequences. In this section we consider the discovery of patterns of form²⁺⁽⁺⁾, i.e. the patterns whose first and last symbol are in.

In order to identify frequently occurring subsequences we need to be able to count the number of occurrences of each subsequence. As previously, we will

1Note that subsequence and substring are not the same concept although the string and sequence in general are used interchangeably. See also (Gusfield 1997).

52 3 DISCOVERY OF FREQUENTLY OCCURRING PATTERNS

identify the occurrence of a pattern by the location of the last position of the pattern in the sequence^S. This definition is not free of problems though.

Example 3.12 Given a string^S ⁼ABAACAsome of the subsequences of^S are, for example, AB*C, A*A*C,and A*A. According to our definition, A*Ahas three occurrences in^SandA*A*CandAB*A*Conly one. The two different sub-sequencesA*CandB*Care different patterns even though they “occur” at exactly the same positions. Note that patternBoccurs once in^S, while its refinementB*A has three occurrences.

Example 3.13 Let^S be a string ^S ² of lengthⁿ. Let us consider the string

S 0

=Sca

n, such that^coccurs only once in^S⁰. Every pattern^c^a, where is a substring of^Soccursⁿtimes in^S⁰even though^coccurs only once.

It is clear from these examples that a prefix of the subsequence pattern(see

cabove) may have fewer occurrences than the patternitself (^ca). Hence, the definition of the frequency of occurrences is perhaps not so intuitive any more.

In practice our pattern discovery algorithm requires all prefixes of a pattern to be frequent.

For multi-sequence inputs we can use a different criteria for counting occur-rences. Namely, we can count the number of sequences that contain a match (one or more occurrences) by the pattern . Then the ambiguity of counting pattern match locations is avoided. First, however, we consider a single-sequence input.

3.3.1 Frequent subsequences of a string^S(P3:A)

Problem P3:A Given a string^S ² and an integer ^K, construct all patterns

with unrestricted length wildcards (patterns of type P3) such thathas at least

Koccurrences in^S.

We will extend our pattern trie construction algorithms so that symbols of the type ^c,^c ² , can also be used as labels for nodes. When the wildcard is used to extend the pattern, it is immediately followed by a character from. For example, if ⁼ ^fA;^C;^G;^Tg, the pattern trie is built over the set of node labels

fA;C;G;T;A;C;G;Tg.

3.3 Patterns with unrestricted length wildcards (P3) 53 Algorithm 3.14 (P3:A) Frequent subsequence generation

Input: Input string^S, threshold^K

Output: All frequently occurring subsequences (patterns over^[^fcjc²^g) Method:

1. ^{R oot} new node 2. R oot:label

3. ^{R oot:pos} ^(1;^2;^:^:^:^;^jS^j) 4. ^enqueue(Q;^{R oot)}

5. while^N ^dequeue(Q) 6. foreach^p²^N:pos 7. add^p⁺¹to^Set(S[p])

8. if ^N⁶⁼^{R oot} then // Create position lists for ’*c’ extensions 9. ^p smallest value from^N:pos

10. ^w ⁰

11. while^p⁺^w^<^jS^j

12. add^p⁺^w⁺¹to^Set(⁰^S[p⁺^w]⁰⁾

13. ^w w+1

14. foreach^c²⁽^[⁰⁰⁾where^jSet(c)j^K

15. ^P new node

16. ^P:char ^c 17. ^N:child(c) ^P 18. ^P:pos ^Set(c)

19. ^enqueue(Q;^P;^f^ront) // Enqueue to the front of queue 20. foreach^c²⁽^[⁰⁰)deleteSet(c)

21. delete^N:pos 22. end

Every position of^S is considered as a possible first character of pattern in Algorithm 3.14. During the process of extending patterns, first all single character extensions are considered (see line 7). Next, all extensions with a wildcard fol-lowed by a character from the alphabet of^Sare added. Note that the test on line 8 does not allow to start a pattern with a wildcard.

When we add nodes corresponding to wildcards followed by a fixed character, only the first occurrence of a pattern prefix needs to be considered (lines 9 – 13).

This is enough for creating the maximal set of occurrences.

There is an exponential number of subsequences of^S(in the length of^S). This is easy to see by noting that each of the positions of ^Scan be either represented in pattern by the character itself or a wildcard. Additionally, for each pattern potentially all its occurrences need to be enumerated. Hence, the worst case time

54 3 DISCOVERY OF FREQUENTLY OCCURRING PATTERNS

complexity of Algorithm 3.14 is^O(n2ⁿ⁾.

The problem statement P3:A required to enumerate only the frequently occur-ring subsequences of^S. Note that we require each prefix of a subsequence pattern to be a frequent subsequence pattern as well, as we think it is more intuitive in this case.

The space complexity. The maximum number of occurrences of any pattern

is^O(n). The sum of the total number of occurrences of all the extensions to a pattern of type^c is^O(n), as the occurrences must be mutually exclusive.

Hence, the total number of occurrences of all the extensions ^X is^O(n). Note that the search is performed in a depth-first order, and hence only the positions along a single search path need to be stored at any given moment. Therefore the maximum number of positions needed to store during the search is ^O(ln) where^lis the length of the longest path in the tree corresponding to the longest commonly occurring pattern. The pattern trie size itself accounts for the rest of the space requirement. If the traversed paths are deleted immediately this space can be reduced. Otherwise the trie size is proportional to the total number of frequent subsequences.

3.3.2 Subsequences common to multiple input strings (P3:B)

The problem statement P3:A at the first glance seems impractical for both the practical relevance as well as the time complexity analysis point of view. Note that the main complexity of the algorithm is due to the length of the input sequence^S. If the input data consists of a set of sequences^Si, we get the following problem statement.

Problem P3:B Given a set of strings ^Sⁿ ⁼ ^fS¹^;^S²^;^:^:^:^;^Sⁿ^g, ^Sⁱ ² , and an integer^K, construct all patterns of type P3 such thathas at least one occurrence in at least^Kof the sequences^S1

;:::;S

Note that the longest common subsequence problem is directly related to the sequence alignment problem (Gusfield 1997). Hence, Problem P3:B can be viewed as a multiple sequence alignment problem where all local alignments com-mon to at least^Kinput sequences are searched for.

3.3 Patterns with unrestricted length wildcards (P3) 55 Algorithm 3.15 (P3:B) Frequent subsequences common to a set of sequences Input: Input strings^Sⁿ⁼^fS1

;:::;S

g, threshold^K

Output: Patterns over^[^fc^jc²^gthat occur in at least^Kdifferent strings from

8. ^s ⁰// for keeping track of sequence number^Ss

9. foreach^p²^N:pos 10. add^p⁺¹to^Set(S^[p])

11. if ^N ⁶⁼^{R oot}and^p²^Siandⁱ^>^s // if^pis within a new sequence^Si

12. ^s ⁱ // Remember which sequence^Siis being studied

13. ^w ⁰

14. while^S[p⁺^w]⁶⁼⁰^#⁰

15. add^p⁺^w⁺¹to^Set(S[p⁺^w])

16. ^w w+1

17. foreach character^c²⁽^[^fcjc²^g)wherecountseq(Set(c))K

18. ^P new node

19. ^P:char ^c 20. ^N:child(c) ^P 21. ^P:pos ^Set(c)

22. ^enqueue(Q;^P;^f^ront) // Enqueue to the front of queue 23. foreach^c²⁽^[^fcjc²^g)delete^Set(c)

24. delete^N:pos 25. end

Algorithm 3.15 does not generate subsequences that would cross delimiters between original strings^Sⁱ. Otherwise it is only a small modification from Algo-rithm 3.14, and constructs all possible subsequences common to at least^K input sequences^Si.

Note that no extension to a subsequencecan be more frequent thanitself, as we only count the number of sequences matches, not the positions (how many occurrences in total). The subsequences that are not common to enough many original sequences^Sⁱare eliminated.

56 3 DISCOVERY OF FREQUENTLY OCCURRING PATTERNS

The space required to store position lists at any time during the algorithm execution is potentially the same as for Algorithm 3.14. However, the number of all potential subsequences is^O(n2^max(jSⁱ^j)⁾.

The time complexity of the Algorithm 3.15 is^O(n²²^max(jSij)⁾.

In document Pattern Discovery from Biosequences (sivua 58-63)