Sequence Pattern Mining in Data Streams

Sequential pattern mining in data streams environment is an interesting data mining problem. The problem of finding sequential patterns in static databases had been studied extensively in the past years, however mining sequential patterns in the data streams still an active field for researches. In this research a new greedy sequence pattern mining algorithm for the data streams is introduced, it will be used to find the strongly supported sequences. The proposed algorithm is built based on the sequence tree which is used to find the sequential patterns in static databases. The proposed algorithm divides the streams into patches or windows and each patch will update the sequence tree which built from the previous windows. An example is introduced to explain how this algorithm works. We also show the efficiency and the effectiveness of the proposed algorithm on a synthetic dataset and prove how it is suited for data streams environment. We showed experimentally that the proposed algorithm is more efficient than the PrefixSpan algorithm for patterns with any support less than 30% for CPU time and with any support less than 60% for memory usage.


Introduction
In recent years new applications have been emerged such as network traffic analysis, wireless sensor networks and user web clicks, this introduced a new kind of data called data streams (Marascu and Masseglia, 2005). A data stream is an ordered sequence of items which in many applications can be read once without storing them in the database due to the huge size and amount of data. Sequential pattern mining in data streams concerns about finding the sequence patterns in data without the need of multi scan of data, however the data is huge and generated continuously (Muthukrishnan, 2005). In this research we focus on the problem of mining sequence patterns in data streams such as web clicks and wireless sensor networks, and we propose a new algorithm for mining sequential patterns in the data streams.
In recent years many contributions have been published to solve the problem of mining frequent patterns (Tiwari et al., 2010), mining sequential patterns in transaction databases (Agrawal & Ramakishnan, 1998), and web frequent multi-dimensional sequential patterns (Hwang et al., 2011). A recent survey (Rao et al., 2013) provides an overview about mostly used sequence pattern mining algorithms and provides a comparative study between them. Another survey (Vijayarani & Sathya, 2012) presents the various applications of data streams and provides the analysis of frequent pattern mining papers in data streams. Incremental mining algorithm of sequential patterns based on sequence tree was introduced in (Liu et al., 2012), however this algorithm is not used in data streams environments because it stores all patterns even they are not frequent and this causes a lot of problems in data streams environments where there is a limitation in memory usage. Even more, the memory allocation of sequence tree will be larger than the size of sequence data base.  Table 1 From sequence tree we can traverse the tree and get all frequent sequences with the requested minimum support. The following are the frequent sequence patterns with Min_sup = 2, for Figure 1. {<a>, <b>, <c>, <d>, <ad>, <ac>, <ab>, <da>, <dab>, <db>} To make the sequence tree suitable for data streams where the traffic always active. The windows of sequences will be considered for each specific period of time, for example the size of the window will be 20 sequences. So the sequence tree will be updated for each window. Due to the limitation of memory in data streams environment such in sensor networks, the tree will be pruned after each specific number of windows. However the branches with support less than minimum support will be removed.

Problem Definition
Suppose that in the stream environment the monitored items set is I = {i 1 , i 2 ... i n } where n is the total number items. A sequence of items S is an ordered list from I denoted by <s 1 s 2 …s m >. A sequence <a 1 a 2 …a n > is a subsequence from < b 1 b 2 …b m > if there exist integers 1 ≤ i 1 < i 2 < … < i k ≤ m such that a 1 ⊆ b i1 , a 2 ⊆ b i2 … a n ⊆ b in .
Given a sequence data base table T (Sid, Sequence), the support of sequence S donated by support(S) is the measure for how many times the sequence S appears in the sequence table. Min_sup is a user defined value to determine if the sequence is frequent or not. If support(S) > Min_sup the sequence S is frequent. Thus the problem of sequence pattern mining in data stream is to find the strongly frequent sequential patterns with support bigger than Min_sup with taking in consideration the memory and processing power limitations (Lee et al., 2014).
The following example is given to illustrate the sequences in data stream environment, however we assume an element has only one item; this is common in sensor networks environment where each sensor is specialized in sensing one phenomena.

Table1. Sequence Database
Id Sequence 1 adc 2 dab 3 ac 4 cdab 5 adb Table1 shows a batch of sequenced data obtained from data stream environment. Considering the Min_sup = 2, sequence <ac>, as example, will be frequent because it appears twice.

Proposed Algorithm
The idea is to divide the active coming streams into windows or batches. Where window size W is user defined value, for each element in the window we check the support and build the sequence tree only for elements with support larger than minimum support. After each specified number of windows P (prune time) we prune the tree in order to get rid of branches with low support.
The idea behind building the sequence tree is based on the permutation for the sequence. Given any sequence S = {a, b, c, d} then all possible sub sequences are actually the permutation for that sequence {<a>, <b>, <c>, <d>, <ab>, <ac>, <ad>, <bc>, <bd>, <cd>, <abc>, <acd>, <bcd>, <abcd>}. Starting from the root node, permutation set of sequences will be inserted to the tree and the count will be incremented only for the latest item in each sequence. Last item in each sequence will indicate to a new possible path in the tree.
Output: Tree of frequent patterns with support > Min_sup.
Initialize the root node, root = {NULL} Initialize the supported items list, Sup_list = {} while(true) /* streams always active so we need to run the algorithm forever */ foreach item i ∈ w n do calculate support(i) if support(i) > Min_sup then add i to Sup_list end if end for foreach sequence s ∈ w n do call pruneSequence(s, Sup_list) /* remove items with support less than Min_sup */ calculate permutation(s) foreach sequence ps ∈ permutaion_list do add ps to the tree, starting from the root node increment the count only for the latest item in the sequence end for end for if n = P /* if number of processed windows = prune time */ call pruneTree() /* remove sequences with support less than Min_sup */ end if end while Lemma 1: In the above algorithm, the item I from a sequence S can be considered as a frequent item only and only if support(I) > Min_sup.
Proof: Before the sequence batch entered to the algorithm, a process of counting the repetition of each element in the batch is made and this grantee that only elements with count greater than the minimum support are considered.
Lemma 2: For the sequence S, if length(S) > number of tree levels; then S is not a frequent sequence.
Proof: Suppose we have a sequence S = {i 1 , i 2 ... i n } with n items and the number of tree levels is m where m < n. Since length of the sequence is n so we need n levels in the tree to store the complete sequence but we have m levels and m is less than n. So sequences S will not entirely fit in the tree, hence it is not a frequent sequence.
Using Lemma 1 and Lemma 2 we can see that the algorithm will prune the Items with support < Min_sup. Some of the frequent patterns will be dropped from the final list of the frequent sequences due to the fact that the proposed algorithm will prune the sequences that are not frequent after ending the preset prune time (p).

Complexity
In the worst case, the time complexity to process one window using DSSPM will be the complexity to calculate the sequence permutation O(m 2 ); where m is the number of items in the sequence, in addition to the complexity www.ccsenet.org/cis Computer and Information Science Vol. 8, No. 3;2015 when building the sequence tree for that window O(n.m); where n is the number of sequences in the window. The total time complexity will be O(n 2 ), and this suit data streams environments such as web clicks and wireless sensors network.

Structure of Sequence Tree
Sequence tree is a data storage structure; it is used to store the frequent patterns with their support values. The root node of the tree is an empty node and just used as a link to the frequent sequences. Each of the tree nodes has two attributes; one stores the item and the other stores the support of the item inside the sequence. Any path from the root node to the leaf node is a frequent pattern and it is support is the support of the leaf node. The support of any parent node is equal or larger than support of it is children.
Suppose the following two windows, as shown in table 2, each of five sequences are coming to the analysis node in data stream environment. The goal is to find sequence patterns with Min_sup = 3. Consider the prune time is 2 and we have the following items set I = {a, b, c, d, e}. First we should find the support for each element in W1, {a = 5, b = 4, c = 2, d = 3, e = 3}. Since support (c) < Min_sup, we will ignore item c when we build the sequence tree. After processing W1, we will have the tree shown in Figure 2. Now we find the support for elements in W2, {a = 5, b = 1, c = 3, d = 2, e = 5}. Since support (b) < Min_sup and support (d) < Min_sup, we will ignore both items b and d. Sequence tree after processing W2 will be as shown in Figure 3.  Now we have to prune the tree; we will remove all paths with support less than Min_sup. The final tree will be as shown in Figure 4. Figure 4. Sequence tree after pruning Traverse the tree we can get the sequence patterns with support > Min_sup = {<a>:10, <b>:4, <c>:3, <d>:3, <e>:8, <ae>:7, <bd>:3}.
Prune time and Min_sup values can be tuned according to the processing node memory limitation; decrease Min_sup and increase prune time will give an accurate result and will increase the memory required to build the tree.
We are using greedy algorithm for finding the frequent patterns; this will not give an optimal answers, which is normal for this class of algorithms. We are removing elements from sequences with support < Min_sup. Also we are pruning the sequences with support < Min_sup after prune time P.

Experimental Results and Performance Analysis
The algorithm is implemented in Java using PC running windows XP with dual core processor (2.8 GHz) and 4GB of memory. We evaluated our proposed algorithm with synthetic data generated by sequence pattern mining framework SPMF (Fournier et al., 2015), each sequence has 17 unique transactions and each transaction is represented by a letter {a-q}. In order to show the efficiency of our proposed algorithm, we did a comparison with PrefixSpan algorithm (Pei et al., 2013). We report in Figure 5 the time needed to find all frequent patterns in 100,000 sequences with different values of minimum support, each window contains 100 sequences and the pruning time is 100 windows. AS shown in Figure 5 (a) the execution time for PrefixSpan grows as the minimum support decrease. Clearly, in Figure 5 (a) the execution time for DSSPM increases slightly while www.ccsenet.org/cis Computer and Information Science Vol. 8, No. 3;2015 decreasing the minimum support. Moreover DSSPM outperforms PrefixSpan in execution time for all minimum support values.  Table 3. Most frequent sequences with minimum support of 10% In Figure 5 (c) we compared both algorithms in terms of number of frequent patterns found, it is clearly that with low minimum support PrefixSpan is more effective while with high support values DSSPM become more effective and can find more frequent patterns.
In order to show the effectiveness of our proposed algorithm, we compared the top 10 most frequent sequences found when minimum support was set to 10%. The results are shown in Table 3. Support  adhji  154  17604  iebfa  152  17557  ebfgh  148  17676  bfegi  147  17668  gfaej  153  17709  fhgbc  152  17677  cadhj  171  17622  dhjib  151  17763  hjieb  144  17689  jiebf  154  17642 www.ccsenet.org/cis Computer and Information Science Vol. 8, No. 3;2015 DSSPM is a greedy algorithm; it looks only on the current window pruning the infrequent sequence patterns and this explains the low support values compared with PrefixSpan, but it still can find the frequent sequences which make it suited for data streams environments.

Conclusion
A new algorithm based on sequence tree was introduced for sequence pattern mining in data streams environment. The depth of the tree will be equal to the max length of supported sequences and this makes it suitable for data streams environment where there is a limitation in the memory. Algorithm inputs Min_sup and prune time are user defined variables which can be tuned according to the environment and the users need. However if the processing node has enough memory and a good processing power, user can decrease the Min_sup or increase the prune time to get more precise results; this improves the greedy behavior of this algorithm.