-
Notifications
You must be signed in to change notification settings - Fork 1
Finding Frequent (Sub)sequences
For finding frequent (sub)sequences the PrefixSpan algorithm by Pei et al. (2004) is used1. The algorithm uses the 'divide and conquer' principle, and is partially inspired by the FP-tree structure used to mine sets of unordered items.
During each iteration, a new prefix is generated based on the current prefix and nodes observed to be frequent in the set of paths. If this prefix is observed in the set of paths it is added to the set of frequent sequences. After this, all suffixes of sequences which start with this prefix are then mined recursively, with the suffix of a sequence being everything that follows said prefix. This improves upon the a priori method by foregoing the need to generate candidate sequences during each iteration. A pseudocode representation is given below.
Procedure PrefixSpace(all_paths, minimum_support)
frequent_items = all nodes in all_paths with occurrence >= minimum_support
return PrefixSpace([], frequent_items, all_paths, [])
EndProcedure
SubProcedure PrefixSpace(prefix, frequent_items, projected_paths, frequent_sequences)
for all item in frequent_items
if (prefix + item) in a path in projected_paths or
(prefix.last + item) in a path in projected_paths
new_prefix <- prefix + item
frequent_sequences <- frequent_sequences U new_prefix
new_projected_paths <- empty_set
for all sequence in projected_paths
if item starts with prefix
new_projected_paths <- new_projected_paths U (item - prefix)
end if
end for
PrefixSpace(new_prefix, frequent_items, new_projected_paths, frequent_sequences)
end if
end for
return frequent_sequences
SubEndProcedure
[1] Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q., ... & Hsu, M. C. (2004). Mining sequential patterns by pattern-growth: The prefixspan approach. IEEE Transactions on knowledge and data engineering, 16(11), 1424-1440.
🐑