Trie-based Output Itemset Sampling
From LRDE
- Authors
- Lamine Diop, Cheikh Talibouya Diop, GiacomettiArnaud, Dominique Li, Arnaud Soulet
- Where
- 2022 IEEE International Conference on Big Data (Big Data)
- Place
- Osaka, Japan
- Type
- inproceedings
- Publisher
- IEEE
- Keywords
- IA
- Date
- 2022-12-12
Abstract
Pattern sampling algorithms produce interesting patterns with a probability proportional to a given utility measure. Utility changes need quick re-preprocessing when sampling patterns from large databases. In this context, existing sampling techniques require storing all data in memorywhich is costly. To tackle these issues, this work enriches D. Knuth's trie structure, avoiding 1) the need to access the database to sample since patterns are drawn directly from the enriched trie and 2) the necessity to reprocess the whole dataset when the utility changes. We define the trie of occurrences that our first algorithm TPSpace (Trie-based Pattern Space) uses to materialize all of the database patterns. Factorizing transaction prefixes compresses the transactional database. TPSampling (Trie-based Pattern Sampling), our second algorithm, draws patterns from a trie of occurrences under a length-based utility measure. Experiments show that TPSampling produces thousands of patterns in seconds.
Bibtex (lrde.bib)
@InProceedings{ diop.22.ieeebigdata, author = {Diop, Lamine and Diop, Cheikh Talibouya and Giacometti, Arnaud and Li, Dominique and Soulet, Arnaud}, booktitle = {2022 IEEE International Conference on Big Data (Big Data)}, title = {Trie-based Output Itemset Sampling}, year = {2022}, address = {Osaka, Japan}, month = dec, abstract = {Pattern sampling algorithms produce interesting patterns with a probability proportional to a given utility measure. Utility changes need quick re-preprocessing when sampling patterns from large databases. In this context, existing sampling techniques require storing all data in memory, which is costly. To tackle these issues, this work enriches D. Knuth's trie structure, avoiding 1) the need to access the database to sample since patterns are drawn directly from the enriched trie and 2) the necessity to reprocess the whole dataset when the utility changes. We define the trie of occurrences that our first algorithm TPSpace (Trie-based Pattern Space) uses to materialize all of the database patterns. Factorizing transaction prefixes compresses the transactional database. TPSampling (Trie-based Pattern Sampling), our second algorithm, draws patterns from a trie of occurrences under a length-based utility measure. Experiments show that TPSampling produces thousands of patterns in seconds.}, pages = {1-10}, publisher = {IEEE}, doi = {10.1109/BigDataXXXX}, note = {accepted} }