Trie-based Output Itemset Sampling

From LRDE

Abstract

Pattern sampling algorithms produce interesting patterns with a probability proportional to a given utility measure. Utility changes need quick re-preprocessing when sampling patterns from large databases. In this context, existing sampling techniques require storing all data in memorywhich is costly. To tackle these issues, this work enriches D. Knuth's trie structure, avoiding 1) the need to access the database to sample since patterns are drawn directly from the enriched trie and 2) the necessity to reprocess the whole dataset when the utility changes. We define the trie of occurrences that our first algorithm TPSpace (Trie-based Pattern Space) uses to materialize all of the database patterns. Factorizing transaction prefixes compresses the transactional database. TPSampling (Trie-based Pattern Sampling), our second algorithm, draws patterns from a trie of occurrences under a length-based utility measure. Experiments show that TPSampling produces thousands of patterns in seconds.


Bibtex (lrde.bib)

@InProceedings{	  diop.22.ieeebigdata,
  author	= {Diop, Lamine and Diop, Cheikh Talibouya and Giacometti,
		  Arnaud and Li, Dominique and Soulet, Arnaud},
  booktitle	= {2022 IEEE International Conference on Big Data (Big
		  Data)},
  title		= {Trie-based Output Itemset Sampling},
  year		= {2022},
  address	= {Osaka, Japan},
  month		= dec,
  abstract	= {Pattern sampling algorithms produce interesting patterns
		  with a probability proportional to a given utility measure.
		  Utility changes need quick re-preprocessing when sampling
		  patterns from large databases. In this context, existing
		  sampling techniques require storing all data in memory,
		  which is costly. To tackle these issues, this work enriches
		  D. Knuth's trie structure, avoiding 1) the need to access
		  the database to sample since patterns are drawn directly
		  from the enriched trie and 2) the necessity to reprocess
		  the whole dataset when the utility changes. We define the
		  trie of occurrences that our first algorithm TPSpace
		  (Trie-based Pattern Space) uses to materialize all of the
		  database patterns. Factorizing transaction prefixes
		  compresses the transactional database. TPSampling
		  (Trie-based Pattern Sampling), our second algorithm, draws
		  patterns from a trie of occurrences under a length-based
		  utility measure. Experiments show that TPSampling produces
		  thousands of patterns in seconds.},
  pages		= {1-10},
  publisher	= {IEEE},
  doi		= {10.1109/BigDataXXXX},
  note		= {accepted}
}