Content Based Sampling over Transactional Data Streams

This paper investigates the problem of sampling from transactional data streams. We introduce CFISDS as a content based sampling algorithm that works on a landmark window model of data streams and preserve more informed sample in sample space. This algorithm that work based on closed frequent itemset mining tasks, first initiate a concept lattice using initial data, then update lattice structure using an incremental mechanism.Incremental mechanism insert, update and delete nodes in/from concept lattice in batch manner. Presented algorithm extracts the final samples on demand of user. Experimental results show the accuracy of CFISDS on synthetic and real datasets, despite on CFISDS algorithm is not faster than exist sampling algorithms such as Z and DSS.





References:
[1] J. Vitter; "Random sampling with a reservoir", ACM Transactions
on Mathematical Software, Vol 11(1), pp.37- 57, 1985 .
[2] P.B.Gibbons, Y. Matias. "New sampling-based summary statistics
for improving approximate query answers". In Proceedings of
ACM SIGMOD international conference on management of data,
1998, pp.331-342.
[3] C. Estan, G. Varghese. "New directions in traffic measurement and
accounting" . In: Proceedings of 1st ACM SIGCOMM workshop
on Internet measurement, 2001, pp. 75-80.
[4] G.Manku, R.Motwani. "Approximate frequency counts over data
streams". In Proc. 2002 Int. Conf. Very Large Data Bases, 2002,
pp. 346-357.
[5] E.D.Demaine, A.L-opez-Ortiz, J.I.Munro. "Frequency estimation
of Internet packet streams with limited space". In: Proceedings of
10th European symposium on algorithms, 2002, pp. 348-360.
[6] M. Dash, W. Ng. "Efficient Reservoir Sampling for Transactional
Data Streams" . Sixth IEEE International Conference on Data
Mining - Workshops, 2006, pp. 662-666 .
[7] W.Ng, M.Dash. "Which Is Better for Frequent Pattern Mining:
Approximate Counting or Sampling?". In Proceedings of the 11th
international Conference on Data Warehousing and Knowledge
Discovery , 2009, p.p. 151-162.
[8] J. Pei, J. Han, and R. Mao. Closet: An efficient algorithm for
mining frequent closed itemsets. In ACM SIGMOD the Internatial
Workshop on Data Mining and Knowledge Discovery, 2000.
[9] J.Wang, J.Han and J.Pei: CLOSET+: Searching for the Best
Strategies for Mining Frequent Closed Itemsets; In Proceeding of
ACM SIGKDD the InternationalConference on Knowledge
Discovery and Data Mining, 2003
[10] M.J.Zaki and C.Hsiao: Charm: An efficient algorithm for closed
itemset mining; In Proceedings of SDM the SIAM International
Conference on Data Mining, 2002
[11] C. Lucchesse, S. Orlando, and R. Perego. DCI-Closed: A fast and
memory efficient algorithm to mine frequent closed itemsets. In B.
Goethals, M. J. Zaki, and R. Bayardo, editors, Proceedings of the
IEEE ICDM Workshop on Frequent Itemset Mining
Implementations (FIMI 2004), volume 126 of CEUR Workshop
Proceedings, Brighton, UK, 1 November 2004.
[12] G. Grahne and J. Zhu. Efficiently using prefix-trees in mining
frequent itemsets. In B. Goethals and M. J. Zaki, editors,
Proceedings of the IEEE ICDM Workshop on Frequent Itemset
Mining Implementations (FIMI 2003), volume 90 of CEUR
Workshop Proceedings, Melbourne, Florida, USA, 19 November
2003.
[13] T.Uno , T.Asai , Y.Uchida , H.Arimura . LCM: An efficient
algorithm for enumerating frequent closed item sets , In
Proceedings of Workshop on Frequent itemset Mining
Implementations , 2003 .
[14] B.N. Ranganath, M.N. Murty. "Stream-Close: Fast Mining of
Closed Frequent Itemsets in High Speed Data Streams". ICDM
Workshops 2008, 2008, pp. 516-525.
[15] H.Li, H.Chen. "Moment+: Mining Closed Frequent Itemsets over
Data Stream", ADMA 2008, 2008, pp. 612-619.
[16] Y. Chi, H. Wang, P.S. Yu, R.R. Muntz. "Moment: Maintaining
Closed Frequent Itemsets over a Stream Sliding Window", In Proc.
Fourth IEEE Int-l Conf. Data Mining, 2004, pp. 59-66.
[17] V. Choi. (2006, Jun) "Faster Algorithms for Constructing a
Concept Galois Lattice." CoRR .vol abs/cs/0602069. Available:
http://arxiv.org/abs/cs/0602069 [Jun. 1,2006].
[18] M.J. Zaki. C.-J. Hsiao. "Efficient Algorithms for Mining Closed
Itemsets and Their Lattice Structure". IEEE Trans. on Knowl. and
Data Eng. Vol.17, No.4. pp. 462-478. 2005.
[19] P. Valtchev, R. Missaoui, R. Godin. "A framework for incremental
generation of closed itemsets". Discrete Applied Mathematics
vol.156 (6). pp. 924-949. 2008.
[20] J.H. Chang, W.S. Lee. "Finding recently frequent itemsets
adaptively over online transactional data streams". Inf. Syst.
Vol.31, No.8. pp. 849-869. 2006.
[21] F. Alqadah , R. Bhatnagar. "Similarity Measures in Formal
Concept Analysis". ISAIM 2010, Fort Lauderdale, Florida , 2010.
[22] IlliMine. Package for Data Mining in C++ ,[Online] Available:
http://illimine.cs.uiuc.edu.