A Monte Carlo Method to Data Stream Analysis

Data stream analysis is the process of computing various summaries and derived values from large amounts of data which are continuously generated at a rapid rate. The nature of a stream does not allow a revisit on each data element. Furthermore, data processing must be fast to produce timely analysis results. These requirements impose constraints on the design of the algorithms to balance correctness against timely responses. Several techniques have been proposed over the past few years to address these challenges. These techniques can be categorized as either dataoriented or task-oriented. The data-oriented approach analyzes a subset of data or a smaller transformed representation, whereas taskoriented scheme solves the problem directly via approximation techniques. We propose a hybrid approach to tackle the data stream analysis problem. The data stream has been both statistically transformed to a smaller size and computationally approximated its characteristics. We adopt a Monte Carlo method in the approximation step. The data reduction has been performed horizontally and vertically through our EMR sampling method. The proposed method is analyzed by a series of experiments. We apply our algorithm on clustering and classification tasks to evaluate the utility of our approach.




References:
[1] C. Aggarwal, J. Han, J. Wang, and P. Yu, A framework for clustering
evolving data streams, in Pro. Very Large Data Bases, 2003.
[2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, Model and
issues in data stream systems, in Pro. ACM PODS, 2002.
[3] M. Berthold and D.J. Hand, Intelligent Data Analysis: An Introduction.
Springer-Verlag, 2003.
[4] J. Bilmes, A gentle tutorial of the EM algorithm and its application to
parameter estimation for Gaussian mixture and hidden Markov models,
Dept. Electrical Engineering and Computer Science, University of
California Berkeley, Technical Report TR-97-021, 1998.
[5] G. Coremode and S. Muthukrishnan, What s hot and what s not:
Tracking most frequent items dynamically, in Pro. ACM PODS, 2003.
[6] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood
from incomplete data via the EM algorithm, Journal of the Royal
Statistical Society B, vol. 39, pp. 1-22, 1977.
[7] P. Domingos and G. Hulten, A general method to scaling up machine
learning algorithms and its application to clustering, in Pro. ICML,
2001.
[8] M. A. T. Figueiredo and A. K. Jain, Unsupervised learning of finite
mixture models, IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 24, pp. 381 396, 2002.
[9] M. Gaber, A. Zaslavsky, and S. Krishnaswamy, Mining data stream: A
review, SIGMOD Record, vol. 34, pp. 18 26, 2005.
[10] A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, One-pass
wavelet decompositions of data streams, IEEE Trans. Knowledge and
Data Engineering, vol. 15, pp. 541 554, 2003.
[11] D. Mackay, Introduction to Monte Carlo,
in Learning in Graphical
Models, M. Jordan, Ed. MIT Press, 1996, pp. 175-204.
[12] J. M. Marin, K. Mengersen, and C. Robert, Bayesian modelling and
inference on mixtures of distributions, in Handbook of Statistics, vol.
25, Elsevier-Science, 2005.
[13] S. Muthukrishnan, Data streams: Algorithms and applications, in Proc.
ACM-SIAM Symposium on Discrete Algorithm, 2003.
[14] R. Neal, Probabilistic inference using Markov chain Monte Carlo
methods, Dept. Computer Science, University of Toronto, Technical
Report CRG-TR93-1, 1993.
[15] B. Resch, A tutorial for the course computational intelligence,
Available: http://www.igi.tugraz.at/lehre/CI
[16] J. von Neumann, Various techniques used in connection with random
digits, Applied Mathematics Series, vol. 12, National Bureau of
Standards, Washington, D.C., 1951.
[17] I. Witten and E. Frank, Data Mining: Practical Machine Learning Tools
and Techniques with Java Implementations. Morgan Kaufmann, 2000.