Data stream analysis is the process of computing
various summaries and derived values from large amounts of data
which are continuously generated at a rapid rate. The nature of a
stream does not allow a revisit on each data element. Furthermore,
data processing must be fast to produce timely analysis results. These
requirements impose constraints on the design of the algorithms to
balance correctness against timely responses. Several techniques
have been proposed over the past few years to address these
challenges. These techniques can be categorized as either dataoriented
or task-oriented. The data-oriented approach analyzes a
subset of data or a smaller transformed representation, whereas taskoriented
scheme solves the problem directly via approximation
techniques. We propose a hybrid approach to tackle the data stream
analysis problem. The data stream has been both statistically
transformed to a smaller size and computationally approximated its
characteristics. We adopt a Monte Carlo method in the approximation
step. The data reduction has been performed horizontally and
vertically through our EMR sampling method. The proposed method
is analyzed by a series of experiments. We apply our algorithm on
clustering and classification tasks to evaluate the utility of our
approach.
[1] C. Aggarwal, J. Han, J. Wang, and P. Yu, A framework for clustering
evolving data streams, in Pro. Very Large Data Bases, 2003.
[2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, Model and
issues in data stream systems, in Pro. ACM PODS, 2002.
[3] M. Berthold and D.J. Hand, Intelligent Data Analysis: An Introduction.
Springer-Verlag, 2003.
[4] J. Bilmes, A gentle tutorial of the EM algorithm and its application to
parameter estimation for Gaussian mixture and hidden Markov models,
Dept. Electrical Engineering and Computer Science, University of
California Berkeley, Technical Report TR-97-021, 1998.
[5] G. Coremode and S. Muthukrishnan, What s hot and what s not:
Tracking most frequent items dynamically, in Pro. ACM PODS, 2003.
[6] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood
from incomplete data via the EM algorithm, Journal of the Royal
Statistical Society B, vol. 39, pp. 1-22, 1977.
[7] P. Domingos and G. Hulten, A general method to scaling up machine
learning algorithms and its application to clustering, in Pro. ICML,
2001.
[8] M. A. T. Figueiredo and A. K. Jain, Unsupervised learning of finite
mixture models, IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 24, pp. 381 396, 2002.
[9] M. Gaber, A. Zaslavsky, and S. Krishnaswamy, Mining data stream: A
review, SIGMOD Record, vol. 34, pp. 18 26, 2005.
[10] A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, One-pass
wavelet decompositions of data streams, IEEE Trans. Knowledge and
Data Engineering, vol. 15, pp. 541 554, 2003.
[11] D. Mackay, Introduction to Monte Carlo,
in Learning in Graphical
Models, M. Jordan, Ed. MIT Press, 1996, pp. 175-204.
[12] J. M. Marin, K. Mengersen, and C. Robert, Bayesian modelling and
inference on mixtures of distributions, in Handbook of Statistics, vol.
25, Elsevier-Science, 2005.
[13] S. Muthukrishnan, Data streams: Algorithms and applications, in Proc.
ACM-SIAM Symposium on Discrete Algorithm, 2003.
[14] R. Neal, Probabilistic inference using Markov chain Monte Carlo
methods, Dept. Computer Science, University of Toronto, Technical
Report CRG-TR93-1, 1993.
[15] B. Resch, A tutorial for the course computational intelligence,
Available: http://www.igi.tugraz.at/lehre/CI
[16] J. von Neumann, Various techniques used in connection with random
digits, Applied Mathematics Series, vol. 12, National Bureau of
Standards, Washington, D.C., 1951.
[17] I. Witten and E. Frank, Data Mining: Practical Machine Learning Tools
and Techniques with Java Implementations. Morgan Kaufmann, 2000.
[1] C. Aggarwal, J. Han, J. Wang, and P. Yu, A framework for clustering
evolving data streams, in Pro. Very Large Data Bases, 2003.
[2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, Model and
issues in data stream systems, in Pro. ACM PODS, 2002.
[3] M. Berthold and D.J. Hand, Intelligent Data Analysis: An Introduction.
Springer-Verlag, 2003.
[4] J. Bilmes, A gentle tutorial of the EM algorithm and its application to
parameter estimation for Gaussian mixture and hidden Markov models,
Dept. Electrical Engineering and Computer Science, University of
California Berkeley, Technical Report TR-97-021, 1998.
[5] G. Coremode and S. Muthukrishnan, What s hot and what s not:
Tracking most frequent items dynamically, in Pro. ACM PODS, 2003.
[6] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood
from incomplete data via the EM algorithm, Journal of the Royal
Statistical Society B, vol. 39, pp. 1-22, 1977.
[7] P. Domingos and G. Hulten, A general method to scaling up machine
learning algorithms and its application to clustering, in Pro. ICML,
2001.
[8] M. A. T. Figueiredo and A. K. Jain, Unsupervised learning of finite
mixture models, IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 24, pp. 381 396, 2002.
[9] M. Gaber, A. Zaslavsky, and S. Krishnaswamy, Mining data stream: A
review, SIGMOD Record, vol. 34, pp. 18 26, 2005.
[10] A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, One-pass
wavelet decompositions of data streams, IEEE Trans. Knowledge and
Data Engineering, vol. 15, pp. 541 554, 2003.
[11] D. Mackay, Introduction to Monte Carlo,
in Learning in Graphical
Models, M. Jordan, Ed. MIT Press, 1996, pp. 175-204.
[12] J. M. Marin, K. Mengersen, and C. Robert, Bayesian modelling and
inference on mixtures of distributions, in Handbook of Statistics, vol.
25, Elsevier-Science, 2005.
[13] S. Muthukrishnan, Data streams: Algorithms and applications, in Proc.
ACM-SIAM Symposium on Discrete Algorithm, 2003.
[14] R. Neal, Probabilistic inference using Markov chain Monte Carlo
methods, Dept. Computer Science, University of Toronto, Technical
Report CRG-TR93-1, 1993.
[15] B. Resch, A tutorial for the course computational intelligence,
Available: http://www.igi.tugraz.at/lehre/CI
[16] J. von Neumann, Various techniques used in connection with random
digits, Applied Mathematics Series, vol. 12, National Bureau of
Standards, Washington, D.C., 1951.
[17] I. Witten and E. Frank, Data Mining: Practical Machine Learning Tools
and Techniques with Java Implementations. Morgan Kaufmann, 2000.
@article{"International Journal of Information, Control and Computer Sciences:53761", author = "Kittisak Kerdprasop and Nittaya Kerdprasop and Pairote Sattayatham", title = "A Monte Carlo Method to Data Stream Analysis", abstract = "Data stream analysis is the process of computing
various summaries and derived values from large amounts of data
which are continuously generated at a rapid rate. The nature of a
stream does not allow a revisit on each data element. Furthermore,
data processing must be fast to produce timely analysis results. These
requirements impose constraints on the design of the algorithms to
balance correctness against timely responses. Several techniques
have been proposed over the past few years to address these
challenges. These techniques can be categorized as either dataoriented
or task-oriented. The data-oriented approach analyzes a
subset of data or a smaller transformed representation, whereas taskoriented
scheme solves the problem directly via approximation
techniques. We propose a hybrid approach to tackle the data stream
analysis problem. The data stream has been both statistically
transformed to a smaller size and computationally approximated its
characteristics. We adopt a Monte Carlo method in the approximation
step. The data reduction has been performed horizontally and
vertically through our EMR sampling method. The proposed method
is analyzed by a series of experiments. We apply our algorithm on
clustering and classification tasks to evaluate the utility of our
approach.", keywords = "Data Stream, Monte Carlo, Sampling, DensityEstimation.", volume = "2", number = "8", pages = "2666-6", }