Analysis of Relation between Unlabeled and Labeled Data to Self-Taught Learning Performance
Obtaining labeled data in supervised learning is often
difficult and expensive, and thus the trained learning algorithm tends
to be overfitting due to small number of training data. As a result,
some researchers have focused on using unlabeled data which may
not necessary to follow the same generative distribution as the labeled
data to construct a high-level feature for improving performance on
supervised learning tasks. In this paper, we investigate the impact of
the relationship between unlabeled and labeled data for classification
performance. Specifically, we will apply difference unlabeled data
which have different degrees of relation to the labeled data for
handwritten digit classification task based on MNIST dataset. Our
experimental results show that the higher the degree of relation
between unlabeled and labeled data, the better the classification
performance. Although the unlabeled data that is completely from
different generative distribution to the labeled data provides the lowest
classification performance, we still achieve high classification performance.
This leads to expanding the applicability of the supervised
learning algorithms using unsupervised learning.
[1] Banko, M., Brill, E. "Mitigating the paucity-of-data problem: exploring
the effect of training corpus size on classifier performance for natural
language processing", 1st Intl. conf. on Human language technology
research, pp. 1-5, 2001.
[2] Ng A. et al., Lecture note on unsupervised feature learning and deep
learning, http://deeplearning.stanford.edu/wiki, 2011.
[3] Raina, R., Battle, A., Lee. H., Packer, B., Ng, A. Y. Self-taught learning:
Transfer learning from unlabeled data, In Proc. of the 24th Intl. Conf.
on Machine Learning, 2007
[4] Hinton, G. E., Salakhutdinov, R. R. "Reducing the dimensionality of data
with neural networks", Science, 313, pp. 504-507, 2006.
[5] Hinton, G. E. "Learning multiple layers of representation", Trends in
Cognitive sciences, vol.11, no.10, pp. 428-434, 2007.
[6] Pan, S. J., Yang, Q. "A survey on Transfer Learning", IEEE trans. on
knowledge and data engineering, vol. 22, no.10, pp. 1345-1359, 2010.
[7] Olshausen, B. A., Field, D. J. "Emergence of simple-cell receptive field
properties by learning a sparse code for natural images", Nature, 381, pp.
607-609, 1996.
[8] Lee, H., Grosse R., Ranganath R., Ng A. "Convolutional deep belief networks
for scalable unsupervised learning of hierarchical representations",
In Proc. 26th Intl. conf. on Machine Learning , pp. 609-616, 2009.
[9] Lee, H., Battle, A., Raina, R., Ng, A. Y. "Efficient sparse coding
algorithms" NIPS. 19, pp. 801-808, 2007.
[1] Banko, M., Brill, E. "Mitigating the paucity-of-data problem: exploring
the effect of training corpus size on classifier performance for natural
language processing", 1st Intl. conf. on Human language technology
research, pp. 1-5, 2001.
[2] Ng A. et al., Lecture note on unsupervised feature learning and deep
learning, http://deeplearning.stanford.edu/wiki, 2011.
[3] Raina, R., Battle, A., Lee. H., Packer, B., Ng, A. Y. Self-taught learning:
Transfer learning from unlabeled data, In Proc. of the 24th Intl. Conf.
on Machine Learning, 2007
[4] Hinton, G. E., Salakhutdinov, R. R. "Reducing the dimensionality of data
with neural networks", Science, 313, pp. 504-507, 2006.
[5] Hinton, G. E. "Learning multiple layers of representation", Trends in
Cognitive sciences, vol.11, no.10, pp. 428-434, 2007.
[6] Pan, S. J., Yang, Q. "A survey on Transfer Learning", IEEE trans. on
knowledge and data engineering, vol. 22, no.10, pp. 1345-1359, 2010.
[7] Olshausen, B. A., Field, D. J. "Emergence of simple-cell receptive field
properties by learning a sparse code for natural images", Nature, 381, pp.
607-609, 1996.
[8] Lee, H., Grosse R., Ranganath R., Ng A. "Convolutional deep belief networks
for scalable unsupervised learning of hierarchical representations",
In Proc. 26th Intl. conf. on Machine Learning , pp. 609-616, 2009.
[9] Lee, H., Battle, A., Raina, R., Ng, A. Y. "Efficient sparse coding
algorithms" NIPS. 19, pp. 801-808, 2007.
@article{"International Journal of Electrical, Electronic and Communication Sciences:56571", author = "Ekachai Phaisangittisagul and Rapeepol Chongprachawat", title = "Analysis of Relation between Unlabeled and Labeled Data to Self-Taught Learning Performance", abstract = "Obtaining labeled data in supervised learning is often
difficult and expensive, and thus the trained learning algorithm tends
to be overfitting due to small number of training data. As a result,
some researchers have focused on using unlabeled data which may
not necessary to follow the same generative distribution as the labeled
data to construct a high-level feature for improving performance on
supervised learning tasks. In this paper, we investigate the impact of
the relationship between unlabeled and labeled data for classification
performance. Specifically, we will apply difference unlabeled data
which have different degrees of relation to the labeled data for
handwritten digit classification task based on MNIST dataset. Our
experimental results show that the higher the degree of relation
between unlabeled and labeled data, the better the classification
performance. Although the unlabeled data that is completely from
different generative distribution to the labeled data provides the lowest
classification performance, we still achieve high classification performance.
This leads to expanding the applicability of the supervised
learning algorithms using unsupervised learning.", keywords = "Autoencoder, high-level feature, MNIST dataset, selftaught learning, supervised learning.", volume = "7", number = "4", pages = "386-5", }