Learning to Order Terms: Supervised Interestingness Measures in Terminology Extraction

Term Extraction, a key data preparation step in Text Mining, extracts the terms, i.e. relevant collocation of words, attached to specific concepts (e.g. genetic-algorithms and decisiontrees are terms associated to the concept “Machine Learning" ). In this paper, the task of extracting interesting collocations is achieved through a supervised learning algorithm, exploiting a few collocations manually labelled as interesting/not interesting. From these examples, the ROGER algorithm learns a numerical function, inducing some ranking on the collocations. This ranking is optimized using genetic algorithms, maximizing the trade-off between the false positive and true positive rates (Area Under the ROC curve). This approach uses a particular representation for the word collocations, namely the vector of values corresponding to the standard statistical interestingness measures attached to this collocation. As this representation is general (over corpora and natural languages), generality tests were performed by experimenting the ranking function learned from an English corpus in Biology, onto a French corpus of Curriculum Vitae, and vice versa, showing a good robustness of the approaches compared to the state-of-the-art Support Vector Machine (SVM).




References:
[1] T. Bäck, Evolutionary Algorithms in theory and practice, 1995.
[2] D. Bourigault and C. Jacquemin, "Term Extraction + Term Clustering:
An Integrated Platform for Computer-Aided Terminology," Proc. of
EACL, Bergen., pp. 15-22, 1999".
[3] L. Breiman, "Arcing Classifiers," Annals of Statistics, vol. 26, no. 3, pp.
801-845, 1998.
[4] R. Caruana and A. Niculescu-Mizil, "Data Mining in Metric Space: An
Empirical Analysis of Supervised Learning Performance Criteria". Proc.
of "ROC Analysis in AI" Workshop ECAI, pp 9-18, 2004.
[5] K.W. Church and P. Hanks, "Word Association Norms, Mutual
Information, and Lexicography," Computational Linguistics, vol. 16, pp.
22-29, 1990.
[6] W. Cohen, R. Schapire, and Y. Singer, "Learning to Order Things,"
Journal of Artificial Intelligence Research, vol. 10, 243-270, 1999.
[7] B. Daille, E. Gaussier, and J.M. Langé, "An Evaluation of Statistical
Scores for Word Association," The Tbilisi Symposium on Logic,
Language and Computation, CSLI Publications, pp. 177-188, 1998.
[8] P. Domingos, "Meta-Cost: A general method for making Classifiers Cost
Sensitive," Knowledge Discovery from Databases, pp. 155-164, 1999.
[9] T.E. Dunning, "Accurate Methods for the Statistics of Surprise and
Coincidence," Computational Linguistics, vol. 19, n┬░1, pp. 61-74, 1993.
[10] R. Esposito and L. Saitta, "Monte Carlo Theory as an Explanation of
Bagging and Boosting," Proc. of International Joint Conference on
Artificial Intelligence, pp. 499-504, Morgan Kaufman Publishers, 2003.
[11] C. Ferri, P. Flach, and J. Hernandez-Orallo, "Learning decision trees
using the area under the ROC curve," Proc. of International Conference
on Machine Learning (ICML), pp. 139-146, 2002.
[12] D.B. Fogel, E.C. Wasson, and E.M. Boughton, "Evolving Neural
Networks for Detecting Breast Cancer," Cancer Letters, vol. 96, pp. 49-
53, 1995.
[13] Y. Freund, R. Iyer, R. E. Schapire, Y. Singer, "An Efficient Boosting
Algorithm for Combining Preferences", Journal of Machine Learning
Research, 4(Nov):933-969, 2003.
[14] R. Jin, Y. Liu, L. Si, J. Carbonell, and A. Hauptmann, "A New Boosting
Algorithm Using Input-Dependent Regularizer," Proc. of International
Conference on Machine Learning (ICML), AAAI Press, 2003.
[15] A. Kolcz, A. Chowdhury, J. Alspector, "Data duplication: An
Imbalance Problem?" Workshop on Learning from Imbalanced Data
Sets II (ICML), 2003
[16] G. Nenadic, H. Mima, I. Spasic, S. Ananiadou, and J. Tsujii,
"Terminology-based Literature Mining and Knowledge Acquisition in
Biomedicine", International Journal of Medical Informatics, vol. 67, pp
33-48, 2002.
[17] M. Roche, J. Azé, O. Matte-Tailliez, and Y. Kodratoff, "Mining texts by
association rules discovery in a technical corpus," Proc. of IIPWM'04,
Springer Verlag, pp. 89-98, 2004.
[18] M. Roche, J. Azé, Y. Kodratoff and M. Sebag, "Learning Interestingness
Measures in Terminology Extraction. A ROC-based approach," Proc. of
"ROC Analysis in AI" Workshop ECAI, pp 81-88, 2004.
[19] S. Rosset, "Model Selection via the AUC," Proc. of International
Conference on Machine Learning (ICML), 2004.
[20] R.E. Schapire, "Theoretical views of boosting," Proc. of European
Conference on Computational Learning Theory, pp. 1-10, 1999.
[21] M. Sebag, N. Lucas, and J. Azé, "ROC-based Evolutionary Learning:
Application to Medical Data Mining," Proc. of International Conference
on Artificial Evolution (EA), Springer Verlag, pp. 384-396, 2004.
[22] M. Sebag, N. Lucas, and J. Azé, "Impact studies and sensitivity analysis
in medical data mining with ROC-based genetic learning," Proc. of
IEEE International Conference on Data Mining (ICDM), pp. 637-640,
2003.
[23] F. Smadja, "Retrieving collocations from text: Xtract," Computational
Linguistics, vol. 19, no. 1, pp. 143-177, 1993
[24] F. Smadja, K. R. McKeown, and V. Hatzivassiloglou, "Translating
collocations for bilingual lexicons: A statistical approach,"
Computational Linguistics, vol. 22, n┬░1, pp. 1-38, 1996.
[25] V.N. Vapnik, "The Nature of Statistical Learning," Springer Verlag,
1995.
[26] J. Vivaldi and L. Marquez and H. Rodriguez, "Improving Term
Extraction by System Combination Using Boosting," Lecture Notes in
Computer Science, vol 2167, pp. 515-526, 2001.
[27] I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin, and C.G. Nevill-
Manning. Kea: Practical automatic keyphrase extraction. Proc. of DL
'99, pp. 254-256, 1999.
[28] F. Xu, D. Kurz, J. Piskorski, and S. Schmeier, "A Domain Adaptive
Approach to Automatic Acquisition of Domain Relevant Terms and their
Relations with Bootstrapping," Proc. of LREC 2002, the third
international conference on language resources and evaluation, 2002.