Interactive, Topic-Oriented Search Support by a Centroid-Based Text Categorisation

Centroid terms are single words that semantically and
topically characterise text documents and so may serve as their
very compact representation in automatic text processing. In the
present paper, centroids are used to measure the relevance of text
documents with respect to a given search query. Thus, a new graphbased
paradigm for searching texts in large corpora is proposed
and evaluated against keyword-based methods. The first, promising
experimental results demonstrate the usefulness of the centroid-based
search procedure. It is shown that especially the routing of search
queries in interactive and decentralised search systems can be greatly
improved by applying this approach. A detailed discussion on further
fields of its application completes this contribution.




References:
[1] B. Sparrow, J. Liu and D. M. Wegner, Google effects on memory:
Cognitive consequences of having information at our fingertips, In
Science, Vol. 333, pp. 776–778, 2011.
[2] C. Cleverdon, The Cranfield Tests on Index Language Devices, In Readings
in Information Retrieval, pp. 47–59, Morgan Kaufmann Publishers
Inc., San Francisco, CA, USA, 1997.
[3] C. D. Manning, P. Raghavan and H. Sch¨utze, Introduction to Information
Retrieval, Cambridge University Press, New York, NY, USA, 2008.
[4] J. B. Miller, Internet Technologies and Information Services, 2nd Edition,
Libraries Unlimited, Santa Barbara, California, USA, 2014.
[5] A. van den Bosch, T. Bogers and M. de Kunder, Estimating search engine
index size variability: a 9-year longitudinal study, In Scientometrics,
Volume 107, Issue 2, pp. 839-856, 2016.
[6] M. Kleppmann, Designing Data-Intensive Applications: The Big Ideas
Behind Reliable, Scalable, and Maintainable Systems, O’Reilly Media,
2017.
[7] E. Pariser, The Filter Bubble: What the Internet Is Hiding from You,
Penguin Group, 2011.
[8] G. Heyer, U. Quasthoff and T. Wittig, Text Mining: Wissensrohstoff Text
– Konzepte, Algorithmen, Ergebnisse, W3L-Verlag, 2008.
[9] M. M. Kubek and H. Unger, Centroid Terms as Text Representatives, In
Proceedings of the 2016 ACM Symposium on Document Engineering,
DocEng ’16, pp. 99–102, ACM, New York, NY, USA, 2016.
[10] M. M. Kubek and H. Unger, Centroid Terms and their Use in Natural
Language Processing, In Autonomous Systems 2016, Fortschritt-Berichte
VDI, Reihe 10 Nr. 848, pp. 167–185, VDI-Verlag D¨usseldorf, 2016.
[11] M. Kubek, T. B¨ohme, and H. Unger, Empiric Experiments with Text
Representing Centroids, In Lecture Notes on Information Theory, Vol. 5,
No. 1, pp. 23–28, 2017.
[12] M. M. Kubek and H. Unger, Towards a Librarian of the Web, In
Proceedings of the 2nd International Conference on Communication and
Information Processing (ICCIP 2016), pp. 70–78, ACM, New York, NY,
USA, 2016.
[13] M. M. Kubek and H. Unger, A Concept Supporting Resilient, Faulttolerant
and Decentralised Search, In Autonomous Systems 2017,
Fortschritt-Berichte VDI, Reihe 10 Nr. 857, pp. 20–31, VDI-Verlag
D¨usseldorf, 2017.
[14] M. M. Kubek and H. Unger, Datasets and Analysis Results, http://www.
docanalyser.de/search-corpora.zip, 2017.
[15] L. R. Dice, Measures of the Amount of Ecologic Association Between
Species, In Ecology, Vol. 26, No. 3, pp. 297–302, 1945.
[16] Neo4j, Inc., Website of the Neo4j Graph Platform, https://neo4j.com,
2017.
[17] C. Biemann, S. Bordag and U. Quasthoff, Automatic Acquisition of
Paradigmatic Relations using Iterated Co-occurrences, In Proceedings
of LREC2004, pp. 967–970, Lisboa, Portugal, 2004.
[18] M. M. Kubek, DocAnalyser – Searching with Web Documents, In
Autonomous Systems 2014, Fortschritt-Berichte VDI, Reihe 10 Nr. 835,
pp. 221–234, VDI-Verlag D¨usseldorf, 2014.
[19] B. H. Bloom, Space/Time Trade-offs in Hash Coding with Allowable
Errors, In Commun. ACM, Vol. 13, No. 7, pp. 422–426, ACM, New
York, NY, USA, 1970.