Weighted-Distance Sliding Windows and Cooccurrence Graphs for Supporting Entity-Relationship Discovery in Unstructured Text

The problem of Entity relation discovery in structured
data, a well covered topic in literature, consists in searching within
unstructured sources (typically, text) in order to find connections
among entities. These can be a whole dictionary, or a specific
collection of named items. In many cases machine learning and/or
text mining techniques are used for this goal. These approaches
might be unfeasible in computationally challenging problems, such
as processing massive data streams. A faster approach consists in collecting the cooccurrences of any
two words (entities) in order to create a graph of relations - a
cooccurrence graph. Indeed each cooccurrence highlights some grade
of semantic correlation between the words because it is more common
to have related words close each other than having them in the
opposite sides of the text. Some authors have used sliding windows for such problem: they
count all the occurrences within a sliding windows running over the
whole text. In this paper we generalise such technique, coming up
to a Weighted-Distance Sliding Window, where each occurrence of
two named items within the window is accounted with a weight
depending on the distance between items: a closer distance implies
a stronger evidence of a relationship. We develop an experiment in
order to support this intuition, by applying this technique to a data
set consisting in the text of the Bible, split into verses.




References:
[1] “10 key marketing trends for 2017,” https://www-01.ibm.com/common/
ssi/cgi-bin/ssialias?htmlfid=WRL12345USEN, accessed: 2018-05-15.
[2] E. Agichtein and L. Gravano, “Snowball: Extracting Relations from
Large Plain-Text Collections,” Proceedings of the fifth ACM conference
on Digital libraries - DL ’00, vol. I, no. 58, pp. 85–94, 2000. [3] E. Agichtein, L. Gravano, J. Pavel, V. Sokolova, A. Voskoboynik,
E. Agichtein, L. Gravano, J. Pavel, V. Sokolova, and A. Voskoboynik,
“Snowball: a prototype system for extracting relations from large text
collections,” in Proceedings of the 2001 ACM SIGMOD international
conference on Management of data - SIGMOD ’01, vol. 30, no. 2. New
York, New York, USA: ACM Press, 2001, p. 612.
[4] J. Zhu, Z. Nie, X. Liu, B. Zhang, and J.-R. Wen, “StatSnowball :
a Statistical Approach to Extracting Entity,” Proceedings of the 18th
international conference on World wide web (WWW2009), pp. 101–110,
2009. (Online). Available: http://www2009.eprints.org/11/1/p101.pdf.
[5] H. Cunningham, “Information Extraction, Automatic Introduction:
Extraction and Retrieval,” Oxford: Elsevier. Heath S B Kortmann
B Miller J, vol. 5, pp. 665–677, 2006. (Online). Available:
http://www.elsevier.com/locate/permissionusematerial.
[6] P. Pantel and M. Pennacchiotti, “Espresso: Leveraging Generic Patterns
for Automatically Harvesting Semantic Relations,” Acl2006, no. July,
pp. 113–120, 2006.
[7] A. O¨ zgu¨r, B. Cetin, and H. Bingl, “Co-Occurrence Network
Of Reuters News,” International Journal of Modern Physics C,
vol. 19, no. 05, pp. 689–702, may 2008. (Online). Available:
http://www.worldscientific.com/doi/abs/10.1142/S0129183108012431.
[8] H. Sayyadi, M. Hurst, and A. Maykov, “Event Detection and
Tracking in Social Streams *,” Proceedings of the Third International
ICWSM Conference (2009) Event, 2009. (Online). Available: https:
//www.aaai.org/ocs/index.php/ICWSM/09/paper/viewFile/170/493.
[9] K. Tanaka-Ishii and H. Iwasaki, “Clustering Co-Occurrence Graph
based on Transitivity,” Fifth Workshop on Very Large Corpora, 1997.
(Online). Available: http://www.aclweb.org/anthology/W97-0111.
[10] D. Benz, C. K¨orner, A. Hotho, G. Stumme, and M. Strohmaier,
“One Tag to Bind Them All: Measuring Term Abstractness in
Social Metadata,” in Extended Semantic Web Conference. Springer,
Berlin, Heidelberg, 2011, pp. 360–374. (Online). Available: http:
//link.springer.com/10.1007/978-3-642-21064-8{ }25.
[11] J. V´eronis, “HyperLex: lexical cartography for information retrieval,”
Computer Speech & Language, vol. 18, no. 3, pp. 223–252, jul 2004.
(Online). Available: https://www.sciencedirect.com/science/article/pii/
S0885230804000142.
[12] E. Agirre, D. Mart´ınez, O. L´opez De Lacalle, and A. Soroa, “Two
graph-based algorithms for state-of-the-art WSD,” Proceedings of
the 2006 Conference on Empirical Methods in Natural Language
Processing, pp. 585–593, 2006.
[13] R. Mihalcea and P. Tarau, “TextRank: Bringing Order into Texts,”
Proceedings of the 2004 conference on empirical methods in natural
language processing, 2004. (Online). Available: http://www.aclweb.org/
anthology/W04-3252.
[14] Y. Matsuo and M. Ishizuka, “Keyword Extraction From
A Single Document Using Word Co-occurrence Statistical
Information,” International Journal on Artificial Intelligence Tools,
vol. 13, no. 01, pp. 157–169, mar 2004. (Online). Available:
http://www.worldscientific.com/doi/abs/10.1142/S0218213004001466.
[15] Y. Liu, F. Wang, C. Kang, Y. Gao, and Y. Lu, “Analyzing relatedness
by toponym co-occurrences on web pages,” Transactions in GIS, 2014.
[16] X. Zhong, J. Liu, Y. Gao, and L. Wu, “Analysis of co-occurrence
toponyms in web pages based on complex networks,” Physica A:
Statistical Mechanics and its Applications, vol. 466, pp. 462–475,
jan 2017. (Online). Available: https://www.sciencedirect.com/science/
article/pii/S0378437116306409.
[17] Z. Zhang and V. Saligrama, “PRISM: Person Re-Identification via
Structured Matching,” IEEE Transaction On Pattern Analysis And
Machine Intelligence, 2015.
[18] I. Ali and A. Melton, “Semantic-Based Text Document Clustering
Using Cognitive Semantic Learning and Graph Theory,” in 2018 IEEE
12th International Conference on Semantic Computing (ICSC). IEEE,
jan 2018, pp. 243–247. (Online). Available: https://ieeexplore.ieee.org/
document/8334465/.
[19] B. Lemaire and G. Denhi`ere, “Incremental Construction of an
Associative Network from a Corpus,” cogprints.org, 2004. (Online).
Available: http://cogprints.org/3779/.
[20] K. Ghoorchian, S. Girdzijauskas, and F. Rahimian, “DeGPar: Large
Scale Topic Detection Using Node-Cut Partitioning on Dense Weighted
Graphs,” in 2017 IEEE 37th International Conference on Distributed
Computing Systems (ICDCS). IEEE, jun 2017, pp. 775–785. (Online).
Available: http://ieeexplore.ieee.org/document/7980020/.
[21] L. Jiang, P. Luo, J. Wang, Y. Xiong, B. Lin, M. Wang, and N. An,
“GRIAS: An entity-relation graph based framework for discovering
entity aliases,” Proceedings - IEEE International Conference on Data
Mining, ICDM, pp. 310–319, 2013.[22] D. V. Kalashnikov and S. Mehrotra, “Domain-independent data cleaning
via analysis of entity-relationship graph,” ACM Transactions on
Database Systems, vol. 31, no. 2, pp. 716–767, 2006.
[23] M. Laclav´ık, ˇ S. Dlugolinsk´y, and M. Ciglan, “Discovering Relations by
Entity Search in Lightweight Semantic Text Graphs,” Computing And
Informatics, vol. 33, no. 4, pp. 877–906, 2015. (Online). Available:
http://www.cai.sk/ojs/index.php/cai/article/viewArticle/2242.
[24] “Bible: Characters in the bible,” Collins Dictionary.
(Online). Available: https://www.collinsdictionary.com/word-lists/
bible-characters-in-the-bible.
[25] T. Bodruk, “Bible: Json + xml,” https://github.com/thiagobodruk/bible.
[26] M. Bastian, S. Heymann, and M. Jacomy, “Gephi: An open
source software for exploring and manipulating networks,”
2009. (Online). Available: http://www.aaai.org/ocs/index.php/ICWSM/
09/paper/view/154.