A Multilanguage Source Code Retrieval System Using Structural-Semantic Fingerprints

Source code retrieval is of immense importance in the software engineering field. The complex tasks of retrieving and extracting information from source code documents is vital in the development cycle of the large software systems. The two main subtasks which result from these activities are code duplication prevention and plagiarism detection. In this paper, we propose a Mohamed Amine Ouddan, and Hassane Essafi source code retrieval system based on two-level fingerprint representation, respectively the structural and the semantic information within a source code. A sequence alignment technique is applied on these fingerprints in order to quantify the similarity between source code portions. The specific purpose of the system is to detect plagiarism and duplicated code between programs written in different programming languages belonging to the same class, such as C, Cµ, Java and CSharp. These four languages are supported by the actual version of the system which is designed such that it may be easily adapted for any programming language.




References:
[1] M. Fowler and K. Beck, Improving the Design of Existing Code,
Addison-Wesley Professional, 1999.
[2] J. Kerievsky, Refactoring to Patterns, Addison-Wesley Professional,
2004.
[3] J.-P. Retaillé, Refactoring des applications Java/J2EE, Eyrolles, 2005.
[4] E.L. Burd and M. Munro, "Investigating the Maintenance Implications
of the Replication of Code", International Conference on Software
Maintenance, IEEE Computer Society, Bari, Italy, 1-3 October1997.
[5] C. Kapser and M.W. Godfrey, "Toward a taxonomy of clones in source
code: A case study", In Proceedings of the First International Workshop
on Evolution of Large-scale Industrial Software Applications (ELISA),
IEEE, September, 2003.
[6] S. Ducasse, M. Rieger, and S. Demeyer. "A language independent
approach for detecting duplicated code", International Conference on
Software Maintenance, IEEE Computer Society, Oxford, England, 1999,
pages 109-118.
[7] B.S. Baker, "On finding duplication and near-duplication in large
software system", Proceedings of Second Working Conference on
Reverse Engineering, 1995.
[8] M. Halstead, "Elements of Software Science". Elsevier, New York,
1977.
[9] K. Ottenstein, "An algorithmic approach to the detection and prevention
of plagiarism", ACM SIGCSE Bull, Vol 8, 1976, pages 30-41.
[10] J. Donaldson, A. Lancaster, and P. Sposato, "A plagiarism detection
system", ACM SIGCSE Bull, vol 13, 1981, pages15-20.
[11] J.A. Faidhi and S.K. Robinson, "An empirical approach for detecting
program similarity and plagiarism within a university programming
environment", Computer Education, Vol. 11, 1987, pages 11-19.
[12] K. Verco and M. Wise, "Software for detecting suspected plagiarism:
comparing structure and attribute counting systems", Proceedings of the
First Australian Conference on Computer Science Education, In J.
Rosenberg, editor, ACM Press, 1996.
[13] J.F. Sowa, "Conceptual structures: information processing in mind and
machine", In Proceedings of the 1993 ACM/SIGAPP symposium on
applied computing, ACM Press, 1993, pages 476-481.
[14] G. Mishne and M. Rijke, "Source Code Retrieval using Conceptual
Similarity", Language & Inference Technology Group University of
Amsterdam, 2004.
[15] M. Wise, "YAP3: improved detection of similarities in computer
program and other text", In Proc. 27th SIGCSE Technical Symp. on
Computer Science Education, Philadelphia USA, February 15-18, 1996,
pages 130-134.
[16] L. Prechelt, G. Malpohl, and M. Philippsen, "Finding plagiarisms
among a set of programs with Jplag", Technical Report No. 1/00,
University of Karlsruhe, Department of Informatics, March 2000.
[17] A. Aiken, "MOSS: a system for detecting software plagiarism",
University of Berkeley, CA, available
http://www.cs.berkeley.edu/~aiken/moss.html,1998.
[18] C.A.R. Hoare, "Some Properties of Predicate Transformers", Journal of
the ACM, 25(3), July, 1978, pages 461-480.
[19] M.A. Ouddan and H. Essafi, " Caractérisation de Documents Code
Source Basée sur un Dictionnaire de Grammaire: Application ├á la
Détection de Plagiats", International Conference on Sciences of
Electronic, Technology of Information and Telecommunications, SETIT
2007, IEEE, Tunisia, 25-29 Mars, 2007.
[20] J. Helfman, "Dotplot Patterns: A Literal Look at Pattern Languages",
TAPOS, 2(1), 1995, pages 31-41.
[21] M.A. Ouddan, S. Sayah, M. Taïleb and H.Essafi, "Audio Database
Retrieval Based on Sequence Alignment", ICSES'06, International
Conference on Signals and Electronic Systems, Poland 17-20 Septembre
2006.
[22] http://www.antlr.org/