An Automatic Bayesian Classification System for File Format Selection
This paper presents an approach for the classification of
an unstructured format description for identification of file formats.
The main contribution of this work is the employment of data mining
techniques to support file format selection with just the unstructured
text description that comprises the most important format features for
a particular organisation. Subsequently, the file format indentification
method employs file format classifier and associated configurations to
support digital preservation experts with an estimation of required file
format. Our goal is to make use of a format specification knowledge
base aggregated from a different Web sources in order to select file
format for a particular institution. Using the naive Bayes method,
the decision support system recommends to an expert, the file format
for his institution. The proposed methods facilitate the selection of
file format and the quality of a digital preservation process. The
presented approach is meant to facilitate decision making for the
preservation of digital content in libraries and archives using domain
expert knowledge and specifications of file formats. To facilitate
decision-making, the aggregated information about the file formats is
presented as a file format vocabulary that comprises most common
terms that are characteristic for all researched formats. The goal is to
suggest a particular file format based on this vocabulary for analysis
by an expert. The sample file format calculation and the calculation
results including probabilities are presented in the evaluation section.
[1] P. Ayris, R. Davies, R. McLeod, R. Miao, H. Shenton, and P. Wheatley.
The life2 final project report. Final project report, LIFE Project, London,
UK, 2008.
[2] L. C. David Tarrant, Steve Hitchcock. Where the semantic web and web
2.0 meet format risk management: P2 registry. International Journal of
Digital Curation, 6(1):165–182, 2011.
[3] S. Gordea, A. Lindley, and R. Graf. Computing recommendations for
long term data accessibility basing on open knowledge and linked data.
Joint proceedings of the RecSys 2011 Workshops Decisions@RecSys’11
and UCERSTI 2, 811:51–58, November 2011.
[4] R. Graf and S. Gordea. Aggregating a knowledge base of file formats
from linked open data. Proceedings of the 9th International Conference
on Preservation of Digital Objects, poster:292–293, October 2012.
[5] R. Graf and S. Gordea. A risk analysis of file formats for preservation
planning. In Proceedings of the 10th International Conference on
Preservation of Digital Objects (iPres2013), pages 177–186, Lissabon,
Portugal, Sep 2013. Biblioteca Nacional de Portugal, Lisboa.
[6] R. Graf, S. Gordea, and H. Ryan. A model for format endangerment
analysis using fuzzy logic. In Proceedings of the 11th International
Conference on Digital Preservation (iPres2014), pages 160–168,
Melbourne, Australia, Oct 2014. State Library of Victoria, Melbourne.
[7] D. Heckerman. Bayesian networks for data mining. Data Mining and
Knowledge Discovery, 1(1):79–119, 1997.
[8] J. Hunter and S. Choudhury. Panic: an integrated approach to the
preservation of composite digital objects using semantic web services.
International Journal on Digital Libraries, 6, (2):174–183, September
2006.
[9] A. N. Jackson. Formats over time: Exploring uk web history.
Proceedings of the 9th International Conference on Preservation of
Digital Objects, pages 155–158, October 2012.
[10] G. W. Lawrence, W. R. Kehoe, O. Y. Rieger, W. H. Walters, and
A. R. Kenney. Risk management of digital information: A file format
investigation. june 2000.
[11] D. Pearson and C. Webb. Defining file format obsolescence: A risky
journey. The International Journal of Digital Curation, Vol 3, No
1:89–106, July 2008.
[12] S. Vermaaten, B. Lavoie, and P. Caplan. Identifying threats to successful
digital preservation: the spot model rsik assessment. D-Lib Magazine,
18(9/10), September 2012.
[13] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda,
G. McLachlan, A. Ng, B. Liu, P. Yu, Z.-H. Zhou, M. Steinbach, D. Hand,
and D. Steinberg. Top 10 algorithms in data mining. Knowledge and
Information Systems, 14(1):1–37, 2008.
[14] R. Zacharski. A Programmer’s Guide to Data Mining: The Ancient Art
of the Numerati. 2012.
[15] H. Zhang. The Optimality of Naive Bayes. In V. Barr and Z. Markov,
editors, FLAIRS Conference. AAAI Press, 2004.
[1] P. Ayris, R. Davies, R. McLeod, R. Miao, H. Shenton, and P. Wheatley.
The life2 final project report. Final project report, LIFE Project, London,
UK, 2008.
[2] L. C. David Tarrant, Steve Hitchcock. Where the semantic web and web
2.0 meet format risk management: P2 registry. International Journal of
Digital Curation, 6(1):165–182, 2011.
[3] S. Gordea, A. Lindley, and R. Graf. Computing recommendations for
long term data accessibility basing on open knowledge and linked data.
Joint proceedings of the RecSys 2011 Workshops Decisions@RecSys’11
and UCERSTI 2, 811:51–58, November 2011.
[4] R. Graf and S. Gordea. Aggregating a knowledge base of file formats
from linked open data. Proceedings of the 9th International Conference
on Preservation of Digital Objects, poster:292–293, October 2012.
[5] R. Graf and S. Gordea. A risk analysis of file formats for preservation
planning. In Proceedings of the 10th International Conference on
Preservation of Digital Objects (iPres2013), pages 177–186, Lissabon,
Portugal, Sep 2013. Biblioteca Nacional de Portugal, Lisboa.
[6] R. Graf, S. Gordea, and H. Ryan. A model for format endangerment
analysis using fuzzy logic. In Proceedings of the 11th International
Conference on Digital Preservation (iPres2014), pages 160–168,
Melbourne, Australia, Oct 2014. State Library of Victoria, Melbourne.
[7] D. Heckerman. Bayesian networks for data mining. Data Mining and
Knowledge Discovery, 1(1):79–119, 1997.
[8] J. Hunter and S. Choudhury. Panic: an integrated approach to the
preservation of composite digital objects using semantic web services.
International Journal on Digital Libraries, 6, (2):174–183, September
2006.
[9] A. N. Jackson. Formats over time: Exploring uk web history.
Proceedings of the 9th International Conference on Preservation of
Digital Objects, pages 155–158, October 2012.
[10] G. W. Lawrence, W. R. Kehoe, O. Y. Rieger, W. H. Walters, and
A. R. Kenney. Risk management of digital information: A file format
investigation. june 2000.
[11] D. Pearson and C. Webb. Defining file format obsolescence: A risky
journey. The International Journal of Digital Curation, Vol 3, No
1:89–106, July 2008.
[12] S. Vermaaten, B. Lavoie, and P. Caplan. Identifying threats to successful
digital preservation: the spot model rsik assessment. D-Lib Magazine,
18(9/10), September 2012.
[13] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda,
G. McLachlan, A. Ng, B. Liu, P. Yu, Z.-H. Zhou, M. Steinbach, D. Hand,
and D. Steinberg. Top 10 algorithms in data mining. Knowledge and
Information Systems, 14(1):1–37, 2008.
[14] R. Zacharski. A Programmer’s Guide to Data Mining: The Ancient Art
of the Numerati. 2012.
[15] H. Zhang. The Optimality of Naive Bayes. In V. Barr and Z. Markov,
editors, FLAIRS Conference. AAAI Press, 2004.
@article{"International Journal of Information, Control and Computer Sciences:70289", author = "Roman Graf and Sergiu Gordea and Heather M. Ryan", title = "An Automatic Bayesian Classification System for File Format Selection", abstract = "This paper presents an approach for the classification of
an unstructured format description for identification of file formats.
The main contribution of this work is the employment of data mining
techniques to support file format selection with just the unstructured
text description that comprises the most important format features for
a particular organisation. Subsequently, the file format indentification
method employs file format classifier and associated configurations to
support digital preservation experts with an estimation of required file
format. Our goal is to make use of a format specification knowledge
base aggregated from a different Web sources in order to select file
format for a particular institution. Using the naive Bayes method,
the decision support system recommends to an expert, the file format
for his institution. The proposed methods facilitate the selection of
file format and the quality of a digital preservation process. The
presented approach is meant to facilitate decision making for the
preservation of digital content in libraries and archives using domain
expert knowledge and specifications of file formats. To facilitate
decision-making, the aggregated information about the file formats is
presented as a file format vocabulary that comprises most common
terms that are characteristic for all researched formats. The goal is to
suggest a particular file format based on this vocabulary for analysis
by an expert. The sample file format calculation and the calculation
results including probabilities are presented in the evaluation section.", keywords = "Data mining, digital libraries, digital preservation, file
format.", volume = "9", number = "6", pages = "1484-6", }