Research of Data Cleaning Methods Based on Dependency Rules
This paper introduces the concept and principle of data
cleaning, analyzes the types and causes of dirty data, and proposes
several key steps of typical cleaning process, puts forward a well
scalability and versatility data cleaning framework, in view of data
with attribute dependency relation, designs several of violation data
discovery algorithms by formal formula, which can obtain inconsistent
data to all target columns with condition attribute dependent no matter
data is structured (SQL) or unstructured (NoSql), and gives 6 data
cleaning methods based on these algorithms.
[1] Lee, M. L., Ling, T. W., Low, W. L. IntelliClean: A knowledge-based
intelligent data cleaner. In: Proceedings of the 6th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining.
Boston: ACM Press, 2000.290 -294.
[2] Galhardas, H., Florescu, D., Shasha, D., et al. AJAX: an extensible data
cleaning tool. In: Chen, W.D., Naughton, J. F., Bernstein, P.A., eds.
Proceedings of the 2000 ACM SIGMOD International Conference on
Management of Data. Texas: ACM, 2000. 590.
[3] Raman, V., Hellerstein, J. Potter' swheel: an interactive data cleaning
system. In: Apers, P., Atzeni, P., Ceri, S., et al, eds. Proceedings of the
27th International Conference on Very Large Data Bases. Roma: Morgan
Kaufmann, 2001.381 ~ 390.
[4] Dasu T., Johnson T. Exploratory data mining and data cleaning (M). John
Wiley, 2003.
[5] Ye H. Z., Wu D, Chen S. An Open Data Clean ing Framework Based on
Semantic Rules for Continuous Auditing (C) In Proceedings of the 2nd
International Conference on Computer Engineering and Technology,
Chengdu, China. 2010: 158- 162.
[6] S. Song and L. Chen. Differential dependencies: Reasoning and
discovery. ACM Trans. Database Syst., 36(3):16, 2011.
[7] D. Z. Wang, X. L. Dong, A. D. Sarma, M. J. Franklin, and A. Y. Halevy.
Functional dependency generation and applications in pay-as-you-go data
integration systems. In WebDB, 2009.
[1] Lee, M. L., Ling, T. W., Low, W. L. IntelliClean: A knowledge-based
intelligent data cleaner. In: Proceedings of the 6th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining.
Boston: ACM Press, 2000.290 -294.
[2] Galhardas, H., Florescu, D., Shasha, D., et al. AJAX: an extensible data
cleaning tool. In: Chen, W.D., Naughton, J. F., Bernstein, P.A., eds.
Proceedings of the 2000 ACM SIGMOD International Conference on
Management of Data. Texas: ACM, 2000. 590.
[3] Raman, V., Hellerstein, J. Potter' swheel: an interactive data cleaning
system. In: Apers, P., Atzeni, P., Ceri, S., et al, eds. Proceedings of the
27th International Conference on Very Large Data Bases. Roma: Morgan
Kaufmann, 2001.381 ~ 390.
[4] Dasu T., Johnson T. Exploratory data mining and data cleaning (M). John
Wiley, 2003.
[5] Ye H. Z., Wu D, Chen S. An Open Data Clean ing Framework Based on
Semantic Rules for Continuous Auditing (C) In Proceedings of the 2nd
International Conference on Computer Engineering and Technology,
Chengdu, China. 2010: 158- 162.
[6] S. Song and L. Chen. Differential dependencies: Reasoning and
discovery. ACM Trans. Database Syst., 36(3):16, 2011.
[7] D. Z. Wang, X. L. Dong, A. D. Sarma, M. J. Franklin, and A. Y. Halevy.
Functional dependency generation and applications in pay-as-you-go data
integration systems. In WebDB, 2009.
@article{"International Journal of Information, Control and Computer Sciences:71042", author = "Yang Bao and Shi Wei Deng and Wang Qun Lin", title = "Research of Data Cleaning Methods Based on Dependency Rules", abstract = "This paper introduces the concept and principle of data
cleaning, analyzes the types and causes of dirty data, and proposes
several key steps of typical cleaning process, puts forward a well
scalability and versatility data cleaning framework, in view of data
with attribute dependency relation, designs several of violation data
discovery algorithms by formal formula, which can obtain inconsistent
data to all target columns with condition attribute dependent no matter
data is structured (SQL) or unstructured (NoSql), and gives 6 data
cleaning methods based on these algorithms.", keywords = "Data cleaning, dependency rules, violation data
discovery, data repair.", volume = "9", number = "10", pages = "2189-5", }