Data Quality Enhancement with String Length Distribution

Recently, collectable manufacturing data are rapidly
increasing. On the other hand, mega recall is getting serious as
a social problem. Under such circumstances, there are increasing
needs for preventing mega recalls by defect analysis such as
root cause analysis and abnormal detection utilizing manufacturing
data. However, the time to classify strings in manufacturing data
by traditional method is too long to meet requirement of quick
defect analysis. Therefore, we present String Length Distribution
Classification method (SLDC) to correctly classify strings in a short
time. This method learns character features, especially string length
distribution from Product ID, Machine ID in BOM and asset list.
By applying the proposal to strings in actual manufacturing data, we
verified that the classification time of strings can be reduced by 80%.
As a result, it can be estimated that the requirement of quick defect
analysis can be fulfilled.




References:
[1] J. Rivera and R. V. D. Meulen. (2014, November 3). Gartner
Says the Processing, Sensing and Communications Semiconductor
Device Portion of the IoT Is Set for Rapid Growth. (Online).
Available: http://www.gartner.com/newsroom/id/2895917 (accessed on
2016, October 31).
[2] National Highway Traffic Safety Administration (NHTSA),
Vehicle recall summary by year (1966-2014). (Online). Available:
http://www.safercar.gov/staticfiles/safercar/pdf/2014-annual-recalls-report
.pdf (accessed on 2016, October 31).
[3] R. Y. Wang, and D. M. Strong, “Beyond accuracy: What data quality
means to data consumers,” in JIMS 12, 4(1996), 5-34.
[4] B. Stvilia, L. Gasser, M. B. Twidale, and L. C. Smisth, “A Framework
for Information Quality Assessment,” In JASIST, 58(12), 1720-1733.
[5] D.P. Ballou, H.L. Pazer, “Modeling data and process quality in
multi-input, multi-output information systems,” Management Science 31
(2), 1985, pp. 150-162.
[6] M. Jarke, Y. Vassiliou, “Data warehouse quality: a review of the
DWQ project, Proceedings of the Conference onInformation Quality,”
Cambridge, MA, 1997, pp. 299-313.
[7] B. K. Kahn, D. M. Strong, and. R. Y.Wang, “Information quality
benchmarks: Product and service performance,” Communications of the
ACM, 45, 4, 184-192, 2002.
[8] Y. W. Lee, D. M. Strong, B. K. Kahn and R. Y. Wang, “AIMQ:
A methodology for information quality assessment,” Information &
Management, 40, 2 December, 133-146, 2002. [9] L. L. Pipino, Y. W. Lee, and R. Y. Wang, “Data quality assessment,”
Commun. ACM 45, 4, 2002.
[10] T. Margaritopoulos, M. Margaritopoulos, I. Mavridis and A. Manitsaris,
“A Conceptual Framework for Metadata Quality Assessment,” In DCMI
2008.
[11] M. Ge, and M. Helfert, “A review of information quality research –
develop a research agenda,” in Proceedings of the 12th ICIQ, Nov, 2007.
[12] Scott S., “Probabilistic Versus Deterministic Data Matching: Making an
Accurate Decision,” Information-management.com access in June 2009.
[13] H. B. Newcombe, J. M. Kennedy, S. Axford, and A. James. “Automatic
linkage of vital records,” in Science, 130(3381):954-959, 1959.
[14] A. K. Menon, O. Tamuz, S. Gulwani, B. Lampson, and A T. Kalai,
“A machine learning framework for programming by example,” in
Proceedings of the 30th ICML, pages 187-95, 2013.
[15] “Tamr’s data connection and enrichment
platform data sheet,” (Online) Available:
http://www.Tamr.com/wp-content/uploads/2015/03/Technical Data
Sheet 021915.pdf (accessed on 2016, October 31).
[16] M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S.
Zdonik, A. Pagan, and S. Xu “Data curation at scale: The Data Tamer
system,” In CIDR, 2013.
[17] A. Bartoli, G. Davanzo, A. D. Lorenzo, M. Mauri, E. Medvet, and
E. Sorio,“Automatic Generation of Regular Expressions from Examples
with Genetic Programming,” in GECCO, 2012.
[18] D. Lorenzo, E. Medvet, and A. Bartoli, “Automatic String Replace by
Examples,” in GECCO, 2013.
[19] A. Bartoli, G. Davanzo, A. D. Lorenzo, E. Medvet, and E. Sorio,
“Automatic Synthesis of Regular Expressions from Examples,” IEEE
Computer, 2014.
[20] “IBM InfoSphere QualityStage data sheet,” (Online). Available:
http://public.dhe.ibm.com/software/data/sw-library/infosphere/datasheets
/InfoSphereQualityStage.pdf (accessed on 2016, October 31).
[21] “Informatica Data Quality data sheet,” (Online). Available:
http://www.informatica.com/content/dam/informatica-com/global/amer
/us/collateral/data-sheet/informatica-data-quality data-sheet 6710.pdf
(accessed on 2016, October 31).
[22] A. Doan, A. Halevy, Z. Ives, Principles of data integration. Waltham:
Morgan Kaufmann. 2012, pp. 173-205.
[23] P. Christen, Data matching concepts and techniques for record linkage,
entity resolution, and duplicate detection. Berlin-Heidelberg-New York:
Springer, 2012, pp. 101-162.
[24] S. Theodoridis., K. Koutroumbas Pattern recognition. Burlington:
Academic Press, 2008, pp. 261-322.