Definition[edit | edit source]
Overview[edit | edit source]
Data quality can also be affected by the structure and consistency of the data being analyzed. The presence of duplicate records, the lack of data standards, the timeliness of updates, and human error can significantly impact the effectiveness of searching and data mining techniques, which are sensitive to subtle differences that may exist in the data.
All data collection efforts suffer accuracy concerns to some degree. Ensuring the accuracy of information can require costly protocols that may not be cost effective if the data is not of inherently high economic value. To improve data quality, it is sometimes necessary to “clean” the data, which can involve the removal of duplicate records, normalizing the values used to represent information in the database (e.g., ensuring that “no” is represented as a 0 throughout the database, and not sometimes as a 0, sometimes as an N, etc.), accounting for missing data points, removing unneeded data fields, identifying anomalous data points (e.g., an individual whose age is shown as 142 years), and standardizing data formats (e.g., changing dates so they all include MM/DD/YYYY).
|“||Data quality is intimately related to false positives and false negatives, in that it is intuitively obvious that using data of poor quality is likely to result in larger numbers of false positives and false negatives than would be the case if the data were of high quality.||”|