The IT Law Wiki
 
Line 9: Line 9:
 
All [[data collection]] efforts suffer [[accuracy]] concerns to some degree. Ensuring the [[accuracy]] of [[information]] can require costly [[protocol]]s that may not be cost effective if the [[data]] is not of inherently high economic value. To improve data quality, it is sometimes necessary to “clean” the [[data]], which can involve the removal of duplicate records, normalizing the values used to represent information in the [[database]] (e.g., ensuring that “no” is represented as a 0 throughout the database, and not sometimes as a 0, sometimes as an N, etc.), accounting for missing [[data]] points, removing unneeded data [[field]]s, identifying anomalous data points (e.g., an individual whose age is shown as 142 years), and standardizing [[data format]]s (e.g., changing dates so they all include MM/DD/YYYY).
 
All [[data collection]] efforts suffer [[accuracy]] concerns to some degree. Ensuring the [[accuracy]] of [[information]] can require costly [[protocol]]s that may not be cost effective if the [[data]] is not of inherently high economic value. To improve data quality, it is sometimes necessary to “clean” the [[data]], which can involve the removal of duplicate records, normalizing the values used to represent information in the [[database]] (e.g., ensuring that “no” is represented as a 0 throughout the database, and not sometimes as a 0, sometimes as an N, etc.), accounting for missing [[data]] points, removing unneeded data [[field]]s, identifying anomalous data points (e.g., an individual whose age is shown as 142 years), and standardizing [[data format]]s (e.g., changing dates so they all include MM/DD/YYYY).
   
{{Quote|Data quality is intimately related to [[false positive]]s and [[false negative]]s, in that it is intuitively obvious that using [[data]] of poor quality is likely to result in larger numbers of [[false positives]] and [[false negative]]s than would be the case if the [[data]] were of high quality.<ref>[[Protecting Individual Privacy in the Struggle Against Terrorists: A Framework for Program Assessment]], at 38.</ref>}}
+
{{Quote|Data quality is intimately related to [[false positive]]s and [[false negative]]s, in that it is intuitively obvious that using [[data]] of poor quality is likely to result in larger numbers of [[false positive]]s and [[false negative]]s than would be the case if the [[data]] were of high quality.<ref>[[Protecting Individual Privacy in the Struggle Against Terrorists: A Framework for Program Assessment]], at 38.</ref>}}
   
 
== References ==
 
== References ==

Latest revision as of 05:21, 30 November 2011

Definition[]

Data quality refers to the accuracy and completeness of the data in a database.

Overview[]

Data quality can also be affected by the structure and consistency of the data being analyzed. The presence of duplicate records, the lack of data standards, the timeliness of updates, and human error can significantly impact the effectiveness of searching and data mining techniques, which are sensitive to subtle differences that may exist in the data.

All data collection efforts suffer accuracy concerns to some degree. Ensuring the accuracy of information can require costly protocols that may not be cost effective if the data is not of inherently high economic value. To improve data quality, it is sometimes necessary to “clean” the data, which can involve the removal of duplicate records, normalizing the values used to represent information in the database (e.g., ensuring that “no” is represented as a 0 throughout the database, and not sometimes as a 0, sometimes as an N, etc.), accounting for missing data points, removing unneeded data fields, identifying anomalous data points (e.g., an individual whose age is shown as 142 years), and standardizing data formats (e.g., changing dates so they all include MM/DD/YYYY).

Data quality is intimately related to false positives and false negatives, in that it is intuitively obvious that using data of poor quality is likely to result in larger numbers of false positives and false negatives than would be the case if the data were of high quality.[1]

References[]