Definitions[edit | edit source]

Although the use and sophistication of data mining (also called content mining) have increased in both the government and the private sector, data mining remains an ambiguous term. According to some experts, data mining overlaps a wide range of analytical activities, including data profiling, data warehousing, online analytical processing, and enterprise analytical applications.[1] Some of the terms used to describe data mining or similar analytical activities include "factual data analysis" and "predictive analytics."

General[edit | edit source]

Data mining is

searches of one or more electronic databases of information concerning U.S. persons, by or on behalf of an agency or employee of the government.[2]
[t]he process or techniques used to analyze large sets of existing information to discover previously unrevealed patterns or correlations.[3]
the process of knowledge discovery, predictive modeling, and analytics. Traditionally, this involves the discovery of patterns and relationships from structured databases of historical occurrences. However, data mining technology has expanded to include different processes, technologies, and methodologies.[4]

Government reports[edit | edit source]

Government reports have defined data mining variously:

  • The Government Accountability Office (GAO) defined data mining in its May 2004 report entitled Data Mining: Federal Efforts Cover a Wide Range of Uses as "the application of database technology and techniques — such as statistical analysis and modeling — to uncover hidden patterns and subtle relationships in data and to infer rules that allow for the prediction of future results."
  • The Congressional Research Service (CRS) defined data mining in its January 27, 2006, report to Congress entitled, "Data Mining and Homeland Security: An Overview," in more generic terms. It states that data mining "involves the uses of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets." The report describes data mining as using a "discovery approach" in which algorithms examine data relationships to identify patterns. It distinguishes this method from analytical tools that use a "verification based approach," where the user develops a hypothesis and then uses data to test the hypothesis.
  • The Department of Homeland Security Office of the Inspector General (DHS OIG) defines data mining in its August 2006 Survey of DHS Data Mining Activities, simply as “the process of knowledge discovery, predictive modeling, and analytics.” It stated that this has traditionally involved the discovery of patterns and relationships from structured databases of historical occurrences.

House Committee report[edit | edit source]

The House Conf. Rept. No. 109-699 has defined data mining as

a query or search or other analysis of 1 or more electronic databases, whereas — (A) at least 1 of the databases was obtained from or remains under the control of a non-Federal entity, or the information was acquired initially by another department or agency of the Federal Government for purposes other than intelligence or law enforcement; (B) a department or agency of the Federal Government or a non-Federal entity acting on behalf of the Federal Government is conducting the query or search or other analysis to find a predictive pattern indicating terrorist or criminal activity; and (C) the search does not use a specific individual’s personal identifiers to acquire information concerning that individual.

This definition is to be used by government departments and agencies in evaluating whether or not their information processing activities constitute data mining activities.

Federal legislation[edit | edit source]

The Federal Agency Data Mining Reporting Act of 2007 defines data mining as:

a program involving pattern-based[5] queries, searches, or other analyses of 1 or more electronic databases, where —
(A) a department or agency of the Federal Government, or a non-Federal entity acting on behalf of the Federal Government, is conducting the queries, searches, or other analyses to discover or locate a predictive pattern or anomaly indicative of terrorist or criminal activity on the part of any individual or individuals;
(B) the queries, searches, or other analyses are not subject-based and do not use personal identifiers of a specific individual, or inputs associated with a specific individual or group of individuals, to retrieve information from the database or databases; and
(C) the purpose of the queries, searches, or other analyses is not solely —
(i) the detection of fraud, waste, or abuse in a Government agency or program; or
(ii) the security of a Government computer system.[6]

The Act expressly excludes queries, searches, or analyses that are conducted solely in electronic databases of publicly-available information: telephone directories, news reporting services, databases of legal and administrative rulings, and other databases and services providing public information without a fee.[7]

Two aspects of the Act's definition of "data mining" are worth emphasizing. First, the definition is limited to pattern-based electronic searches, queries or analyses; activities that use only PII or other terms specific to individuals (e.g., a license plate number or vessel registration number), as search terms are excluded from the definition. Second, the definition is limited to searches, queries or analyses that are conducted for the purpose of identifying predictive patterns or anomalies that are indicative of terrorist or criminal activity by an individual or individuals. Research in electronic databases that produces only a summary of historical trends, therefore, is not "data mining" under the Act.

Overview[edit | edit source]

Data mining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets. These tools can include statistical models, mathematical algorithms, and machine learning methods (algorithms that improve their performance automatically through experience, such as neural networks or decision trees). Consequently, data mining consists of more than collecting and managing data, it also includes analysis and prediction.

[Data mining] is a convergence of many fields of academic research in both applied mathematics and computer science, including statistics, databases, artificial intelligence, and machine learning. Like other technologies, advances in data mining have a research and development stage, in which new algorithms and computer programs are developed, and they have subsequent phases of commercialization and application.[8]

The two most common types of data mining are pattern-based queries and subject-based queries.

  • Pattern-based queries search for data elements that match or depart from a pre-determined pattern, such as unusual travel patterns that might indicate a terrorist threat.
  • Subject-based queries search for any available information on a predetermined subject using a specific identifier. This identifier could be linked to an individual (such as a person's name or Social Security Number) or an object (such as a bar code or registration number). For example, one could initiate a search for information related to an automobile license plate number.

In practice, many data-mining systems use a combination of pattern-based and subject-based queries.

Uses of data mining[edit | edit source]

Data mining enables corporations and government agencies to analyze massive volumes of data quickly and relatively inexpensively. The use of this type of information retrieval has been driven by the exponential growth in the volumes and availability of information collected by the public and private sectors, as well as by advances in computing and data storage capabilities. In response to these trends, generic data mining tools are increasingly available for — or built into — major commercial database applications. Today, mining can be performed on many types of data, including those in structured, textual, spatial, Web, or multimedia forms.

Data mining applications can use a variety of parameters to examine the data. They include association (patterns where one event is connected to another event), sequence or path analysis (patterns where one event leads to another event, such as the birth of a child and purchasing diapers), classification (identification of new patterns), clustering (finding and visually documenting groups of previously unknown facts), and forecasting (discovering patterns from which one can make reasonable predictions regarding future activities).

Data mining has become increasingly common in both the public and private sectors. Industries such as banking, insurance, medicine, and retailing commonly use data mining to reduce costs, enhance research, and increase sales. For example, the insurance and banking industries use data mining applications to detect fraud and assist in risk assessment (e.g., credit scoring). Using customer data collected over several years, companies can develop models that predict whether a customer is a good credit risk, or whether an accident claim may be fraudulent and should be investigated more closely.

The medical community sometimes uses data mining to help predict the effectiveness of a procedure or medicine. Pharmaceutical firms use data mining of chemical compounds and genetic material to help guide research on new treatments for diseases. Retailers can use information collected through affinity programs (e.g., shoppers’ club cards, frequent flyer points, contests) to assess the effectiveness of product selection and placement decisions, coupon offers, and which products are often purchased together. Companies such as telephone service providers and music clubs can use data mining to create a “churn analysis,” to assess which customers are likely to remain as subscribers and which ones are likely to switch to a competitor.

The proliferation of data mining has raised implementation and oversight issues, including concerns about the quality of the data being analyzed, the interoperability of the databases and software, and potential infringements on privacy.

In the public sector, data mining applications were initially used as a means to detect fraud and waste, but they have grown also to be used for purposes such as measuring and improving program performance. In the public sector, the most frequent uses of data mining are in the following areas:

  • improving service or performance;
  • detecting fraud, waste, and abuse;
  • analyzing scientific and research information;
  • managing human resources;
  • detecting criminal activities or patterns; and
  • analyzing intelligence and detecting terrorist activities.[9]


Data quality[edit | edit source]

Data quality is a multifaceted issue that represents one of the biggest challenges for data mining. Data quality refers to the accuracy and completeness of the data. Data quality can also be affected by the structure and consistency of the data being analyzed. The presence of duplicate records, the lack of data standards, the timeliness of updates, and human error can significantly impact the effectiveness of the more complex data mining techniques, which are sensitive to subtle differences that may exist in the data. To improve data quality, it is sometimes necessary to "clean" the data, which can involve the removal of duplicate records, normalizing the values used to represent information in the database (e.g., ensuring that "no" is represented as a 0 throughout the database, and not sometimes as a 0, sometimes as an N, etc.), accounting for missing data points, removing unneeded data fields, identifying anomalous data points (e.g., an individual whose age is shown as 142 years), and standardizing data formats (e.g., changing dates so they all include MM/DD/YYYY).

All data collection efforts suffer accuracy concerns to some degree. Ensuring the accuracy of information can require costly protocols that may not be cost effective if the data is not of inherently high economic value. In well-managed data mining projects, the original data collecting organization is likely to be aware of the data’s limitations and account for these limitations accordingly. However, such awareness may not be communicated or heeded when data is used for other purposes. For example, the accuracy of information collected through a shopper’s club card may suffer for a variety of reasons, including the lack of identity authentication when a card is issued, cashiers using their own cards for customers who do not have one, and/or customers who use multiple cards.[10] For the purposes of marketing to consumers, the impact of these inaccuracies is negligible to the individual. If a government agency were to use that information to target individuals based on food purchases associated with particular religious observances though, an outcome based on inaccurate information could be, at the least, a waste of resources by the government agency, and an unpleasant experience for the misidentified individual.

Anti-terrorism activities[edit | edit source]

Since the terrorist attacks of September 11, 2001, data mining has been seen increasingly as a useful tool to help detect terrorist threats by improving the collection and analysis of public and private sector data. One response to these concerns was the creation of the Information Awareness Office (IAO) at the Defense Advanced Research Projects Agency (DARPA) in January 2002. The role of IAO was "in part to bring together, under the leadership of one technical office director, several existing DARPA programs focused on applying information technology to combat terrorist threats."[11] The mission statement for IAO suggested that the emphasis on these technology programs was to "counter asymmetric threats by achieving total information awareness useful for preemption, national security warning, and national security decision making."[12]

In a report on information sharing and analysis to address the challenges of homeland security, it was noted that agencies at all levels of government are now interested in collecting and mining large amounts of data from commercial sources.[13] The report noted that agencies may use such data not only for investigations of known terrorists, but also to perform large-scale data analysis and pattern discovery in order to discern potential terrorist activity by unknown individuals. Such use of data mining by federal agencies has raised public and congressional concerns regarding privacy.

Legal issues[edit | edit source]

Federal government access to and mining of information on individuals held in a multiplicity of databases, public and private, raises a plethora of issues — both legal and policy. To what extent should the government be able to gather and mine information about individuals to aid the war on terrorism?[14] Should unrestricted access to personal information be permitted? Should limitations, if any, be imposed on the government’s access to personal information? In resolving these issues, the current state of the law in this area may be consulted. The following is a description of selected information access, collection and disclosure laws and regulations that relate to these issues.

Laws governing federal government access to information[edit | edit source]

Generally there are no blanket prohibitions on federal government access to publicly available information (e.g., real property records, liens, mortgages, etc.). Occasionally a statute will specifically authorize access to such data. The USA PATRIOT Act of 2001, for example, in transforming the Treasury Department’s Financial Crimes Enforcement Network (FinCEN) from an administratively established bureau to one established by statute, specified that it was to provide government-wide access to information collected under the anti-money laundering laws, records maintained by other government offices, as well as privately and publicly held information.

Other government agencies have also availed themselves of computer software products that provide access to a range of personal information. The FBI reportedly purchases personal information from ChoicePoint, Inc., a provider of identification and credential verification services, for data analysis.[15]

The Federal Agency Data Mining Reporting Act of 2007 requires the Department of Homeland Security to provide Congress a detailed description of each DHS activity that meets the Act’s definition of “data mining,” including the methodology and technology used, the sources of the data being analyzed, the legal authority for the activity, a discussion of the activity’s efficacy in achieving its purpose, and an analysis of the activity’s impact on privacy and the policies and procedures in place to protect the privacy and due process rights of individuals.[16]

Privacy concerns[edit | edit source]

Mining government and private databases containing personal information creates a range of privacy concerns. Through data mining, government agencies can quickly and efficiently obtain information on individuals or groups by exploiting large databases containing personal information aggregated from public and private records. Information can be developed about a specific individual or about unknown individuals whose behavior or characteristics fit a specific pattern. Before data aggregation and data mining came into use, personal information contained in paper records stored at widely dispersed locations, such as courthouses or other government offices, was relatively difficult to gather and analyze. As one expert noted, data mining technologies that provide for easy access and analysis of aggregated data challenge the concept of privacy protection afforded to individuals through the inherent inefficiency of government agencies analyzing paper, rather than aggregated, computer records.[17]

Privacy concerns about mined or analyzed personal data also include concerns about the quality and accuracy of the mined data; the use of the data for other than the original purpose for which the data were collected without the consent of the individual (mission creep); the protection of the data against unauthorized access, modification, or disclosure; and the right of individuals to know about the collection of personal information, how to access that information, and how to request a correction of inaccurate information.[18]

Some observers contend that tradeoffs may need to be made regarding privacy to ensure security. Other observers suggest that existing laws and regulations regarding privacy protections are adequate, and that these initiatives do not pose any threats to privacy. Still other observers argue that not enough is known about how data mining projects will be carried out, and that greater oversight is needed. There is also some disagreement over how privacy concerns should be addressed. Some observers suggest that technical solutions are adequate. In contrast, some privacy advocates argue in favor of creating clearer policies and exercising stronger oversight. As data mining efforts move forward, Congress may consider a variety of questions including, the degree to which government agencies should use and mix commercial data with government data, whether data sources are being used for purposes other than those for which they were originally designed, and the possible application of the 1974 Privacy Act to these initiatives.

References[edit | edit source]

  1. See Lou Agosta, "Data Mining Is Dead—Long Live Predictive Analytics!" (Forrester Research) (Oct. 30, 2003) (full-text).
  2. Safeguarding Privacy in the Fight against Terrorism, at viii n.*.
  3. NICCS, Explore Terms: A Glossary of Common Cybersecurity Terminology (full-text).
  4. Survey of DHS Data Mining Activities, at 4 (footnotes omitted).
  5. The limitation to predictive, "pattern-based" data mining is significant because analysis performed within the ODNI and its constituent elements for counterterrorism and similar purpose is also performed using various types of link analysis tools. These tools start with a known or suspected terrorist or other subject of foreign intelligence interest and use various methods to uncover links between the known subject and potential associates or other persons with whom that subject is or has been in contact. The Act does not include such analyses within its definition of "data mining" because such analyses are not "pattern-based." Rather, these analyses rely on inputting the "personal identifiers of a specific individual, or inputs associated with a specific individual or group of individuals," which is excluded from the definition of "data mining" under the Act.
  6. 42 U.S.C. §2000ee-3(b)(1).
  7. Id. §2000ee-3(b)(2).
  8. Big Data and Privacy: A Technological Perspective, at 24.
  9. See Data Mining: Federal Efforts Cover a Wide Range of Uses.
  10. Department of Defense, Technology and Privacy Advisory Comm., Safeguarding Privacy in the Fight Against Terrorism 40 (Mar, 2004).
  11. Department of Defense, Report to Congress Regarding the Terrorism Information Awareness Program, Executive Summary, at 2 (May 20, 2003).
  12. Id. at 1 (emphasis added).
  13. Creating a Trusted Information Network for Homeland Security (Markle Foundation) (Dec. 2003).[1]
  14. The Markle Foundation Task Force on National Security in the Information Age has proposed guidelines to allow the effective use of information (including the use of data mining technologies) in the war against terrorism while respecting individuals’ interests in the use of private information. See Markle Foundation Task Force on National Security in the Information Age: Protecting America’s Freedom in the Information Age 32-34 (Oct. 2002).[2]
  15. Glenn R. Simpson, "Big Brother-in-Law: If the FBI Hopes to Get the Goods on You, It May Ask ChoicePoint — U.S. Agencies’ Growing Use of Outside Data Suppliers Raises Privacy Concerns," Wall St. J., Apr. 13, 2001 (The company “specialize[s] in doing what the law discourages the government from doing on its own — culling, sorting and packaging data on individuals from scores of sources, including credit bureaus, marketers and regulatory agencies.”)
  16. 42 U.S.C. §2000ee-3(c)(2).
  17. K.A. Taipale, “Data Mining and Domestic Security: Connecting the Dots to Make Sense of Data,” 5 Columbia Sci. & Tech. L. Rev. (2003-04)[3]
  18. These privacy concerns are reflected in the Fair Information Practices proposed in 1980 by the Organization for Economic Cooperation and Development and endorsed by the U.S. Department of Commerce in 1981. These practices govern collection limitation, purpose specification, use limitation, data quality, security safeguards, openness, individual participation, and accountability.

Sources[edit | edit source]

See also[edit | edit source]

Community content is available under CC-BY-SA unless otherwise noted.