Data mining

Overview
Data mining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets. These tools can include statistical models, mathematical algorithms, and machine learning methods (algorithms that improve their performance automatically through experience, such as neural networks or decision trees). Consequently, data mining consists of more than collecting and managing data, it also includes analysis and prediction.

Data mining enables corporations and government agencies to analyze massive volumes of data quickly and relatively inexpensively. The use of this type of information retrieval has been driven by the exponential growth in the volumes and availability of information collected by the public and private sectors, as well as by advances in computing and data storage capabilities. In response to these trends, generic data mining tools are increasingly available for &mdash; or built into &mdash; major commercial database applications. Today, mining can be performed on many types of data, including those in structured, textual, spatial, Web, or multimedia forms.

Data mining applications can use a variety of parameters to examine the data. They include association (patterns where one event is connected to another event), sequence or path analysis (patterns where one event leads to another event, such as the birth of a child and purchasing diapers), classification (identification of new patterns), clustering (finding and visually documenting groups of previously unknown facts), and forecasting (discovering patterns from which one can make reasonable predictions regarding future activities).

Data mining has become increasingly common in both the public and private sectors. Industries such as banking, insurance, medicine, and retailing commonly use data mining to reduce costs, enhance research, and increase sales. For example, the insurance and banking industries use data mining applications to detect fraud and assist in risk assessment (e.g., credit scoring). Using customer data collected over several years, companies can develop models that predict whether a customer is a good credit risk, or whether an accident claim may be fraudulent and should be investigated more closely.

The medical community sometimes uses data mining to help predict the effectiveness of a procedure or medicine. Pharmaceutical firms use data mining of chemical compounds and genetic material to help guide research on new treatments for diseases. Retailers can use information collected through affinity programs (e.g., shoppers’ club cards, frequent flyer points, contests) to assess the effectiveness of product selection and placement decisions, coupon offers, and which products are often purchased together. Companies such as telephone service providers and music clubs can use data mining to create a “churn analysis,” to assess which customers are likely to remain as subscribers and which ones are likely to switch to a competitor.

The proliferation of data mining has raised implementation and oversight issues, including concerns about the quality of the data being analyzed, the interoperability of the databases and software, and potential infringements on privacy.

In the public sector, data mining applications were initially used as a means to detect fraud and waste, but they have grown also to be used for purposes such as measuring and improving program performance. In the public sector, the most frequent uses of data mining are in the following areas:


 * improving service or performance;
 * detecting fraud, waste, and abuse;
 * analyzing scientific and research information;
 * managing human resources;
 * detecting criminal activities or patterns; and
 * analyzing intelligence and detecting terrorist activities.

Definitions
Although the use and sophistication of data mining have increased in both the government and the private sector, data mining remains an ambiguous term. According to some experts, data mining overlaps a wide range of analytical activities, including data profiling, data warehousing, online analytical processing, and enterprise analytical applications. Some of the terms used to describe data mining or similar analytical activities include “factual data analysis” and “predictive analytics.”

Government reports
Government reports have defined "data mining" variously:
 * The Government Accountability Office (GAO) defined data mining in its May 2004 report entitled "Data Mining: Federal Efforts Cover a Wide Range of Uses" as “the application of database technology and techniques &mdash; such as statistical analysis and modeling &mdash; to uncover hidden patterns and subtle relationships in data and to infer rules that allow for the prediction of future results.”
 * The Congressional Research Service (CRS) defined data mining in its January 27, 2006, report to Congress entitled, "Data Mining and Homeland Security: An Overview," in more generic terms. It states that data mining “involves the uses of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets.” The report describes data mining as using a “discovery approach” in which algorithms examine data relationships to identify patterns. It distinguishes this method from analytical tools that use a “verification based approach,” where the user develops a hypothesis and then uses data to test the hypothesis.
 * The Department of Homeland Security Office of the Inspector General (DHS OIG) defines data mining in its August 2006 Survey of DHS Data Mining Activities, simply as “the process of knowledge discovery, predictive modeling, and analytics.” It stated that this has traditionally involved the discovery of patterns and relationships from structured databases of historical occurrences.

House Committee report
The [http://frwebgate.access.gpo.gov/cgi-bin/getdoc.cgi?dbname=2006_record&docid=cr28se06-150 House Conf. Rept. No. 109-699] has defined “data mining” as


 * a query or search or other analysis of 1 or more electronic databases, whereas &mdash; (A) at least 1 of the databases was obtained from or remains under the control of a non-Federal entity, or the information was acquired initially by another department or agency of the Federal Government for purposes other than intelligence or law enforcement; (B) a department or agency of the Federal Government or a non-Federal entity acting on behalf of the Federal Government is conducting the query or search or other analysis to find a predictive pattern indicating terrorist or criminal activity; and (C) the search does not use a specific individual’s personal identifiers to acquire information concerning that individual.

This definition is to be used by government departments and agencies in evaluating whether or not their information processing activities constitute data mining activities.

Federal legislation
The Federal Agency Data Mining Reporting Act of 2007 defines “data mining” as:
 * a program involving pattern-based queries, searches, or other analyses of 1 or more electronic databases, where &mdash;
 * (A) a department or agency of the Federal Government, or a non-Federal entity acting on behalf of the Federal Government, is conducting the queries, searches, or other analyses to discover or locate a predictive pattern or anomaly indicative of terrorist or criminal activity on the part of any individual or individuals;
 * (B) the queries, searches, or other analyses are not subject-based and do not use personal identifiers of a specific individual, or inputs associated with a specific individual or group of individuals, to retrieve information from the database or databases; and
 * (C) the purpose of the queries, searches, or other analyses is not solely &mdash;
 * (i) the detection of fraud, waste, or abuse in a Government agency or program; or
 * (ii) the security of a Government computer system.

The Act expressly excludes queries, searches, or analyses that are conducted solely in electronic databases of publicly-available information: telephone directories, news reporting services, databases of legal and administrative rulings, and other databases and services providing public information without a fee.

Two aspects of the Act’s definition of “data mining” are worth emphasizing. First, the definition is limited to pattern-based electronic searches, queries or analyses; activities that use only PII or other terms specific to individuals (e.g., a license plate number or vessel registration number), as search terms are excluded from the definition. Second, the definition is limited to searches, queries or analyses that are conducted for the purpose of identifying predictive patterns or anomalies that are indicative of terrorist or criminal activity by an individual or individuals. Research in electronic databases that produces only a summary of historical trends, therefore, is not “data mining” under the Act.

Data Quality
Data quality is a multifaceted issue that represents one of the biggest challenges for data mining. Data quality refers to the accuracy and completeness of the data. Data quality can also be affected by the structure and consistency of the data being analyzed. The presence of duplicate records, the lack of data standards, the timeliness of updates, and human error can significantly impact the effectiveness of the more complex data mining techniques, which are sensitive to subtle differences that may exist in the data. To improve data quality, it is sometimes necessary to “clean” the data, which can involve the removal of duplicate records, normalizing the values used to represent information in the database (e.g., ensuring that “no” is represented as a 0 throughout the database, and not sometimes as a 0, sometimes as an N, etc.), accounting for missing data points, removing unneeded data fields, identifying anomalous data points (e.g., an individual whose age is shown as 142 years), and standardizing data formats (e.g., changing dates so they all include MM/DD/YYYY).

All data collection efforts suffer accuracy concerns to some degree. Ensuring the accuracy of information can require costly protocols that may not be cost effective if the data is not of inherently high economic value. In well-managed data mining projects, the original data collecting organization is likely to be aware of the data’s limitations and account for these limitations accordingly. However, such awareness may not be communicated or heeded when data is used for other purposes. For example, the accuracy of information collected through a shopper’s club card may suffer for a variety of reasons, including the lack of identity authentication when a card is issued, cashiers using their own cards for customers who do not have one, and/or customers who use multiple cards. For the purposes of marketing to consumers, the impact of these inaccuracies is negligible to the individual. If a government agency were to use that information to target individuals based on food purchases associated with particular religious observances though, an outcome based on inaccurate information could be, at the least, a waste of resources by the government agency, and an unpleasant experience for the misidentified individual.

Anti-terrorism Activities
Since the terrorist attacks of September 11, 2001, data mining has been seen increasingly as a useful tool to help detect terrorist threats by improving the collection and analysis of public and private sector data. One response to these concerns was the creation of the Information Awareness Office (IAO) at the Defense Advanced Research Projects Agency (DARPA) in January 2002. The role of IAO was “in part to bring together, under the leadership of one technical office director, several existing DARPA programs focused on applying information technology to combat terrorist threats.” The mission statement for IAO suggested that the emphasis on these technology programs was to “counter asymmetric threats by achieving total information awareness useful for preemption, national security warning, and national security decision making.”

In a report on information sharing and analysis to address the challenges of homeland security, it was noted that agencies at all levels of government are now interested in collecting and mining large amounts of data from commercial sources. The report noted that agencies may use such data not only for investigations of known terrorists, but also to perform large-scale data analysis and pattern discovery in order to discern potential terrorist activity by unknown individuals. Such use of data mining by federal agencies has raised public and congressional concerns regarding privacy.

Legal Issues
Federal government access to and mining of information on individuals held in a multiplicity of databases, public and private, raises a plethora of issues &mdash; both legal and policy. To what extent should the government be able to gather and mine information about individuals to aid the war on terrorism? Should unrestricted access to personal information be permitted? Should limitations, if any, be imposed on the government’s access to personal information? In resolving these issues, the current state of the law in this area may be consulted. The following is a description of selected information access, collection and disclosure laws and regulations that relate to these issues.

Laws Governing Federal Government Access to Information
Generally there are no blanket prohibitions on federal government access to publicly available information (e.g., real property records, liens, mortgages, etc.). Occasionally a statute will specifically authorize access to such data. The USA Patriot Act, for example, in transforming the Treasury Department’s Financial Crimes Enforcement Network (FinCEN) from an administratively established bureau to one established by statute, specified that it was to provide government-wide access to information collected under the anti-money laundering laws, records maintained by other government offices, as well as privately and publicly held information.

Other government agencies have also availed themselves of computer software products that provide access to a range of personal information. The FBI reportedly purchases personal information from ChoicePoint, Inc., a provider of identification and credential verification services, for data analysis.

Privacy Concerns
Mining government and private databases containing personal information creates a range of privacy concerns. Through data mining, government agencies can quickly and efficiently obtain information on individuals or groups by exploiting large databases containing personal information aggregated from public and private records. Information can be developed about a specific individual or about unknown individuals whose behavior or characteristics fit a specific pattern. Before data aggregation and data mining came into use, personal information contained in paper records stored at widely dispersed locations, such as courthouses or other government offices, was relatively difficult to gather and analyze. As one expert noted, data mining technologies that provide for easy access and analysis of aggregated data challenge the concept of privacy protection afforded to individuals through the inherent inefficiency of government agencies analyzing paper, rather than aggregated, computer records.

Privacy concerns about mined or analyzed personal data also include concerns about the quality and accuracy of the mined data; the use of the data for other than the original purpose for which the data were collected without the consent of the individual (mission creep); the protection of the data against unauthorized access, modification, or disclosure; and the right of individuals to know about the collection of personal information, how to access that information, and how to request a correction of inaccurate information.

Some observers contend that tradeoffs may need to be made regarding privacy to ensure security. Other observers suggest that existing laws and regulations regarding privacy protections are adequate, and that these initiatives do not pose any threats to privacy. Still other observers argue that not enough is known about how data mining projects will be carried out, and that greater oversight is needed. There is also some disagreement over how privacy concerns should be addressed. Some observers suggest that technical solutions are adequate. In contrast, some privacy advocates argue in favor of creating clearer policies and exercising stronger oversight. As data mining efforts move forward, Congress may consider a variety of questions including, the degree to which government agencies should use and mix commercial data with government data, whether data sources are being used for purposes other than those for which they were originally designed, and the possible application of the 1974 Privacy Act to these initiatives.