Data Mining

7372

INDUSTRY SNAPSHOT

Data mining is a process in which special software applications are used to analyze multiple sets of information and find patterns or relationships between them. Once patterns or links have been established between information sets (normally large databases), data mining techniques enable users to make informed decisions or predictions based on the results of their analyses.

Data mining applications use algorithms to produce desired results. Simply stated, an algorithm is a problem-solving method that uses a sequence of steps or actions to achieve an objective. In recent years, this software classification has grown more powerful due to the increased processing power and data storage capacity of modern computer systems.

During the 1980s, the earliest data mining applications were somewhat limited in scope. While applications were more robust by the mid-1990s, they still required expert users who were versed in statistical analysis. By the early years of the first decade of the twenty-first century, things had changed. Everyday professionals were able to personally use a new, more user-friendly generation of data mining software for a variety of purposes. Corporate professionals used these applications to solve the most pressing business problems, law enforcement agencies used them to predict crime patterns and catch wrongdoers, and financially strapped state governments used them to identify tax delinquent individuals and businesses.

Data mining applications are marketed by leading software industry heavyweights, including IBM, Oracle, SPSS, and SAS Institute Inc. These companies market their offerings in a larger industry that achieved sales of approximately $170 billion during the middle years of the twenty-first century's first decade.

ORGANIZATION AND STRUCTURE

Data mining applications comprise but one of the many segments within the larger software industry. To best understand this emerging segment, one need only look to the end users who apply data mining applications to a wide swath of challenges. It is they who drive the development of new products and services.

In mid-2005, KDnuggets—a leading newsletter about data mining, knowledge discovery, Web mining, and related topics—conducted a survey to determine the industries and fields where data mining techniques were being applied. Of the 421 votes received, the leading 10 categories were customer relationship management (12%), banking (12%), credit scoring (8%), direct marketing/fundraising (8%), fraud detection (7%), retail (6%), insurance (6%), telecommunications (5%), manufacturing (5%), and science (4%).

In terms of the specific types of data mining techniques employed by end users—many of which are highly technical—a subsequent April 2006 KDnuggets poll of 176 voters revealed that decision trees/rules (51.1%), clustering (39.8%), and regression (38.1%) led the top ten, followed by statistics (36.4%), association rules (30.7%), visualization (21.6%), SVM (17.6%), neural networks (17.6%), sequence/time series analysis (13.6%), and Bayesian (13.6%).

Some industry observers estimate that roughly half of data mining users come from larger companies with annual revenues exceeding $1 billion, while nearly 20 percent are from smaller organizations or consultancies with revenues of less than $500,000.

Regardless of the industry in which it is used, the size of the organization using it, or the specific techniques being applied, data mining is more suitable for certain purposes than it is for others. According to the November/December 1999 issue of IEEE Intelligent Systems, "Good data-mining application areas require knowledge-based decisions; have accessible, sufficient, and relevant data; have a changing environment; have sub-optimal current methods; will not be obsoleted by imminent new technology; and provide a high payoff for the correct decisions."

Applications of Data Mining

While the basic principle behind data mining is essentially the same, the types of benefits vary depending on the context in which the technique is applied. In the business realm, companies use data mining to gain new insight about their customers. For example, research from consumer goods giant Procter & Gamble revealed that fathers often purchased beer during after-work trips to the store for items like milk and diapers. This type of intelligence is extremely useful to store planners, who determine where to place merchandise. One leading national clothing retailer stopped stocking all of its stores the same way when data mining revealed that demand for colors and sizes of the same garments varied considerably among major metropolitan markets. The company was thus able to sell more effectively by tailoring the product mix to each individual market.

Pizza Hut is one example of how a leading restaurant chain used data mining to increase sales. By late 2003, Pizza Hut laid claim to the fast food industry's largest consumer data warehouse, containing information on some 40 million households (40 to 50 percent of the U.S. market). Pizza Hut's initial database included approximately 10 years of telephone pizza orders. Using Teradata Warehouse Miner software, the chain was able to improve the accuracy of its data and increase the profitability of its direct marketing efforts.

Businesses also use data mining to up-sell or cross-sell customers. This works extremely well online, where leading retailers like Amazon.com and L.L. Bean combine detailed data about customers' viewing and purchase histories to recommend other relevant products. However, recommendations that appear to be irrelevant can irritate customers. This is one reason why some leading retailers continue to combine data mining and human insight to produce meaningful recommendations.

The business sector's use of data mining extends beyond customer interactions. For example, the sheer size of some companies has been known to cause duplication of effort. Data mining was used to prevent one division of a leading aerospace company from embarking on a multi-million dollar project that was already being conducted in another division. Data mining also is employed by NASDAQ to prevent illegal stock trades, which are sometimes attempted in advance of negative company information disclosures.

Within the education market, school administrators use data mining to identify patterns among a wide range of data—including teacher seniority, attendance, disciplinary action, immigration patterns, student demographics, test scores, grades, and medical problems—and determine which students are likely to have trouble or drop out of school. Such analysis also enables educators to tailor curricula to students in a more individualized way and choose which new programs to develop. Some systems enable educators to analyze more than 200 different combinations of data. One of the main drawbacks to using data mining in the education market is the cost. This especially was true during the early and middle years of the twenty-first century's first decade, when many school districts faced budget deficits.

Dire financial circumstances also were a factor in data mining investments for many state governments, but for the opposite reason. Despite budget shortfalls that in many cases exceeded the $1 billion mark, many states pressed on with multi-million-dollar data mining projects because of their potential to maximize tax collections. States like Iowa, Texas, and Virginia used data mining to compare tax collection records with a diverse array of private and public information sources—including federal tax returns, property tax records, drivers license records, aircraft and boat registrations, credit reports, state contract listings, building permits, customs declarations, and unemployment insurance records—and identify tax delinquent individuals or companies. Some systems were capable of comparing information from as many as 100 different databases in one search.

In its May 8, 2002 issue, the Wall Street Journal reported that this technique allowed Texas to collect $158 million between 1999 and 2001 alone. Iowa and Virginia reported collections in the $30 million range over a three to five-year period. States were expected to begin using similar data mining approaches to detect welfare fraud. In these cases eligibility records could be compared with information about house or car purchases. Data mining was especially attractive to many states because leading industry vendors were willing to provide or upgrade data mining systems at no cost in exchange for a cut of tax recoveries over a specified time period.

Government use of data mining to identify lawbreakers led to outcries from consumer privacy proponents, including the Washington, D.C.-based Electronic Privacy Information Center. These groups asserted that governments did not have the right to cull through non-public information sources, and that such behavior was an invasion of privacy. The same criticisms were applied to the federal government's efforts to use data mining in the war against terrorism. This was especially true of the Pentagon's Total Information Awareness (TIA) project, a $64 million effort headed by retired Admiral John Poindexter to develop a system for analyzing public and private records for terrorist information.

In addition to government initiatives directed at uncovering terrorists, federal, state, and local law enforcement agencies were...

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT