Crankshaft: Data Mining Fundamental Concepts and Critical Issues (Artificial Intelligence)

INTRODUCTION

Data mining is the process of extracting previously unknown information from large databases or data warehouses and using it to make crucial business decisions. Data mining tools find patterns in the data and infer rules from them. The extracted information can be used to form a prediction or classification model, identify relations between database records, or provide a summary of the databases being mined. Those patterns and rules can be used to guide decision making and forecast the effect of those decisions, and data mining can speed analysis by focusing attention on the most important variables.

BACKGROUND

We are drowning in data, but starving for knowledge. In recent years the amount or the volume of information has increased significantly. Some researchers suggest that the volume of information stored doubles every year. Disk storage per person (DSP) is a way to measure the growth in personal data. Edelstein (2003) estimated that the number has dramatically grown from 28MB in 1996 to 472MB in 2000.

Data mining seems to be the most promising solution for the dilemma of dealing with too much data having very little knowledge. By using pattern recognition technologies and statistical and mathematical techniques to sift through warehoused information, data mining helps analysts recognize significant facts, relationships, trend, patterns, exceptions and anomalies. The use of data mining can advance a company's position by creating a sustainable competitive advantage. Data warehousing and mining is the science of managing and analyzing large datasets and discovering novel patterns (Davenport & Harris, 2007; Wang, 2006; Olafsson, 2006).

Data mining is taking off for several reasons: organizations are gathering more data about their businesses, the enormous drop in storage costs, competitive business pressures, a desire to leverage existing information technology investments, and the dramatic drop in the cost/performance ratio of computer systems. Another reason is the rise of data warehousing. In the past, it was often necessary to gather the data, cleanse it, and merge it. Now, in many cases, the data are already sitting in a data warehouse ready to be used.

Over the last 40 years, the tools and techniques to process data and information have continued to evolve from data bases to data warehousing and further to data mining. Data warehousing applications have become business-critical. Data mining can compress even more value out of these huge repositories of information. Data mining is a multidisciplinary field covering a lot of disciplines such as databases, statistics, artificial intelligence, pattern recognition, machine learning, information theory, control theory, operations research, information retrieval, data visualization, high-performance computing or parallel and distributed computing, etc (Zhou, 2003; (Hand, Mannila, & Smyth, 2001).

Certainly, many statistical models had emerged a long time ago. Machine learning has marked a milestone in the evolution of computer science. Although data mining is still in its infancy, it is now being used in a wide range of industries and for a range of tasks in a variety of contexts (Wang, 2003; Lavoie, Dempsey, & Connaway, 2006). Data mining is synonymous with knowledge discovery in databases, knowledge extraction, data/pattern analysis, data archeology, data dredging, data snooping, data fishing, information harvesting, and business intelligence (Han and Kamber, 2001).

MAIN FOCUS

Functionalities and Tasks

The common types of information that can be derived from data mining operations are associations, sequences, classifications, clusters, and forecasting. Associations happen when occurrences are linked in a single event. One of the most popular association applications deals with market basket analysis. This technique incorporates the use of frequency and probability functions to estimate the percentage chance of occurrences. Business strategists can leverage off of market basket analysis by applying such techniques as cross-selling and up-selling. In sequences, events are linked over time. This is particularly applicable in e-business for Website analysis.

Classification is probably the most common data mining activity today. It recognizes patterns that describe the group to which an item belongs. It does this by examining existing items that already have been classified and inferring a set of rules from them. Clustering is related to classification, but differs in that no groups have yet been defined. Using clustering, the data-mining tool discovers different groupings within the data. The resulting groups or clusters help the end user make some sense out of vast amounts of data (Kudyba, & Hoptroff, 2001). All of these applications may involve predictions. The fifth application type, forecasting, is a different form of prediction. It estimates the future value of continuous variables based on patterns within the data.

Algorithms and Methodologies

Neural Networks

Also referred to as artificial intelligence (AI), neural networks utilize predictive algorithms. This technology has many similar characteristics to that of regression because the application generally examines historical data, and utilizes a functional form that best equates explanatory variables and the target variable in a manner that minimizes the error between what the model had produced and what actually occurred in the past, and then applies this function to future data. Neural networks are a bit more complex as they incorporate intensive program architectures in attempting to identify linear, non-linear and patterned relationships in historical data.

Decision Trees

Megaputer (2006) mentioned that this method can be applied for solution of classification tasks only. As a result of applying this method to a training set, a hierarchical structure of classifying rules of the type "if. ..then..." is created. This structure has a form of a tree. In order to decide to which class an object or a situation should be assigned one has to answer questions located at the tree nodes, starting from the root. Following this procedure one eventually comes to one of the final nodes (called leaves), where the analyst finds a conclusion to which class the considered object should be assigned.

Genetic Algorithms (or Evolutionary Programming)

Genetic algorithms, biologically inspired search method, borrow mechanisms of inheritance to find solutions. Biological systems demonstrated flexibility, robustness and efficiency. Many biological systems are good at adapting to their environments. Some biological methods (such as reproduction, crossover and mutation) can be used as an approach to computer-based problem solving. An initial population of solutions is created randomly. Only a fixed number of candidate solutions are kept from one generation to the next. Those solutions that are less fit tend to die off, similar to the biological notion of "survival of the fittest".

Regression Analysis

This technique involves specifying a functional form that best describes the relationship between explanatory, driving or independent variables and the target or dependent variable the decision maker is looking to explain. Business analysts typically utilize regression to identify the quantitative relationships that exist between variables and enable them to forecast into the future.

Regression models also enable analysts to perform "what if" or sensitivity analysis. Some examples include how response rates change if a particular marketing or promotional campaign is launched, or how certain compensation policies affect employee performance and many more.

Logistics Regression

Logistic regression should be used when you want to predict the outcome of a dichotomous (e.g., yes/no) variable. This method is used for data that is not normally distributed (bell-shaped curve) i.e., categorical (coded) data. When a dependent variable can only have one of two answers, such as "will graduate" or "will not graduate", you cannot get a normal distribution as previously discussed.

Memory Based Reasoning (MBR) or the Nearest Neighbor Method

To forecast a future situation, or to make a correct decision, such systems find the closest past analogs of the present situation and choose the same solution which was the right one in those past situations. The drawback of this application is that there is no guarantee that resulting clusters provide any value to the end user. Resulting clusters may just not make any sense with regards to the overall business environment. Because of limitations of this technique, no predictive, "what if" or variable/target connection can be implemented.

The key differentiator between classification and segmentation with that of regression and neural network technology mentioned above is the inability of the former to perform sensitivity analysis or forecasting.

Applications and Benefits

Data mining can be used widely in science and business areas for analyzing databases, gathering data and solving problems. In line with Berry and Linoff (2004), the benefits data mining can provide for businesses are limitless. Here are just a few examples:

• Identify best prospects and then retain them as customers.

By concentrating marketing efforts only on the best prospects, companies will save time and money, thus increasing effectiveness of their marketing operation.

• Predict cross-sell opportunities and make recommendations.

Both traditional and Web-based operations can help customers quickly locate products of interest to them and simultaneously increase the value of each communication with the customers.

• Learn parameters influencing trends in sales and margins.

In the majority of cases we have no clue on what combination of parameters influences operation (black box). In these situations data mining is the only real option.

• Segment markets and personalize communications.

There might be distinct groups of customers, patients, or natural phenomena that require different approaches in their handling.

The importance of collecting data that reflect specific business or scientific activities to achieve competitive advantage is widely recognized. Powerful systems for collecting data and managing it in large databases are in place in all large and mid-range companies. However, the bottleneck of turning this data into information is the difficulty of extracting knowledge about the system being studied from the collected data. Human analysts without special tools can no longer make sense of enormous volumes of data that require processing in order to make informed business decisions (Kudyba & Hoptroff, 2001).

The applications of data mining are everywhere: from biomedical data (Hu and Xu, 2005) to mobile user data (Goh and Taniar, 2005); from data warehousing (Tjioe and Taniar, 2005) to intelligent web personalization (Zhou, Cheung, & Fong, 2005); from analyzing clinical outcome (Hu, Song, Han, Yoo, Prestrud, Brennan, & Brooks, 2005) to mining crime patterns (Bagui, 2006).

Potential Pitfalls

Data Quality

Data quality means the accuracy and completeness of the data. Data quality is a versatile issue that represents one of the biggest challenges for data mining.

Data quality problem is of great importance due to the emergence of large volumes of data. Many business and industrial applications critically rely on the quality of information stored in diverse databases and data warehouses. As Seifert (2004) emphasized that data quality can be affected by the structure and consistency of the data being analyzed. Other factors like the presence of duplicate records, the lack of data standards, the timeliness of updates and human errors can significantly impact the effectiveness of complex data mining techniques, which are sensitive to subtle differences in data. To improve the quality of data it is sometimes necessary to clean data by removing the duplicate records, standardizing the values or symbols used in the database to represent certain information, accounting for missing data points, removing unneeded data fields, identifying abnormal data points.

Interoperability

Interoperability refers to the ability of computer system and/or data to work with other systems or data using common standards or process. Until recently, some government agencies elected not to gamble with any level of open access and operated isolated information systems. But isolated data is in many ways useless data; bits of valuable information on the Sept. 11, 2001 hijackers' activities may have been stored in a variety of databases at the federal, state, and local government levels, but that information was not colleted and available to those who needed to see it to glimpse a complete picture of the growing threat. So Seifert (2004) suggested that it is a critical part of the larger efforts to improve interagency collaboration and information sharing. For public data mining, interoperability of databases and software is important to enable the search and analysis of multiple databases simultaneously. This also ensures the compatibility of data mining activities of different agencies.

Standardization

This allows you to arrange customer information in a consistent format. Among the biggest challenges are inconsistent abbreviations, and misspellings and variant spellings. Among the types of data that can be appended are demographic, geographic, psychographic, behavioristic, event-driven and computed. Matching allows you to identify similar data within and across your data sources. One of the greatest challenges of matching is creating a system that incorporates your "business rules," or criteria for determining what constitutes a match.

Preventing Decay

The worst enemy of information is time. And information decays at different rates (Berry & Linoff, 2004). Cleaning your database is a large accomplishment, but it will be short-lived if you fail to implement procedures for keeping it clean at the source. According to the second law of thermodynamics, ordered systems tend to disorder, and a database is a very ordered system. Contacts move. Companies grow. Knowledge workers enter new customer information incorrectly.

Some information simply starts out wrong, result of data input errors such as typos, transpositions, omissions and other mistakes. These are often easy to avoid. Finding ways to successfully implement these new technologies into a comprehensive data quality program not only increases the quality of your customer information, but also saves time, reduces frustration, improves customer relations, and ultimately increases revenue. Without constant attention to quality, your information quality will disintegrate.

No Generalizations to a Population

In statistics a population is defined, and then a sample is collected to make inferences about the population. This means that data cannot be re-used. They define a model before looking at the data. Data mining does not attempt generalizations to a population. The database is considered as the population. With the computing power of modern computers data miners can use the whole database, making sampling redundant. Data can be re-used. In data mining it is a common practice to try hundreds of models and find the one that fits best. This makes the interpretation of the significance difficult. Machine learning is the data mining equivalent to regression. In machine learning we use a training set to train the system to find the dependent variable.

FUTURE TRENDS

Predictive Analysis

Augusta (2004) suggested that predictive analysis is one of the major future trends for data mining. Rather than beingjust about mining large amounts of data, predictive analytics looks to actually understand the data content. They hope to forecast based on the contents of the data. However this requires complex programming and a great amount of business acumen. They are looking to do more than simply archive data, which is what data mining is currently known for. They want to not just process it, but understand it more clearly which will in turn allow them to make better predictions about future behavior. With predictive analytics you have the program scour the data and try to form, or help form, new hypotheses itself. This shows great promise, and would be a boon for industries everywhere.

Diversity of Application Domains

Data mining and X" phenomenon, as Tuzhilin (2006) coined, where X constitutes a broad range of fields in which data mining is used for analyzing the data. This has resulted in a process of cross-fertilization of ideas generated within this diverse population of researchers interacting across the traditional boundaries of their disciplines. The next generation of data mining applications covers a large number of different fields from traditional businesses to advance scientific research. Kantardzic & Zurada (2005) observed that with new tools, methodologies, and infrastructure, this trend of diversification will continue each year.

CONCLUSION

The emergence of new information technologies has given us much more data and many more options how to use it. Yet managing that flood of data, and making it useful and available to decision makers has been a major organizational challenge. Data mining allows the extraction of diamonds of knowledge from huge historical mines of data. It helps to predict outcomes of future situations, to optimize business decisions, to increase the value of each customer and communication, and to improve customer satisfaction.

The management of data requires understanding and a skill set far beyond mere programming. Managing data mining is a new revelation as analysts will have to sift through more and more information daily due to the ever increasing size ofthe Web and consumer purchases. Data mining can have enormous rewards if properly used. We have an unprecedented opportunity for the future is we could avoid data mining's pitfalls.

KEY TERMS

Data Mining: The process of automatically searching large volumes of data for patterns. Data mining is a fairly recent and contemporary topic in computing.

Data Visualization: A technology for helping users to see patterns and relationships in large amounts of data by presenting the data in graphical form.

Explanatory Variables: Used interchangeably and refer to those variables that explain the variation of a particular target variable. Also called driving, or descriptive, or independent variables.

Information Quality Decay: Quality of some data goes down when facts about real world objects change over time, but those facts are not updated in the database.

Information Retrieval: The art and science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hypertext networked databases such as the Internet or intranets, for text, sound, images or data.

Machine Learning: Concerned with the development of algorithms and techniques, which allow computers to "learn".

Neural Networks: Also referred to as artificial intelligence (AI), which utilizes predictive algorithms.

Pattern Recognition: The act of taking in raw data and taking an action based on the category of the data. It is a field within the area of machine learning.

PredictiveAnalysis: Use of data mining techniques, historical data, and assumptions about future conditions to predict outcomes of events.

Segmentation: Another maj or group that comprises the world of data mining involving technology that identifies not only statistically significant relationships between explanatory and target variables, but determines noteworthy segments within variable categories that illustrate prevalent impacts on the target variable.