What Is Data Mining?
Data mining involves the automatic extraction of valuable insights from extensive data repositories By utilizing data mining techniques, organizations can analyze large datasets to uncover hidden patterns and trends Additionally, these techniques enable predictions about future behaviors, such as estimating a customer's spending in both online and physical retail environments.
Not all information discovery tasks are considered to be data mining.
Queries such as searching for specific records in a database or locating web pages with certain keywords can be efficiently handled by database management and information retrieval systems These systems utilize advanced computer science methods, including indexing structures and query processing algorithms, to organize and retrieve data from extensive repositories Moreover, data mining techniques have been integrated to further enhance performance by improving the relevance of search results to user queries.
Data Mining and Knowledge Discovery in Databases
Data mining plays a crucial role in knowledge discovery in databases (KDD), which involves transforming raw data into valuable information This comprehensive process includes several key steps, ranging from data preprocessing to the postprocessing of mining results.
Filtering Patterns Visualization Pattern Interpretation
Feature Selection Dimensionality Reduction Normalization
Data Subsetting Figure 1.1 The process of knowledge discovery in databases (KDD). k k
Motivating Challenges
Data can be stored in various formats, such as flat files, spreadsheets, or relational tables, and may be centralized or distributed across multiple locations The primary goal of data preprocessing is to convert raw input data into a suitable format for analysis This process involves several crucial steps, including merging data from different sources, cleaning the data to eliminate noise and duplicates, and selecting relevant records and features for the data mining task Due to the diverse methods of data collection and storage, preprocessing is often the most labor-intensive and time-consuming phase in the knowledge discovery process.
"Closing the loop" refers to the integration of data mining results into decision support systems, particularly in business applications where insights enhance campaign management tools for effective marketing promotions This integration necessitates a postprocessing step to filter valid and useful results, with visualization techniques enabling analysts to examine data from multiple perspectives Additionally, hypothesis testing methods can be employed during postprocessing to discard misleading data mining outcomes.
Traditional data analysis methods frequently struggle to address the challenges presented by big data applications These difficulties have highlighted the need for innovative solutions, leading to the development of data mining techniques to effectively tackle these specific challenges.
Scalability Because of advances in data generation and collection, data sets with sizes of terabytes, petabytes, or even exabytes are becoming common.
To effectively manage large data sets, data mining algorithms must be scalable, employing specialized search strategies to address exponential search challenges This scalability often necessitates innovative data structures for efficient record access and may involve out-of-core algorithms for data sets that exceed main memory capacity Additionally, enhancing scalability can be achieved through sampling techniques or the development of parallel and distributed algorithms.
A general overview of techniques for scaling up data mining algorithms is given in Appendix F. k k
High dimensionality is increasingly common in data sets, particularly in fields like bioinformatics, where advancements in microarray technology have led to gene expression data with thousands of attributes Data sets with temporal or spatial elements, such as temperature measurements taken over time and across various locations, also exhibit high dimensionality Traditional data analysis methods, designed for low-dimensional data, often struggle with these high-dimensional data sets due to challenges like the curse of dimensionality and rapidly increasing computational complexity as the number of features grows.
The increasing complexity and heterogeneity of data in fields like business, science, and medicine necessitate advanced data analysis techniques that can accommodate diverse attributes beyond traditional continuous or categorical types Modern data sets often include non-traditional formats such as web and social media content, which encompasses text, images, and videos, as well as intricate DNA sequences and climate data with varied measurements over time and space Effective mining of these complex data objects requires an understanding of their inherent relationships, including temporal and spatial autocorrelation, graph connectivity, and hierarchical structures found in semi-structured text and XML documents.
Data ownership and distribution can complicate analysis when necessary information is not centralized or controlled by a single organization Often, data is spread across various locations and managed by multiple entities, necessitating the creation of advanced distributed data mining techniques to effectively gather and analyze this fragmented information.
Distributed data mining algorithms encounter several key challenges, including minimizing communication requirements for efficient computation, effectively integrating results from diverse data sources, and ensuring data security and privacy.
The Origins of Data Mining
Non-traditional analysis diverges from the traditional statistical method, which relies on a hypothesize-and-test framework involving hypothesis formulation, data collection through experiments, and subsequent analysis This conventional approach is labor-intensive, often necessitating the generation and evaluation of thousands of hypotheses To address this challenge, data mining techniques have emerged to automate the hypothesis generation and evaluation process Additionally, the data sets used in data mining are frequently opportunistic samples rather than carefully curated random samples, highlighting the need for more flexible analytical methods.
1.3 The Origins of Data Mining
Data mining, once seen as a step within the Knowledge Discovery in Databases (KDD) framework, has evolved into a significant academic discipline in computer science, encompassing data preprocessing, mining, and postprocessing Its roots date back to the late 1980s, when workshops on knowledge discovery in databases began to unite researchers from various fields to explore the application of computational techniques for extracting actionable insights from large datasets These workshops eventually transformed into popular conferences, attracting both academic and industry professionals The growing interest from businesses in hiring data mining experts has significantly contributed to the rapid expansion of this field.
The foundation of the field is rooted in established methodologies and algorithms utilized by researchers, particularly in data mining This discipline incorporates statistical concepts like sampling, estimation, and hypothesis testing, alongside search algorithms, modeling techniques, and learning theories derived from artificial intelligence, pattern recognition, and machine learning.
Data mining has rapidly integrated concepts from various fields such as optimization, evolutionary computing, information theory, signal processing, visualization, and information retrieval to effectively address the challenges associated with mining big data.
A number of other areas also play key supporting roles In particular,database systems are needed to provide support for efficient storage, indexing, k k
AI, Machine Learning, Statistics and
Database Technology, Parallel Computing, Distributed Computing
Data mining integrates various disciplines and utilizes advanced techniques, including high-performance parallel computing, to effectively manage and analyze large data sets.
Distributed techniques are crucial for managing data size and are particularly beneficial when data cannot be centralized This approach highlights the interconnectedness of data mining with various other domains, as illustrated in Figure 1.2.
Data Science and Data-Driven Discovery
Data science is an interdisciplinary field focused on extracting valuable insights from data using various tools and techniques While it is considered an emerging domain with its own identity, data science incorporates methods from diverse areas such as data mining, statistics, artificial intelligence, machine learning, pattern recognition, database technology, and distributed computing.
Data science has emerged as a distinct field, acknowledging that traditional data analysis disciplines often lack the comprehensive tools necessary for addressing the diverse challenges presented by new applications.
Analyzing data from social media and the Web presents unique challenges for social scientists, necessitating a diverse set of computational, mathematical, and statistical skills Unlike traditional research methods reliant on surveys, contemporary studies leverage web mining, natural language processing (NLP), network analysis, data mining, and advanced statistics This modern approach enables researchers to quantitatively measure human behavior on a large scale, highlighting the importance of collaboration with skilled analysts to effectively navigate the complexities of this data.
Data Mining Tasks
amounts of data Thus, data science is, by necessity, a highly interdisciplinary field that builds on the continuing work of many fields.
Data science employs a data-driven approach that focuses on uncovering patterns and relationships within large datasets, often minimizing the need for extensive domain expertise A prime illustration of this success is seen in the advancements of neural networks and deep learning, which have excelled in challenging tasks such as object recognition in images and speech recognition However, this is merely one example; significant progress has been made across various other domains of data analysis, many of which will be explored further in this book.
Some cautions on potential limitations of a purely data-driven approach are given in the Bibliographic Notes.
Data mining tasks are generally divided into two major categories:
Predictive tasks aim to forecast the value of a specific attribute using other attributes' values The attribute being predicted is referred to as the target or dependent variable, while the attributes utilized for making the prediction are known as explanatory or independent variables.
Descriptive tasks aim to identify patterns such as correlations, trends, clusters, trajectories, and anomalies within data, summarizing the underlying relationships These tasks are typically exploratory and often necessitate post-processing techniques to validate and clarify the findings.
Figure 1.3 illustrates four of the core data mining tasks that are described in the remainder of this book.
Predictive modeling involves creating a model that predicts a target variable based on explanatory variables It encompasses two main tasks: classification, which deals with discrete target variables, and regression, which focuses on continuous target variables For instance, predicting if a web user will buy from an online bookstore is a classification task due to the binary nature of the target variable.
Yes No No Yes No No Yes No No No
Single Married Single Married Divorced Married Divorced Single Married Single
No No No No Yes No No Yes No Yes
Figure 1.3 Four of the core data mining tasks.
Forecasting stock prices is a regression task due to the continuous nature of price attributes The primary objective is to develop a model that reduces the discrepancy between predicted and actual target values Predictive modeling has diverse applications, including identifying customers likely to engage with marketing campaigns, predicting ecological disruptions, and diagnosing diseases based on medical test results.
To predict the species of an Iris flower, we classify it into one of three types: Setosa, Versicolour, or Virginica, using a dataset that includes various flower characteristics The well-known Iris dataset from the UCI Machine Learning Repository provides essential data, including sepal width, sepal length, petal length, and petal width A plot illustrates the relationship between petal width and petal length for 150 flowers, with petal width categorized into low, medium, and high based on specific intervals.
Data mining tasks categorize petal length into three groups: low, medium, and high, corresponding to the intervals [0, 2.5), [2.5, 5), and [5, ∞), respectively These classifications of petal width and length enable the derivation of specific rules for analysis.
Petal width low and petal length low implies Setosa.
Petal width medium and petal length medium implies Versicolour.
Petal width high and petal length high implies Virginica.
Although these classification rules do not encompass all flower types, they effectively categorize the majority Notably, flowers from the Setosa species are distinctly separated from the Versicolour and Virginica species in terms of petal width and length; however, there is some overlap between the latter two species regarding these characteristics.
Figure 1.4 Petal width versus petal length for 150 Iris flowers.
Association analysis aims to uncover patterns that highlight strongly associated features within data, often represented as implication rules or feature subsets Due to the exponential nature of its search space, the primary objective is to efficiently extract the most intriguing patterns This technique has valuable applications, such as identifying groups of functionally related genes, analyzing web pages frequently accessed together, and exploring the relationships among various components of Earth’s climate system.
Market Basket Analysis utilizes point-of-sale data from grocery store transactions to uncover patterns in consumer behavior By applying association analysis, we can identify items frequently purchased together, such as the rule {Diapers} → {Milk}, indicating that customers who buy diapers often also buy milk This insight can be leveraged to create effective cross-selling opportunities for related products.
3 {Bread, Butter, Coffee, Diapers, Milk, Eggs}
10 {Tea, Eggs, Cookies, Diapers, Milk}
Cluster analysis identifies groups of closely related observations, ensuring that items within the same cluster exhibit greater similarity than those in different clusters This technique is widely applied in various fields, including customer segmentation, identifying oceanic regions that significantly influence Earth's climate, and data compression.
Document clustering allows for the organization of news articles by topic, as demonstrated by a collection of articles that can be categorized into two distinct groups Each article is represented by word-frequency pairs, where the word (w) is linked to its occurrence count (c) within the text The first cluster includes four articles focused on economic news, while the second cluster encompasses four articles related to health care An effective clustering algorithm should accurately identify these two groups by analyzing the similarities in word usage across the articles.
Scope and Organization of the Book
Table 1.2 Collection of news articles.
1 dollar: 1, industry: 4, country: 2, loan: 3, deal: 2, government: 2
2 machinery: 2, labor: 3, market: 4, industry: 2, work: 3, country: 1
3 job: 5, inflation: 3, rise: 2, jobless: 2, market: 3, country: 2, index: 3
4 domestic: 3, forecast: 2, gain: 1, market: 2, sale: 3, price: 2
5 patient: 4, symptom: 2, drug: 3, health: 2, clinic: 2, doctor: 2
6 pharmaceutical: 2, company: 3, drug: 2, vaccine: 1, flu: 3
7 death: 2, cancer: 4, drug: 3, public: 4, health: 3, director: 2
8 medical: 2, cost: 3, increase: 2, patient: 2, health: 3, care: 1
Anomaly detection involves identifying observations that significantly differ from the majority of data, referred to as anomalies or outliers The primary objective of an anomaly detection algorithm is to accurately identify true anomalies while minimizing the misclassification of normal data as anomalies An effective anomaly detector should achieve a high detection rate alongside a low false alarm rate This technique has various applications, including fraud detection, network intrusion identification, monitoring unusual disease patterns, and assessing ecosystem disturbances like droughts, floods, fires, and hurricanes.
Credit card companies monitor transactions and personal data of cardholders, including credit limits, age, income, and address Due to the rarity of fraudulent transactions compared to legitimate ones, anomaly detection methods are utilized to establish user profiles based on normal transaction behavior When a new transaction occurs, it is assessed against the user's established profile, and if significant discrepancies are detected, the transaction is marked as potentially fraudulent.
1.5 Scope and Organization of the Book
This book presents key principles and techniques in data mining from an algorithmic viewpoint, essential for comprehending the application of data mining technology across diverse data types It also serves as an introductory resource for readers aspiring to conduct research in this field.
This book begins with a technical exploration of data in Chapter 2, covering fundamental types of data, data quality, preprocessing techniques, and measures of similarity and dissimilarity, which are crucial for effective data analysis Chapters 3 and 6 focus on classification, with Chapter 3 introducing decision tree classifiers and key concepts such as overfitting, underfitting, model selection, and performance evaluation Building on this foundation, Chapter 6 delves into various classification techniques, including rule-based systems, nearest neighbor classifiers, Bayesian classifiers, artificial neural networks (including deep learning), support vector machines, and ensemble classifiers, while also addressing multiclass and imbalanced class challenges.
These topics can be covered independently.
Association analysis is thoroughly examined in Chapters 4 and 7 Chapter 4 covers the fundamentals, including frequent itemsets, association rules, and key algorithms for their generation, along with specific types of frequent itemsets like maximal, closed, and hyperclique essential for data mining The chapter also addresses evaluation measures for association analysis In Chapter 7, advanced topics are explored, focusing on the application of association analysis to both categorical and continuous data, as well as data organized within a concept hierarchy, which categorizes objects hierarchically, such as store items.
This chapter explores the extension of association analysis to identify sequential patterns that involve order, as well as patterns in graphs and negative relationships, where the presence of one item indicates the absence of another.
Cluster analysis is discussed in Chapters 5 and 8 Chapter 5 first describes the different types of clusters, and then presents three specific clustering techniques: K-means, agglomerative hierarchical clustering, and DBSCAN.
The article delves into methods for validating clustering algorithm results and introduces various clustering techniques in Chapter 8, such as fuzzy and probabilistic clustering, Self-Organizing Maps (SOM), graph-based clustering, spectral clustering, and density-based clustering It also addresses scalability challenges and key considerations for choosing the appropriate clustering algorithm.
Chapter 9 focuses on anomaly detection, introducing fundamental definitions and exploring various methods, including statistical, distance-based, density-based, clustering-based, reconstruction-based, one-class classification, and information theoretic approaches Chapter 10 complements previous discussions by delving into essential statistical concepts.
Bibliographic Notes
To avoid spurious results in data mining, it is crucial to understand concepts such as statistical hypothesis testing, p-values, the false discovery rate, and permutation testing The book also includes appendices A through F, which provide a concise overview of essential topics like linear algebra, dimensionality reduction, statistics, regression, optimization, and methods for scaling data mining techniques to handle big data effectively.
Data mining is a rapidly evolving field that has grown significantly, making it too extensive to be fully addressed in a single book Important topics like data quality are mentioned briefly, with selected references available in the Bibliographic Notes section of the relevant chapter Additionally, subjects not included in this book, such as mining streaming data and privacy-preserving data mining, are also referenced in the Bibliographic Notes of this chapter.
Data mining has inspired a variety of textbooks catering to different aspects of the field Introductory texts include works by Dunham, Han et al., Hand et al., Roiger and Geatz, Zaki and Meira, and Aggarwal For those interested in business applications, notable titles are by Berry and Linoff, Pyle, and Parr Rud Additionally, books focusing on statistical learning are authored by Cherkassky and Mulier, as well as Hastie et al Meanwhile, works emphasizing machine learning and pattern recognition include contributions from Duda et al., Kantardzic, Mitchell, and Webb.
[57], and Witten and Frank [58] There are also some more specialized books:
The field of data mining encompasses a diverse range of research contributions, including Chakrabarti's work on web mining, Fayyad et al.'s early articles on data mining and visualization techniques, Grossman et al.'s insights into applications in science and engineering, Kargupta and Chan's exploration of distributed data mining, Wang et al.'s advancements in bioinformatics, and Zaki and Ho's studies on parallel data mining.
Several prominent conferences focus on data mining, including the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), the IEEE International Conference on Data Mining (ICDM), the SIAM International Conference on Data Mining (SDM), the European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), and the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD).
Data mining papers can also be found in other major conferences such as the Conference and Workshop on Neural Information Processing Systems k k
The International Conference on Machine Learning (ICML), NIPS, ACM SIGMOD/PODS, International Conference on Very Large Data Bases (VLDB), Conference on Information and Knowledge Management (CIKM), International Conference on Data Engineering (ICDE), National Conference on Artificial Intelligence (AAAI), IEEE International Conference on Big Data, and IEEE International Conference on Data Science and Advanced Analytics (DSAA) are key events in the fields of machine learning, data engineering, and artificial intelligence, showcasing cutting-edge research and advancements.
Prominent journals in the field of data mining include IEEE Transactions on Knowledge and Data Engineering, Data Mining and Knowledge Discovery, Knowledge and Information Systems, ACM Transactions on Knowledge Discovery from Data, Statistical Analysis and Data Mining, and Information Systems Additionally, there are several open-source data mining tools available, such as Weka and Scikit-learn Recently, advanced data mining software like Apache Mahout and Apache Spark has emerged, specifically designed to tackle large-scale challenges on distributed computing platforms.
Data mining is a crucial component of the knowledge discovery process, as outlined by Fayyad et al [19] Numerous articles have explored its definition and its connections to other disciplines, especially statistics.
Chen et al provide a database perspective on data mining, while Ramakrishnan and Grama discuss its general principles and various viewpoints Hand and Friedman highlight the distinctions between data mining and statistics, with Lambert examining the role of statistics in large data sets and the interplay between the two fields Glymour et al emphasize the insights that statistics can offer to data mining Additionally, Smyth et al note that the evolution of data mining is influenced by new data types and applications, including streams, graphs, and text, while Han et al explore emerging applications, and Smyth outlines key research challenges in the field.
Wu et al [59] explore the transformation of data mining research into practical applications, while Grossman et al [24] focus on the establishment of data mining standards Additionally, Bradley [7] addresses the scalability of data mining algorithms for handling large datasets.
The rise of new data mining applications has introduced significant challenges, particularly regarding privacy breaches in areas like web commerce and healthcare Consequently, there is an increasing focus on creating data mining algorithms that prioritize user privacy This has led to the development of privacy-preserving data mining techniques, which involve mining encrypted or randomized data Key references in this field include works by Agrawal and Srikant, as well as Clifton et al.
Kargupta et al and Vassilios et al highlight the issue of bias in predictive models, particularly in applications like job applicant screening and prison parole decisions The challenge in evaluating the fairness of these applications arises from the use of black box models, which lack straightforward interpretability, making it difficult to assess their potential biases effectively.
Data science and its related fields hold significant potential for knowledge discovery, although much of this potential remains untapped It primarily relies on observational data collected during regular operations by various organizations, leading to common sampling biases that complicate the identification of causal factors Consequently, interpreting predictive models derived from this data can be challenging As a result, theory, experimentation, and computational simulations will remain essential methodologies in many scientific disciplines.
A solely data-driven approach can overlook valuable existing knowledge in a specific field, leading to suboptimal model performance This may result in inaccurate predictions or an inability to adapt to new scenarios.
While a model with high predictive accuracy may suffice for practical applications in certain fields, many sectors, particularly medicine and science, prioritize understanding the underlying domain Recent efforts are focused on developing theory-guided data science, which integrates existing domain knowledge to enhance insights and effectiveness.
Exercises
1 Discuss whether or not each of the following activities is a data mining task.
(a) Dividing the customers of a company according to their gender.
(b) Dividing the customers of a company according to their profitability.
(c) Computing the total sales of a company.
(d) Sorting a student database based on student identification numbers.
(e) Predicting the outcomes of tossing a (fair) pair of dice.
(f) Predicting the future stock price of a company using historical records.
(g) Monitoring the heart rate of a patient for abnormalities.
(h) Monitoring seismic waves for earthquake activities.
(i) Extracting the frequencies of a sound wave.
As a data mining consultant for an Internet search engine company, leveraging techniques like clustering, classification, association rule mining, and anomaly detection can significantly enhance operations Clustering can group similar search queries to improve user experience by delivering more relevant results, while classification can categorize web pages based on content type, aiding in more accurate indexing Association rule mining can uncover relationships between user search patterns, allowing for personalized recommendations and targeted advertising Additionally, anomaly detection can identify unusual search behaviors, helping to detect fraud or spam, thereby maintaining the integrity of search results.
3 For each of the following data sets, explain whether or not data privacy is an important issue.
(b) IP addresses and visit times of web users who visit your website.
(c) Images from Earth-orbiting satellites.
(d) Names and addresses of people from the telephone book.
(e) Names and email addresses collected from the Web. k k
This page is intentionally left blank k k
This chapter discusses several data-related issues that are important for suc- cessful data mining:
Data sets vary significantly, characterized by different attributes that can be categorized as either quantitative or qualitative Additionally, many data sets possess unique features, such as time series data or objects that have explicit relationships with one another.
The analysis of data is heavily influenced by its type, as different tools and techniques are tailored to specific data forms Moreover, advancements in data mining research are frequently motivated by the emergence of new application areas that require innovative methods for handling diverse data types.
Data quality is crucial for effective analysis, as imperfect data can hinder accurate results While many data mining techniques can handle some flaws, prioritizing data quality significantly enhances analysis outcomes Common data quality issues include noise and outliers, missing or inconsistent data, duplicates, and biases that render the data unrepresentative of the intended phenomenon or population Addressing these issues is essential for improving overall data integrity and analytical insights.
Preprocessing is essential for enhancing raw data's suitability for data mining, aiming to improve data quality and align it with specific analytical techniques For instance, continuous attributes like length may need to be converted into discrete categories such as short, medium, or long to facilitate certain methods Additionally, reducing the number of attributes in a dataset is common, as many techniques perform better with a smaller set of relevant features.
Analyzing data by exploring relationships among data objects allows for more effective analysis techniques, such as clustering, classification, and anomaly detection By computing the similarity or distance between pairs of objects, analysts can leverage these relationships instead of relying solely on the data objects themselves The choice of similarity or distance measures is crucial and should be tailored to the specific type of data and the intended application.
Example 2.1 (An Illustration of Data-Related Issues) To further illustrate the importance of these issues, consider the following hypothetical situation.
You receive an email from a medical researcher concerning a project that you are eager to work on.
I’ve attached the data file that I mentioned in my previous email.
Each line contains the information for a single patient and consists of five fields We want to predict the last field using the other fields.
I will be out of town for a couple of days and won't be able to provide additional information about the data However, I hope this won't hinder your progress too much Once I return, could we schedule a meeting to discuss your preliminary results? I may also invite a few other team members to join us.
Thanks and see you in a couple of days.
Despite some misgivings, you proceed to analyze the data The first few rows of the file are as follows:
Upon reviewing the data, everything appears normal, prompting you to set aside any doubts and begin your analysis Although the dataset comprises only 1,000 lines—smaller than expected—you find yourself making notable progress after two days of work.
You arrive for the meeting, and while waiting for others to arrive, you strike k k
During a conversation with a statistician involved in the project, she inquires about your analysis of the data and requests a concise summary of your findings.
Statistician: So, you got the data for all the patients?
Data Miner: Yes I haven’t had much time for analysis, but I do have a few interesting results.
Statistician: Amazing There were so many data issues with this set of patients that I couldn’t do much.
Data Miner: Oh? I didn’t hear about any possible problems.
In the analysis of data, particularly in field 5, the target variable for prediction, it is widely recognized that utilizing the logarithm of the values enhances the results However, I only became aware of this method later in my research.
Statistician:But surely you heard about what happened to field 4? It’s supposed to be measured on a scale from 1 to 10, with
Due to a data entry error, all instances of the value 10 were incorrectly recorded as 0, creating confusion in interpreting the data This issue is compounded by the presence of missing values among patients, making it unclear whether a recorded 0 signifies an actual absence of data or represents the intended value of 10.
Quite a few of the records have that problem.
Data Miner: Interesting Were there any other problems?
Statistician: Yes, fields 2 and 3 are basically the same, but I assume that you probably noticed that.
Data Miner: Yes, but these fields were only weak predictors of field 5.
Statistician: Anyway, given all those problems, I’m surprised you were able to accomplish anything.
Data Miner: True, but my results are really quite good Field 1 is a very strong predictor of field 5 I’m surprised that this wasn’t noticed before.
Statistician: What? Field 1 is just an identification number.
Data Miner: Nonetheless, my results speak for themselves.
Statistician: Oh, no! I just remembered We assigned ID numbers after we sorted the records based on field 5 There is a strong connection, but it’s meaningless Sorry. k k
While this example illustrates an extreme case, it highlights the critical need for understanding your data This chapter will explore the four key issues identified earlier, detailing fundamental challenges and common strategies for addressing them.
Types of Data
Attributes and Measurement
In this section, we explore the various types of attributes that describe data objects We begin by defining what an attribute is and then discuss the concept of attribute types Finally, we outline the commonly encountered types of attributes in data analysis.
We start with a more detailed definition of an attribute.
Definition 2.1 An attribute is a property or characteristic of an object that can vary, either from one object to another or from one time to another.
Eye color is a symbolic attribute that varies among individuals, with limited possible values such as brown, black, blue, green, and hazel In contrast, temperature is a numerical attribute that can change over time and has an unlimited range of potential values.
At the most basic level, attributes are not about numbers or symbols.
However, to discuss and more precisely analyze the characteristics of objects, we assign numbers or symbols to them To do this in a well-defined way, we need a measurement scale.
Definition 2.2 A measurement scale is a rule (function) that associates a numerical or symbolic value with an attribute of an object.
Formally, the process of measurement is the application of a measure- ment scale to associate a value with a particular attribute of a specific object.
Measurement is an integral part of our daily lives, often occurring in various forms, such as checking our weight on a bathroom scale, categorizing individuals by gender, or counting chairs to ensure adequate seating for a meeting In each instance, we translate the physical attributes of objects into numerical or symbolic values, highlighting the importance of measurement in understanding our environment.
With this background, we can discuss the type of an attribute, a concept that is important in determining if a particular data analysis technique is consistent with a specific type of attribute.
The Type of an Attribute
Attributes can be defined using various measurement scales, highlighting that the properties of an attribute may differ from those of the values used to measure it This means that the characteristics of the values representing an attribute do not necessarily align with the attributes themselves Two examples illustrate this concept effectively.
In the context of employee data, two key attributes are ID and age, both represented as integers While calculating the average age of employees is meaningful, determining an average employee ID is not, as the primary function of an ID is to ensure uniqueness among employees Therefore, the only relevant operation for employee IDs is to check for equality This distinction is not apparent when using integers for employee IDs Conversely, the age attribute reflects its integer representation more closely, though there are limitations, such as the fact that ages have a maximum value, unlike integers.
In Figure 2.1, various line segments illustrate how their length can be represented numerically in two distinct ways Each line segment, starting from the top, is created by repeatedly appending the first segment to itself, making all subsequent segments multiples of the first The right side of the figure reflects both the ordering and additivity of these lengths, while the left side only shows their order This example highlights that measurement methods can vary in their ability to capture the complete properties of an attribute.
Understanding the type of an attribute is crucial, as it reveals which characteristics of the measured values align with the inherent properties of the attribute This knowledge helps prevent misguided actions, such as calculating the average of employee IDs.
A mapping of lengths to numbers that captures only the order properties of length.
A mapping of lengths to numbers that captures both the order and additivity properties of length.
Figure 2.1 The measurement of the length of line segments on two different scales of measurement.
The Different Types of Attributes
A straightforward method to define an attribute's type is by associating its properties with numerical characteristics For instance, the attribute of length shares several properties with numbers, allowing for meaningful comparisons and ordering of objects based on their length Furthermore, it is possible to discuss differences and ratios in length, highlighting the numerical operations typically employed to describe various attributes.
There are four types of attributes in data analysis: nominal, ordinal, interval, and ratio Each type is defined by specific properties and statistical operations, with ratio attributes encompassing all the properties and operations of the preceding types Thus, any valid property or operation for nominal, ordinal, and interval attributes is also applicable to ratio attributes.
In other words, the definition of the attribute types is cumulative However, k k
Nominal The values of a nominal attribute are just different names; i.e., nominal values provide only enough information to distinguish one object from another.
(=, =) zip codes, employee ID numbers, eye color, gender mode, entropy, contingency correlation, χ 2 test
Ordinal The values of an ordinal attribute provide enough information to order objects.
{good, better, best}, grades, street numbers median, percentiles, rank correlation, run tests, sign tests
Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measure- ment exists.
(+, − ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson’s correlation, t and F tests Numeric (Quan titativ e)
Ratio For ratio variables, both differences and ratios are meaningful.
When dealing with various attribute types such as temperature in Kelvin, monetary quantities, counts, age, mass, length, and electrical current, it's important to recognize that the statistical operations suitable for one attribute type may not be appropriate for others For instance, while geometric and harmonic means can be useful for certain data sets, percent variation must be applied with caution, as each attribute type requires tailored statistical approaches to ensure accurate analysis.
Nominal and ordinal attributes, known as categorical or qualitative attributes, differ from quantitative attributes, which include interval and ratio types While qualitative attributes, such as employee ID, may be represented by numbers, they function more as symbols rather than possessing numerical properties In contrast, quantitative attributes are numerical and exhibit most numerical characteristics, existing as either integer-valued or continuous forms.
The types of attributes can also be described in terms of transformations that do not change the meaning of an attribute Indeed, S Smith Stevens, the k k
Table 2.3 Transformations that define attribute levels.
Nominal Any one-to-one mapping, e.g., a permutation of values
If all employee ID numbers are reassigned, it will not make any difference.
Ordinal An order-preserving change of values, i.e., new value = f (old value), where f is a monotonic function.
An attribute encompassing the notion of good, better, best can be represented equally well by the values { 1, 2, 3 } or by { 0.5, 1, 10 } Interval new value = a × old value + b, a and b constants.
The Fahrenheit and Celsius temperature scales differ in the location of their zero value and the size of a degree (unit).
The new value of a measurement can be calculated by multiplying a constant (a) with the old value, and length can be expressed in either meters or feet without altering its meaning This concept, defined by a psychologist, highlights that the attributes of length remain consistent regardless of the unit of measurement used.
Statistical operations applicable to specific attribute types produce consistent results when the attributes undergo meaning-preserving transformations For example, the average length of a set of objects differs when measured in meters compared to feet; however, both measurements convey the same physical length Table 2.3 illustrates the meaning-preserving transformations corresponding to the four attribute types listed in Table 2.2.
Temperature serves as a clear example of the concepts discussed, as it can be classified as either an interval or a ratio attribute based on the measurement scale used.
On the Kelvin scale, a temperature of 2 K is physically twice that of 1 K, but this relationship does not hold true for the Celsius or Fahrenheit scales In these scales, the difference between 1 °F (or °C) and 2 °F (or °C) is negligible This inconsistency arises because the zero points of the Celsius and Fahrenheit scales are arbitrary, making the ratio of temperatures measured in these units physically meaningless.
Describing Attributes by the Number of Values
An independent way of distinguishing between attributes is by the number of values they can take.
Discrete A discrete attribute has a finite or countably infinite set of values.
Types of Data Sets
As the field of data mining evolves, a diverse range of data sets becomes available for analysis This article categorizes the most common types of data sets into three main groups: record data, graph-based data, and ordered data While these categories encompass many possibilities, they do not represent an exhaustive classification of all data set types.
General Characteristics of Data Sets
Before delving into specific types of data sets, it's essential to highlight three key characteristics—dimensionality, distribution, and resolution—that significantly influence the choice of data mining techniques employed.
Dimensionality refers to the number of attributes that objects in a data set possess, and analyzing low-dimensional data is qualitatively different from dealing with moderate or high-dimensional data The challenges associated with high-dimensional data are often termed the "curse of dimensionality," highlighting the need for dimensionality reduction during data preprocessing This topic will be explored in greater detail later in this chapter and in Appendix B.
Distribution The distribution of a data set is the frequency of occurrence of various values or sets of values for the attributes comprising data objects.
The distribution of a data set reflects how objects are concentrated across different areas of the data space Statisticians have identified numerous distribution types, such as Gaussian (normal), and have outlined their characteristics While statistical methods for describing distributions can provide valuable analytical tools, many data sets exhibit distributions that do not conform to standard statistical models.
Many data mining algorithms do not rely on specific statistical distributions for the data they analyze However, certain general characteristics of distributions can significantly influence the results For instance, when a categorical attribute serves as a class variable, a distribution where one category dominates at 95% while others collectively account for only 5% can complicate classification efforts This skewness in distribution poses challenges in data analysis, highlighting the importance of understanding distribution characteristics.
Sparsity is a specific type of skewed data where most attributes of an object have values of 0, often resulting in fewer than 1% of values being non-zero This characteristic is beneficial in practice, as it allows for the storage and manipulation of only the non-zero values, leading to significant savings in computation time and storage requirements Additionally, certain data mining algorithms, particularly association rule mining algorithms, are optimized for use with sparse data, enhancing their effectiveness.
Finally, note that often the attributes in sparse data sets are asymmetric attributes.
Data can be obtained at various levels of resolution, each revealing different properties and patterns For example, the Earth's surface appears uneven at a few meters resolution but smooth at tens of kilometers The visibility of patterns in data is influenced by resolution; overly fine resolutions may obscure patterns in noise, while overly coarse resolutions can cause patterns to vanish entirely Atmospheric pressure variations, for instance, can indicate storm movements on an hourly scale, but these phenomena become undetectable over a monthly scale.
Much data mining work assumes that the data set is a collection of records (data objects), each of which consists of a fixed set of data fields (attributes).
Record data, depicted in Figure 2.2(a), represents the simplest form of data storage, where records or data fields lack explicit relationships and each record shares the same attributes This data is typically housed in flat files or relational databases While relational databases offer more than just a collection of records, data mining primarily utilizes them as a convenient means to access records without leveraging the additional relational information Various types of record data are detailed below and illustrated in Figure 2.2.
No No No No Yes No No Yes No Yes
Yes No No Yes No No Yes No No No
Single Married Single Married Divorced Married Divorced Single Married Single
Bread, Soda, Milk Beer, Bread Beer, Soda, Diapers, Milk Beer, Bread, Diapers, Milk Soda, Diapers, Milk
(c) Data matrix. team coac h pla y score game win lost timeout season ball
Figure 2.2 Different variations of record data.
Transaction or market basket data refers to a specific type of record data where each transaction consists of a collection of items purchased together For instance, in a grocery store, the items bought by a customer during a single shopping trip form a transaction, while the individual products represent the items This data is termed "market basket data" because it reflects the products in a consumer's shopping basket Typically, transaction data comprises sets of items and can be viewed as records with asymmetric attributes, often represented in a binary format that indicates whether each item was purchased.
2.1 Types of Data 57 generally, the attributes can be discrete or continuous, such as the number of items purchased or the amount spent on those items Figure 2.2(b) shows a sample transaction data set Each row represents the purchases of a particular customer at a particular time.
A data matrix is a structured representation of data objects, where each object is represented as a point in a multidimensional space defined by a fixed set of numeric attributes This matrix consists of rows corresponding to individual data objects and columns representing distinct attributes, making it possible to apply standard matrix operations for data transformation and manipulation As a result, the data matrix serves as the standard format for most statistical data analysis, facilitating efficient handling of numeric information.
A sparse data matrix is characterized by having attributes of the same type that are asymmetric, where only non-zero values hold significance A typical example of a sparse data matrix is transaction data, which consists solely of 0-1 entries Document data also serves as a common illustration of this concept.
The "bag of words" approach allows for the representation of a document as a term vector, where each term serves as a component of the vector, with its value indicating the frequency of occurrence within the document This method creates a document-term matrix, where documents are represented as rows and terms as columns To optimize storage, only the non-zero entries of these sparse data matrices are typically retained.
A graph can sometimes be a convenient and powerful representation for data.
We consider two specific cases: (1) the graph captures relationships among data objects and (2) the data objects themselves are represented as graphs.
Data relationships among objects are crucial for conveying significant information, often represented as graphs In this context, data objects are mapped to nodes, while their relationships are depicted as links, incorporating properties like direction and weight For instance, web pages on the World Wide Web contain text and links to other pages, which web search engines analyze to extract content and assess relevance The links to and from each page are vital for understanding a page's relevance to search queries Additionally, social networks exemplify this graph data representation, where individuals are the data objects and their interactions on social media define their relationships.
Objects with internal structures, characterized by subobjects and their relationships, are often represented as graphs For instance, the structure of chemical compounds can be depicted as graphs, where nodes represent atoms and edges signify chemical bonds A ball-and-stick diagram of benzene illustrates this concept, showcasing carbon and hydrogen atoms Graph representations enable the identification of frequently occurring substructures within a set of compounds and help determine correlations between these substructures and specific chemical properties, such as melting points or heat of formation Frequent graph mining, a specialized area of data mining that focuses on this analysis, is explored in Section 7.5.
For some types of data, the attributes have relationships that involve order in time or space Different types of ordered data are described next and are shown in Figure 2.4.