Description

linking research with practice and knowledge with ethical decision-making. This assignment is a written assignment where students will demonstrate how this course research has connected and put into practice within their own career.

Assignment:

Provide a reflection of at least 500 words (or 2 pages double spaced) of how the knowledge, skills, or theories of this course have been applied, or could be applied, in a practical manner to your current work environment. If you are not currently working, share times when you have or could observe these theories and knowledge could be applied to an employment opportunity in your field of study.

INTRODUCTION TO DATA

MINING

INTRODUCTION TO DATA

MINING

SECOND EDITION

PANG-NING TAN

Michigan State University

MICHAEL STEINBACH

University of Minnesota

ANUJ KARPATNE

University of Minnesota

VIPIN KUMAR

University of Minnesota

330 Hudson Street, NY NY 10013

Director, Portfolio Management: Engineering, Computer Science & Global Editions:

Julian Partridge

Specialist, Higher Ed Portfolio Management: Matt Goldstein

Portfolio Management Assistant: Meghan Jacoby

Managing Content Producer: Scott Disanno

Content Producer: Carole Snyder

Web Developer: Steve Wright

Rights and Permissions Manager: Ben Ferrini

Manufacturing Buyer, Higher Ed, Lake Side Communications Inc (LSC): Maura

Zaldivar-Garcia

Inventory Manager: Ann Lam

Product Marketing Manager: Yvonne Vannatta

Field Marketing Manager: Demetrius Hall

Marketing Assistant: Jon Bryant

Cover Designer: Joyce Wells, jWellsDesign

Full-Service Project Management: Chandrasekar Subramanian, SPi Global

Copyright Ã‚Â©2019 Pearson Education, Inc. All rights reserved. Manufactured in the

United States of America. This publication is protected by Copyright, and

permission should be obtained from the publisher prior to any prohibited

reproduction, storage in a retrieval system, or transmission in any form or by any

means, electronic, mechanical, photocopying, recording, or likewise. For

information regarding permissions, request forms and the appropriate contacts

within the Pearson Education Global Rights & Permissions department, please visit

www.pearsonhighed.com/permissions/.

Many of the designations by manufacturers and sellers to distinguish their products

are claimed as trademarks. Where those designations appear in this book, and the

publisher was aware of a trademark claim, the designations have been printed in

initial caps or all caps.

Library of Congress Cataloging-in-Publication Data on File

Names: Tan, Pang-Ning, author. | Steinbach, Michael, author. | Karpatne, Anuj,

author. | Kumar, Vipin, 1956- author.

Title: Introduction to Data Mining / Pang-Ning Tan, Michigan State University,

Michael Steinbach, University of Minnesota, Anuj Karpatne, University of

Minnesota, Vipin Kumar, University of Minnesota.

Description: Second edition. | New York, NY : Pearson Education, [2019] |

Includes bibliographical references and index.

Identifiers: LCCN 2017048641 | ISBN 9780133128901 | ISBN 0133128903

Subjects: LCSH: Data mining.

Classification: LCC QA76.9.D343 T35 2019 | DDC 006.3/12Ã¢â‚¬â€œdc23 LC record

available at https://lccn.loc.gov/2017048641

1 18

ISBN-10: 0133128903

ISBN-13: 9780133128901

To our families Ã¢â‚¬Â¦

Preface to the Second Edition

Since the first edition, roughly 12 years ago, much has changed in the field of data

analysis. The volume and variety of data being collected continues to increase, as

has the rate (velocity) at which it is being collected and used to make decisions.

Indeed, the term, Big Data, has been used to refer to the massive and diverse data

sets now available. In addition, the term data science has been coined to describe

an emerging area that applies tools and techniques from various fields, such as

data mining, machine learning, statistics, and many others, to extract actionable

insights from data, often big data.

The growth in data has created numerous opportunities for all areas of data

analysis. The most dramatic developments have been in the area of predictive

modeling, across a wide range of application domains. For instance, recent

advances in neural networks, known as deep learning, have shown impressive

results in a number of challenging areas, such as image classification, speech

recognition, as well as text categorization and understanding. While not as

dramatic, other areas, e.g., clustering, association analysis, and anomaly detection

have also continued to advance. This new edition is in response to those advances.

Overview

As with the first edition, the second edition of the book provides a comprehensive

introduction to data mining and is designed to be accessible and useful to students,

instructors, researchers, and professionals. Areas covered include data

preprocessing, predictive modeling, association analysis, cluster analysis, anomaly

detection, and avoiding false discoveries. The goal is to present fundamental

concepts and algorithms for each topic, thus providing the reader with the

necessary background for the application of data mining to real problems. As

before, classification, association analysis and cluster analysis, are each covered in

a pair of chapters. The introductory chapter covers basic concepts, representative

algorithms, and evaluation techniques, while the more following chapter discusses

advanced concepts and algorithms. As before, our objective is to provide the

reader with a sound understanding of the foundations of data mining, while still

covering many important advanced topics. Because of this approach, the book is

useful both as a learning tool and as a reference.

To help readers better understand the concepts that have been presented, we

provide an extensive set of examples, figures, and exercises. The solutions to the

original exercises, which are already circulating on the web, will be made public.

The exercises are mostly unchanged from the last edition, with the exception of

new exercises in the chapter on avoiding false discoveries. New exercises for the

other chapters and their solutions will be available to instructors via the web.

Bibliographic notes are included at the end of each chapter for readers who are

interested in more advanced topics, historically important papers, and recent

trends. These have also been significantly updated. The book also contains a

comprehensive subject and author index.

What is New in the Second Edition?

Some of the most significant improvements in the text have been in the two

chapters on classification. The introductory chapter uses the decision tree classifier

for illustration, but the discussion on many topicsÃ¢â‚¬â€those that apply across all

classification approachesÃ¢â‚¬â€has been greatly expanded and clarified, including

topics such as overfitting, underfitting, the impact of training size, model complexity,

model selection, and common pitfalls in model evaluation. Almost every section of

the advanced classification chapter has been significantly updated. The material on

Bayesian networks, support vector machines, and artificial neural networks has

been significantly expanded. We have added a separate section on deep networks

to address the current developments in this area. The discussion of evaluation,

which occurs in the section on imbalanced classes, has also been updated and

improved.

The changes in association analysis are more localized. We have completely

reworked the section on the evaluation of association patterns (introductory

chapter), as well as the sections on sequence and graph mining (advanced

chapter). Changes to cluster analysis are also localized. The introductory chapter

added the K-means initialization technique and an updated the discussion of cluster

evaluation. The advanced clustering chapter adds a new section on spectral graph

clustering. Anomaly detection has been greatly revised and expanded. Existing

approachesÃ¢â‚¬â€statistical, nearest neighbor/density-based, and clustering basedÃ¢â‚¬â€

have been retained and updated, while new approaches have been added:

reconstruction-based, one-class classification, and information-theoretic. The

reconstruction-based approach is illustrated using autoencoder networks that are

part of the deep learning paradigm. The data chapter has been updated to include

discussions of mutual information and kernel-based techniques.

The last chapter, which discusses how to avoid false discoveries and produce valid

results, is completely new, and is novel among other contemporary textbooks on

data mining. It supplements the discussions in the other chapters with a discussion

of the statistical concepts (statistical significance, p-values, false discovery rate,

permutation testing, etc.) relevant to avoiding spurious results, and then illustrates

these concepts in the context of data mining techniques. This chapter addresses

the increasing concern over the validity and reproducibility of results obtained from

data analysis. The addition of this last chapter is a recognition of the importance of

this topic and an acknowledgment that a deeper understanding of this area is

needed for those analyzing data.

The data exploration chapter has been deleted, as have the appendices, from the

print edition of the book, but will remain available on the web. A new appendix

provides a brief discussion of scalability in the context of big data.

To the Instructor

As a textbook, this book is suitable for a wide range of students at the advanced

undergraduate or graduate level. Since students come to this subject with diverse

backgrounds that may not include extensive knowledge of statistics or databases,

our book requires minimal prerequisites. No database knowledge is needed, and

we assume only a modest background in statistics or mathematics, although such a

background will make for easier going in some sections. As before, the book, and

more specifically, the chapters covering major data mining topics, are designed to

be as self-contained as possible. Thus, the order in which topics can be covered is

quite flexible. The core material is covered in chapters 2 (data), 3 (classification), 5

(association analysis), 7 (clustering), and 9 (anomaly detection). We recommend at

least a cursory coverage of Chapter 10 (Avoiding False Discoveries) to instill in

students some caution when interpreting the results of their data analysis. Although

the introductory data chapter (2) should be covered first, the basic classification (3),

association analysis (5), and clustering chapters (7), can be covered in any order.

Because of the relationship of anomaly detection (9) to classification (3) and

clustering (7), these chapters should precede Chapter 9. Various topics can be

selected from the advanced classification, association analysis, and clustering

chapters (4, 6, and 8, respectively) to fit the schedule and interests of the instructor

and students. We also advise that the lectures be augmented by projects or

practical exercises in data mining. Although they are time consuming, such handson assignments greatly enhance the value of the course.

Support Materials

Support materials available to all readers of this book are available at http://wwwusers.cs.umn.edu/~kumar/dmbook.

PowerPoint lecture slides

Suggestions for student projects

Data mining resources, such as algorithms and data sets

Online tutorials that give step-by-step examples for selected data mining

techniques described in the book using actual data sets and data analysis

software

Additional support materials, including solutions to exercises, are available only to

instructors adopting this textbook for classroom use. The bookÃ¢â‚¬â„¢s resources will be

mirrored at www.pearsonhighered.com/cs-resources. Comments and

suggestions, as well as reports of errors, can be sent to the authors through

dmbook@cs.umn.edu.

Acknowledgments

Many people contributed to the first and second editions of the book. We begin by

acknowledging our families to whom this book is dedicated. Without their patience

and support, this project would have been impossible.

We would like to thank the current and former students of our data mining groups at

the University of Minnesota and Michigan State for their contributions. Eui-Hong

(Sam) Han and Mahesh Joshi helped with the initial data mining classes. Some of

the exercises and presentation slides that they created can be found in the book

and its accompanying slides. Students in our data mining groups who provided

comments on drafts of the book or who contributed in other ways include Shyam

Boriah, Haibin Cheng, Varun Chandola, Eric Eilertson, Levent ErtÃƒÂ¶z, Jing Gao,

Rohit Gupta, Sridhar Iyer, Jung-Eun Lee, Benjamin Mayer, Aysel Ozgur, Uygar

Oztekin, Gaurav Pandey, Kashif Riaz, Jerry Scripps, Gyorgy Simon, Hui Xiong,

Jieping Ye, and Pusheng Zhang. We would also like to thank the students of our

data mining classes at the University of Minnesota and Michigan State University

who worked with early drafts of the book and provided invaluable feedback. We

specifically note the helpful suggestions of Bernardo Craemer, Arifin Ruslim,

Jamshid Vayghan, and Yu Wei.

Joydeep Ghosh (University of Texas) and Sanjay Ranka (University of Florida)

class tested early versions of the book. We also received many useful suggestions

directly from the following UT students: Pankaj Adhikari, Rajiv Bhatia, Frederic

Bosche, Arindam Chakraborty, Meghana Deodhar, Chris Everson, David Gardner,

Saad Godil, Todd Hay, Clint Jones, Ajay Joshi, Joonsoo Lee, Yue Luo, Anuj

Nanavati, Tyler Olsen, Sunyoung Park, Aashish Phansalkar, Geoff Prewett, Michael

Ryoo, Daryl Shannon, and Mei Yang.

Ronald Kostoff (ONR) read an early version of the clustering chapter and offered

numerous suggestions. George Karypis provided invaluable LATEX assistance in

creating an author index. Irene Moulitsas also provided assistance with LATEX and

reviewed some of the appendices. Musetta Steinbach was very helpful in finding

errors in the figures.

We would like to acknowledge our colleagues at the University of Minnesota and

Michigan State who have helped create a positive environment for data mining

research. They include Arindam Banerjee, Dan Boley, Joyce Chai, Anil Jain, Ravi

Janardan, Rong Jin, George Karypis, Claudia Neuhauser, Haesun Park, William F.

Punch, GyÃƒÂ¶rgy Simon, Shashi Shekhar, and Jaideep Srivastava. The collaborators

on our many data mining projects, who also have our gratitude, include Ramesh

Agrawal, Maneesh Bhargava, Steve Cannon, Alok Choudhary, Imme Ebert-Uphoff,

Auroop Ganguly, Piet C. de Groen, Fran Hill, Yongdae Kim, Steve Klooster, Kerry

Long, Nihar Mahapatra, Rama Nemani, Nikunj Oza, Chris Potter, Lisiane Pruinelli,

Nagiza Samatova, Jonathan Shapiro, Kevin Silverstein, Brian Van Ness, Bonnie

Westra, Nevin Young, and Zhi-Li Zhang.

The departments of Computer Science and Engineering at the University of

Minnesota and Michigan State University provided computing resources and a

supportive environment for this project. ARDA, ARL, ARO, DOE, NASA, NOAA,

and NSF provided research support for Pang-Ning Tan, Michael Stein-bach, Anuj

Karpatne, and Vipin Kumar. In particular, Kamal Abdali, Mitra Basu, Dick Brackney,

Jagdish Chandra, Joe Coughlan, Michael Coyle, Stephen Davis, Frederica

Darema, Richard Hirsch, Chandrika Kamath, Tsengdar Lee, Raju Namburu, N.

Radhakrishnan, James Sidoran, Sylvia Spengler, Bhavani Thuraisingham, Walt

Tiernin, Maria Zemankova, Aidong Zhang, and Xiaodong Zhang have been

supportive of our research in data mining and high-performance computing.

It was a pleasure working with the helpful staff at Pearson Education. In particular,

we would like to thank Matt Goldstein, Kathy Smith, Carole Snyder, and Joyce

Wells. We would also like to thank George Nichols, who helped with the art work

and Paul Anagnostopoulos, who provided LATEX support.

We are grateful to the following Pearson reviewers: Leman Akoglu (Carnegie

Mellon University), Chien-Chung Chan (University of Akron), Zhengxin Chen

(University of Nebraska at Omaha), Chris Clifton (Purdue University), Joy-deep

Ghosh (University of Texas, Austin), Nazli Goharian (Illinois Institute of

Technology), J. Michael Hardin (University of Alabama), Jingrui He (Arizona State

University), James Hearne (Western Washington University), Hillol Kargupta

(University of Maryland, Baltimore County and Agnik, LLC), Eamonn Keogh

(University of California-Riverside), Bing Liu (University of Illinois at Chicago),

Mariofanna Milanova (University of Arkansas at Little Rock), Srinivasan

Parthasarathy (Ohio State University), Zbigniew W. Ras (University of North

Carolina at Charlotte), Xintao Wu (University of North Carolina at Charlotte), and

Mohammed J. Zaki (Rensselaer Polytechnic Institute).

Over the years since the first edition, we have also received numerous comments

from readers and students who have pointed out typos and various other issues.

We are unable to mention these individuals by name, but their input is much

appreciated and has been taken into account for the second edition.

Contents

Preface to the Second Edition v

1 Introduction 1

1.1 What Is Data Mining? 4

1.2 Motivating Challenges 5

1.3 The Origins of Data Mining 7

1.4 Data Mining Tasks 9

1.5 Scope and Organization of the Book 13

1.6 Bibliographic Notes 15

1.7 Exercises 21

2 Data 23

2.1 Types of Data 26

2.1.1 Attributes and Measurement 27

2.1.2 Types of Data Sets 34

2.2 Data Quality 42

2.2.1 Measurement and Data Collection Issues 42

2.2.2 Issues Related to Applications 49

2.3 Data Preprocessing 50

2.3.1 Aggregation 51

2.3.2 Sampling 52

2.3.3 Dimensionality Reduction 56

2.3.4 Feature Subset Selection 58

2.3.5 Feature Creation 61

2.3.6 Discretization and Binarization 63

2.3.7 Variable Transformation 69

2.4 Measures of Similarity and Dissimilarity 71

2.4.1 Basics 72

2.4.2 Similarity and Dissimilarity between Simple Attributes 74

2.4.3 Dissimilarities between Data Objects 76

2.4.4 Similarities between Data Objects 78

2.4.5 Examples of Proximity Measures 79

2.4.6 Mutual Information 88

2.4.7 Kernel Functions* 90

2.4.8 Bregman Divergence* 94

2.4.9 Issues in Proximity Calculation 96

2.4.10 Selecting the Right Proximity Measure 98

2.5 Bibliographic Notes 100

2.6 Exercises 105

3 Classification: Basic Concepts and Techniques 113

3.1 Basic Concepts 114

3.2 General Framework for Classification 117

3.3 Decision Tree Classifier 119

3.3.1 A Basic Algorithm to Build a Decision Tree 121

3.3.2 Methods for Expressing Attribute Test Conditions 124

3.3.3 Measures for Selecting an Attribute Test Condition 127

3.3.4 Algorithm for Decision Tree Induction 136

3.3.5 Example Application: Web Robot Detection 138

3.3.6 Characteristics of Decision Tree Classifiers 140

3.4 Model Overfitting 147

3.4.1 Reasons for Model Overfitting 149

3.5 Model Selection 156

3.5.1 Using a Validation Set 156

3.5.2 Incorporating Model Complexity 157

3.5.3 Estimating Statistical Bounds 162

3.5.4 Model Selection for Decision Trees 162

3.6 Model Evaluation 164

3.6.1 Holdout Method 165

3.6.2 Cross-Validation 165

3.7 Presence of Hyper-parameters 168

3.7.1 Hyper-parameter Selection 168

3.7.2 Nested Cross-Validation 170

3.8 Pitfalls of Model Selection and Evaluation 172

3.8.1 Overlap between Training and Test Sets 172

3.8.2 Use of Validation Error as Generalization Error 172

3.9 Model Comparison* 173

3.9.1 Estimating the Confidence Interval for Accuracy 174

3.9.2 Comparing the Performance of Two Models 175

3.10 Bibliographic Notes 176

3.11 Exercises 185

4 Classification: Alternative Techniques 193

4.1 Types of Classifiers 193

4.2 Rule-Based Classifier 195

4.2.1 How a Rule-Based Classifier Works 197

4.2.2 Properties of a Rule Set 198

4.2.3 Direct Methods for Rule Extraction 199

4.2.4 Indirect Methods for Rule Extraction 204

4.2.5 Characteristics of Rule-Based Classifiers 206

4.3 Nearest Neighbor Classifiers 208

4.3.1 Algorithm 209

4.3.2 Characteristics of Nearest Neighbor Classifiers 210

4.4 NaÃƒÂ¯ve Bayes Classifier 212

4.4.1 Basics of Probability Theory 213

4.4.2 NaÃƒÂ¯ve Bayes Assumption 218

4.5 Bayesian Networks 227

4.5.1 Graphical Representation 227

4.5.2 Inference and Learning 233

4.5.3 Characteristics of Bayesian Networks 242

4.6 Logistic Regression 243

4.6.1 Logistic Regression as a Generalized Linear Model 244

4.6.2 Learning Model Parameters 245

4.6.3 Characteristics of Logistic Regression 248

4.7 Artificial Neural Network (ANN) 249

4.7.1 Perceptron 250

4.7.2 Multi-layer Neural Network 254

4.7.3 Characteristics of ANN 261

4.8 Deep Learning 262

4.8.1 Using Synergistic Loss Functions 263

4.8.2 Using Responsive Activation Functions 266

4.8.3 Regularization 268

4.8.4 Initialization of Model Parameters 271

4.8.5 Characteristics of Deep Learning 275

4.9 Support Vector Machine (SVM) 276

4.9.1 Margin of a Separating Hyperplane 276

4.9.2 Linear SVM 278

4.9.3 Soft-margin SVM 284

4.9.4 Nonlinear SVM 290

4.9.5 Characteristics of SVM 294

4.10 Ensemble Methods 296

4.10.1 Rationale for Ensemble Method 297

4.10.2 Methods for Constructing an Ensemble Classifier 297

4.10.3 Bias-Variance Decomposition 300

4.10.4 Bagging 302

4.10.5 Boosting 305

4.10.6 Random Forests 310

4.10.7 Empirical Comparison among Ensemble Methods 312

4.11 Class Imbalance Problem 313

4.11.1 Building Classifiers with Class Imbalance 314

4.11.2 Evaluating Performance with Class Imbalance 318

4.11.3 Finding an Optimal Score Threshold 322

4.11.4 Aggregate Evaluation of Performance 323

4.12 Multiclass Problem 330

4.13 Bibliographic Notes 333

4.14 Exercises 345

5 Association Analysis: Basic Concepts and Algorithms 357

5.1 Preliminaries 358

5.2 Frequent Itemset Generation 362

5.2.1 The Apriori Principle 363

5.2.2 Frequent Itemset Generation in the Apriori Algorithm 364

5.2.3 Candidate Generation and Pruning 368

5.2.4 Support Counting 373

5.2.5 Computational Complexity 377

5.3 Rule Generation 380

5.3.1 Confidence-Based Pruning 380

5.3.2 Rule Generation in Apriori Algorithm 381

5.3.3 An Example: Congressional Voting Records 382

5.4 Compact Representation of Frequent Itemsets 384

5.4.1 Maximal Frequent Itemsets 384

5.4.2 Closed Itemsets 386

5.5 Alternative Methods for Generating Frequent Itemsets* 389

5.6 FP-Growth Algorithm* 393

5.6.1 FP-Tree Representation 394

5.6.2 Frequent Itemset Generation in FP-Growth Algorithm 397

5.7 Evaluation of Association Patterns 401

5.7.1 Objective Measures of Interestingness 402

5.7.2 Measures beyond Pairs of Binary Variables 414

5.7.3 SimpsonÃ¢â‚¬â„¢s Paradox 416

5.8 Effect of Skewed Support Distribution 418

5.9 Bibliographic Notes 424

5.10 Exercises 438

6 Association Analysis: Advanced Concepts 451

6.1 Handling Categorical Attributes 451

6.2 Handling Continuous Attributes 454

6.2.1 Discretization-Based Methods 454

6.2.2 Statistics-Based Methods 458

6.2.3 Non-discretization Methods 460

6.3 Handling a Concept Hierarchy 462

6.4 Sequential Patterns 464

6.4.1 Preliminaries 465

6.4.2 Sequential Pattern Discovery 468

6.4.3 Timing ConstraintsÃ¢Ë†â€” 473

6.4.4 Alternative Counting SchemesÃ¢Ë†â€” 477

6.5 Subgraph Patterns 479

6.5.1 Preliminaries 480

6.5.2 Frequent Subgraph Mining 483

6.5.3 Candidate Generation 487

6.5.4 Candidate Pruning 493

6.5.5 Support Counting 493

6.6 Infrequent PatternsÃ¢Ë†â€” 493

6.6.1 Negative Patterns 494

6.6.2 Negatively Correlated Patterns 495

6.6.3 Comparisons among Infrequent Patterns, Negative Patterns, and

Negatively Correlated Patterns 496

6.6.4 Techniques for Mining Interesting Infrequent Patterns 498

6.6.5 Techniques Based on Mining Negative Patterns 499

6.6.6 Techniques Based on Support Expectation 501

6.7 Bibliographic Notes 505

6.8 Exercises 510

7 Cluster Analysis: Basic Concepts and Algorithms 525

7.1 Overview 528

7.1.1 What Is Cluster Analysis? 528

7.1.2 Different Types of Clusterings 529

7.1.3 Different Types of Clusters 531

7.2 K-means 534

7.2.1 The Basic K-means Algorithm 535

7.2.2 K-means: Additional Issues 544

7.2.3 Bisecting K-means 547

7.2.4 K-means and Different Types of Clusters 548

7.2.5 Strengths and Weaknesses 549

7.2.6 K-means as an Optimization Problem 549

7.3 Agglomerative Hierarchical Clustering 554

7.3.1 Basic Agglomerative Hierarchical Clustering Algorithm 555

7.3.2 Specific Techniques 557

7.3.3 The Lance-Williams Formula for Cluster Proximity 562

7.3.4 Key Issues in Hierarchical Clustering 563

7.3.5 Outliers 564

7.3.6 Strengths and Weaknesses 565

7.4 DBSCAN 565

7.4.1 Traditional Density: Center-Based Approach 565

7.4.2 The DBSCAN Algorithm 567

7.4.3 Strengths and Weaknesses 569

7.5 Cluster Evaluation 571

7.5.1 Overview 571

7.5.2 Unsupervised Cluster Evaluation Using Cohesion and Separation

574

7.5.3 Unsupervised Cluster Evaluation Using the Proximity Matrix

582

7.5.4 Unsupervised Evaluation of Hierarchical Clustering 585

7.5.5 Determining the Correct Number of Clusters 587

7.5.6 Clustering Tendency 588

7.5.7 Supervised Measures of Cluster Validity 589

7.5.8 Assessing the Significance of Cluster Validity Measures 594

7.5.9 Choosing a Cluster Validity Measure 596

7.6 Bibliographic Notes 597

7.7 Exercises 603

8 Cluster Analysis: Additional Issues and Algorithms 613

8.1 Characteristics of Data, Clusters, and Clustering Algorithms 614

8.1.1 Example: Comparing K-means and DBSCAN 614

8.1.2 Data Characteristics 615

8.1.3 Cluster Characteristics 617

8.1.4 General Characteristics of Clustering Algorithms 619

8.2 Prototype-Based Clustering 621

8.2.1 Fuzzy Clustering 621

8.2.2 Clustering Using Mixture Models 627

8.2.3 Self-Organizing Maps (SOM) 637

8.3 Density-Based Clustering 644

8.3.1 Grid-Based Clustering 644

8.3.2 Subspace Clustering 648

8.3.3 DENCLUE: A Kernel-Based Scheme for Density-Based Clustering

652

8.4 Graph-Based Clustering 656

8.4.1 Sparsification 657

8.4.2 Minimum Spanning Tree (MST) Clustering 658

8.4.3 OPOSSUM: Optimal Partitioning of Sparse Similarities Using

METIS 659

8.4.4 Chameleon: Hierarchical Clustering with Dynamic Modeling

660

8.4.5 Spectral Clustering 666

8.4.6 Shared Nearest Neighbor Similarity 673

8.4.7 The Jarvis-Patrick Clustering Algorithm 676

8.4.8 SNN Density 678

8.4.9 SNN Density-Based Clustering 679

8.5 Scalable Clustering Algorithms 681

8.5.1 Scalability: General Issues and Approaches 681

8.5.2 BIRCH 684

8.5.3 CURE 686

8.6 Which Clustering Algorithm? 690

8.7 Bibliographic Notes 693

8.8 Exercises 699

9 Anomaly Detection 703

9.1 Characteristics of Anomaly Detection Problems 705

9.1.1 A Definition of an Anomaly 705

9.1.2 Nature of Data 706

9.1.3 How Anomaly Detection is Used 707

9.2 Characteristics of Anomaly Detection Methods 708

9.3 Statistical Approaches 710

9.3.1 Using Parametric Models 710

9.3.2 Using Non-parametric Models 714

9.3.3 Modeling Normal and Anomalous Classes 715

9.3.4 Assessing Statistical Significance 717

9.3.5 Strengths and Weaknesses 718

9.4 Proximity-based Approaches 719

9.4.1 Distance-based Anomaly Score 719

9.4.2 Density-based Anomaly Score 720

9.4.3 Relative Density-based Anomaly Score 722

9.4.4 Strengths and Weaknesses 723

9.5 Clustering-based Approaches 724

9.5.1 Finding Anomalous Clusters 724

9.5.2 Finding Anomalous Instances 725

9.5.3 Strengths and Weaknesses 728

9.6 Reconstruction-based Approaches 728

9.6.1 Strengths and Weaknesses 731

9.7 One-class Classification 732

9.7.1 Use of Kernels 733

9.7.2 The Origin Trick 734

9.7.3 Strengths and Weaknesses 738

9.8 Information Theoretic Approaches 738

9.8.1 Strengths and Weaknesses 740

9.9 Evaluation of Anomaly Detection 740

9.10 Bibliographic Notes 742

9.11 Exercises 749

10 Avoiding False Discoveries 755

10.1 Preliminaries: Statistical Testing 756

10.1.1 Significance Testing 756

10.1.2 Hypothesis Testing 761

10.1.3 Multiple Hypothesis Testing 767

10.1.4 Pitfalls in Statistical Testing 776

10.2 Modeling Null and Alternative Distributions 778

10.2.1 Generating Synthetic Data Sets 781

10.2.2 Randomizing Class Labels 782

10.2.3 Resampling Instances 782

10.2.4 Modeling the Distribution of the Test Statistic 783

10.3 Statistical Testing for Classification 783

10.3.1 Evaluating Classification Performance 783

10.3.2 Binary Classification as Multiple Hypothesis Testing 785

10.3.3 Multiple Hypothesis Testing in Model Selection 786

10.4 Statistical Testing for Association Analysis 787

10.4.1 Using Statistical Models 788

10.4.2 Using Randomization Methods 794

10.5 Statistical Testing for Cluster Analysis 795

10.5.1 Generating a Null Distribution for Internal Indices 796

10.5.2 Generating a Null Distribution for External Indices 798

10.5.3 Enrichment 798

10.6 Statistical Testing for Anomaly Detection 800

10.7 Bibliographic Notes 803

10.8 Exercises 808

Author Index 816

Subject Index 829

Copyright Permissions 839

1 Introduction

Rapid advances in data collection and storage technology,

coupled with the ease with which data can be generated and

disseminated, have triggered the explosive growth of data,

leading to the current age of big data. Deriving actionable

insights from these large data sets is increasingly important in

decision making across almost all areas of society, including

business and industry; science and engineering; medicine and

biotechnology; and government and individuals. However, the

amount of data (volume), its complexity (variety), and the rate at

which it is being collected and processed (velocity) have simply

become too great for humans to analyze unaided. Thus, there is a

great need for automated tools for extracting useful information

from the big data despite the challenges posed by its enormity

and diversity.

Data mining blends traditional data analysis methods with

sophisticated algorithms for processing this abundance of data. In

this introductory chapter, we present an overview of data mining

and outline the key topics to be covered in this book. We start with

a description of some applications that require more advanced

techniques for data analysis.

Business and Industry Point-of-sale data collection (bar code scanners, radio

frequency identification (RFID), and smart card technology) have allowed retailers

to collect up-to-the-minute data about customer purchases at the checkout

counters of their stores. Retailers can utilize this information, along with other

business-critical data, such as web server logs from e-commerce websites and

customer service records from call centers, to help them better understand the

needs of their customers and make more informed business decisions.

Data mining techniques can be used to support a wide range of business

intelligence applications, such as customer profiling, targeted marketing, workflow

management, store layout, fraud detection, and automated buying and selling. An

example of the last application is high-speed stock trading, where decisions on

buying and selling have to be made in less than a second using data about

financial transactions. Data mining can also help retailers answer important

business questions, such as Ã¢â‚¬Å“Who are the most profitable customers?Ã¢â‚¬Â Ã¢â‚¬Å“What

products can be cross-sold or up-sold?Ã¢â‚¬Â and Ã¢â‚¬Å“What is the revenue outlook of the

company for next year?Ã¢â‚¬Â These questions have inspired the development of such

and 6 ).

data mining techniques as association analysis (Chapters 5

As the Internet continues to revolutionize the way we interact and make decisions

in our everyday lives, we are generating massive amounts of data about our online

experiences, e.g., web browsing, messaging, and posting on social networking

websites. This has opened several opportunities for business applications that use

web data. For example, in the e-commerce sector, data about our online viewing or

shopping preferences can be used to provide personalized recommendations of

products. Data mining also plays a prominent role in supporting several other

Internet-based services, such as filtering spam messages, answering search

queries, and suggesting social updates and connections. The large corpus of text,

images, and videos available on the Internet has enabled a number of

advancements in data mining methods, including deep learning, which is discussed

in Chapter 4 . These developments have led to great advances in a number of

applications, such as object recognition, natural language translation, and

autonomous driving.

Another domain that has undergone a rapid big data transformation is the use of

mobile sensors and devices, such as smart phones and wearable computing

devices. With better sensor technologies, it has become possible to collect a variety

of information about our physical world using low-cost sensors embedded on

everyday objects that are connected to each other, termed the Internet of Things

(IOT). This deep integration of physical sensors in digital systems is beginning to

generate large amounts of diverse and distributed data about our environment,

which can be used for designing convenient, safe, and energy-efficient home

systems, as well as for urban planning of smart cities.

Medicine, Science, and Engineering Researchers in medicine, science, and

engineering are rapidly accumulating data that is key to significant new discoveries.

For example, as an important step toward improving our understanding of the

EarthÃ¢â‚¬â„¢s climate system, NASA has deployed a series of Earth-orbiting satellites that

continuously generate global observations of the land surface, oceans, and

atmosphere. However, because of the size and spatio-temporal nature of the data,

traditional methods are often not suitable for analyzing these data sets. Techniques

developed in data mining can aid Earth scientists in answering questions such as

the following: Ã¢â‚¬Å“What is the relationship between the frequency and intensity of

ecosystem disturbances such as droughts and hurricanes to global warming?Ã¢â‚¬Â

Ã¢â‚¬Å“How is land surface precipitation and temperature affected by ocean surface

temperature?Ã¢â‚¬Â; and Ã¢â‚¬Å“How well can we predict the beginning and end of the growing

season for a region?Ã¢â‚¬Â

As another example, researchers in molecular biology hope to use the large

amounts of genomic data to better understand the structure and function of genes.

In the past, traditional methods in molecular biology allowed scientists to study only

a few genes at a time in a given experiment. Recent breakthroughs in microarray

technology have enabled scientists to compare the behavior of thousands of genes

under various situations. Such comparisons can help determine the function of

each gene, and perhaps isolate the genes responsible for certain diseases.

However, the noisy, high-dimensional nature of data requires new data analysis

methods. In addition to analyzing gene expression data, data mining can also be

used to address other important biological challenges such as protein structure

prediction, multiple sequence alignment, the modeling of biochemical pathways,

and phylogenetics.

Another example is the use of data mining techniques to analyze electronic health

record (EHR) data, which has become increasingly available. Not very long ago,

studies of patients required manually examining the physical records of individual

patients and extracting very specific pieces of information pertinent to the particular

question being investigated. EHRs allow for a faster and broader exploration of

such data. However, there are significant challenges since the observations on any

one patient typically occur during their visits to a doctor or hospital and only a small

number of details about the health of the patient are measured during any particular

visit.

Currently, EHR analysis focuses on simple types of data, e.g., a patientÃ¢â‚¬â„¢s blood

pressure or the diagnosis code of a disease. However, large amounts of more

complex types of medical data are also being collected, such as

electrocardiograms (ECGs) and neuroimages from magnetic resonance imaging

(MRI) or functional Magnetic Resonance Imaging (fMRI). Although challenging to

analyze, this data also provides vital information about patients. Integrating and

analyzing such data, with traditional EHR and genomic data is one of the

capabilities needed to enable precision medicine, which aims to provide more

personalized patient care.

1.1 What Is Data Mining?

Data mining is the process of automatically discovering useful information in large

data repositories. Data mining techniques are deployed to scour large data sets in

order to find novel and useful patterns that might otherwise remain unknown. They

also provide the capability to predict the outcome of a future observation, such as

the amount a customer will spend at an online or a brick-and-mortar store.

Not all information discovery tasks are considered to be data mining. Examples

include queries, e.g., looking up individual records in a database or finding web

pages that contain a particular set of keywords. This is because such tasks can be

accomplished through simple interactions with a database management system or

an information retrieval system. These systems rely on traditional computer science

techniques, which include sophisticated indexing structures and query processing

algorithms, for efficiently organizing and retrieving information from large data

repositories. Nonetheless, data mining techniques have been used to enhance the

performance of such systems by improving the quality of the search results based

on their relevance to the input queries.

Data Mining and Knowledge Discovery in Databases

Data mining is an integral part of knowledge discovery in databases (KDD),

which is the overall process of converting raw data into useful information, as

shown in Figure 1.1 . This process consists of a series of steps, from data

preprocessing to postprocessing of data mining results.

Figure 1.1.

The process of knowledge discovery in databases (KDD).

The input data can be stored in a variety of formats (flat files, spreadsheets, or

relational tables) and may reside in a centralized data repository or be distributed

across multiple sites. The purpose of preprocessing is to transform the raw input

data into an appropriate format for subsequent analysis. The steps involved in data

preprocessing include fusing data from multiple sources, cleaning data to remove

noise and duplicate observations, and selecting records and features that are

relevant to the data mining task at hand. Because of the many ways data can be

collected and stored, data preprocessing is perhaps the most laborious and timeconsuming step in the overall knowledge discovery process.

Ã¢â‚¬Å“Closing the loopÃ¢â‚¬Â is a phrase often used to refer to the process of integrating data

mining results into decision support systems. For example, in business

applications, the insights offered by data mining results can be integrated with

campaign management tools so that effective marketing promotions can be

conducted and tested. Such integration requires a postprocessing step to ensure

that only valid and useful results are incorporated into the decision support system.

An example of postprocessing is visualization, which allows analysts to explore the

data and the data mining results from a variety of viewpoints. Hypothesis testing

methods can also be applied during postprocessing to eliminate spurious data

mining results. (See Chapter 10 .)

1.2 Motivating Challenges

As mentioned earlier, traditional data analysis techniques have often encountered

practical difficulties in meeting the challenges posed by big data applications. The

following are some of the specific challenges that motivated the development of

data mining.

Scalability

Because of advances in data generation and collection, data sets with sizes of

terabytes, petabytes, or even exabytes are becoming common. If data mining

algorithms are to handle these massive data sets, they must be scalable. Many

data mining algorithms employ special search strategies to handle exponential

search problems. Scalability may also require the implementation of novel data

structures to access individual records in an efficient manner. For instance, out-ofcore algorithms may be necessary when processing data sets that cannot fit into

main memory. Scalability can also be improved by using sampling or developing

parallel and distributed algorithms. A general overview of techniques for scaling up

data mining algorithms is given in Appendix F.

High Dimensionality

It is now common to encounter data sets with hundreds or thousands of attributes

instead of the handful common a few decades ago. In bioinformatics, progress in

microarray technology has produced gene expression data involving thousands of

features. Data sets with temporal or spatial components also tend to have high

dimensionality. For example, consider a data set that contains measurements of

temperature at various locations. If the temperature measurements are taken

repeatedly for an extended period, the number of dimensions (features) increases

in proportion to the number of measurements taken. Traditional data analysis

techniques that were developed for low-dimensional data often do not work well for

such high-dimensional data due to issues such as curse of dimensionality (to be

discussed in Chapter 2 ). Also, for some data analysis algorithms, the

computational complexity increases rapidly as the dimensionality (the number of

features) increases.

Heterogeneous and Complex Data

Traditional data analysis methods often deal with data sets containing attributes of

the same type, either continuous or categorical. As the role of data mining in

business, science, medicine, and other fields has grown, so has the need for

techniques that can handle heterogeneous attributes. Recent years have also seen

the emergence of more complex data objects. Examples of such non-traditional

types of data include web and social media data containing text, hyperlinks,

images, audio, and videos; DNA data with sequential and three-dimensional

structure; and climate data that consists of measurements (temperature, pressure,

etc.) at various times and locations on the EarthÃ¢â‚¬â„¢s surface. Techniques developed

for mining such complex objects should take into consideration relationships in the

data, such as temporal and spatial autocorrelation, graph connectivity, and parentchild relationships between the elements in semi-structured text and XML

documents.

Data Ownership and Distribution

Sometimes, the data needed for an analysis is not stored in one location or owned

by one organization. Instead, the data is geographically distributed among

resources belonging to multiple entities. This requires the development of

distributed data mining techniques. The key challenges faced by distributed data

mining algorithms include the following: (1) how to reduce the amount of

communication needed to perform the distributed computation, (2) how to

effectively consolidate the data mining results obtained from multiple sources, and

(3) how to address data security and privacy issues.

Non-traditional Analysis

The traditional statistical approach is based on a hypothesize-and-test paradigm. In

other words, a hypothesis is proposed, an experiment is designed to gather the

data, and then the data is analyzed with respect to the hypothesis. Unfortunately,

this process is extremely labor-intensive. Current data analysis tasks often require

the generation and evaluation of thousands of hypotheses, and consequently, the

development of some data mining techniques has been motivated by the desire to

automate the process of hypothesis generation and evaluation. Furthermore, the

data sets analyzed in data mining are typically not the result of a carefully designed

experiment and often represent opportunistic samples of the data, rather than

random samples.

1.3 The Origins of Data Mining

While data mining has traditionally been viewed as an intermediate process within

the KDD framework, as shown in Figure 1.1 , it has emerged over the years as

an academic field within computer science, focusing on all aspects of KDD,

including data preprocessing, mining, and postprocessing. Its origin can be traced

back to the late 1980s, following a series of workshops organized on the topic of

knowledge discovery in databases. The workshops brought together researchers

from different disciplines to discuss the challenges and opportunities in applying

computational techniques to extract actionable knowledge from large databases.

The workshops quickly grew into hugely popular conferences that were attended by

researchers and practitioners from both the academia and industry. The success of

these conferences, along with the interest shown by businesses and industry in

recruiting new hires with data mining background, have fueled the tremendous

growth of this field.

The field was initially built upon the methodology and algorithms that researchers

had previously used. In particular, data mining researchers draw upon ideas, such

as (1) sampling, estimation, and hypothesis testing from statistics and (2) search

algorithms, modeling techniques, and learning theories from artificial intelligence,

pattern recognition, and machine learning. Data mining has also been quick to

adopt ideas from other areas, including optimization, evolutionary computing,

information theory, signal processing, visualization, and information retrieval, and

extending them to solve the challenges of mining big data.

A number of other areas also play key supporting roles. In particular, database

systems are needed to provide support for efficient storage, indexing, and query

processing. Techniques from high performance (parallel) computing are often

important in addressing the massive size of some data sets. Distributed techniques

can also help address the issue of size and are essential when the data cannot be

shows the relationship of data mining to

gathered in one location. Figure 1.2

other areas.

Figure 1.2.

Data mining as a confluence of many disciplines.

Data Science and Data-Driven Discovery

Data science is an interdisciplinary field that studies and applies tools and

techniques for deriving useful insights from data. Although data science is regarded

as an emerging field with a distinct identity of its own, the tools and techniques

often come from many different areas of data analysis, such as data mining,

statistics, AI, machine learning, pattern recognition, database technology, and

distributed and parallel computing. (See Figure 1.2 .)

The emergence of data science as a new field is a recognition that, often, none of

the existing areas of data analysis provides a complete set of tools for the data

analysis tasks that are often encountered in emerging applications. Instead, a

broad range of computational, mathematical, and statistical skills is often required.

To illustrate the challenges that arise in analyzing such data, consider the following

example. Social media and the Web present new opportunities for social scientists

to observe and quantitatively measure human behavior on a large scale. To

conduct such a study, social scientists work with analysts who possess skills in

areas such as web mining, natural language processing (NLP), network analysis,

data mining, and statistics. Compared to more traditional research in social

science, which is often based on surveys, this analysis requires a broader range of

skills and tools, and involves far larger amounts of data. Thus, data science is, by

necessity, a highly interdisciplinary field that builds on the continuing work of many

fields.

The data-driven approach of data science emphasizes the direct discovery of

patterns and relationships from data, especially in large quantities of data, often

without the need for extensive domain knowledge. A notable example of the

success of this approach is represented by advances in neural networks, i.e., deep

learning, which have been particularly successful in areas which have long proved

challenging, e.g., recognizing objects in photos or videos and words in speech, as

well as in other application areas. However, note that this is just one example of the

success of data-driven approaches, and dramatic improvements have also

occurred in many other areas of data analysis. Many of these developments are

topics described later in this book.

Some cautions on potential limitations of a purely data-driven approach are given in

the Bibliographic Notes.

1.4 Data Mining Tasks

Data mining tasks are generally divided into two major categories:

Predictive tasks The objective of these tasks is to predict the value of a particular

attribute based on the values of other attributes. The attribute to be predicted is

commonly known as the target or dependent variable, while the attributes used

for making the prediction are known as the explanatory or independent

variables.

Descriptive tasks Here, the objective is to derive patterns (correlations, trends,

clusters, trajectories, and anomalies) that summarize the underlying relationships in

data. Descriptive data mining tasks are often exploratory in nature and frequently

require postprocessing techniques to validate and explain the results.

Figure 1.3

illustrates four of the core data mining tasks that are described in the

remainder of this book.

Figure 1.3.

Four of the core data mining tasks.

Predictive modeling refers to the task of building a model for the target variable as

a function of the explanatory variables. There are two types of predictive modeling

tasks: classification, which is used for discrete target variables, and regression,

which is used for continuous target variables. For example, predicting whether a

web user will make a purchase at an online bookstore is a classification task

because the target variable is binary-valued. On the other hand, forecasting the

future price of a stock is a regression task because price is a continuous-valued

attribute. The goal of both tasks is to learn a model that minimizes the error

between the predicted and true values of the target variable. Predictive modeling

can be used to identify customers who will respond to a marketing campaign,

predict disturbances in the EarthÃ¢â‚¬â„¢s ecosystem, or judge whether a patient has a

particular disease based on the results of medical tests.

Example 1.1 (Predicting the Type of a Flower).

Consider the task of predicting a species of flower based on the characteristics

of the flower. In particular, consider classifying an Iris flower as one of the

following three Iris species: Setosa, Versicolour, or Virginica. To perform this

task, we need a data set containing the characteristics of various flowers of

these three species. A data set with this type of information is the well-known

Iris data set from the UCI Machine Learning Repository at http://

www.ics.uci.edu/~mlearn. In addition to the species of a flower, this data set

contains four other attributes: sepal width, sepal length, petal length, and petal

width. Figure 1.4

shows a plot of petal width versus petal length for the 150

flowers in the Iris data set. Petal width is broken into the categories low,

medium, and high, which correspond to the intervals [0, 0.75), [0.75, 1.75),

[1.75, Ã¢Ë†Å¾), respectively. Also, petal length is broken into categories low,

medium, and high, which correspond to the intervals [0, 2.5), [2.5, 5), [5, Ã¢Ë†Å¾),

respectively. Based on these categories of petal width and length, the following

rules can be derived:

Petal width low and petal length low implies Setosa.

Petal width medium and petal length medium implies Versicolour.

Petal width high and petal length high implies Virginica.

While these rules do not classify all the flowers, they do a good (but not perfect)

job of classifying most of the flowers. Note that flowers from the Setosa species

are well separated from the Versicolour and Virginica species with respect to

petal width and length, but the latter two species overlap somewhat with

respect to these attributes.

Figure 1.4.

Petal width versus petal length for 150 Iris flowers.

Association analysis is used to discover patterns that describe strongly

associated features in the data. The discovered patterns are typically represented

in the form of implication rules or feature subsets. Because of the exponential size

of its search space, the goal of association analysis is to extract the most

interesting patterns in an efficient manner. Useful applications of association

analysis include finding groups of genes that have related functionality, identifying

web pages that are accessed together, or understanding the relationships between

different elements of EarthÃ¢â‚¬â„¢s climate system.

Example 1.2 (Market Basket Analysis).

illustrate point-of-sale data collected at

The transactions shown in Table 1.1

the checkout counters of a grocery store. Association analysis can be applied

to find items that are frequently bought together by customers. For example, we

may discover the rule

, which suggests that customers

who buy diapers also tend to buy milk. This type of rule can be used to identify

potential cross-selling opportunities among related items.

{Diapers} Ã¢â€ â€™ {Milk}

Table 1.1. Market basket data.

Transaction ID

Items

1

{Bread, Butter, Diapers, Milk}

2

{Coffee, Sugar, Cookies, Salmon}

3

{Bread, Butter, Coffee, Diapers, Milk, Eggs}

4

{Bread, Butter, Salmon, Chicken}

5

{Eggs, Bread, Butter}

6

{Salmon, Diapers, Milk}

7

{Bread, Tea, Sugar, Eggs}

8

{Coffee, Sugar, Chicken, Eggs}

9

{Bread, Diapers, Milk, Salt}

10

{Tea, Eggs, Cookies, Diapers, Milk}

Cluster analysis seeks to find groups of closely related observations so that

observations that belong to the same cluster are more similar to each other than

observations that belong to other clusters. Clustering has been used to group sets

of related customers, find areas of the ocean that have a significant impact on the

EarthÃ¢â‚¬â„¢s climate, and compress data.

Example 1.3 (Document Clustering).

The collection of news articles shown in Table 1.2

can be grouped based on

their respective topics. Each article is represented as a set of word-frequency

pairs (w : c), where w is a word and c is the number of times the word appears

in the article. There are two natural clusters in the data set. The first cluster

consists of the first four articles, which correspond to news about the economy,

while the second cluster contains the last four articles, which correspond to

news about health care. A good clustering algorithm should be able to identify

these two clusters based on the similarity between words that appear in the

articles.

Table 1.2. Collection of news articles.

Article

Word-frequency pairs

1

dollar: 1, industry: 4, country: 2, loan: 3, deal: 2, government: 2

2

machinery: 2, labor: 3, market: 4, industry: 2, work: 3, country: 1

3

job: 5, inflation: 3, rise: 2, jobless: 2, market: 3, country: 2, index: 3

4

domestic: 3, forecast: 2, gain: 1, market: 2, sale: 3, price: 2

5

patient: 4, symptom: 2, drug: 3, health: 2, clinic: 2, doctor: 2

6

pharmaceutical: 2, company: 3, drug: 2, vaccine: 1, flu: 3

7

death: 2, cancer: 4, drug: 3, public: 4, health: 3, director: 2

8

medical: 2, cost: 3, increase: 2, patient: 2, health: 3, care: 1

Anomaly detection is the task of identifying observations whose characteristics

are significantly different from the rest of the data. Such observations are known as

anomalies or outliers. The goal of an anomaly detection algorithm is to discover

the real anomalies and avoid falsely labeling normal objects as anomalous. In other

words, a good anomaly detector must have a high detection rate and a low false

alarm rate. Applications of anomaly detection include the detection of fraud,

network intrusions, unusual patterns of disease, and ecosystem disturbances, such

as droughts, floods, fires, hurricanes, etc.

Example 1.4 (Credit Card Fraud Detection).

A credit card company records the transactions made by every credit card

holder, along with personal information such as credit limit, age, annual income,

and address. Since the number of fraudulent cases is relatively small compared

to the number of legitimate transactions, anomaly detection techniques can be

applied to build a profile of legitimate transactions for the users. When a new

transaction arrives, it is compared against the profile of the user. If the

characteristics of the transaction are very different from the previously created

profile, then the transaction is flagged as potentially fraudulent.

1.5 Scope and Organization of

the Book

This book introduces the major principles and techniques used in data mining from

an algorithmic perspective. A study of these principles and techniques is essential

for developing a better understanding of how data mining technology can be

applied to various kinds of data. This book also serves as a starting point for

readers who are interested in doing research in this field.

We begin the technical discussion of this book with a chapter on data (Chapter

2 ), which discusses the basic types of data, data quality, preprocessing

techniques, and measures of similarity and dissimilarity. Although this material can

be covered quickly, it provides an essential foundation for data analysis. Chapters

and 4

cover classification. Chapter 3

provides a foundation by

3

discussing decision tree classifiers and several issues that are important to all

classification: overfitting, underfitting, model selection, and performance evaluation.

Using this foundation, Chapter 4

describes a number of other important

classification techniques: rule-based systems, nearest neighbor classifiers,

Bayesian classifiers, artificial neural networks, including deep learning, support

vector machines, and ensemble classifiers, which are collections of classifiers. The

multiclass and imbalanced class problems are also discussed. These topics can be

covered independently.

Association analysis is explored in Chapters 5

and 6 . Chapter 5

describes the basics of association analysis: frequent itemsets, association rules,

and some of the algorithms used to generate them. Specific types of frequent

itemsetsÃ¢â‚¬â€maximal, closed, and hypercliqueÃ¢â‚¬â€that are important for data mining are

also discussed, and the chapter concludes with a discussion of evaluation

measures for association analysis. Chapter 6

considers a variety of more

advanced topics, including how association analysis can be applied to categorical

and continuous data or to data that has a concept hierarchy. (A concept hierarchy

is a hierarchical categorization of objects, e.g.,

store items Ã¢â€ â€™ clothing Ã¢â€ â€™ shoes Ã¢â€ â€™ sneakers.) This chapter also describes

how association analysis can be extended to find sequential patterns (patterns

involving order), patterns in graphs, and negative relationships (if one item is

present, then the other is not).

Cluster analysis is discussed in Chapters 7

and 8

. Chapter 7

first

describes the different types of clusters, and then presents three specific clustering

techniques: K-means, agglomerative hierarchical clustering, and DBSCAN. This is

followed by a discussion of techniques for validating the results of a clustering

algorithm. Additional clustering concepts and techniques are explored in Chapter

8

, including fuzzy and probabilistic clustering, Self-Organizing Maps (SOM),

graph-based clustering, spectral clustering, and density-based clustering. There is

also a discussion of scalability issues and factors to consider when selecting a

clustering algorithm.

Chapter 9 , is on anomaly detection. After some basic definitions, several

different types of anomaly detection are considered: statistical, distance-based,

density-based, clustering-based, reconstruction-based, one-class classification,

and information theoretic. The last chapter, Chapter 10 , supplements the

discussions in the other Chapters with a discussion of the statistical concepts

important for avoiding spurious results, and then discusses those concepts in the

context of data mining techniques studied in the previous chapters. These

techniques include statistical hypothesis testing, p-values, the false discovery rate,

and permutation testing. Appendices A through F give a brief review of important

topics that are used in portions of the book: linear algebra, dimensionality

reduction, statistics, regression, optimization, and scaling up data mining

techniques for big data.

The subject of data mining, while relatively young compared to statistics or machine

learning, is already too large to cover in a single book. Selected references to

topics that are only briefly covered, such as data quality, are provided in the

Bibliographic Notes section of the appropriate chapter. References to topics not

covered in this book, such as mining streaming data and privacy-preserving data

mining are provided in the Bibliographic Notes of this chapter.

1.6 Bibliographic Notes

The topic of data mining has inspired many textbooks. Introductory textbooks

include those by Dunham [16], Han et al. [29], Hand et al. [31], Roiger and Geatz

[50], Zaki and Meira [61], and Aggarwal [2]. Data mining books with a stronger

emphasis on business applications include the works by Berry and Linoff [5], Pyle

[47], and Parr Rud [45]. Books with an emphasis on statistical learning include

those by Cherkassky and Mulier [11], and Hastie et al. [32]. Similar books with an

emphasis on machine learning or pattern recognition are those by Duda et al. [15],

Kantardzic [34], Mitchell [43], Webb [57], and Witten and Frank [58]. There are also

some more specialized books: Chakrabarti [9] (web mining), Fayyad et al. [20]

(collection of early articles on data mining), Fayyad et al. [18] (visualization),

Grossman et al. [25] (science and engineering), Kargupta and Chan [35]

(distributed data mining), Wang et al. [56] (bioinformatics), and Zaki and Ho [60]

(parallel data mining).

There are several conferences related to data mining. Some of the main

conferences dedicated to this field include the ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining (KDD), the IEEE

International Conference on Data Mining (ICDM), the SIAM International

Conference on Data Mining (SDM), the European Conference on Principles and

Practice of Knowledge Discovery in Databases (PKDD), and the Pacific-Asia

Conference on Knowledge Discovery and Data Mining (PAKDD). Data mining

papers can also be found in other major conferences such as the Conference and

Workshop on Neural Information Processing Systems (NIPS),the International

Conference on Machine Learning (ICML), the ACM SIGMOD/PODS conference,

the International Conference on Very Large Data Bases (VLDB), the Conference on

Information and Knowledge Management (CIKM), the International Conference on

Data Engineering (ICDE), the National Conference on Artificial Intelligence (AAAI),

the IEEE International Conference on Big Data, and the IEEE International

Conference on Data Science and Advanced Analytics (DSAA).

Journal publications on data mining include IEEE Transactions on Knowledge and

Data Engineering, Data Mining and Knowledge Discovery, Knowledge and

Information Systems, ACM Transactions on Knowledge Discovery from Data,

Statistical Analysis and Data Mining, and Information Systems. There are various

open-source data mining software available, including Weka [27] and Scikit-learn

[46]. More recently, data mining software such as Apache Mahout and Apache

Spark have been developed for large-scale problems on the distributed computing

platform.

There have been a number of general articles on data mining that define the field or

its relationship to other fields, particularly statistics. Fayyad et al. [19] describe data

mining and how it fits into the total knowledge discovery process. Chen et al. [10]

give a database perspective on data mining. Ramakrishnan and Grama [48]

provide a general discussion of data mining and present several viewpoints. Hand

[30] describes how data mining differs from statistics, as does Friedman [21].

Lambert [40] explores the use of statistics for large data sets and provides some

comments on the respective roles of data mining and statistics. Glymour et al. [23]

consider the lessons that statistics may have for data mining. Smyth et al. [53]

describe how the evolution of data mining is being driven by new types of data and

applications, such as those involving streams, graphs, and text. Han et al. [28]

consider emerging applications in data mining and Smyth [52] describes some

research challenges in data mining. Wu et al. [59] discuss how developments in

data mining research can be turned into practical tools. Data mining standards are

the subject of a paper by Grossman et al. [24]. Bradley [7] discusses how data

mining algorithms can be scaled to large data sets.

The emergence of new data mining applications has produced new challenges that

need to be addressed. For instance, concerns about privacy breaches as a result of

data mining have escalated in recent years, particularly in application domains such

as web commerce and health care. As a result, there is growing interest in

developing data mining algorithms that maintain user privacy. Developing

techniques for mining encrypted or randomized data is known as privacypreserving data mining. Some general references in this area include papers by

Agrawal and Srikant [3], Clifton et al. [12] and Kargupta et al. [36]. Vassilios et al.

[55] provide a survey. Another area of concern is the bias in predictive models that

may be used for some applications, e.g., screening job applicants or deciding

prison parole [39]. Assessing whether such applications are producing biased

results is made more difficult by the fact that the predictive models used for such

applications are often black box models, i.e., models that are not interpretable in

any straightforward way.

Data science, its constituent fields, and more generally, the new paradigm of

knowledge discovery they represent [33], have great potential, some of which has

been realized. However, it is important to emphasize that data science works

mostly with observational data, i.e., data that was collected by various

organizations as part of their normal operation. The consequence of this is that

sampling biases are common and the determination of causal factors becomes

more problematic. For this and a number of other reasons, it is often hard to

interpret the predictive models built from this data [42, 49]. Thus, theory,

experimentation and computational simulations will continue to be the methods of

choice in many areas, especially those related to science.

More importantly, a purely data-driven approach often ignores the existing

knowledge in a particular field. Such models may perform poorly, for example,

predicting impossible outcomes or failing to generalize to new situations. However,

if the model does work well, e.g., has high predictive accuracy, then this approach

may be sufficient for practical purposes in some fields. But in many areas, such as

medicine and science, gaining insight into the underlying domain is often the goal.

Some recent work attempts to address these issues in order to create theoryguided data science, which takes pre-existing domain knowledge into account [17,

37].

Recent years have witnessed a growing number of applications that rapidly

generate continuous streams of data. Examples of stream data include network

traffic, multimedia streams, and stock prices. Several issues must be considered

when mining data streams, such as the limited amount of memory available, the

need for online analysis, and the change of the data over time. Data mining for

stream data has become an important area in data mining. Some selected

publications are Domingos and Hulten [14] (classification), Giannella et al. [22]

(association analysis), Guha et al. [26] (clustering), Kifer et al. [38] (change

detection), Papadimitriou et al. [44] (time series), and Law et al. [41] (dimensionality

reduction).

Another area of interest is recommender and collaborative filtering systems [1, 6, 8,

13, 54], which suggest movies, television shows, books, products, etc. that a

person might like. In many cases, this problem, or at least a component of it, is

treated as a prediction problem and thus, data mining techniques can be applied [4,

51].

Bibliography

[1] G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender

systems: A survey of the state-of-the-art and possible extensions. IEEE

transactions on knowledge and data engineering, 17(6):734Ã¢â‚¬â€œ749, 2005.

[2] C. Aggarwal. Data mining: The Textbook. Springer, 2009.

[3] R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proc. of 2000

ACMSIGMOD Intl. Conf. on Management of Data, pages 439Ã¢â‚¬â€œ450, Dallas,

Texas, 2000. ACM Press.

[4] X. Amatriain and J. M. Pujol. Data mining methods for recommender systems. In

Recommender Systems Handbook, pages 227Ã¢â‚¬â€œ262. Springer, 2015.

[5] M. J. A. Berry and G. Linoff. Data Mining Techniques: For Marketing, Sales, and

Customer Relationship Management. Wiley Computer Publishing, 2nd edition,

2004.

[6] J. Bobadilla, F. Ortega, A. Hernando, and A. GutiÃƒÂ©rrez. Recommender systems

survey. Knowledge-based systems, 46:109Ã¢â‚¬â€œ132, 2013.

[7] P. S. Bradley, J. Gehrke, R. Ramakrishnan, and R. Srikant. Scaling mining

algorithms to large databases. Communications of the ACM, 45(8):38Ã¢â‚¬â€œ43, 2002.

[8] R. Burke. Hybrid recommender systems: Survey and experiments. User

modeling and user-adapted interaction, 12(4):331Ã¢â‚¬â€œ370, 2002.

[9] S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data.

Morgan Kaufmann, San Francisco, CA, 2003.

[10] M.-S. Chen, J. Han, and P. S. Yu. Data Mining: An Overview from a Database

Perspective. IEEE Transactions on Knowledge and Data Engineering, 8(6):866Ã¢â‚¬â€œ

883, 1996.

[11] V. Cherkassky and F. Mulier. Learning from Data: Concepts, Theory, and

Methods. Wiley-IEEE Press, 2nd edition, 1998.

[12] C. Clifton, M. Kantarcioglu, and J. Vaidya. Defining privacy for data mining. In

National Science Foundation Workshop on Next Generation Data Mining, pages

126Ã¢â‚¬â€œ133, Baltimore, MD, November 2002.

[13] C. Desrosiers and G. Karypis. A comprehensive survey of neighborhood-based

recommendation methods. Recommender systems handbook, pages 107Ã¢â‚¬â€œ144,

2011.

[14] P. Domingos and G. Hulten. Mining high-speed data streams. In Proc. of the

6th Intl. Conf. on Knowledge Discovery and Data Mining, pages 71Ã¢â‚¬â€œ80, Boston,

Massachusetts, 2000. ACM Press.

[15] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley Ã¢â‚¬Â¦

Sons, Inc., New York, 2nd edition, 2001.

[16] M. H. Dunham. Data Mining: Introductory and Advanced Topics. Prentice Hall,

2006.

[17] J. H. Faghmous, A. Banerjee, S. Shekhar, M. Steinbach, V. Kumar, A. R.

Ganguly, and N. Samatova. Theory-guided data science for climate change.

Computer, 47(11):74Ã¢â‚¬â€œ78, 2014.

[18] U. M. Fayyad, G. G. Grinstein, and A. Wierse, editors. Information Visualization

in Data Mining and Knowledge Discovery. Morgan Kaufmann Publishers, San

Francisco, CA, September 2001.

[19] U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From Data Mining to

Knowledge Discovery: An Overview. In Advances in Knowledge Discovery and

Data Mining, pages 1Ã¢â‚¬â€œ34. AAAI Press, 1996.

[20] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors.

Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.

[21] J. H. Friedman. Data Mining and Statistics: WhatÃ¢â‚¬â„¢s the Connection?

Unpublished. www-stat.stanford.edu/~jhf/ftp/dm-stat.ps, 1997.

[22] C. Giannella, J. Han, J. Pei, X. Yan, and P. S. Yu. Mining Frequent Patterns in

Data Streams at Multiple Time Granularities. In H. Kargupta, A. Joshi, K.

Sivakumar, and Y. Yesha, editors, Next Generation Data Mining, pages 191Ã¢â‚¬â€œ

212. AAAI/MIT, 2003.

[23] C. Glymour, D. Madigan, D. Pregibon, and P. Smyth. Statistical Themes and

Lessons for Data Mining. Data Mining and Knowledge Discovery, 1(1):11Ã¢â‚¬â€œ28,

1997.

[24] R. L. Grossman, M. F. Hornick, and G. Meyer. Data mining standards

initiatives. Communications of the ACM, 45(8):59Ã¢â‚¬â€œ61, 2002.

[25] R. L. Grossman, C. Kamath, P. Kegelmeyer, V. Kumar, and R. Namburu,

editors. Data Mining for Scientific and Engineering Applications. Kluwer

Academic Publishers, 2001.

[26] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. OÃ¢â‚¬â„¢Callaghan. Clustering

Data Streams: Theory and Practice. IEEE Transactions on Knowledge and Data

Engineering, 15(3):515Ã¢â‚¬â€œ528, May/June 2003.

[27] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten.

The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1),

2009.

[28] J. Han, R. B. Altman, V. Kumar, H. Mannila, and D. Pregibon. Emerging

scientific applications in data mining. Communications of the ACM, 45(8):54Ã¢â‚¬â€œ58,

2002.

[29] J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques.

Morgan Kaufmann Publishers, San Francisco, 3rd edition, 2011.

[30] D. J. Hand. Data Mining: Statistics and More? The American Statistician, 52(2):

112Ã¢â‚¬â€œ118, 1998.

[31] D. J. Hand, H. Mannila, and P. Smyth. Principles of Data Mining. MIT Press,

2001.

[32] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical

Learning: Data Mining, Inference, Prediction. Springer, 2nd edition, 2009.

[33] T. Hey, S. Tansley, K. M. Tolle, et al. The fourth paradigm: data-intensive

scientific discovery, volume 1. Microsoft research Redmond, WA, 2009.

[34] M. Kantardzic. Data Mining: Concepts, Models, Methods, and Algorithms.

Wiley-IEEE Press, Piscataway, NJ, 2003.

[35] H. Kargupta and P. K. Chan, editors. Advances in Distributed and Parallel

Knowledge Discovery. AAAI Press, September 2002.

[36] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar. On the Privacy Preserving

Properties of Random Data Perturbation Techniques. In Proc. of the 2003 IEEE

Intl. Conf. on Data Mining, pages 99Ã¢â‚¬â€œ106, Melbourne, Florida, December 2003.

IEEE Computer Society.

[37] A. Karpatne, G. Atluri, J. Faghmous, M. Steinbach, A. Banerjee, A. Ganguly, S.

Shekhar, N. Samatova, and V. Kumar. Theory-guided Data Science: A New

Paradigm for Scientific Discovery from Data. IEEE Transactions on Knowledge

and Data Engineering, 2017.

[38] D. Kifer, S. Ben-David, and J. Gehrke. Detecting Change in Data Streams. In

Proc. of the 30th VLDB Conf., pages 180Ã¢â‚¬â€œ191, Toronto, Canada, 2004. Morgan

Kaufmann.

[39] J. Kleinberg, J. Ludwig, and S. Mullainathan. A Guide to Solving Social

Problems with Machine Learning. Harvard Business Review, December 2016.

[40] D. Lambert. What Use is Statistics for Massive Data? In ACM SIGMOD

Workshop on Research Issues in Data Mining and Knowledge Discovery, pages

54Ã¢â‚¬â€œ62, 2000.

[41] M. H. C. Law, N. Zhang, and A. K. Jain. Nonlinear Manifold Learning for Data

Streams. In Proc. of the SIAM Intl. Conf. on Data Mining, Lake Buena Vista,

Florida, April 2004. SIAM.

[42] Z. C. Lipton. The mythos of model interpretability. arXiv preprint

arXiv:1606.03490, 2016.

[43] T. Mitchell. Machine Learning. McGraw-Hill, Boston, MA, 1997.

[44] S. Papadimitriou, A. Brockwell, and C. Faloutsos. Adaptive, unsupervised

stream mining. VLDB Journal, 13(3):222Ã¢â‚¬â€œ239, 2004.

[45] O. Parr Rud. Data Mining Cookbook: Modeling Data for Marketing, Risk and

Customer Relationship Management. John Wiley Ã¢â‚¬Â¦ Sons, New York, NY, 2001.

[46] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.

Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D.

Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine

Learning in Python. Journal of Machine Learning Research, 12:2825Ã¢â‚¬â€œ2830,

2011.

[47] D. Pyle. Business Modeling and Data Mining. Morgan Kaufmann, San

Francisco, CA, 2003.

[48] N. Ramakrishnan and A. Grama. Data Mining: From Serendipity to ScienceÃ¢â‚¬â€

Guest EditorsÃ¢â‚¬â„¢ Introduction. IEEE Computer, 32(8):34Ã¢â‚¬â€œ37, 1999.

[49] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?: Explaining

the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining, pages

1135Ã¢â‚¬â€œ1144. ACM, 2016.

[50] R. Roiger and M. Geatz. Data Mining: A Tutorial Based Primer. AddisonWesley, 2002.

[51] J. Schafer. The Application of Data-Mining to Recommender Systems.

Encyclopedia of data warehousing and mining, 1:44Ã¢â‚¬â€œ48, 2009.

[52] P. Smyth. Breaking out of the Black-Box: Research Challenges in Data Mining.

In Proc. of the 2001 ACM SIGMOD Workshop on Research Issues in Data

Mining and Knowledge Discovery, 2001.

[53] P. Smyth, D. Pregibon, and C. Faloutsos. Data-driven evolution of data mining

algorithms. Communications of the ACM, 45(8):33Ã¢â‚¬â€œ37, 2002.

[54] X. Su and T. M. Khoshgoftaar. A survey of collaborative filtering techniques.

Advances in artificial intelligence, 2009:4, 2009.

[55] V. S. Verykios, E. Bertino, I. N. Fovino, L. P. Provenza, Y. Saygin, and Y.

Theodoridis. State-of-the-art in privacy preserving data mining. SIGMOD

Record, 33(1):50Ã¢â‚¬â€œ57, 2004.

[56] J. T. L. Wang, M. J. Zaki, H. Toivonen, and D. E. Shasha, editors. Data Mining

in Bioinformatics. Springer, September 2004.

[57] A. R. Webb. Statistical Pattern Recognition. John Wiley Ã¢â‚¬Â¦ Sons, 2nd edition,

2002.

[58] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and

Techniques. Morgan Kaufmann, 3rd edition, 2011.

[59] X. Wu, P. S. Yu, and G. Piatetsky-Shapiro. Data Mining: How Research Meets

Practical Development? Knowledge and Information Systems, 5(2):248Ã¢â‚¬â€œ261,

2003.

[60] M. J. Zaki and C.-T. Ho, editors. Large-Scale Parallel Data Mining. Springer,

September 2002.

[61] M. J. Zaki and W. Meira Jr. Data Mining and Analysis: Fundamental Concepts

and Algorithms. Cambridge University Press, New York, 2014.

1.7 Exercises

1. Discuss whether or not each of the following activities is a data mining task.

a. Dividing the customers of a company according to their gender.

b. Dividing the customers of a company according to their profitability.

c. Computing the total sales of a company.

d. Sorting a student database based on student identification numbers.

e. Predicting the outcomes of tossing a (fair) pair of dice.

f. Predicting the future stock price of a company using historical records.

g. Monitoring the heart rate of a patient for abnormalities.

h. Monitoring seismic waves for earthquake activities.

i. Extracting the frequencies of a sound wave.

2. Suppose that you are employed as a data mining consultant for an Internet

search engine company. Describe how data mining can help the company by giving

specific examples of how techniques, such as clustering, classification, association

rule mining, and anomaly detection can be applied.

3. For each of the following data sets, explain whether or not data privacy is an

important issue.

a. Census data collected from 1900Ã¢â‚¬â€œ1950.

b. IP addresses and visit times of web users who visit your website.

c. Images from Earth-orbiting satellites.

d. Names and addresses of people from the telephone book.

e. Names and email addresses collected from the Web.

2 Data

This chapter discusses several data-related issues that are

important for successful data mining:

The Type of Data Data sets differ in a number of ways. For example, the attributes

used to describe data objects can be of different typesÃ¢â‚¬â€quantitative or qualitative

Ã¢â‚¬â€and data sets often have special characteristics; e.g., some data sets contain

time series or objects with explicit relationships to one another. Not surprisingly, the

type of data determines which tools and techniques can be used to analyze the

data. Indeed, new research in data mining is often driven by the need to

accommodate new application areas and their new types of data.

The Quality of the Data Data is often far from perfect. While most data mining

techniques can tolerate some level of imperfection in the data, a focus on

understanding and improving data quality typically improves the quality of the

resulting analysis. Data quality issues that often need to be addressed include the

presence of noise and outliers; missing, inconsistent, or duplicate data; and data

that is biased or, in some other way, unrepresentative of the phenomenon or

population that the data is supposed to describe.

Preprocessing Steps to Make the Data More Suitable for Data Mining Often,

the raw data must be processed in order to make it suitable for analysis. While one

objective may be to improve data quality, other goals focus on modifying the data

so that it better fits a specified data mining technique or tool. For example, a

continuous attribute, e.g., length, sometimes needs to be transformed into an

attribute with discrete categories, e.g., short, medium, or long, in order to apply a

particular technique. As another example, the number of attributes in a data set is

often reduced because many techniques are more effective when the data has a

relatively small number of attributes.

Analyzing Data in Terms of Its Relationships One approach to data analysis is to

find relationships among the data objects and then perform the remaining analysis

using these relationships rather than the data objects themselves. For instance, we

can compute the similarity or distance between pairs of objects and then perform

the analysisÃ¢â‚¬â€clustering, classification, or anomaly detectionÃ¢â‚¬â€based on these

similarities or distances. There are many such similarity or distance measures, and

the proper choice depends on the type of data and the particular application.

Example 2.1 (An Illustration of Data-Related Issues).

To further illustrate the importance of these issues, consider the following

hypothetical situation. You receive an email from a medical researcher

concerning a project that you are eager to work on.

Hi,

IÃ¢â‚¬â„¢ve attached the data file that I mentioned in my previous email. Each line contains

the information for a single patient and consists of five fields. We want to predict the

last field using the other fields. I donÃ¢â‚¬â„¢t have time to provide any more information

about the data since IÃ¢â‚¬â„¢m going out of town for a couple of days, but hopefully that

wonÃ¢â‚¬â„¢t slow you down too much. And if you donÃ¢â‚¬â„¢t mind, could we meet when I get back

to discuss your preliminary results? I might invite a few other members of my team.

Thanks and see you in a couple of days.

Despite some misgivings, you proceed to analyze the data. The first few rows of

the file are as follows:

012

232

33.5

0

10.7

020

121

16.9

2

210.1

027

165

24.0

0

427.6

Ã¢â€¹Â®

A brief look at the data reveals nothing strange. You put your doubts aside and start

the analysis. There are only 1000 lines, a smaller data file than you had hoped for,

but two days later, you feel that you have made some progress. You arrive for the

meeting, and while waiting for others to arrive, you strike up a conversation with a

statistician who is working on the project. When she learns that you have also been

analyzing the data from the project, she asks if you would mind giving her a brief

overview of your results.

Statistician: So, you got the data for all the patients?

Data Miner: Yes. I havenÃ¢â‚¬â„¢t had much time for analysis, but I do have a few

interesting results.

Statistician: Amazing. There were so many data issues with this set of

patients that I couldnÃ¢â‚¬â„¢t do much.

Data Miner: Oh? I didnÃ¢â‚¬â„¢t hear about any possible problems.

Statistician: Well, first there is field 5, the variable we want to predict. ItÃ¢â‚¬â„¢s

common knowledge among people who analyze this type of data that results

are better if you work with the log of the values, but I didnÃ¢â‚¬â„¢t discover this until

later. Was it mentioned to you?

Data Miner: No.

Statistician: But surely you heard about what happened to field 4? ItÃ¢â‚¬â„¢s

supposed to be measured on a scale from 1 to 10, with 0 indicating a

missing value, but because of a data entry error, all 10Ã¢â‚¬â„¢s were changed into

0Ã¢â‚¬â„¢s. Unfortunately, since some of the patients have missing values for this

field, itÃ¢â‚¬â„¢s impossible to say whether a 0 in this field is a real 0 or a 10. Quite a

few of the records have that problem.

Data Miner: Interesting. Were there any other problems?

Statistician: Yes, fields 2 and 3 are basically the same, but I assume that

you probably noticed that.

Data Miner: Yes, but these fields were only weak predictors of field 5.

Statistician: Anyway, given all those problems, IÃ¢â‚¬â„¢m surprised you were able

to accomplish anything.

Data Miner: True, but my results are really quite good. Field 1 is a very

strong predictor of field 5. IÃ¢â‚¬â„¢m surprised that this wasnÃ¢â‚¬â„¢t noticed before.

Statistician: What? Field 1 is just an identification number.

Data Miner: Nonetheless, my results speak for themselves.

Statistician: Oh, no! I just remembered. We assigned ID numbers after we

sorted the records based on field 5. There is a strong connection, but itÃ¢â‚¬â„¢s

meaningless. Sorry.

Although this scenario represents an extreme situation, it emphasizes the

importance of Ã¢â‚¬Å“knowing your data.Ã¢â‚¬Â To that end, this chapter will address each of

the four issues mentioned above, outlining some of the basic challenges and

standard approaches.

2.1 Types of Data

A data set can often be viewed as a collection of data objects. Other names for a

data object are record, point, vector, pattern, event, case, sample, instance,

observation, or entity. In turn, data objects are described by a number of attributes

that capture the characteristics of an object, such as the mass of a physical object

or the time at which an event occurred. Other names for an attribute are variable,

characteristic, field, feature, or dimension.

Example 2.2 (Student Information).

Often, a data set is a file, in which the objects are records (or rows) in the file

and each field (or column) corresponds to an attribute. For example, Table

2.1

shows a data set that consists of student information. Each row

corresponds to a student and each column is an attribute that describes some

aspect of a student, such as grade point average (GPA) or identification

number (ID).

Table 2.1. A sample data set containing student information.

Student ID

Year

Ã¢â€¹Â®

Grade Point Average (GPA)

Ã¢â‚¬Â¦

1034262

Senior

3.24

Ã¢â‚¬Â¦

1052663

Freshman

3.51

Ã¢â‚¬Â¦

1082246

Sophomore

3.62

Ã¢â‚¬Â¦

Although record-based data sets are common, either in flat files or relational

database systems, there are other important types of data sets and systems for

storing data. In Section 2.1.2 , we will discuss some of the types of data sets

that are commonly encountered in data mining. However, we first consider

attributes.

2.1.1 Attributes and Measurement

In this section, we consider the types of attributes used to describe data objects.

We first define an attribute, then consider what we mean by the type of an attribute,

and finally describe the types of attributes that are commonly encountered.

What Is an Attribute?

We start with a more detailed definition of an attribute.

Definition 2.1.

An attribute is a property or characteristic of an object that can vary,

either from one object to another or from one time to another.

For example, eye color varies from person to person, while the temperature of an

object varies over time. Note that eye color is a symbolic attribute with a small

number of possible values {brown, black, blue, green, hazel, etc.} , while

temperature is a numerical attribute with a potentially unlimited number of values.

At the most basic level, attributes are not about numbers or symbols. However, to

discuss and more precisely analyze the characteristics of objects, we assign

numbers or symbols to them. To do this in a well-defined way, we need a

measurement scale.

Definition 2.2.

A measurement scale is a rule (function) that associates a numerical

or symbolic value with an attribute of an object.

Formally, the process of measurement is the application of a measurement scale

to associate a value with a particular attribute of a specific object. While this may

seem a bit abstract, we engage in the process of measurement all the time. For

instance, we step on a bathroom scale to determine our weight, we classify

someone as male or female, or we count the number of chairs in a room to see if

there will be enough to seat all the people coming to a meeting. In all these cases,

the Ã¢â‚¬Å“physical valueÃ¢â‚¬Â of an attribute of an object is mapped to a numerical or

symbolic value.

With this background, we can discuss the type of an attribute, a concept that is

important in determining if a particular data analysis technique is consistent with a

specific type of attribute.

The Type of an Attribute

It is common to refer to the type of an attribute as the type of a measurement

scale. It should be apparent from the previous discussion that an attribute can be

described using different measurement scales and that the properties of an

attribute need not be the same as the properties of the values used to measure it.

In other words, the values used to represent an attribute can have properties that

are not properties of the attribute itself, and vice versa. This is illustrated with two

examples.

Example 2.3 (Employee Age and ID Number).

Two attributes that might be associated with an employee are ID and age (in

years). Both of these attributes can be represented as integers. However, while

it is reasonable to talk about the average age of an employee, it makes no

sense to talk about the average employee ID. Indeed, the only aspect of

employees that we want to capture with the ID attribute is that they are distinct.

Consequently, the only valid operation for employee IDs is to test whether they

are equal. There is no hint of this limitation, however, when integers are used to

represent the employee ID attribute. For the age attribute, the properties of the

integers used to represent age are very much the properties of the attribute.

Even so, the correspondence is not complete because, for example, ages have

a maximum, while integers do not.

Example 2.4 (Length of Line Segments).

Consider Figure 2.1

, which shows some objectsÃ¢â‚¬â€line segmentsÃ¢â‚¬â€and how

the length attribute of these objects can be mapped to numbers in two different

ways. Each successive line segment, going from the top to the bottom, is

formed by appending the topmost line segment to itself. Thus, the second line

segment from the top is formed by appending the topmost line segment to itself

twice, the third line segment from the top is formed by appending the topmost

line segment to itself three times, and so forth. In a very real (physical) sense,

all the line segments are multiples of the first. This fact is captured by the

measurements on the right side of the figure, but not by those on the left side.

More specifically, the measurement scale on the left side captures only the

ordering of the length attribute, while the scale on the right side captures both

the ordering and additivity properties. Thus, an attribute can be measured in a

way that does not capture all the properties of the attribute.

Figure 2.1.

The measurement of the length of line segments on two different scales of

measurement.

Knowing the type of an attribute is important because it tells us which properties of

the measured values are consistent with the underlying properties of the attribute,

and therefore, it allows us to avoid foolish actions, such as computing the average

employee ID.

The Different Types of Attributes

A useful (and simple) way to specify the type of an attribute is to identify the

properties of numbers that correspond to underlying properties of the attribute. For

example, an attribute such as length has many of the properties of numbers. It

makes sense to compare and order objects by length, as well as to talk about the

differences and ratios of length. The following properties (operations) of numbers

are typically used to describe attributes.

1. Distinctness

= and Ã¢â€°Â

2. Order , and Ã¢â€°Â¥

3. Addition + and Ã¢Ë†â€™

4. Multiplication Ãƒâ€” and /

Given these properties, we can define four types of attributes: nominal , ordinal,

interval , and ratio. Table 2.2

gives the definitions of these types, along with

information about the statistical operations that are valid for each type. Each

attribute type possesses all of the properties and operations of the attribute types

above it. Consequently, any property or operation that is valid for nominal, ordinal,

and interval attributes is also valid for ratio attributes. In other words, the definition

of the attribute types is cumulative. However, this does not mean that the statistical

operations appropriate for one attribute type are appropriate for the attribute types

above it.

Table 2.2. Different attribute types.

Attribute Type

Categorical

Nominal

(Qualitative)

Description

Examples

Operations

The values of a nominal

zip codes,

mode,

attribute are just different

employee ID

entropy,

names; i.e., nominal

numbers, eye

contingency

values provide only

color, gender

correlation,

The values of an ordinal

hardness of

median,

attribute provide enough

minerals,

percentiles,

information to order

{good, better,

rank

objects. ()

best}, grades,

correlation,

street

run tests,

numbers

sign tests

For interval attributes, the

calendar

mean,

differences between

dates,

standard

values are meaningful, i.e.,

temperature

deviation,

a unit of measurement

in Celsius or

PearsonÃ¢â‚¬â„¢s

exists. (+, Ã¢Ë†â€™)

Fahrenheit

correlation,

enough information to

Ãâ€¡2 test

distinguish one object from

another. (=, Ã¢â€°Â )

Ordinal

Numeric

(Quantitative)

Interval

t and F

tests

Attribute Type

Ratio

Description

Examples

Operations

For ratio variables, both

temperature

geometric

differences and ratios are

in Kelvin,

mean,

meaningful. (Ãƒâ€”, /)

monetary

harmonic

quantities,

mean,

counts, age,

percent

mass, length,

variation

electrical

current

Nominal and ordinal attributes are collectively referred to as categorical or

qualitative attributes. As the name suggests, qualitative attributes, such as

employee ID, lack most of the properties of numbers. Even if they are represented

by numbers, i.e., integers, they should be treated more like symbols. The remaining

two types of attributes, interval and ratio, are collectively referred to as quantitative

or numeric attributes. Quantitative attributes are represented by numbers and have

most of the properties of numbers. Note that quantitative attributes can be integervalued or continuous.

The types of attributes can also be described in terms of transformations that do

not change the meaning of an attribute. Indeed, S. Smith Stevens, the psychologist

who originally defined the types of attributes shown in Table 2.2 , defined them in

terms of these permissible transformations. For example, the meaning of a

length attribute is unchanged if it is measured in meters instead of feet.

The statistical operations that make sense for a particular type of attribute are those

that will yield the same results when the attribute is transformed by using a

transformation that preserves the attributeÃ¢â‚¬â„¢s meaning. To illustrate, the average

length of a set of objects is different when measured in meters rather than in feet,

but both averages represent the same length. Table 2.3

shows the meaningpreserving transformations for the four attribute types of Table 2.2 .

Table 2.3. Transformations that define attribute levels.

Attribute Type

Transformation

Comment

Attribute Type

Categorical

Nominal

(Qualitative)

Transformation

Comment

Any one-to-one mapping, e.g., a

If all employee ID

permutation of values

numbers are

reassigned, it will

not make any

difference.

Ordinal

An order-preserving change of

An attribute

values, i.e.,

encompassing the

new_value = f (old_value),

notion of good,

where f is a monotonic function.

better, best can be

represented equally

well by the values

{1, 2, 3} or by {0.5,

1, 10}.

Numeric

Interval

(Quantitative)

new_value = a Ãƒâ€” old_value + b,

The Fahrenheit and

a and b constants.

Celsius temperature

scales differ in the

location of their zero

value and the size

of a degree (unit).

Ratio

new_value = a Ãƒâ€” old_value

Length can be

measured in meters

or feet.

Example 2.5 (Temperature Scales).

Temperature provides a good illustration of some of the concepts that have

been described. First, temperature can be either an interval or a ratio attribute,

depending on its measurement scale. When measured on the Kelvin scale, a

temperature of 2 is, in a physically meaningful way, twice that of a temperature

Ã¢â€”Â¦

of 1 . This is not true when temperature is measured on either the Celsius or

Ã¢â€”Â¦

Fahrenheit scales, because, physically, a temperature of 1 Fahrenheit (Celsius)

Ã¢â€”Â¦

is not much different than a temperature of 2 Fahrenheit (Celsius). The

Ã¢â€”Â¦

problem is that the zero points of the Fahrenheit and Celsius scales are, in a

physical sense, arbitrary, and therefore, the ratio of two Celsius or Fahrenheit

temperatures is not physically meaningful.

Describing Attributes by the Number of Values

An independent way of distinguishing between attributes is by the number of values

they can take.

Discrete A discrete attribute has a finite or countably infinite set of values. Such

attributes can be categorical, such as zip codes or ID numbers, or numeric, such as

counts. Discrete attributes are often represented using integer variables. Binary

attributes are a special case of discrete attributes and assume only two values,

e.g., true/false, yes/no, male/female, or 0/1. Binary attributes are often represented

as Boolean variables, or as integer variables that only take the values 0 or 1.

Continuous A continuous attribute is one whose values are real numbers.

Examples include attributes such as temperature, height, or weight. Continuous

attributes are typically represented as floating-point variables. Practically, real

values can be measured and represented only with limited precision.

In theory, any of the measurement scale typesÃ¢â‚¬â€nominal, ordinal, interval, and ratio

Ã¢â‚¬â€could be combined with any of the types based on the number of attribute values

Ã¢â‚¬â€binary, discrete, and continuous. However, some combinations occur only

infrequently or do not make much sense. For instance, it is difficult to think of a

realistic data set that contains a continuous binary attribute. Typically, nominal and

ordinal attributes are binary or discrete, while interval and ratio attributes are

continuous. However, count attributes , which are discrete, are also ratio

attributes.

Asymmetric Attributes

For asymmetric attributes, only presenceÃ¢â‚¬â€a non-zero attribute valueÃ¢â‚¬â€is regarded

as important. Consider a data set in which each object is a student and each

attribute records whether a student took a particular course at a university. For a

specific student, an attribute has a value of 1 if the student took the course

associated with that attribute and a value of 0 otherwise. Because students take

only a small fraction of all available courses, most of the values in such a data set

would be 0. Therefore, it is more meaningful and more efficient to focus on the nonzero values. To illustrate, if students are compared on the basis of the courses they

donÃ¢â‚¬â„¢t take, then most students would seem very similar, at least if the number of

courses is large. Binary attributes where only non-zero values are important are

called asymmetric binary attributes. This type of attribute is particularly important

for association analysis, which is discussed in Chapter 5 . It is also possible to

have discrete or continuous asymmetric features. For instance, if the number of

credits associated with each course is recorded, then the resulting data set will

consist of asymmetric discrete or continuous attributes.

General Comments on Levels of Measurement

As described in the rest of this chapter, there are many diverse types of data. The

previous discussion of measurement scales, while useful, is not complete and has

some limitations. We provide the following comments and guidance.

Distinctness, order, and meaningful intervals and ratios are only four

properties of dataÃ¢â‚¬â€many others are possible. For instance, some data is

inherently cyclical, e.g., position on the surface of the Earth or time. As another

example, consider set valued attributes, where each attribute value is a set of

elements, e.g., the set of movies seen in the last year. Define one set of

elements (movies) to be greater (larger) than a second set if the second set is a

subset of the first. However, such a relationship defines only a partial order that

does not match any of the attribute types just defined.

The numbers or symbols used to capture attribute values may not capture

all the properties of the attributes or may suggest properties that are not

there. An illustration of this for integers was presented in Example 2.3

, i.e.,

averages of IDs and out of range ages.

Data is often transformed for the purpose of…

Purchase answer to see full

attachment