Counterfeiting banknotes has been a problem since the introduction of color photocopiers and computer image scanners. The banking industry has suffered from counterfeits due to inflation and reduction in the value of real money. Assume that you are a data mining expert who works in the banking industry.

The dataset called banknotes.csv Download banknotes.csv contains 5 variables (or columns) and the description-bank.docx Download description-bank.docxcontains a description of the dataset. The end goal is to build an appropriate model (or tool) to successfully predict forgery. Using SAS Studio, perform the following tasks:

Explore the dataset by providing summary statistics and graphical summaries of all the variables.

Explain some of the key aspects of data in part 1.

Examine if the dataset has any anomalies. Describe the method(s) you used as well as the results.

Examine if there are any association among the variables. Describe the approaches as well as the results.

Using one of the clustering techniques, analyze all the variables. Explain the results.

Using one of the classification techniques from the course, build the model that predicts forgery. Explain why you think the model youâ€™ve chosen is most appropriate for this dataset.

Evaluate the model. How well does the model fit? Can you improve the model? Explain.

ATTACHED:

1. Dataset

2. Documentation for dataset

3. What I currently have completed

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA:
University of California, School of Information and Computer Science.
Data Set Information:
Data were extracted from images that were taken from genuine and forged banknote-like specimens. For
digitization, an industrial camera usually used for print inspection was used. The final images have 400x
400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a
resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from
images.
Attribute Information:
1. variance of Wavelet Transformed image (continuous)
2. skewness of Wavelet Transformed image (continuous)
3. kurtosis of Wavelet Transformed image (continuous)
4. entropy of image (continuous)
5. class (integer): 1- genuine, 2-forged
1
Portfolio Project Option 2: Counterfeiting Banknotes
Rachael Herman
MIS450: Data Mining
Professor Mamdouh Babi
August 7, 2022
2
Portfolio Project Option 2: Counterfeiting Banknotes
The portfolio project uses the banknotes.csv file to build an appropriate model that will
successfully predict forgery. This dataset was uploaded into SAS in the milestone for this
project. Summary statistics and a graphical summary of histograms are provided, along with a
discussion of key aspects from that data. ____ is used for anomaly detection, and associations are
identified using ____. The clustering technique used in this project to analyze the variables is
____, and results are explained in detail. A classification model to predict forgeries is provided,
along with reasoning for its use. Finally, the model is evaluated and impressions on model fit
reviewed with any opportunities to improve on the model.
Figure 1
Summary Statistics and Histograms of Banknotes Dataset
There are two classes identified in the dataset, with 1 meaning the observation is genuine
and 2 that it is forged. Basic statistics are performed on each of these two variables to look more
closely at legitimate and counterfeit items. The data shows there are 762 genuine entries and 610
3
forgery entries. The Wavelet Transform tool extracts features from the images, which are
demonstrated in integers for four variables, including:
1. V1 = variance of Wavelet Transformed image, which measures the spread between these
numbers in the dataset (Sturdivant et al., 2016).
2. V2 = skewness of Wavelet Transformed image detects the symmetry of the curve in
terms of the image data features.
3. V3 = kurtosis of Wavelet Transformed image describes the degree of the tails and at the
peak of the curve in a frequency distribution (Sturdivant et al., 2016)
4. V4 = entropy of image measures the degree of randomness in the image features (Thum,
1984).
Summary statistics and histograms demonstrate a left skewness in genuine entries (class
= 1) for variance, skewness, and entropy. Kurtosis has a right skew that looks to be multi-modal.
Forgeries (class = 2) show a rightward skewness in variance and kurtosis, with a left skewness in
entropy. This suggests there is higher variance, skewness, and entropy in the images. When
classes are not separated, variance is multimodal but relatively normally distributed. Skewness is
left-skewed and multimodal, kurtosis is right-skewed and bimodal, and entropy is left-skewed.
Items with larger values of entropy and kurtosis tend to be a class = 2, forgery. Higher variance
and skewness values tend to be class =1, genuine.
Figure 2
Summary Statistics of Continuous Variables Separated by Class
4
Figure 3
Summary Statistics of Continuous Variables not Separated by Class
Figure 4
Figure 5
Histogram With Classes in V1 = Variance
Histogram Without Classes in V1
Figure 6
Figure 7
Histogram With Classes in V2 = Skewness
Histogram Without Classes in V2
5
Figure 8
Figure 9
Histogram With Classes in V3 = Kurtosis
Histogram Without Classes in V3
Figure 10
Figure 11
Histogram With Classes in V4 = Entropy
Histogram Without Classes in V4
Anomaly Detection
An important task in data analysis is to look for any outliers or anomalies in the dataset.
To do this, one must identify objects that do not conform to the normal patterns of behavior in
the dataset (Tan et al., 2018).
Clustering Technique for Variable Analysis
Variable Associations
Classification Model
6
A decision tree will be used for this dataset to predict forgery. A decision tree
classification model uses a series of questions about the attributes of an observation or instance
to determine its class, which are organized into a hierarchical structure (Tan et al., 2018).
Advancement of digitization with scan and print techniques has led to serious counterfeiting
issues, making it difficult to identify forgery with the naked eye (Upadhyaya et al., 2018).
Therefore, a decision tree model can perform variable scrutiny on the banknote dataset given the
variance, skewness, kurtosis, and entropy of those images, making it a good model for this task.
Model Evaluation
Summary statistics and histograms help with visualizing the data. After review of the
documentation and analysis of these summaries, I was able to determine a good classification
model. A decision tree will help predict forgery by asking specific questions about variance,
skewness, kurtosis, and entropy of the banknote images. This data will prove beneficial in the
final portfolio project, which will design a complete model for forgery prediction.
7
References
Sturdivant, R., Pardoe, I., Berrier, I., & Watts, K. (2016). Statistics for Data Analytics. zyBook
[online].
Tan, P.N., Steinback, M., Karpatne, A., & Kumar, V. (2018). Introduction to data mining (2nd
ed.). Pearson.
Thum, C. (1984). Measurement of entropy of an image with application to image focusing.
Optica Acta: International Journal of Optics, 31(2), 203-211. DOI: 10.1080/713821475
Upadhyaya, A., Shokeen, V., & Srivastava, G. (2018). Decision tree model for classification of
fake and genuine banknotes using SPSS. World Review of Entrepreneurship,
Management, and Sust. Development, 14(6), 683-693. DOI:
10.1504/WREMSD.2018.097696