+1(978)310-4246 credencewriters@gmail.com
  

BUS 8375 – Assignment 2 – Tabulated Data
TABLE 1
Respondent
Number
Age
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
21
19
23
21
21
20
26
24
26
30
21
19
17
19
35
27
21
27
21
22
21
19
32
19
Exam
Mark
(%)
87
83
85
81
81
67
75
92
78
89
72
81
75
76
80
75
85
79
90
97
90
87
95
68
Source: Course textbook page 298
Essay
Mark
(%)
83
80
86
75
75
68
88
78
92
95
80
65
77
85
83
60
80
75
93
95
82
86
90
57
Gender
Year in
College
IQ
M
M
M
F
F
F
F
F
M
F
F
M
M
F
F
F
M
M
F
M
M
F
M
F
2
1
4
1
3
3
2
4
4
3
1
2
1
1
3
2
3
4
3
3
2
3
2
3
80
100
98
76
82
99
120
115
126
129
86
80
70
99
99
60
89
70
140
165
115
119
120
89
Understanding
Causality and Big
Data: Complexities,
Challenges, and
Tradeoffs
This example shows two interesting concepts:
correlation and causality from statistics, which
play a key role in Data Science and Big Data.
Correlation means that we will see two
readings behave together (e.g. smoking and
cancer) while causality means one is the cause
of the other. The key point is that if there is a
causality, removing the first will change or
remove the second. That is not the case with
correlation.
Correlation does not mean Causation!
Srinath Perera
Mar 30, 2016
“Does smoking causes cancer?”
We have heard that lot of smokers have lung
cancer. However, can we mathematically tell that
smoking causes cancer?
We can look at cancer patients and check how
many of them are smoking. We can look at
smokers and check will they develop cancer.
Let’s assume that answers come up 100%. That
is, hypothetically, we can see a 1–1 relationship
between smokers and cancer.
Ok great, can we claim that smoking causes
cancer? Apparently it is not easy to make that
claim. Let’s assume that there is a gene that
causes cancer and also makes people like to
smoke. If that is the cause, we will see the 1–1
relationship between cancer and smoking. In this
scenario, cancer is caused by the gene. That
means there may be an innocent explanation to
1–1 relationship we saw between cancer and
smoking.
This difference is critical when deciding how to
react to an observation. If there is causality
between A and B, then A is responsible. We
might decide to punish A in some way or we
might decide to control A. However, correlation
does warrant such actions.
For example, as described in the post The
Blagojevich Upside, the state of Illinois found
that having books at home is highly correlated
with better test scores even if the kids have not
read them. So they decide the distribute books. In
retrospect, we can easily find a common cause.
Having the book in a home could be an indicator
of how studious parents are, which will help with
better scores. Sending books home, however, is
unlikely to change anything.
You see correlation without a causality when
there is a common cause that drives both
readings. This is a common theme of the
discussion. You can find a detailed discussion on
causality from the talk “Challenges in Causality”
by Isabelle Guyon.
Can we prove
Causality?
Great, how can I show causality? Casualty is
measured through randomized experiments
(a.k.a. randomized trials or AB tests). A
randomized experiment selects samples and
randomly break them into two groups called the
control and variation. Then we apply the cause
(e.g. send a book home) to variation group and
measure the effects (e.g. test scores). Finally, we
measure the casualty by comparing the effect in
control and variation groups. This is how
medications are tested.
To be precise, if error bars for groups does not
overlap for both the groups, then there is a
causality. Check https://www.optimizely.com/abtesting/ for more details.
differentiate between correlation and
causality.
Following are examples when causality is
needed.
•
•
•
•
•
Before punishing someone
Diagnosing a patient
Measure effectiveness of a new drug
Evaluate the effect of a new policy (e.g. new
Tax)
To change a behavior
Big Data and Causality
Most big data datasets are observational data
collected from the real world. Hence, there is no
control group. Therefore, most of the time all
you can only show and it is very hard to prove
causality.
However, that is not always practical. For
example, if you want to prove that smoking
causes cancer, you need to first select a
population, place them randomly into two
groups, make half of the smoke, and make sure
other half does not smoke. Then wait for like 50
years and compare.
There are two reactions to this problem.
Did you see the catch? it is not good enough to
compare smokers and non-smokers as there may
be a common cause like the gene that cause them
to do so. To prove causality, you need to
randomly pick people and ask some of them to
smoke. Well, that is not ethical. So this
experiment can never be done. Actually, this
argument has been used before, e.g.
https://en.wikipedia.org/wiki/A_Frank_Statement
.
Obviously, there are lots of interesting
knowledge in observational data. If we can find a
way to use them, that will let us use these
techniques in many more applications. We need
to figure out a way to use it and stop
complaining. If current statistics does not know
how to do it, we need to find a way.
This can get funnier. If you want to prove that
greenhouse gasses cause global warming, you
need to find another copy of earth, apply
greenhouse gasses to one, and wait few hundred
years!!
I find this view blind.
To summarize, Casualty, sometime, might be
very hard to prove and you really need to
First, “Big data guys does not understand what
they are doing. It is stupid to try to draw
conclusions without randomized experiment”.
I find this view lazy.
Second is “forget causality! Correlation is
enough”.
Playing ostrich does not make the problem go
away. This kind of crude generalizations make
people do stupid things and can limit the
adoption of Big Data technologies.
We need to find the middle ground!
When do we need
Causality?
The answer depends on what are we going to do
with the data. For example, if we are going to
just recommend a product based on the data,
chances are that correlation is enough. However,
if we are taking a life changing decision or make
a major policy decision, we might need causality.
Let us investigate both types of cases.
Correlation is enough when stakes are low, or
we can later verify our decision. Following are
few examples.
1. When stakes are low (e.g. marketing,
recommendations) — when showing an
advertisement or recommending a product to
buy, one has more freedom to make an error.
2. As a starting point for an investigation —
correlation is never enough to prove
someone is guilty, however, it can show us
useful places to start digging.
3. Sometimes, it is hard to know what things
are connected, but easy verify the quality
given a choice. For example, if you are
trying to match candidates to a job or decide
good dating pairs, correlation might be
enough. In both these cases, given a pair,
there are good way to verify the fit.
There are other cases where causality is
crucial. Following are few examples.
1. Find a cause for disease
2. Policy decisions (would 15$ minimum
wage be better? would free health care is
better?)
3. When stakes are too high ( Shutting down a
company, passing a verdict in court, sending
a book to each kid in the state)
4. When we are acting on the decision
( firing an employee)
Even, in these cases, correlation might be useful
to find good experiments that you want to run.
You can find factors that are correlated, and
design the experiments to test causality, which
will reduce the number of experiments you need
to do. In the book example, state could have run
an experiment by selecting a population and
sending the book to half of them and looking at
the outcome.
Some cases, you can build your system to
inherently run experiments that let you measure
causality. Google is famous for A/B testing every
small thing, down to the placement of a button
and shade of color. When they roll out a new
feature, they select a population and rollout the
feature for only part of the population and
compare the two.
So in any of the cases, correlation is pretty
useful. However, the key is to make sure that the
decision makers understand the difference when
they act on the results.
Closing Remarks
Causality can be a pretty hard thing to prove.
Since most big data is observational data, often
we can only show the correlation, but not
causality. If we mixed up the two, we can end up
doing stupid things.
Most important thing is having a clear
understanding at the point when we act on the
decisions. Sometime, when stakes are low,
correlation might be enough. On some other
cases, it is best to run an experiment to verify our
claims. Finally, some systems might warrant
building experiments into system itself, letting
you draw strong causality results. Choose wisely!
Source: https://medium.com/making-sense-ofdata/understanding-causality-and-big-datacomplexities-challenges-and-tradeoffsdb6755e8e220
Retrieved: Dec. 19, 2019
BUSINESS RESEARCH AND DATA
ANALYSIS
LECTURE 10
QUANTIFIED DATA ANALYSIS
BUS8375 – 2022
1
TODAY’S AGENDA
• Lecture: Quantitative Data Analysis – Ch 15 and 16
• Assignment 2.
• Next lectures.
• Quiz 2 next week after the class Lecture
2
QUANTIFIED DATA ANALYSIS
3
OBJECTIVES
• Demonstrate the ability to get data ready for quantified
analysis.
• Describe the various processes by which one can get a
feel for the data in the study.
4
COMMENTS ON TEXTBOOK
• Substantial amount of material is added to chapter 15
and Ch 16. Please make sure that you understand it. If
you don’t, research the material on your own to grasp
how it all works. Lots of GREAT YouTube videos.
• For some of you, we will revisit the material that you
have covered in your undergraduate studies.
• The manipulation of the data assumes that you master
Excel, as studied in your other course in the GBM
program. Again, use YouTube videos to complement it.
• Chapter 14 – The following material is NOT covered:
• All elements related to the Excelsior Enterprises case and the
associated software used in the chapter (SPSS). You will instead
use Excel.
• Testing the goodness of measures.
5
CREATE ORDER
OUT OF
CHAOS
6
THE JOY OF
VISUALIZATION
IT IS ALL ABOUT GETTING
THE MESSAGE TO THE OTHER PERSON!
7
COMPILATION OF THE DATA
• So, you have administered a
questionnaire as part of your
primary research. You now
have the raw data that has to be
formatted, compiled and
displayed properly.
• Each person will generate
information that will need to be
combined with all the other
respondents.
• This raw data will need to be
formatted in a manner that
allows the researcher to
understand the meaning and
possibly start to extract
information.
8
CODING OR TRANSFORMATION OF DATA
• In some instances, the answers might need to be simplified to
convert it in a useable format. Often this will happen for textual
answers describing a status (education level), feelings (happy,
sad, melancholic, angry, etc.). Numbers will be associated to a
word.
• At times, you will find in the data that some of the answers don’t
make sense or are inconsistent from previous answers. You will
then need to decide if you cancel only the invalid answer or the
whole data for a particular respondent.
9
TABULATING THE DATA
• The raw data then needs to
be entered into a data
processor which has the
capability to display this
data in various manners
and to process it to extract
relationships.
• Many softwares exist to that
effect. We will use Excel as
a processor.
• Here is the data that we will
use in our ICA 7 today.
10
GETTING A FEEL FOR THE DATA FROM
QUESTIONS
Quantifie
d
analysis
A
B
C
D
11
A. MEASURES OF CENTRAL TENDENCY
• Mode: Most frequent value.
• Median: The point where
there is an equal number of
samples on each side.
Mean: Average of a
group of numbers: sum
of x’s / n.
• Experimental data: 3, 4,
4, 5, 6, 8.
• Population (the whole).
•
•
N
X
• Sample (the extracts).
•
•
n
x
12
EXAMPLE
• Individually.
• Determine the median, the mode and the mean for the
following numbers:
• 2 4
8
4 6 2 7
8
4
3 8
9
4 3
5
• Median: _____
• Mode: _____
• Mean (arithmetic average): _____
13
B. MEASURE OF DISPERSION
• What is this
table
telling us?
• Range?
• Min.: ___
• Max.: ___
• Creation of
intervals.
14
HISTOGRAM
• Using data on previous page.
• X axis is for the range of
measurement. Midpoint is used.
• Y axis for the measurement.
• An “x” is located at the
measurement for each
segment.
• This allows us to add a line that
traces the measurement at each
midpoint, creating a
distribution curve. We will use
this curve to calculate the
dispersion
x
x
x
x
x
x
x
15
HISTOGRAM TO DISTRIBUTION CURVE
• Second level of analysis as we try to figure out the “spread”
and behaviour of the data. Range: min. to max. range.
• Graphic representation of sampling an event, activity, etc.
16
MEASURE OF DISPERSION
• What can you tell me by looking at these 3 super- imposed graphs?
Don’t forget, you are looking at a data distribution pattern here.
• If for an exam, which class do you want to be in?
17
MEASURES OF SHAPE
• Compared to a NORMAL distribution, is the data leading
one way or the other way?
• Skewness = 3(µ:mean – Md:median) / σ:standard
deviation.
• Higher the number, more skewed is the distribution.
18
MEASURE OF DISPERSION
• A company builds advanced computers. Daily production
data: 5, 9, 16, 17 and 18. Total production of 65.
• Average daily output (mean): 13 (i.e. total production / no. of
days).
• Deviation from the mean (µ); how does each daily output
compares with the average (sum is always or near zero):
Deviation = Daily production – Mean
19
VARIANCE AND STANDARD DEVIATION
• Variance is the average
of the squared
deviations.
• Standard deviation is the
most popular way of
measuring the spread of
data.
• PAY ATTENTION!
• For calculating the std
deviation of a sample
data instead of using N
as a denominator, we
will use n-1. Rest of the
formula is the same i.e.
std Dev of sample = sq
rt( Sum of Sq / n-1)
20
VERY USEFUL TOOL – NORMAL CURVE
• ±σ. ±2σ and ±3σ: extent of distribution; standard
deviation.
21
NORMAL CURVE – SEEN IS A
DIFFERENT WAY
• ±σ. ±2σ and ±3σ: extent of distribution; standard
deviation.
22
C. VISUAL SUMMARY – 1 VARIABLE
• Histogram is a type of
vertical bar that is
used to depict a
frequency distribution.
• Can depict multiple
years on same graph.
23
VISUAL SUMMARY – 1 VARIABLE
• Pie chart is a
circular depiction of
data where the area
of the whole pie
represents 100% of
the data and the
slices represent the
breakdown of the
sublevels.
24
D. MEASURE OF RELATION – 2
VARIABLES
• A scatterplot is a twodimensional graph plot of
pairs of points from two
numerical variables.
• What are the 2 key
components of this scatter
plot?
• What would be a pair of
data?
• What can you deduct by
looking at this graph?
• Correlation (relationship)
is NOT causation!
25
THE DANGER OF SCATTERPLOTS
• What is the message in
this graph?
• Does this graph make
sense?
• Why?
• DANGER!
Source: Cairo, A., (2019), Does Obesity Shorten Life?
Scientific American, (321), 3, p. 100
26
MEASURE OF CORRELATION
• How one parameter relates to the other parameter.
• Correlation, not causation, i.e. not what leads to what.
• Higher the number, greater the link between the two.
27
CORRELATION – EXAMPLES
• Many parameters can be analysed in a 2-dimensional aspect. 3
(or more)-dimensional is also possible but becomes more
challenging (AI for optimum solution).
28
CORRELATION AND CAUSATION
• Easily confusing.
• Should ice cream be banned
in the summer?
• Is there something wrong
with this graph?
• What conclusion can we
generate from this
information?
• What is the causation of the
number of drowning?
• DANGER, DANGER!
• Correlation is generally
symmetrical, Causation is
directional.
29
CAUSATION
• At times, easy to see. You want to
move an object? You push it. That is
causation.
• The direct relationship, called
dependency, has to be proven
through a series of experiments.
• This dependency can be multidimensional, i.e. more than over
variable. E.g. increase is sales can
be caused by more than price
reduction.
• READ excellent article loaded on
eConestoga.
30
RAW DATA – 1
31
RAW DATA – 2
32
ASSIGNMENT 2
• Raw data is only a mean to understand a variable.
• To understand we need to convert this raw data into useable
information.
• Then we need to integrate this data into a format that is user
friendly.
• A variety of tools are available that will allow us to let the
data “talk” to us.
• You must master Excel.
• Assignment 2 has master data file. Use this file to answer 4
questions listed in the second file for assignment 2. Show the
details for your work. Submit answer for each question in its
corresponding worksheet in the Excel file
33
KEY WORDS AND CONCEPTS
• Raw data
• Coding
• Tabulation
• Data processor
• Population
• Sample
• Central tendency
• Median
• Mode
• Mean
• Range
Histogram
Pie chart
Distribution curve
Normal curve
Variation
Standard deviation
Measure of
dispersion
• Scatterplot
• Correlation
• Causation
•
•
•
•
•
•
•
34
WRAP-UP OF LECTURE
• Raw data is only a mean to understand a variable.
• To understand we need to convert this raw data into
useable information.
• Then we need to integrate this data into a format that is
user friendly.
• A variety of tools are available that will allow us to let
the data “talk” to us.
• You must master Excel.
35
NEXT SESSION
• Week 11 – Lec. 10:
• Lecture: Qualitative Analysis – Ch 17
• Quiz 2.
• Final Exam Review
36
BUSINESS RESEARCH AND DATA
ANALYSIS
LECTURE 11
QUALIFIED ANALYSIS
BUS8375 – 2022
1
TODAY’S AGENDA
• Lecture: Qualified Analysis, Chapter 17.
• Quiz 2 is after Lecture today
• Group Project report submission was due before start of
Lecture today
2
LECTURE
QUALITATIVE ANALYSIS
CH 17
3
OBJECTIVES
• Discuss the 3 important steps in qualitative analysis:
• Data reduction
• Data display
• Drawing conclusions.
• Discuss reliability and validity.
4
COMMENTS ON TEXTBOOK
• Chapter 17, with the exception of:
• Some other methods of gathering and analyzing qualitative data
• Big Data.
5
CREATE ORDER
OUT OF
CHAOS
6
CHAOS
• After proceeding with a series of interviews, observations
and questionnaires (i.e. with open ended questions),
researchers end up with a large quantity of text; RAW DATA
(i.e. chaos).
• This valuable data (assuming that the process and the
questions used were appropriate (so many concerns here)),
needs to be compiled in a manner that will allow the
researcher to extract the messages that are communicated
by the respondents.
• This is MOST CHALLENGING, particularly compared to
quantified analysis that allows a variety of mathematical
tools to process the raw data that is numerical.
• One tool is recognized as the bench mark in this analysis:
1. Data reduction
2. Data display
3. Drawing conclusions.
7
1. DATA REDUCTION – 1
• The ultimate goal is to get the data to “Talk to us”.
• The raw data is overwhelming and needs to be reduced
in segments that can be managed individually, then
categorized.
• This process is NOT LINEAR (i.e. continuous) but
ITERATIVE (i.e. back and forth), because adjustments will
be needed as progress is made.
• FIRST STEP: the raw data needs to be rearranged into
groups through a coding process. Coding will take
recurring themes and provide titles under which these
ideas will be grouped. These are also called
CATEGORIES.
• This process might require the re-reading of the raw data
a few times (i.e. iteration) to see patterns and
connections emerge between the various statements that
were compiled in the raw data.
8
1. DATA REDUCTION – 2
• Coding units (also called categories) are selected; words
(2 to 10 max.) that describe the various components of
what is in the raw data.
• This will create a list of code and categories that will be
used to analyze the raw data.
• The coding will be done for 2 types of information:
• Categories that are being studied, e.g. theme, issue, idea (rows)
• Categories of answers for each category being studied (column)
• The goal is to identify recurring topics, comments,
suggestions, etc. that can be combined through
similarities.
• From the mass of collected data, the researcher has to
identify common threads.
• The challenge will be to decide what data is NOT USED
as there are one-off comments that are “outliers”, which
cannot be used. These are discarded as not meaningful.9
1. DATA REDUCTION – 3
• At times you might want to quantify the frequency of a
category that is being mentioned in the raw data. This can
be applied to both rows and columns.
• The categorisation of columns, can vary from row to row
depending on the questions asked and the answers found
in the raw data. This will provide flexibility in the analysis
of the parameters measured, i.e. the rows.
• Often a column is added where comments can be
inserted for the categories in the rows.
• Finally, in each cell at the intersections of rows and
columns, information is provided that either quantifies the
intersection (i.e. the frequency of the intersection) or
describes the meaning of the intersection.
10
2. DATA DISPLAY – 1
• The display that will be created by the researcher will vary
from project to project. The design of the display will need to
adapt to the coding/category that was created by the
researcher through its questions, structure and survey
design.
• The display will generally be a spreadsheet with one axis
listing a series of parameters (areas studied – variables)
with the other axis listing another series of parameters
(feedback received on each variable).
• The goal is to establish relationships between these 2
parameters , generating a series of “messages” from which
the researchers will be able to generate observations,
possibly leading to recommendations.
• The number of categories in each axis will also depend on
the situation that is being analysed.
11
• 3 components: Rows, Columns and Cells.
2. DATA DISPLAY – 2
• Once the number of rows (areas studied) and columns
(comments in the raw data that are related to the areas
studied (i.e. each row)) are determined, the spreadsheet
is created.
• Quantification can be used to measure the frequency that
each category is mentioned. This can be applied to the
rows (area studied) or to individual cells. This will begin
the interpretation of the data.
• Often, a column of comments is added, allowing the
researcher to insert comments for rows (areas that are
studied) allowing the identification of a particular point
(e.g. pattern and relationship) that is worth mentioning.
• In addition to a spreadsheet, other methods can be used:
networks or diagrams that will allow the researcher to
12
present relationships between concepts in the data.
3. DRAWING CONCLUSIONS
• After a few iterations, the researcher will be able to better
understand the information gathered, i.e. the answers or
the feedback provided.
• The iterative process is key in understanding the data as
it forces a reflection not only on the content, but on the
relationships (i.e. the structure) in the data.
• The interpretation of the data will lead the researcher to
draw conclusions from this data.
• At the same time, the researcher has to take the feedback
provided (i.e. conclusion) and relate it to what is
“possible” as other parameters might constraint
(availability of resources ($, people or time),
rules/laws/ethic, relevancy, etc.) the implementation of
the findings.
13
RELIABILITY AND VALIDITY
• Key points need to be considered in regard to the quality of
the data:
• The quality of the survey (I-O-Q); questionnaire and sampling
design
• The “honesty” of the respondents; any misunderstanding of the
questions or the presence of bias in the answers
• The “honesty” of the researcher in structuring and compiling the
data.
• Category reliability (for both rows and columns) will be
affected by the above points, as these points might impact
the selection of the categories used.
• One approach to increasing reliability and validity is to
have various people involved at each stage, checking the
work done in the previous stage. Also, the raw data can be
analysed (creation of the spreadsheet ) individually by 2 or
3 people, then compared to see the similarities or
differences.
14
KEY WORDS AND CONCEPTS
• Qualitative data
• Category reliability
• Data reduction
• Data display
• Data coding
• Reliability
• Validity
• Continuous process
• Iterative process
• Patterns
• Connections
• Outliers
15
WRAP-UP OF LECTURE
• Textual answers provide an opportunity for researchers to
gather “open” answers that might not be captured in a
quantified analysis (closed-ended questions).
• The raw data gathered can be substantial and complex to
distill.
• This data need to be simplified, then compiled within a
structure (often a spreadsheet) in order to provide
categories from which it will be possible through the
interpretation of the structured data to identify patters and
relationships.
• A high reliability and validity is more challenging to achieve
than for quantified data analysis.
16
NEXT SESSION
• Lecture 12: Course Overview of processes and tools.
• Discussion on Final Exam.
• Wrap-up of course.
17

Purchase answer to see full
attachment

  
error: Content is protected !!