+1(978)310-4246 credencewriters@gmail.com
Select Page

Glossary
Complete this glossary with the definition and the chapter / module where you can find
Note: You can turn in your completed glossary at the end of the course for bonus points
Term: Definition
Alpha:
Chapter
Alternative Hypothesis:
Associations:
Bias:
Bivariate:
Bivariate analysis:
Case:
Categorical:
Categories (response options):
Causation:
Census:
Changes:
PUH 250 Glossary
Page 1 of 6
Term: Definition
Chapter
Chi-square test:
Collapsed:
Comparisons:
Confidence Interval:
Constant:
Contingency table:
Continuous variable:
Control:
Convenience Sampling:
Correlation:
Correlation coefficient:
Cross tabulation:
Cumulative Percent:
Data: A collection of facts, not necessarily numeric, such as: age, gender, hair
color, weight, temperature, etc.
1
Dependent (outcome) variable:
PUH 250 Glossary
Page 2 of 6
Term: Definition
Chapter
Descriptive statistics:
Dichotomous:
Dispersion:
Evidence:
Exhaustive:
Experiment:
Frequency / Frequency Distribution:
Generalizability:
Hypotheses:
Hypothesis:
Independent (predictor) variable:
Inference:
Inferential Statistics:
Interquartile Range:
Lower bound:
PUH 250 Glossary
Page 3 of 6
Term: Definition
Chapter
Mean:
Median:
Mode:
Model:
Âµ (mu):
Mutually exclusive:
N:
No plausible alternative:
Nominal:
Non-random or convenience sample:
Null hypothesis:
Observation:
Ordinal:
p-value:
Parameters:
PUH 250 Glossary
Page 4 of 6
Term: Definition
Chapter
Pearsonâ€™s r (or simply r):
Percent:
Population: A well defined collection of objects, such as: students
1
Proportion:
Quantiles:
Range (Minimum, Maximum):
Representative:
Research question:
Resources:
Row, column, total percent:
Sample size:
Sample:
Sampling:
Simple random sample:
Simple Random Sampling:
PUH 250 Glossary
Page 5 of 6
Term: Definition
Chapter
Standard Deviation:
Statistical Significance:
Stratified Sampling:
Study designs:
t-test:
Temporal precedence:
Unit of analysis:
Univariate analysis:
Upper bound:
Variable:
Variable type:
Variance:
PUH 250 Glossary
Page 6 of 6
Biostatistics for
Public Health
Stacey S. Cofield, PhD
Erika L. Austin, PhD, MPH
Stacey S. Cofield, PhD, and Erika L. Austin are Associate Professors in the Department
of Biostatistics in the School of Public Health (SOPH) at the University of Alabama at
Birmingham (UAB). Drs. Cofield and Austin have been teaching introductory
biostatistics for over 15 years at both the graduate and undergraduate levels. All
contents here Â© Stacey S. Cofield and Erika L. Austin, 2021.
Stacey S. Cofield, PhD
Associate Professor, Biostatistics
Associate Dean of Recruitment, Retention, & Diversity
Erika L. Austin, PhD, MPH
Associate Professor, Biostatistics
Associate Dean of Student & Academic Affairs
School of Public Health
Preface
How to Use this Book ………………………………………. 4
Chapter 1
The Role of Biostatistics in Public Health …………….. 5
Chapter 2
Continuous Variables …………………………………….. 12
Chapter 3
Categorical Variables …………………………………….. 21
Chapter 4
Chapter 5
Collecting the Evidence ………………………………….. 47
Chapter 6
Review Chapters 1-5 ……………………………………… 54
Chapter 7
Chapter 8
Chapter 9
Determining Significance ………………………………… 59
Chapter 10 Comparing Means …………………………………………. 72
Chapter 11 Comparing Proportions …………………………………… 81
Chapter 12 Correlation……………………………………………………. 97
Chapter 14 Review ………………………………………………………. 107
Preface
This textbook has been specifically designed for this course to align with the course
modules. Each chapter contains: an introduction, the objectives for the chapter (that
are associated with the module and course objectives in the course), important
definitions for that chapter, software examples, and practice problems for you to
complete and evaluate your progress in the course. Each chapter will conclude with
important questions you should be able to answer after completing the chapter and
associated module on Canvas.
In addition to the practice problems in these chapters, there will be practice exercises
on Canvas for you to complete and turn in for credit. There are also lecture
notes/videos, software demos and additional bonus exercises available on Canvas.
This textbook alone is not sufficient for complete understanding and comprehension of
the materials presented in this course. This textbook is one of several materials to help
you succeed!
Canvas
Learning
Checks
Practice
Textbook
Software
Chapter 1 The Role of Biostatistics in Public Health
Module Learning Objectives (Course Learning Objectives): At the end of
this module students should be able to:
â€¢
MO 1.1:
â€¢
MO 1.2:
List several of the ways that biostatistics provide the evidence
necessary for public health research and practice (CLO 2)
Justify the importance of a particular public health problem on the
basis of an evaluation of the evidence (CLO 1)
Definitions
â€¢ Data
â€¢ Population
â€¢ Census
â€¢ Sample
â€¢ Variable
â€¢ Descriptive Statistics
â€¢ Inferential Statistics
â€¢ Sampling
â€¢ Simple Random Sampling
â€¢ Stratified Sampling
â€¢ Convenience Sampling
â€¢ Parameters
â€¢ Model
â€¢ Hypotheses
â€¢ Statistical Significance
â€¢ Evidence
â€¢ Associations
â€¢ Changes
â€¢ Resources
â€¢ Comparisons
What are statistics? What is the practice of biostatistics?
These are two different questions! Statistics are just numbers but the practice of
biostatistics involves measuring variability of numbers to interpret results.
Statistics can be used to analyze data after an experiment has been carried out but can
(and should) also be used to make suggestions for how experiments can be designed to
reduce variation and produce better, more accurate, consistent, and predictive results.
5
There are numbers, formulas, and defined scientific processes involved in answering a
statistical question. However, keep in mind that statistics as a mathematical discipline is
a different discipline, called theoretical statistics, that is NOT this class.
We will be approaching statistics as an applied discipline using some basic levels of
math. Instead of using theorems, properties, and abstract math, weâ€™re going to use case
studies and real data to illustrate some fundamental points about using statistics to
make sense out of data and answer questions.
Weâ€™ll begin with some definitions (there will be more on each of these topics in future
chapters):
â€¢ Data: A collection of facts, not necessarily numeric, such as: age, gender, hair color,
weight, temperature, etc.
â€¢ Population: A well defined collection of objects, such as: students (at UAB, in
engineering), paint colors (from 1 company, from multiple companies), etc.
â€¢ Census vs Sample:
o If you collect information on all of the objects in a population, that is a census.
o If you collect information on some of the population, that is a sample.
o Rarely do you have a true census, what you try to do is collect a sample that is
representative of the population about which you want to make inferences.
o Note: The US Census is actually a sample not a true census.
â€¢ Variable: A measurement on an object that can change from one object to another.
Usually denoted with lower case letters: x, y, z
o Numerical variables: age, height, time. These are variables that can
(theoretically) be measured on an (infinite) continuous numeric scale.
Â§ There is an inherent order to numeric variables, there is a minimum and
a maximum value and potential values in-between.
Â§ Numeric variables are often summarized with means and standard
deviations, medians and/or ranges.
Â§ These variables can be grouped into categories but that is not how the
data was originally collected â€“ it was collected as a number.
o Categorical variables: gender, hair color, school class. These are variables
that are measured in mutually exclusive (non-overlapping) categories. There
may or may not be an inherent order to categorical variables.
Â§ Nominal variables are in name only and there is no defined order as to
which is better or higher, e.g. hair color.
Â§ Ordinal variables have an inherent order, e.g. Low, medium, high; or an
ordinal scale 1, 2, 3, 4, 5, where a 4.5 isnâ€™t possible and doesnâ€™t make
sense.
6
These variables are often summarized with the number (n) and percentage
(%) in each group.
Â§ These variables generally canâ€™t be reverted into continuous numeric
variables, means and standard deviations donâ€™t make sense with grouped
variables.
Descriptive Statistics: Often called summary statistics, such as the number of
subjects (N), the mean of values (Âµ), variance (s2), standard deviation (s). Can be
Â§
â€¢
â€¢
depicted using plots, such as: histograms, box and scatter plots.
Inferential Statistics: The process of using data to make generalizations to a
population, such as: confidence intervals, estimation, prediction, etc. Inference is a
conclusion that patterns in the data are present in the population.
When collecting data, make sure you collect a good sample to avoid a biased sample.
For example, if you are trying to summarize how students feel about a political issue,
ask men and women, republicans and democrats, freshman and seniors, etc. There are
several sampling procedures:
â€¢
â€¢
â€¢
Simple random sampling: the most simple sampling procedure involves selecting
a subset of n objects from the population, such that each object has an equal
chance of being selected
Stratified sampling: sampling a subset of n from each gender, each age group,
or each school class
Convenience sampling: when it isnâ€™t possible to get a simple random sample,
you sample what you have available to you
Often the goal of a study is to declare a causal relationship between a response and
predictors. The response could be change in blood pressure and the predictors could be
age, gender, exercise, and weight. Unless the study is designed well, ensuring a
random sample, with a temporal association, you wonâ€™t be able to declare a causal
relationship.
Cause and effect relationships should only be drawn from randomized experiments.
Observational studies, where the subjects are not randomly chosen or allocated for study,
can establish correlation between a response and predictors not causation.
Keep in mind that
Correlation â‰  Causation
Just because two things are associated does not mean that one causes the other.
Inferences to populations should only be drawn from random sampling studies, such as
7
randomized clinical trials and designed laboratory experiments. Some other definitions
that statisticians use:
â€¢
â€¢
â€¢
â€¢
Parameters: These are unknown coefficients (variables) in the model that need to
be estimated, such as the mean or standard deviation. Unless you have a census
(all subjects in a population), these are never truly known â€“ only estimated.
Model: The statistical model is an equation that predicts the response (or outcome)
as a function of other variables.
Hypotheses: Usually in terms of null and alternative hypotheses. The question you
are trying to answer and the alternative (or opposite) of that question. In statistics,
the null hypothesis is usually the current standard or what you are trying to show is
no longer valid. The alternative is what you are trying to show by statistically
â€œrejectingâ€ the null. You never prove or disprove a hypothesis, you reject or fail to
reject or accept or fail to accept hypotheses.
Statistical Significance: A precise statistical term that does not equate to practical
or clinical significance. This usually means that the data provides evidence that the
estimated parameter is not the same as the null value (assumed value).
Asking the question: Public Health research usually starts with a question. Is blood
pressure different by gender? Is blood pressure associated with age? With caloric
intake? With amount of exercise? Is high blood pressure associated with an increased
risk of having a stroke? Can medication lower blood pressure?
To answers these questions, and questions like these, we need to learn about data,
about how to ask the right questions, how to use that data to answer the question, and
how to convey what we learned.
How do we answer the question? How do we use statistics to answer the question? If
you think about how you make decisions every day, this can be applied to making
statistical decision â€“ use what see, use what you know, use what you can show:
â€¢ Begin by writing down what you understand
â€¢ Outline what the data says and form clear and succinct questions pertaining
to what the data may imply (or what you would like to show)
â€¢ Form a scientific question to determine if the results are random
â€¢ Compare the data from each side of the question and decide what to believe
â€¢ Write down what you found and what it means
8
Biostatistics and Public Health
According to the American Public Health Association (https://www.apha.org/what-is-public-health):
Public health promotes and protects the health of people and the
communities where they live, learn, work and play.
While a doctor treats people who are sick, those of us working in public
health try to prevent people from getting sick or injured in the first place. We
also promote wellness by encouraging healthy behaviors.
From conducting scientific research to educating about health, people in the
field of public health work to assure the conditions in which people can be
healthy. That can mean vaccinating children and adults to prevent the
spread of disease. Or educating people about the risks of alcohol and
tobacco. Public health sets safety standards to protect workers and develops
Public health works to track disease outbreaks, prevent injuries and shed
light on why some of us are more likely to suffer from poor health than
others. The many facets of public health include speaking out for laws that
to stay healthy and giving science-based solutions to problems.
Public health saves money, improves our quality of life, helps children thrive
and reduces human suffering.
So how does Biostatistics fit in with public health? Biostatistics is a way of providing
evidence to support conclusions about disease, injury, treatment, behaviors, etc.
Biostatistics does this in 4 ways:
1. Associations: Biostatistics can help find associations between exposures
& outcomes, treatments & disease changes, comorbidities, behaviors &
physical changes, behaviors & diseases, etc.
2. Changes: Biostatistics is used to track changes over time
3. Targeting resources: Biostatistics can help determine disparities to target
interventions for change. Where can the greatest changes be made? What
group or groups should be focused on for change?
4. Comparisons: Biostatistics can compare impacts and find differences
We began with an introduction to biostatistics and defined some commonly used terms.
Weâ€™ll get to each of these terms and how they apply to the steps to answer a question
9
with data. First, we will examine the types of data and variables and study methods that
can be used to collect data for analysis.
Chapter 1: Wrap up Questions
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
What is the difference between a census and a sample?
What are the two main types of variables?
What are the two types of categorical variables?
What is the difference between descriptive statistics and inferential statistics?
How do we determine the importance of different public health problems?
What can Biostatistics do for public health?
What are the four types of evidence provided by biostatistics?
10
Chapter 1: Problems
An older apartment building is having water quality problems, in order to determine if
there have been any adverse health outcomes in the building, you are going to collect
information on the residents of the building:
1. If you have collected data on all persons living in the apartment building, is this
a census or a sample?
2. If you have collected data on people that were home during your visits to that
building, is that a census or a sample? If it is a sample, what type of sample is
this?
3. If you asked people to give their age and you collected it as Distribution (this means go to the Analyze Menu and select
Distribution)
3. Select your variable, then click on Y, Columns
4. Press OK
For this example:
1. Open PUH 250 Module 2 Example.jmp
2. Analyze > Distribution
3. Select Days Exercise
4. Press OK
18
Example Interpretations:
â€¢ Visitors to the UAB Rec Center were asked about how many days in a week they
exercise, and on average how many minutes do they exercise (N=10). The mean
response was 3.7 (SD 1.9) days per week.
â€¢ Visitors to the UAB Rec Center were asked about how many days in a week they
exercise, and on average how many minutes do they exercise (N=10). The
median response was 4.0 (IQR 2.3, 5.0) days per week.
â€¢ Visitors to the UAB Rec Center were asked about how many days in a week they
exercise, and on average how many minutes do they exercise (N=10). The
median response was 4.0 (range 0, 7) days per week. Ã  Note that here, this is
not that informative since there are only 0-7 days in a week.
When you have a continuous variable, you can describe the center and spread of the
data using the mean (SD), median (IQR), median (min, max), or the five-number
summary; depending upon the shape of the distribution and how much information you
Chapter 2: Wrap up Questions
â€¢
â€¢
â€¢
â€¢
What is the middle point of a distribution in numbers?
What is the best estimate of a distribution of numbers?
What is the most common value of a distribution of numbers?
What are the 3 measures of spread for a distribution of numbers?
19
Chapter 2: Problems
Open JMP, under the Help menu > Sample Data, open Diabetes.jmp. This is a sample
dataset of 442 participants with diabetes.
Use Analyze Distribution to answer the following questions for Age, BMI, Total
Cholesterol, Glucose, and HDL:
1. What is the mean (SD)?
2. What is the five-number summary?
3. Is the data roughly symmetric or is it skewed?
4. Would you report the mean (SD) or the five-number summary?
20
Chapter 3. Categorical Variables
Module Learning Objectives (Course Learning Objectives): At the end of this
module students should be able to:
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
MO 3.1: Identify the characteristics of a categorical variable (CLO 1, 3)
MO 3.2: Determine variable type based on examination of the categories
(response options) (CLO 1, 2, 5)
MO 3.3: Order variables from lower to higher level of measurement (CLO 1, 2, 5)
MO 3.4: Create variables with mutually exclusive and exhaustive categories
(response options) (CLO 1, 2, 5)
MO 3.5: Define proportion and percentage (CLO 1, 3)
MO 3.6: Calculate proportions and percentages (CLO 1, 3, 5)
MO 3.7: Interpret proportions and percentages (CLO 1, 3, 5)
MO 3.8: Distinguish between categorical and continuous variables (CLO 1, 2)
Definitions for this module:
â€¢ Data
â€¢ Descriptive statistics
â€¢ Inferential statistics
â€¢ Variable
â€¢ Constant
â€¢ Categories (response options)
â€¢ Variable type
â€¢ Categorical
â€¢ Nominal
â€¢ Dichotomous
â€¢ Ordinal
â€¢ Mutually exclusive
â€¢ Exhaustive
â€¢ Percent
â€¢ Proportion
â€¢ Collapsed
Recall that Biostatistics is a way of using math to analyze the health of populations.
Biostatistics helps us to organize information (data) to look for patterns (descriptive
statistics) in our data so that we can make more general statements about the
population (inferential statistics). Since Biostatistics uses math, we do need numbers
to use for analysis. We saw this when we looked at continuous variables â€“ variables
21
that are numeric from a measurable and consistent scale. But what about questions that
have answers that are not collected as numbers? Consider the following question:
â€¢
Have you ever been told you have high blood pressure?
Sure, we know that we can collect actual blood pressure measures (in mmHg) and then
determine if someone meets the criteria for having high blood pressure, but what if we
canâ€™t see them in a clinic? What if we are simply asking the Yes/No question?
Specifically, the question asked on the NHANES (National Health and Nutrition
Examination Survey) survey is:
â€¢ Have you ever been told by a doctor or other health professional that you had
hypertension, also called high blood pressure?
The responses from this question (and others) allowed for the creation of Figure 1 (seen
below) in the Hypertension Prevalence Among Adults Aged 18 and Over: United States,
2017-2018 NCHS Data Brief, Number 364, April 2020 (found in Module 2).
22
In this case, hypertension is determined from a Yes/No response. It is measured in
categories, as is sex (men and women). Sometimes, the data is assigned numbers to
these categories (Yes = 1, No = 0; Men = 1, Women = 2) but these numbers donâ€™t have
any inherent meaning (we could have used Yes = 2, No = 1; Men = 0, Women = 1).
Donâ€™t worry, you donâ€™t actually have to assign the numbers for analysis, this is done in
the background in software.
Of course, we donâ€™t just look around the world and convert everything we see to
numbers, we are specific about what data we collect. This data, each individual
question or aspect we collect are called variables. As weâ€™ve seen, variables can have
natural numeric responses (like height, weight) but they can also have categories as
response options. The term variable implies the responses have options, as opposed
to a constant, which always has the same value. Variables are associated with the
question being asked and the values are the responses.
â€¢
Learning Check: In the NCHS data brief, looking at Figures 1, 2, and 3, there
are 5 variables used in the results. Can you identify them?
When we have a natural numeric response and use the variable in that way, those are
called continuous variables (as we saw in Chapter 2). Other responses are measured
more simply (out of ease or design) by putting the responses into categories â€“ these are
categorical variables. The difference between continuous and categorical variables is
the response options. Variables range from very simple to more complex levels of
measurement. The table below outlines our 2 types of variable classes.
Variable
Continuous
(Numeric)
Categorical
(Character)
â€¢
Response Options
â€¢ Variables have response options that are numbers, where the
numbers have an inherent meaning and an associated scale
(units) of measurement
â€¢ Dichotomous variables only have 2 possible responses
â€¢ Nominal variables have 3 or more response options that are
in name only (words) that are not in a specific preferred order
â€¢ Ordinal variables have response options that can be put in
order or ranked. These can be either words (ranging from less
to more) or groups of numbers captured as a group (1-3, 4-6.
7-9).
Learning Check: In Figures 1, 2, and 3 in the NCHS Data Brief, there are 2
dichotomous variables, 2 nominal variables, and 1 ordinal variable. Can you
identify them?
23
The categories (response options) for variables need to be carefully and specifically
designed to accurately capture all the possible responses that people could give. For
example:
â€¢ How many times have you been told you have high blood pressure?
â€¢ Response options: 0 times, 1-2 times, 3-5 times, more than 5 times
These categories are mutually exclusive, meaning that any response selected only fits
into one category. These are mutually exclusive, because there is no overlap. Often,
we do see categories with bounds that do overlap: 0-2, 3-5, 5 or more. In this case, if
you were a 5, you could be in the 3-5 or 5 or more categories. These are NOT mutually
exclusive.
Categories must also be exhaustive, meaning that every possible response must be in
one of the categories. What if the responses were 0-2, 3-4, 5-6? Then what if a person
had been told 10 times they had high blood pressure? These categories would not be
exhaustive because those with responses 7 or greater have nowhere to go. How do you
allow for exhaustive response options without having too many categories?
â€¢ For categorical, include a â€œor moreâ€ option, or â€œotherâ€ option
â€¢ For continuous, allow to write in the number
â€¢ It is also very good practice to give the following: donâ€™t know, unsure, or prefer
not to answer. Depending on the question this can allow for everyone to answer
the question, even if they donâ€™t know or do not want to give specific information.
Describing Categorical Variables
Once you have designed your categorical variable, we need to think about how to
describe the responses. Similar to continuous measure, you will use a frequency, the
number of times something occurs. Again, you will use software to organize the data to
create a frequency distribution (also called a distribution) to determine the number of
times each response occurs and divides this number by the total number of responses,
N (also called the Sample Size), to determine the percent of each response (%).
It is important to note the difference between two commonly used terms with categorical
frequency:
â€¢ Percent: 0-100%. This is very common in everyday language:
o Number of responses in that category / N (total number responses) * 100
â€¢ Proportion or Probability: 0-1. This is a statistical version of the same concept. If
you multiple a proportion by 100% then it becomes a percent. Youâ€™ll see Prob in
JMP (and other software) but when you are reporting results, make sure to report
percentages (people understand percentages):
o Number of responses in that category / N (total number responses)
24
Consider this example for the same question of â€œHow many times have you been told
you have High Blood Pressure?â€ with the response options of 0 (Never), 1-2, 3-5, or 6
or more. Here is the data with output from JMP (Analyze > Distribution):
JMP output shows the:
â€¢ Histogram: Chart with bars for each level
â€¢ Level: the response options
â€¢ Count: Also called N, the number of times the
response was recorded
â€¢ Prob: the probability or proportion out of the total
number of responses
o To go from Prob to %, multiply Prob *100
o E.g., 0.45*100 = 45.0%
o 45.-% (9) of 20 respondents indicated they have
never been told they have high blood pressure.
â€¢ Total: total number with a response (note that this
may be different from the total number of people in
â€¢ N Missing: the number in your overall data set that
do not have a response to this variable.
o Total + N Missing = Number of rows in your data
set
Distributions
Times High BP
6 or more
3-5
1-2
0 (Never)
Frequencies
Level
Count
Prob
0 (Never)
9 0.45000
1-2
4 0.20000
3-5
4 0.20000
6 or more
3 0.15000
Total
20 1.00000
N Missing
0
4 Levels
This could also be shown in a Cumulative Frequency table:
Table 3.1
Frequency Distribution for Number of Times Told has High BP
Number
% of all
Number of Responses
Cumulative %
of Times
responses
0 (Never) 9
45.0
45.0
1-2
4
20.0
65.0
3-5
4
20.0
85.0
6 or more 3
15.0
100.0
N=
20
100%
Data Source: NHANES Simulation (ÃŸ this is made up data!)
Again, youâ€™ll have a Cumulative % that add together each of the frequencies up to a
certain point in the distribution. With mutually exclusive and exhaustive response
options, all of your % should total to 100%.
25
â€¢
Learning Check: Which type of categorical variable is this? Dichotomous,
Nominal, or Ordinal?
Interpretation guidelines for frequency distributions:
â€¢ Description of the variable (what question was asked?)
â€¢ The total N (and any N missing, that is are there people that didnâ€™t answer the
question)
â€¢ Percentage and N of cases in each category or percentages associated with
common numeric responses: e.g. 45.0 (9) as seen above
â€¢ Source of the data
Since categorical variables are more simply described, there are fewer options for
description. The most common are:
â€¢ N(%): this is the number of responses (N) for each option with the percentage out of
the total. It is critical to report not just the % but also the N. This is because you can
get 20.0% from many difference ways.
o The majority of respondents indicated they have Never been told they have high
blood pressure, 45.0% (9).
o Or you will also see it as N(%) or 9 (45.0), either is fine.
â€¢ Mode: The mode is the most common response in a distribution. For a continuous
variable it is the number that occurs most frequently out of all the responses. Be
sure to report the actual number, not the frequency. The mode is appropriate to
report for all variable types, those collected as a number and those collected as a
category.
â€¢ Mode = 0 (Never) in this example
In common conversation, you would ask how much spread is in responses? Recall with
continuous data, there were 3 main ways to describe dispersion: range, IQR, or SD.
With categorical responses, the frequency distribution helps to demonstrate the spread
of data (histograms are a common way to display categorical responses). But is there
variance associated with categorical responses?
Yesâ€¦butâ€¦ you donâ€™t need to worry about reporting the variance with categorical
responses. The variance here is a function of the proportions. That is, if you know the
proportion, you can get the variance from that value. So unlike continuous variable
where you need to report both a mean and SD because you can have the same mean
but different SDs or the same SD with a different means (statistical note: mean and SD
are independent!), with categorical responses only need to report the n and proportion
of each response.
26
So what do you report? In general:
â€¢ N (%)
â€¢ Cumulative % if the responses are ordinal
â€¢ Mode if that is informative
â€¢ Sometimes it is useful to report or show graphically all the categories with the
frequencies of each category
Steps in JMP:
2. Analyze > Distribution (this means go to the Analyze Menu and select
Distribution)
3. Select your variable, then click on Y, Columns
4. Press OK
For this example:
5. Open PUH 250 Module 3 Example.jmp
6. Analyze > Distribution
7. Select Number of Times High Blood Pressure
8. Press OK
Example Interpretations:
blood pressures. The majority reported that they had never been told they had
high blood pressure, 9 (45.0%). With the same number of respondents saying
they had been told 1-2 times or 3-5 times, both 4 (20.0%), and only 15% (3)
reporting they had been told 6 or more times.
â€¢ You could also reference a table or figure and use fewer words: Twenty (20)
pressures. The majority reported that they had never been told they had high
blood pressure, 9 (45.0%), with 11 (55.0%) reporting 1 or more times (Table X or
Figure X).
The second interpretation here, collapsed the categories for more simple reporting.
That is, they took all the categories where someone responded 1-2, 3-5, or 6 or more
and combined them / added them up: 4+4+3 = 11 and divided by the total 20 Ã  11/20 =
0.55 Ã  0.55*100 = 55.0%.
27
â€¢
You can do this in 2 ways in JMP:
o Recode option: easy if you have only a small number of original categories
o Create a formula: if you have a lot of categories or if you want to go from a
continuous variable to categories
o Both will be demonstrated in the practice JMP videos
Continuous vs Categorical
If you can collect a variable as a continuous number, then do so! You can always
create categories later. Sometimes it is not possible to collect a number, even when
something is truly quantitative in nature. For example:
â€¢ It would be possible to collect the age someone was when they first had an
alcoholic beverage but depending upon when you asked the question, people
may not remember exactly. In this case you would likely give them response
options like Fit Y by X
3. Put your Y (Helmet Use) in the Y, Response box
4. Put your X (Sex) in the X, Factor box
5. Press OK
Aside: JMP is going to tell you what analysis will be run, if you look at the data types on
the X and Y in the Fit Y by X platform window.
Y: Helmet use is a
Nominal variable
(red histogram)
X: Sex is a Nominal variable
(red histogram)
36
Letâ€™s look at the JMP Output you get:
The first thing you see is the Mosaic Plot. This
plot shows you the proportion of Females (Left)
where Helmet Use = Yes (Blue, Top) and = No
(Red, Bottom). It also shows the same for Males
(Right).
The far right, thin bar shows the proportion of
Helmet Use overall (without regard to Sex). Here
you can see that the observed proportion of Yes
for Females is a bit higher (more Blue) than for
Males. You can also see, however, that the split
of Yes/No is fairly close to the overall split of
Yes/No in the far right bar.
Note: The Mosaic Plot is only useful for YOU. This is not a plot you will turn in for an
assignment or include in a paper or your data brief. There are other ways to display this
information.
Next in JMP Output: Contingency Table
â€¢ The predictor (X) is in the rows (side)
â€¢ The outcome (Y) is in the columns (top)
â€¢ The numbers at the end of each row and the
bottom of each column show the number that have
that response, while the bottom right corner shows
the total number of observations (30):
o Row Totals: 15 females, 15 males
o Column Totals: 17 report not wearing a helmet,
13 report wearing a helmet
â€¢ The top number in each cell is the number of cases
with that combination of responses:
o 8 females report not wearing a helmet
o 7 females report wearing a helmet
o 9 males report not wearing a helmet
o 6 males report wearing a helmet
â€¢ There are 3 % values given for each combination of responses
o Total %: this is the number / the total:
Â§ 8 / 30 = 26.67 or 26.7% of all respondents are female AND wear a helmet
o Col %: this is the number / the total in that column:
37
â€¢
Â§ 8 / 17 = 47.06 or 47.1% of those that did not wear a helmet are female
o Row %: this is the number / the total in that row:
Â§ 8 / 15 = 53.33 or 53.3% of females did not wear a helmet
Â§ This is the most natural way to talk about contingency tables. You will
generally use Row % to report and compare values.
These are also called conditional probabilities. You are conditioning on one
variable and reporting the % of the other. Of Femalesâ€¦this % of the outcomeâ€¦
Interpreting contingency table analysis
1. For a 2×2 table, select the outcome of interest to report. Usually this is the presence
of some condition, attitude, behavior, event, etc. Occasionally we want to report â€œNoâ€
response (or the lack of absence of something), but this is less common. For a
larger table, you can report on the most interesting or unexpected outcome based
upon the patterns you observe.
2. Reading down the column for the outcome of interest (in this example, helmet use =
Yes), find the row percentages.
3. Assess the row percentages for each level of the predictor:
â€¢ Are the percentages the same, regardless of the category of the predictor? If so,
there is no relationship between the predictor and the outcome in this sample
â€¢ Are the percentages similar, regardless of the category of the predictor? If so,
there may be limited evidence of a relationship or a weak relationship between
the predictor and the outcome in this sample
â€¢ Are the percentages different (more than about 10%)? If so, there may be
evidence of a strong relationship between the predictor and the outcome in this
sample
In this case, we will select the Yes, wears a helmet response. Reading down the Yes
column for the row percentages for each gender, we observe that:
â€¢ 46.7% of females report wearing a helmet
â€¢ 40.0% of males report wearing a helmet
â€¢ We have limited evidence of a relationship between gender and wearing a
helmet when riding a ZYP bike in this sample from Birmingham, AL.
â€¢ There will be a statistical test for this to formally decide between our H0 and
HA â€“ more on this later in the course
38
Comparisons of Group Means: Categorical predictor and continuous outcome
Using this same Fit Y by X approach, we can compare the mean value on an outcome
variable separately for two (or more) groups of a predictor variable.
Appropriate ways of posing research questions for contingency table analysis:
â€¢ Do females report a greater average number of times riding a ZYP bike?
â€¢ Is there a gender difference in average number of times riding a ZYP bike?
â€¢ Does the average number of times riding a ZYP bike vary by gender?
Put in terms of the hypotheses:
â€¢ H0: There is no difference in mean number of rides by gender
â€¢ HA: There is a difference in in mean number of rides by gender
You could also write it this way:
â€¢ H0: The mean number of times people ride a bike is the same for each sex
â€¢ HA: The mean number of times people ride a bike is the same for each sex
Steps in JMP:
1. Make sure your data is recoded!
2. Analyze > Fit Y by X
3. Put your Y (Zyp times) in the Y, Response box
4. Put your X (Sex) in the X, Factor box
5. Press OK
Y: Zyp times is a
continuous variable
(blue triangle)
X: Sex is a Nominal variable
(red histogram)
39
Letâ€™s look at the JMP Output: Bivariate Plot of Zyp Times by Sex
There are two groups on the X
axis (bottom horizontal line), and
the observed Zyp times on the Y
axis (vertical left line). You see a
dot for each Zyp time reported but
be careful, this isnâ€™t all the people
that responded. There may be
multiple people with the same
response. There are only 14 dots
here and we know there were 30
responses.
The grey horizontal line (just
below 5) shows the overall mean
of Zyp times. You would already know this by doing Analyze > Distribution on Zyptimes
overall when you started examining your data with univariate analysis (which you
ALWAYS do before you start bivariate analysis). So how do you get the means by each
sex?
Click on the red triangle next to Oneway Analysis ofâ€¦ and select â€œMeans and Std Devâ€
40
Important things to notice about the comparison on group means:
â€¢ The predictor is on the X axis
â€¢ The outcome is on the Y axis
â€¢ The plot shows the overall mean
(grey bar)
â€¢ And means for each group (small
middle blue dash, hard to see)
â€¢ The bottom panel shows the mean
and SD for each group:
o The mean (SD) for females = 3.7
(3.9)
o The mean (SD) for males = 4.5
(4.5)
â€¢ JMP gives you way too many decimal
places, usually 1 is sufficient, this is
up to YOU to edit for your data.
Interpreting the comparison of group means
â€¢ The mean number of times riding a ZYP bike for females is 3.7 (SD = 3.9), while the
mean number for males is 4.5 (4.5) in this sample from Birmingham, AL.
â€¢ For now, we will say it is observed that the males report riding about 1 time more
per month compared to females (4.5 vs 3.7).
â€¢ Again, there is a statistical test to help us pick between H0 and HA, more on that
later in the course.
What about medians, range, or IQR?
You can also get that from the red
triangle dropdown by selecting
Quantiles:
â€¢ Now you have a Box Plot (see the
boxes?) by Sex
â€¢ And you get the Quantiles, similar
to what you saw in the Univariate
analysis from Analyze > Distribution
option
â€¢ There will be a statistical test for
comparing medians, too.
41
Correlation Analysis: Continuous predictor and continuous outcome
Using this same Fit Y by X approach, we can assess the relationship between a
predictor and an outcome variable when both are continuous variables. You may
remember correlation from high school math class, when you graphed coordinates and
looked at the patter they formed. You may remember it as â€œRise over Runâ€ or Y = mx +
b?
In statistics, we use this concept with scatterplots and the correlation coefficient to
describe the relationship between two continuous variables. For correlation, there are
also multiple ways to pose the question but they are all asking the same thing: is there a
relationship between two variables?
Appropriate ways of posing research questions for contingency table analysis:
â€¢ Is there an association between age of rider and the number of times riding a ZYP
bike?
â€¢ Is there are relationship between the number of times riding a ZYP bike and the
age of the rider?
â€¢ Are age of rider and number of times riding a ZYP bike correlated?
â€¢ Note: in statistics, correlation is a very specific term for two continuous variables.
For other relationships, say association or relationship.
Put in terms of the hypotheses:
â€¢ H0: There is no relationship between age and number of rides
â€¢ HA: There is a relationship between age and number of rides
Steps in JMP:
1. Make sure your data is recoded!
2. Analyze > Fit Y by X
3. Put your Y (Zyp times) in the Y, Response box
4. Put your X (Age) in the X, Factor box.
â€¢ Why is age the predictor? Because age in this case is fixed (independent)
and we are trying to predict frequency of rides.
5. Press OK
42
Y: Zyp times is a
continuous variable
(blue triangle)
Look! Bivariate Analysis!
X: age is a continuous variable
(blue triangle)
Letâ€™s look at the JMP Output: Bivariate Fit of ZYP time by Age
â€¢
â€¢
â€¢
â€¢
Here, ZYP Time is on the Y axis
Age is on the X axis
There is a dot for each unique pair of
responses (for example, age = 19,
zyptimes = 1 is the lower left dot. Hover
over the point in JMP and it will tell you
this is row 1)
If there are multiple pairs with the same
set of responses, there will be only 1 dot
Getting correlation: Again, the red drop down
triangle:
â€¢
â€¢
Ask for a 0.95 Density Ellipse
Youâ€™ll then need to click on the grey triangle next
to Bivariate Normal Ellipse to open up the results
box.
43
correlation analysis:
â€¢ The predictor is on the X axis
â€¢ The outcome is on the Y axis
â€¢ The scattperplot shows the:
o Strength: is it a tight or spread
out cluster of points?
o Direction: is it positive
(increases left to right) or
negative (decreases left to
right)?
â€¢ The bottom panel gives the precise
value and direction of the
correlation
o Here it is negative, -0.37
o You also get the mean (SD) of
each variable
â€¢ For now, ignore the p-value, this is
the statistical test, and (you guessed it) more on that later in the course
Interpreting correlation
The correlation coefficient (written as â€œrâ€, also called Pearsonâ€™s correlation or Pearsonâ€™s
r) ranges from a negative 1 to a positive 1 or -1.0 to +1.0 (with 0 in the middle)
â€¢ Values nearer to -1.0 are a stronger negative correlation (inverse relationship, as
X increases, Y decreases)
â€¢ Values nearer to 0 indicate no association or a weaker relationship
â€¢ Values nearer to +1.0 are a stringer positive correlation (as X increases, Y
decreases)
â€¢ What about variables in the middle?
o Between 0 and 0.30 is considered weak
o Between 0.31 to 0.50 is moderate
o Over 0.50 is a strong correlation
o This is true for both positive and negative correlations, just make sure to
say if it is a positive or negative (inverse) relationship
In this case, we would say:
â€¢ There is a moderate, negative correlation between age and number of times
riding a ZYP bike (r = -0.37) in this sample from Birmingham, AL.
44
Interpretation guidelines for all bivariate analyses
â€¢ State the question
o Include how each variable is measured, including the question wording and the
variable type or levels)
â€¢ What is the N in the sample and what is the N on both variables (must have both to
be in a bivariate analysis)
â€¢ Univariate summary measures: n(%) for each level or mean (SD) / median (IQR or
Range)
â€¢ The bivariate summary (as applicable from above)
â€¢ The interpretation sentence, based on the bivariate analysis conducted.
â€¢ Include the sample information and source of the data
â€¢ In all of the above, make sure to use real words and terms NOT the variable names
Chapter 4: Wrap up Questions
â€¢
â€¢
â€¢
â€¢
â€¢
What is a research question?
What are the null (H0) and alternative hypotheses (HA)?
What are the 3 types of bivariate analyses?
What are the two variable types needed for each bivariate analysis?
What are the summaries reported for each type of analysis?
45
Chapter 4: Problems
Using the same Sample Data from Chapter 2: open Diabetes.jmp. This is a sample
dataset of 442 participants with diabetes.
Use Fit Y by X to answer the following (make sure that Gender is Recoded, 1 = Male, 2
= Female):
1. For Y = Y Binary and X = Gender:
â€¢ What is the question you are asking?
â€¢ What is the name of this analysis?
â€¢ What is the overall N(%) with High Glucose?
â€¢ What is the N(%) of Males with High Glucose? Of Females?
â€¢ Write the interpretation according to the guidelines for this type of analysis
2. For Y = Glucose and X = Gender:
â€¢ What is the question you are asking?
â€¢ What is the name of this analysis?
â€¢ What is the overall mean (SD) of Glucose? (You need to use Analyze >
Distribution for this, like you did in Chapter 2)
â€¢ What is the mean (SD) of Glucose for Males? For Females?
â€¢ Write the interpretation according to the guidelines for this type of analysis
3. For Y = Glucose and X = Age:
â€¢ What is the question you are asking?
â€¢ What is the name of this analysis?
â€¢ What is the overall mean (SD) of age and Glucose? You can get these from
Analyze > Distribution but they are also shown in the JMP output for this
analysis.
â€¢ What is the correlation?
â€¢ Write the interpretation according to the guidelines for this type of analysis
46
Chapter 5.
Collecting the Evidence
Module Learning Objectives (Course Learning Objectives): At the end of this
module students should be able to:
â€¢
â€¢
â€¢
â€¢
â€¢
MO 5.1: Define population and a population sample (CLO 1, 2)
MO 5.2: Identify sampling techniques (CLO 1, 2)
MO 5.3: Differentiate between observational and experimental study designs
(CLO 1, 2)
MO 5.4: Define causality (CLO 1, 2, 3)
MO 5.5: Identify the criteria for concluding a causal relationship (CLO 1, 2, 3, 4)
Definitions for this module:
â€¢ Population
â€¢ Sample
â€¢ Representative
â€¢ Simple random sample
â€¢ Non-random or convenience sample
â€¢ Bias
â€¢ Generalizability
â€¢ Inference
â€¢ Unit of analysis
â€¢ Case
â€¢ Control
â€¢ Study designs:
o Observation
o Experiment
â€¢ Causation:
o Temporal precedence
o Correlation
o No plausible alternative
In the last few modules, we have looked at descriptive summary statistics (univariate
statistics) and using two variables to determine if they are related (bivariate statistics).
We briefly talked about identifying the population of reference so that your results have
context. That is, you wouldnâ€™t want to report results from a study in only adult women
and try to apply those results to adolescent males. So it is important to understand
where you got your data sample, from which population, so you can make statements
47
Population
A population is defined as â€œthe whole number of people or inhabitants in a country or
region or the total of individuals occupying an area or making up a wholeâ€ according to
Merriam-Webster Dictionary. Which means that you can define a population very
broadly (United States) or very narrowly (student athletes at UAB). But in either case,
population refers to the entire group of things meeting your definition. Very often, we
wonâ€™t actually know the true number in our populations, e.g. people with arthritis. Why?
Because some people wonâ€™t ever get diagnosed. That is why you often hear about a
population in research being defined as â€œpeople DIAGNOSED withâ€ a condition.
Statistically, a population is a well-defined group that is of interest for a research
question. Here also, it may not be possible to actually collect information on all people
40-65 years of age with arthritis. So we turn to samples from the population.
Sampling
Although our goal in public health is to describe and hopefully improve the health of
populations, we usually only study samples. It is often too expensive, time-consuming,
and logistically challenging to study a whole population, and luckily the power of
inferential statistics means we donâ€™t have to (more on this in the second half of the
semester!). Inference or inferential statistics means we can take the results from our
sample and infer (or conclude) they apply to the population, too.
The key is selecting a sample that is representative (meaning it shares the same
characteristics) of the population from which is was drawn. The best way to do this is
through simple random sampling â€“ putting the names of every single person in the
population in a hat, shaking it up, and then randomly selecting a subset of names. But
as you can imagine, simple random sampling is often not so simple: Where do we get
all the names? How do we find all the people whose names we drew and convince them
to participate in our study? And just how big is this hat???
Though simple random sampling is the ideal procedure, it is rarely followed. More often
we draw a non-random or convenience sample, trying as much as possible to avoid
drawing it in such a way that it is also decidedly non-representative. For example, we
want to avoid sampling only one type of person, or sampling only at a specific time or
location. Doing so is likely to introduce bias, meaning that our sample will represent
some portions of the population but not others. This can make certain characteristics,
behaviors, outcomes, etc. appear more or less common than they actually are, limiting
the generalizability (our ability to accurately describe a population using a sample) of
48
our results. Ideally we want to include lots of different types of people in our sample, to
account for all the different types of people that make up the population.
Although we tend to think in terms of individuals making up our samples and the overall
US population as â€œtheâ€ population, samples and populations can be made up of
anything. Imagine weâ€™re conducting a study of air pollution in US cities. We could collect
air pollution data in a sample of cities, with the goal of generalizing to all US cities. The
unit of analysis here is a city â€“ thatâ€™s who (or what) weâ€™re collecting data from and who
the data describe. Each city is a case, or individual unit in our study.
Study Design
One of the most interesting aspects of public health research is the variety â€“ there are
countless ways to collect public health data. All of these data collection techniques can
generally be sorted into two key study designs, based on the role of the researcher. In
observational designs, the researcher directly observes data; for example, by asking
questions using a survey (like the YRBS) or by taking measurements of contaminants in
the air (like in the air pollution study described above).
In experimental designs, by contrast, the researcher is more involved. An experimental
study starts with the researcher randomly sorting study participants into two groups: the
experimental group and the control group. The researcher then provides some
treatment or stimulus to the experimental group â€“ gives them nutritional counseling,
provides them with a medication â€“ whatever the researcher is attempting to determine
the effect of. The control group does not receive this same treatment. The researcher
then examines the outcome of interest in both groups to see if being in the experimental
group has led to a different value on the outcome.
Experimental designs can be hard to implement, and sometimes it isnâ€™t possible due to
ethical concerns (you certainly couldnâ€™t assign one group to smoke and one group not to
smoke to determine whether smoking causes cancer!). But, experimental designs are
an incredibly powerful tool for establishing cause and effect due to the fact that study
participants are randomized into either the experimental or control group.
Randomization means that the two groups will be very similar in their characteristics
with only their assignment to group (experimental or control) differing, so the effect of
whatever treatment they receive in the experiment will be clear. This is why
experimental designs are considered the â€œgold standardâ€ in the research world.
49
Causality
As noted above, experimental designs are necessary to establish causality. Causality
refers to situations in which values on a predictor variable are understood to directly
cause values on the outcome variable (like in the smoking -> cancer example we looked
at earlier in the semester). We often speak casually about causality, but in fact it is very
challenging to establish evidence of a causal relationship. To do so, three key criteria
must be met (see additional information on Hillâ€™s Criteria in online Module):
1. Temporal precedence. This sounds fancy, but it really just means that the cause
has to come before the effect in time. This can be difficult to establish, especially in
observational designs that just collect data at one time point. How do we determine
that X causes Y, rather than Y causing X? If we observe both variables at the same
time, we struggle to establish which occurred first. This is one reason experimental
designs are so powerful â€“ we can be certain that the treatment (X) occurs before the
outcome (Y) weâ€™re interested in!
2. Correlation/Association. This is exactly like it sounds and refers to whether there is
an association or relationship between the two variables weâ€™re interested in.
Statistically, we have multiple ways to establish this, but the two most common are:
Pearsonâ€™s correlation coefficient (written as â€œrâ€) â€“ this is a measure of
association for two continuous variables. That is, what happens to Y as X
increases? It can be negative (indicating an inverse relationship, Y decreases as
X increases) or positive (Y increases as X increases). It ranges from -1.0 to +1.0,
which 0 meaning no association. Closer to -1.0 or +1.0 indicates a stronger
association. From the SBP Weight data, see the relationship below where as
weight (X) increases, SBP (Y) also increases, with an r = 0.33; indicating a
moderate positive relationship.
Systolic Blood Pressure (mmHg)
â€¢
150
140
130
120
110
100
90
100
150
200 250 300
Weight (Pounds)
350
50
â€¢
Contingency tables â€“ this technique allows us to look at the association
between two categorical variables. We arrange the responses on the predictor
in the rows and responses on the outcome in the columns, and calculate the
percentage of cases in each row experiencing the outcome of interest. Recall
this contingency table from our discussion of how researchers established the
evidence that smoking is associated with lung cancer.
3. No plausible alternative explanation. This third criterion can be challenging to
establish. It requires that we demonstrate that the apparent association between
the two variables weâ€™re interested in cannot be explained away by a third
variable. The video on How Ice Cream Kills! presents some great examples of
situations in which a third variable explains an apparent association, but this
tends to be much more complicated in public health research when weâ€™re dealing
with real people who live wonderfully complicated lives.
Again, you can see why experimental designs are so powerful â€“ if we just manipulate
one thing, the treatment, itâ€™s much easier to rule out the effect of a third variable. Not so
in observational designs, where we measure lots of variables but struggle to know how
they fit together and in what order.
â€¢ Observational designs are good for determining correlations between variables, but
not good at establishing temporal precedence or excluding alternative explanations.
â€¢ Experimental designs, when theyâ€™re done well, can address all three criteria for
causality.
51
It is critical to know that:
â€¢
â€¢
Observational designs can establish correlation and only in very special
circumstances (very rare), establish causality.
Experimental designs can establish both correlation and causality, when properly
designed.
Correlation does not equal Causation!
Correlation â‰  Causation!
Chapter 5:
â€¢
â€¢
â€¢
â€¢
â€¢
Wrap up Questions
What is a population?
How is a sample different from a population?
How do you obtain a representative sample?
What is the difference between an observational and experimental study
design?
When can you draw a conclusion of correlation from a study?
52
Chapter 5:
Problems
biostatistics courses, and you obtain your samples in the following ways, what
type of samples are they? Convenience or simple random sample
â€¢ Asking all students in your class to fill out a survey?
â€¢ Asking the teacher to generate a list of students such that each student has
the same chance of being asked to complete the survey?
2. What population would these samples represent? List out the likely
characteristics of students in undergraduate biostatistics courses.
3. Could a single survey tell you about correlation? Or causation?
4. What if you asked students at the start of class and after the class the same set
of questions, could this draw causation? Why or why not?
5. What if you divided up students into two different types of study groups and
followed their course performance by group? Could this draw causation? Why or
why not?
53
Chapter 6.
Review Chapter 1-5
The module objectives from Chapters 1-5 outline what you should know or do thus far:
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
MO 1.1: List several of the ways that biostatistics provide the evidence
necessary for public health research and practice
MO 1.2: Justify the importance of a particular public health problem on the basis of
an evaluation of the evidence
MO 2.1: Identify the characteristics of a continuous variable
MO 2.2: Define measures of central tendency: mean, median, mode
MO 2.3: Define measures of spread (dispersion): variance, standard deviation,
range, inter-quartile range
MO 2.4: Calculate measures of central tendency: mean, median, mode
MO 2.5: Calculate measures of spread (dispersion): variance, standard
deviation, range, inter-quartile range
MO 2.6: Interpret measures of central tendency: mean, median, mode
MO 2.7: Interpret measures of spread (dispersion): variance, standard deviation,
range, inter-quartile range
MO 3.1: Identify the characteristics of a categorical variable
MO 3.2: Determine variable type based on examination of the categories
(response options)
MO 3.3: Order variables from lower to higher level of measurement
MO 3.4: Create variables with mutually exclusive and exhaustive categories
(response options)
MO 3.5: Define proportion and percentage
MO 3.6: Calculate proportions and percentages
MO 3.7: Interpret proportions and percentages
MO 3.8: Distinguish between categorical and continuous variables
MO 4.1: Define research questions and hypotheses
MO 4.2: Define predictor and outcome variables
MO 4.3: Distinguish between independent (predictor) and dependent (outcome)
variables in a research scenario
MO 4.4: Define the null and alternative hypotheses
MO 4.5: Differentiate between the null and alternative hypotheses
MO 4.6: Determine the appropriate summary statistics for a bivariate analysis
MO 5.1: Define population and a population sample
MO 5.2: Identify sampling techniques
MO 5.3: Differentiate between observational and experimental study designs
MO 5.4: Define causality
MO 5.5: Identify the criteria for concluding a causal relationship
54
Below is a listing of what has been covered broken out into:
â€¢ Knowledge: definitions and concepts
â€¢ Skills: what you do in JMP and what to answer and report
You may want to make notes of your own below and note which module contains the
materials so that you can find them during then exam (hint, use the listing of Module
Objectives).
Definitions and concepts
â€¢ Evidence: Identify how biostatistics provides evidence for public health
â€¢ Creating variables: including the question and response options
â€¢ Variable types: distinguishing dichotomous, nominal, ordinal, and continuous
variables:
o Mutually exclusive and exhaustive categories of variables
â€¢ Creating and interpreting frequency distributions: what must be included in
the interpretation
â€¢ Measures of central tendency: calculation and interpretation of mode, median,
and mean; for which variable type each is appropriate
â€¢ Measures of dispersion: calculation and interpretation of range and interquartile
range; interpretation of standard deviation
â€¢ Describing distributions: know when to use mean & standard deviation and
when to report a five-number summary
â€¢ Shape of distributions: modality, symmetry, skewed right and left distributions
â€¢ Outliers: recognizing if and when they should be excluded from analysis
(remember, sometimes they are errors)
â€¢ Samples, populations: be able to identify from a research description
o Understand the difference between sample and population
â€¢ Independent (predictor) and dependent variables: be able to define and
identify from a research description
â€¢ Writing research questions based on independent and dependent
variables: make sure they end in a question mark and take the form of:
o How does [Independent variable] influence [Dependent variable]? or
o Are [one category of the IV] or [other category of the IV] more likely to
[DV]?
55
Preparing and Examining Data (with JMP)
â€¢ Recoding to give variables meaningful levels for categorical variables
â€¢ Recoding to combine categories/values
â€¢ Using Analyze > Distribution to examine categorical and continuous data
o For continuous: what is the shape of the distribution?
o For categorical: what is the outcome level of interest?
â€¢ Using the By command in Distribution to analyze separate groups/samples â€¢
â€¢ Using Analyze > Fit Y by X to describe an outcome by a predictor
For each Fit Y by X, be able to answer:
â€¢ Which variable is Y (dependent / outcome)?
â€¢ Which variable is X (independent / predictor)?
â€¢ Which null hypothesis would be appropriate, given the variables you have?
â€¢ Which alternative hypothesis would be appropriate, given the variables that you
have?
â€¢ Are you looking for:
o Continency Table
o Group Means
o Correlation
â€¢ What are the appropriate summary statistics for your outcome?
â€¢ What can you conclude about the relationship between these variables, based on
the observed summary statistics?
â€¢ Make sure to state your question in words (not variable names)
â€¢ Include relevant summary statistics, including the N (number in your sample)
â€¢ Give the source of your data
Answering questions with data follows the general pattern:
â€¢ Begin by writing down what you understand
â€¢ Outline what the data says and form clear and succinct questions pertaining
to what the data may imply (or what you would like to show)
â€¢ Form a scientific question to determine if the results are random
â€¢ Compare the data from each side of the question and decide what to believe
â€¢ Write down what you found and what it means
Chapter 6: Problems
The worksheet for Chapter 6 is found in Module 6.
56
Chapter 7.
Midterm Exam
There is no new material in this chapter. You will focus on taking the Midterm Exam. As
you do that, consider the module objectives that have been addressed through Module
5 (reference materials in Chapter 6).
Read each objective and make sure you understand them and/or how to address them
using JMP.
When approaching an analysis, you need to:
â€¢ Define the questions that you want to address
â€¢ Review the variables and variable types
â€¢ Clean or make sure the data is clean and ready for use
â€¢ Summarize each variable as appropriate for the type of variable and distribution
â€¢ Create a table or descriptive figures for summary, as applicable (i.e. Table 1)
â€¢ Make a statistical decision from the analysis output and record the applicable results
â€¢ Write summary conclusions in general language, including the statistical results but
not focusing on statistical language
57
Chapter 8.
Midterm Exam Reflection and Revision
For this chapter, you will review key concepts on variable types, sampling strategies,
and how to summarize data and analyze bivariate data. After the review, you will have
the opportunity to resubmit your Midterm Exam.
Review what you missed on the midterm exam, determine which objective the question
is associated with and then return to that chapter and module to help determine the
58
Chapter 9.
Determining Significance
Module Learning Objectives (Course Learning Objectives): At the end of this
module students should be able to:
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
â€¢
MO 4.4 (review): Define the null and alternative hypotheses (CLO 1)
MO 4.5 (review): Differentiate between the null and alternative hypotheses
(CLO 1, 2)
MO 9.1: Define the 4 steps of hypothesis testing (CLO 1, 2, 3, 5)
MO 9.2: Calculate the p-value using JMP to answer your statistical question for
a Contingency table (Chi-square test), Group Means (t-test), and
Correlation (t-test) analysis (CLO 2, 5)
MO 9.3: Define statistical inference (CLO 1, 3)
MO 9.4: Estimate the Confidence Interval with JMP (CLO 5)
MO 9.5: Interpret the Confidence Interval (CLO 3)
Definitions
â€¢ Hypothesis
â€¢ Null Hypothesis
â€¢ Alternative Hypothesis
â€¢ Alpha
â€¢ P-value
â€¢ Statistical Significance
â€¢ Confidence Interval
â€¢ Upper bound
â€¢ Lower bound
â€¢ Inference
So far, weâ€™ve looked at univariate and bivariate statistical approaches to describe the
distributions of variables in a particular sample. This is a critical first step in any
statistical analysis, but our ultimate goal is usually to describe the population from
which a particular sample was drawn. Although we have a sample of a population in a
dataset, our goal is to describe the whole population.
Using data from a single sample to describe a population â€“ known as statistical
inference â€“ requires that we take into account the fact that some samples are more
similar to the populations from which they were drawn than others. We have some
control over this â€“ as we said previously, randomly selected samples are more likely to
be representative of the population, as are larger samples. Much of this, however, is
due to random chance â€“ sometimes we draw a representative sample, but sometimes
59
we do not. This is called sampling variability, and luckily for us it is based on known
laws of probability. We wonâ€™t get into any probability calculations in this course, but a
basic understanding of the role of probability or random chance in statistical analysis is
essential.
Hypothesis Testing (Review from Module 4)
Hypothesis testing is the most commonly used â€“ and most misunderstood! â€“ approach
for determining whether a relationship exists between two variables in the population
(remember, we can directly observe a relationship between variables in the sample
using contingency tables, correlation, and group means, but our goal now is to describe
the population).
In hypothesis testing, we start by assuming that there is NO relationship between two
variables in the population (H0 or null hypothesis). We then examine evidence from the
sample and consider how likely it is that we would observe a relationship between
variables in the sample IF there was truly NO relationship between the variables in the
population.
For example, in adults age 30-60:
â€¢ We assume that there is no relationship between liking coffee and being scared of
clowns. (yes, this is a silly example)
â€¢ We would collect data of people aged 30-60 (ideally a random sample, not from a
clown college or while in a coffee shop)
â€¢ We would then use Fit Y by X to determine if there is a relationship between liking
coffee and being scared of clowns. There are two options:
1. If the relationship between two variables in the sample is relatively weak, we
conclude that it is likely just due to sampling variability â€“ we happened to draw
a sample in which there appears to be some relationship, but itâ€™s not very strong
and probably doesnâ€™t reflect a true relationship between those variables in the
population. We conclude there is no relationship, we stay with H0.
2. If the relationship between two variables in the sample is strong, we have
reason to believe that it is likely NOT due to sampling variability â€“ we have
evidence of a true relationship between those variables in the population. We
will reject the null hypothesis and conclude there is a relationship.
This makes intuitive sense â€“ if we see a strong relationship between two variables in the
sample, we get excited and say, Wow thereâ€™s really something going on here!