read the book and answer the chapter 1 questions 5-8

Glossary

Complete this glossary with the definition and the chapter / module where you can find

more information on the topic

Note: You can turn in your completed glossary at the end of the course for bonus points

Term: Definition

Alpha:

Chapter

Alternative Hypothesis:

Associations:

Bias:

Bivariate:

Bivariate analysis:

Case:

Categorical:

Categories (response options):

Causation:

Census:

Changes:

PUH 250 Glossary

Page 1 of 6

Term: Definition

Chapter

Chi-square test:

Collapsed:

Comparisons:

Confidence Interval:

Constant:

Contingency table:

Continuous variable:

Control:

Convenience Sampling:

Correlation:

Correlation coefficient:

Cross tabulation:

Cumulative Percent:

Data: A collection of facts, not necessarily numeric, such as: age, gender, hair

color, weight, temperature, etc.

1

Dependent (outcome) variable:

PUH 250 Glossary

Page 2 of 6

Term: Definition

Chapter

Descriptive statistics:

Dichotomous:

Dispersion:

Evidence:

Exhaustive:

Experiment:

Frequency / Frequency Distribution:

Generalizability:

Hypotheses:

Hypothesis:

Independent (predictor) variable:

Inference:

Inferential Statistics:

Interquartile Range:

Lower bound:

PUH 250 Glossary

Page 3 of 6

Term: Definition

Chapter

Mean:

Median:

Mode:

Model:

Âµ (mu):

Mutually exclusive:

N:

No plausible alternative:

Nominal:

Non-random or convenience sample:

Null hypothesis:

Observation:

Ordinal:

p-value:

Parameters:

PUH 250 Glossary

Page 4 of 6

Term: Definition

Chapter

Pearsonâ€™s r (or simply r):

Percent:

Population: A well defined collection of objects, such as: students

1

Proportion:

Quantiles:

Range (Minimum, Maximum):

Representative:

Research question:

Resources:

Row, column, total percent:

Sample size:

Sample:

Sampling:

Simple random sample:

Simple Random Sampling:

PUH 250 Glossary

Page 5 of 6

Term: Definition

Chapter

Standard Deviation:

Statistical Significance:

Stratified Sampling:

Study designs:

t-test:

Temporal precedence:

Unit of analysis:

Univariate analysis:

Upper bound:

Variable:

Variable type:

Variance:

PUH 250 Glossary

Page 6 of 6

Biostatistics for

Public Health

Undergraduates

Stacey S. Cofield, PhD

Erika L. Austin, PhD, MPH

About the Authors

Stacey S. Cofield, PhD, and Erika L. Austin are Associate Professors in the Department

of Biostatistics in the School of Public Health (SOPH) at the University of Alabama at

Birmingham (UAB). Drs. Cofield and Austin have been teaching introductory

biostatistics for over 15 years at both the graduate and undergraduate levels. All

contents here Â© Stacey S. Cofield and Erika L. Austin, 2021.

Stacey S. Cofield, PhD

Associate Professor, Biostatistics

Associate Dean of Recruitment, Retention, & Diversity

UAB Graduate School

Erika L. Austin, PhD, MPH

Associate Professor, Biostatistics

Associate Dean of Student & Academic Affairs

School of Public Health

Table of Contents

Preface

How to Use this Book ………………………………………. 4

Chapter 1

The Role of Biostatistics in Public Health …………….. 5

Chapter 2

Continuous Variables …………………………………….. 12

Chapter 3

Categorical Variables …………………………………….. 21

Chapter 4

Asking the Questions……………………………………… 30

Chapter 5

Collecting the Evidence ………………………………….. 47

Chapter 6

Review Chapters 1-5 ……………………………………… 54

Chapter 7

Testing your Knowledge …………………………………. 57

Chapter 8

Assessing your Knowledge……………………………… 58

Chapter 9

Determining Significance ………………………………… 59

Chapter 10 Comparing Means …………………………………………. 72

Chapter 11 Comparing Proportions …………………………………… 81

Chapter 12 Correlation……………………………………………………. 97

Chapter 13 Reading Research……………………………………….. 106

Chapter 14 Review ………………………………………………………. 107

Preface

This textbook has been specifically designed for this course to align with the course

modules. Each chapter contains: an introduction, the objectives for the chapter (that

are associated with the module and course objectives in the course), important

definitions for that chapter, software examples, and practice problems for you to

complete and evaluate your progress in the course. Each chapter will conclude with

important questions you should be able to answer after completing the chapter and

associated module on Canvas.

In addition to the practice problems in these chapters, there will be practice exercises

on Canvas for you to complete and turn in for credit. There are also lecture

notes/videos, software demos and additional bonus exercises available on Canvas.

This textbook alone is not sufficient for complete understanding and comprehension of

the materials presented in this course. This textbook is one of several materials to help

you succeed!

Canvas

Learning

Checks

Practice

Textbook

Software

Chapter 1 The Role of Biostatistics in Public Health

Module Learning Objectives (Course Learning Objectives): At the end of

this module students should be able to:

â€¢

MO 1.1:

â€¢

MO 1.2:

List several of the ways that biostatistics provide the evidence

necessary for public health research and practice (CLO 2)

Justify the importance of a particular public health problem on the

basis of an evaluation of the evidence (CLO 1)

Definitions

â€¢ Data

â€¢ Population

â€¢ Census

â€¢ Sample

â€¢ Variable

â€¢ Descriptive Statistics

â€¢ Inferential Statistics

â€¢ Sampling

â€¢ Simple Random Sampling

â€¢ Stratified Sampling

â€¢ Convenience Sampling

â€¢ Parameters

â€¢ Model

â€¢ Hypotheses

â€¢ Statistical Significance

â€¢ Evidence

â€¢ Associations

â€¢ Changes

â€¢ Resources

â€¢ Comparisons

What are statistics? What is the practice of biostatistics?

These are two different questions! Statistics are just numbers but the practice of

biostatistics involves measuring variability of numbers to interpret results.

Statistics can be used to analyze data after an experiment has been carried out but can

(and should) also be used to make suggestions for how experiments can be designed to

reduce variation and produce better, more accurate, consistent, and predictive results.

Biostatistics for Public Health Undergraduates

5

There are numbers, formulas, and defined scientific processes involved in answering a

statistical question. However, keep in mind that statistics as a mathematical discipline is

a different discipline, called theoretical statistics, that is NOT this class.

We will be approaching statistics as an applied discipline using some basic levels of

math. Instead of using theorems, properties, and abstract math, weâ€™re going to use case

studies and real data to illustrate some fundamental points about using statistics to

make sense out of data and answer questions.

Weâ€™ll begin with some definitions (there will be more on each of these topics in future

chapters):

â€¢ Data: A collection of facts, not necessarily numeric, such as: age, gender, hair color,

weight, temperature, etc.

â€¢ Population: A well defined collection of objects, such as: students (at UAB, in

engineering), paint colors (from 1 company, from multiple companies), etc.

â€¢ Census vs Sample:

o If you collect information on all of the objects in a population, that is a census.

o If you collect information on some of the population, that is a sample.

o Rarely do you have a true census, what you try to do is collect a sample that is

representative of the population about which you want to make inferences.

o Note: The US Census is actually a sample not a true census.

â€¢ Variable: A measurement on an object that can change from one object to another.

Usually denoted with lower case letters: x, y, z

o Numerical variables: age, height, time. These are variables that can

(theoretically) be measured on an (infinite) continuous numeric scale.

Â§ There is an inherent order to numeric variables, there is a minimum and

a maximum value and potential values in-between.

Â§ Numeric variables are often summarized with means and standard

deviations, medians and/or ranges.

Â§ These variables can be grouped into categories but that is not how the

data was originally collected â€“ it was collected as a number.

o Categorical variables: gender, hair color, school class. These are variables

that are measured in mutually exclusive (non-overlapping) categories. There

may or may not be an inherent order to categorical variables.

Â§ Nominal variables are in name only and there is no defined order as to

which is better or higher, e.g. hair color.

Â§ Ordinal variables have an inherent order, e.g. Low, medium, high; or an

ordinal scale 1, 2, 3, 4, 5, where a 4.5 isnâ€™t possible and doesnâ€™t make

sense.

Biostatistics for Public Health Undergraduates

6

These variables are often summarized with the number (n) and percentage

(%) in each group.

Â§ These variables generally canâ€™t be reverted into continuous numeric

variables, means and standard deviations donâ€™t make sense with grouped

variables.

Descriptive Statistics: Often called summary statistics, such as the number of

subjects (N), the mean of values (Âµ), variance (s2), standard deviation (s). Can be

Â§

â€¢

â€¢

depicted using plots, such as: histograms, box and scatter plots.

Inferential Statistics: The process of using data to make generalizations to a

population, such as: confidence intervals, estimation, prediction, etc. Inference is a

conclusion that patterns in the data are present in the population.

When collecting data, make sure you collect a good sample to avoid a biased sample.

For example, if you are trying to summarize how students feel about a political issue,

ask men and women, republicans and democrats, freshman and seniors, etc. There are

several sampling procedures:

â€¢

â€¢

â€¢

Simple random sampling: the most simple sampling procedure involves selecting

a subset of n objects from the population, such that each object has an equal

chance of being selected

Stratified sampling: sampling a subset of n from each gender, each age group,

or each school class

Convenience sampling: when it isnâ€™t possible to get a simple random sample,

you sample what you have available to you

Often the goal of a study is to declare a causal relationship between a response and

predictors. The response could be change in blood pressure and the predictors could be

age, gender, exercise, and weight. Unless the study is designed well, ensuring a

random sample, with a temporal association, you wonâ€™t be able to declare a causal

relationship.

Cause and effect relationships should only be drawn from randomized experiments.

Observational studies, where the subjects are not randomly chosen or allocated for study,

can establish correlation between a response and predictors not causation.

Keep in mind that

Correlation â‰ Causation

Just because two things are associated does not mean that one causes the other.

Inferences to populations should only be drawn from random sampling studies, such as

Biostatistics for Public Health Undergraduates

7

randomized clinical trials and designed laboratory experiments. Some other definitions

that statisticians use:

â€¢

â€¢

â€¢

â€¢

Parameters: These are unknown coefficients (variables) in the model that need to

be estimated, such as the mean or standard deviation. Unless you have a census

(all subjects in a population), these are never truly known â€“ only estimated.

Model: The statistical model is an equation that predicts the response (or outcome)

as a function of other variables.

Hypotheses: Usually in terms of null and alternative hypotheses. The question you

are trying to answer and the alternative (or opposite) of that question. In statistics,

the null hypothesis is usually the current standard or what you are trying to show is

no longer valid. The alternative is what you are trying to show by statistically

â€œrejectingâ€ the null. You never prove or disprove a hypothesis, you reject or fail to

reject or accept or fail to accept hypotheses.

Statistical Significance: A precise statistical term that does not equate to practical

or clinical significance. This usually means that the data provides evidence that the

estimated parameter is not the same as the null value (assumed value).

Asking the question: Public Health research usually starts with a question. Is blood

pressure different by gender? Is blood pressure associated with age? With caloric

intake? With amount of exercise? Is high blood pressure associated with an increased

risk of having a stroke? Can medication lower blood pressure?

To answers these questions, and questions like these, we need to learn about data,

about how to ask the right questions, how to use that data to answer the question, and

how to convey what we learned.

How do we answer the question? How do we use statistics to answer the question? If

you think about how you make decisions every day, this can be applied to making

statistical decision â€“ use what see, use what you know, use what you can show:

â€¢ Begin by writing down what you understand

â€¢ Outline what the data says and form clear and succinct questions pertaining

to what the data may imply (or what you would like to show)

â€¢ Form a scientific question to determine if the results are random

â€¢ Compare the data from each side of the question and decide what to believe

â€¢ Write down what you found and what it means

Biostatistics for Public Health Undergraduates

8

Biostatistics and Public Health

According to the American Public Health Association (https://www.apha.org/what-is-public-health):

Public health promotes and protects the health of people and the

communities where they live, learn, work and play.

While a doctor treats people who are sick, those of us working in public

health try to prevent people from getting sick or injured in the first place. We

also promote wellness by encouraging healthy behaviors.

From conducting scientific research to educating about health, people in the

field of public health work to assure the conditions in which people can be

healthy. That can mean vaccinating children and adults to prevent the

spread of disease. Or educating people about the risks of alcohol and

tobacco. Public health sets safety standards to protect workers and develops

school nutrition programs to ensure kids have access to healthy food.

Public health works to track disease outbreaks, prevent injuries and shed

light on why some of us are more likely to suffer from poor health than

others. The many facets of public health include speaking out for laws that

promote smoke-free indoor air and seatbelts, spreading the word about ways

to stay healthy and giving science-based solutions to problems.

Public health saves money, improves our quality of life, helps children thrive

and reduces human suffering.

So how does Biostatistics fit in with public health? Biostatistics is a way of providing

evidence to support conclusions about disease, injury, treatment, behaviors, etc.

Biostatistics does this in 4 ways:

1. Associations: Biostatistics can help find associations between exposures

& outcomes, treatments & disease changes, comorbidities, behaviors &

physical changes, behaviors & diseases, etc.

2. Changes: Biostatistics is used to track changes over time

3. Targeting resources: Biostatistics can help determine disparities to target

interventions for change. Where can the greatest changes be made? What

group or groups should be focused on for change?

4. Comparisons: Biostatistics can compare impacts and find differences

We began with an introduction to biostatistics and defined some commonly used terms.

Weâ€™ll get to each of these terms and how they apply to the steps to answer a question

Biostatistics for Public Health Undergraduates

9

with data. First, we will examine the types of data and variables and study methods that

can be used to collect data for analysis.

Chapter 1: Wrap up Questions

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

What is the difference between a census and a sample?

What are the two main types of variables?

What are the two types of categorical variables?

What is the difference between descriptive statistics and inferential statistics?

How do we determine the importance of different public health problems?

What can Biostatistics do for public health?

What are the four types of evidence provided by biostatistics?

Biostatistics for Public Health Undergraduates

10

Chapter 1: Problems

An older apartment building is having water quality problems, in order to determine if

there have been any adverse health outcomes in the building, you are going to collect

information on the residents of the building:

1. If you have collected data on all persons living in the apartment building, is this

a census or a sample?

2. If you have collected data on people that were home during your visits to that

building, is that a census or a sample? If it is a sample, what type of sample is

this?

3. If you asked people to give their age and you collected it as Distribution (this means go to the Analyze Menu and select

Distribution)

3. Select your variable, then click on Y, Columns

4. Press OK

For this example:

1. Open PUH 250 Module 2 Example.jmp

2. Analyze > Distribution

3. Select Days Exercise

4. Press OK

Biostatistics for Public Health Undergraduates

18

Example Interpretations:

â€¢ Visitors to the UAB Rec Center were asked about how many days in a week they

exercise, and on average how many minutes do they exercise (N=10). The mean

response was 3.7 (SD 1.9) days per week.

â€¢ Visitors to the UAB Rec Center were asked about how many days in a week they

exercise, and on average how many minutes do they exercise (N=10). The

median response was 4.0 (IQR 2.3, 5.0) days per week.

â€¢ Visitors to the UAB Rec Center were asked about how many days in a week they

exercise, and on average how many minutes do they exercise (N=10). The

median response was 4.0 (range 0, 7) days per week. Ã Note that here, this is

not that informative since there are only 0-7 days in a week.

When you have a continuous variable, you can describe the center and spread of the

data using the mean (SD), median (IQR), median (min, max), or the five-number

summary; depending upon the shape of the distribution and how much information you

want to convey about your data.

Chapter 2: Wrap up Questions

â€¢

â€¢

â€¢

â€¢

What is the middle point of a distribution in numbers?

What is the best estimate of a distribution of numbers?

What is the most common value of a distribution of numbers?

What are the 3 measures of spread for a distribution of numbers?

Biostatistics for Public Health Undergraduates

19

Chapter 2: Problems

Open JMP, under the Help menu > Sample Data, open Diabetes.jmp. This is a sample

dataset of 442 participants with diabetes.

Use Analyze Distribution to answer the following questions for Age, BMI, Total

Cholesterol, Glucose, and HDL:

1. What is the mean (SD)?

2. What is the five-number summary?

3. Is the data roughly symmetric or is it skewed?

4. Would you report the mean (SD) or the five-number summary?

Biostatistics for Public Health Undergraduates

20

Chapter 3. Categorical Variables

Module Learning Objectives (Course Learning Objectives): At the end of this

module students should be able to:

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

MO 3.1: Identify the characteristics of a categorical variable (CLO 1, 3)

MO 3.2: Determine variable type based on examination of the categories

(response options) (CLO 1, 2, 5)

MO 3.3: Order variables from lower to higher level of measurement (CLO 1, 2, 5)

MO 3.4: Create variables with mutually exclusive and exhaustive categories

(response options) (CLO 1, 2, 5)

MO 3.5: Define proportion and percentage (CLO 1, 3)

MO 3.6: Calculate proportions and percentages (CLO 1, 3, 5)

MO 3.7: Interpret proportions and percentages (CLO 1, 3, 5)

MO 3.8: Distinguish between categorical and continuous variables (CLO 1, 2)

Definitions for this module:

â€¢ Data

â€¢ Descriptive statistics

â€¢ Inferential statistics

â€¢ Variable

â€¢ Constant

â€¢ Categories (response options)

â€¢ Variable type

â€¢ Categorical

â€¢ Nominal

â€¢ Dichotomous

â€¢ Ordinal

â€¢ Mutually exclusive

â€¢ Exhaustive

â€¢ Percent

â€¢ Proportion

â€¢ Collapsed

Recall that Biostatistics is a way of using math to analyze the health of populations.

Biostatistics helps us to organize information (data) to look for patterns (descriptive

statistics) in our data so that we can make more general statements about the

population (inferential statistics). Since Biostatistics uses math, we do need numbers

to use for analysis. We saw this when we looked at continuous variables â€“ variables

Biostatistics for Public Health Undergraduates

21

that are numeric from a measurable and consistent scale. But what about questions that

have answers that are not collected as numbers? Consider the following question:

â€¢

Have you ever been told you have high blood pressure?

Sure, we know that we can collect actual blood pressure measures (in mmHg) and then

determine if someone meets the criteria for having high blood pressure, but what if we

canâ€™t see them in a clinic? What if we are simply asking the Yes/No question?

Specifically, the question asked on the NHANES (National Health and Nutrition

Examination Survey) survey is:

â€¢ Have you ever been told by a doctor or other health professional that you had

hypertension, also called high blood pressure?

The responses from this question (and others) allowed for the creation of Figure 1 (seen

below) in the Hypertension Prevalence Among Adults Aged 18 and Over: United States,

2017-2018 NCHS Data Brief, Number 364, April 2020 (found in Module 2).

Biostatistics for Public Health Undergraduates

22

In this case, hypertension is determined from a Yes/No response. It is measured in

categories, as is sex (men and women). Sometimes, the data is assigned numbers to

these categories (Yes = 1, No = 0; Men = 1, Women = 2) but these numbers donâ€™t have

any inherent meaning (we could have used Yes = 2, No = 1; Men = 0, Women = 1).

Donâ€™t worry, you donâ€™t actually have to assign the numbers for analysis, this is done in

the background in software.

Of course, we donâ€™t just look around the world and convert everything we see to

numbers, we are specific about what data we collect. This data, each individual

question or aspect we collect are called variables. As weâ€™ve seen, variables can have

natural numeric responses (like height, weight) but they can also have categories as

response options. The term variable implies the responses have options, as opposed

to a constant, which always has the same value. Variables are associated with the

question being asked and the values are the responses.

â€¢

Learning Check: In the NCHS data brief, looking at Figures 1, 2, and 3, there

are 5 variables used in the results. Can you identify them?

When we have a natural numeric response and use the variable in that way, those are

called continuous variables (as we saw in Chapter 2). Other responses are measured

more simply (out of ease or design) by putting the responses into categories â€“ these are

categorical variables. The difference between continuous and categorical variables is

the response options. Variables range from very simple to more complex levels of

measurement. The table below outlines our 2 types of variable classes.

Variable

Continuous

(Numeric)

Categorical

(Character)

â€¢

Response Options

â€¢ Variables have response options that are numbers, where the

numbers have an inherent meaning and an associated scale

(units) of measurement

â€¢ Dichotomous variables only have 2 possible responses

â€¢ Nominal variables have 3 or more response options that are

in name only (words) that are not in a specific preferred order

â€¢ Ordinal variables have response options that can be put in

order or ranked. These can be either words (ranging from less

to more) or groups of numbers captured as a group (1-3, 4-6.

7-9).

Learning Check: In Figures 1, 2, and 3 in the NCHS Data Brief, there are 2

dichotomous variables, 2 nominal variables, and 1 ordinal variable. Can you

identify them?

Biostatistics for Public Health Undergraduates

23

The categories (response options) for variables need to be carefully and specifically

designed to accurately capture all the possible responses that people could give. For

example:

â€¢ How many times have you been told you have high blood pressure?

â€¢ Response options: 0 times, 1-2 times, 3-5 times, more than 5 times

These categories are mutually exclusive, meaning that any response selected only fits

into one category. These are mutually exclusive, because there is no overlap. Often,

we do see categories with bounds that do overlap: 0-2, 3-5, 5 or more. In this case, if

you were a 5, you could be in the 3-5 or 5 or more categories. These are NOT mutually

exclusive.

Categories must also be exhaustive, meaning that every possible response must be in

one of the categories. What if the responses were 0-2, 3-4, 5-6? Then what if a person

had been told 10 times they had high blood pressure? These categories would not be

exhaustive because those with responses 7 or greater have nowhere to go. How do you

allow for exhaustive response options without having too many categories?

â€¢ For categorical, include a â€œor moreâ€ option, or â€œotherâ€ option

â€¢ For continuous, allow to write in the number

â€¢ It is also very good practice to give the following: donâ€™t know, unsure, or prefer

not to answer. Depending on the question this can allow for everyone to answer

the question, even if they donâ€™t know or do not want to give specific information.

Describing Categorical Variables

Once you have designed your categorical variable, we need to think about how to

describe the responses. Similar to continuous measure, you will use a frequency, the

number of times something occurs. Again, you will use software to organize the data to

create a frequency distribution (also called a distribution) to determine the number of

times each response occurs and divides this number by the total number of responses,

N (also called the Sample Size), to determine the percent of each response (%).

It is important to note the difference between two commonly used terms with categorical

frequency:

â€¢ Percent: 0-100%. This is very common in everyday language:

o Number of responses in that category / N (total number responses) * 100

â€¢ Proportion or Probability: 0-1. This is a statistical version of the same concept. If

you multiple a proportion by 100% then it becomes a percent. Youâ€™ll see Prob in

JMP (and other software) but when you are reporting results, make sure to report

percentages (people understand percentages):

o Number of responses in that category / N (total number responses)

Biostatistics for Public Health Undergraduates

24

Consider this example for the same question of â€œHow many times have you been told

you have High Blood Pressure?â€ with the response options of 0 (Never), 1-2, 3-5, or 6

or more. Here is the data with output from JMP (Analyze > Distribution):

JMP output shows the:

â€¢ Histogram: Chart with bars for each level

â€¢ Level: the response options

â€¢ Count: Also called N, the number of times the

response was recorded

â€¢ Prob: the probability or proportion out of the total

number of responses

o To go from Prob to %, multiply Prob *100

o E.g., 0.45*100 = 45.0%

o 45.-% (9) of 20 respondents indicated they have

never been told they have high blood pressure.

â€¢ Total: total number with a response (note that this

may be different from the total number of people in

your overall data set)

â€¢ N Missing: the number in your overall data set that

do not have a response to this variable.

o Total + N Missing = Number of rows in your data

set

Distributions

Times High BP

6 or more

3-5

1-2

0 (Never)

Frequencies

Level

Count

Prob

0 (Never)

9 0.45000

1-2

4 0.20000

3-5

4 0.20000

6 or more

3 0.15000

Total

20 1.00000

N Missing

0

4 Levels

This could also be shown in a Cumulative Frequency table:

Table 3.1

Frequency Distribution for Number of Times Told has High BP

Number

% of all

Number of Responses

Cumulative %

of Times

responses

0 (Never) 9

45.0

45.0

1-2

4

20.0

65.0

3-5

4

20.0

85.0

6 or more 3

15.0

100.0

N=

20

100%

Data Source: NHANES Simulation (ÃŸ this is made up data!)

Again, youâ€™ll have a Cumulative % that add together each of the frequencies up to a

certain point in the distribution. With mutually exclusive and exhaustive response

options, all of your % should total to 100%.

Biostatistics for Public Health Undergraduates

25

â€¢

Learning Check: Which type of categorical variable is this? Dichotomous,

Nominal, or Ordinal?

Interpretation guidelines for frequency distributions:

â€¢ Description of the variable (what question was asked?)

â€¢ The total N (and any N missing, that is are there people that didnâ€™t answer the

question)

â€¢ Percentage and N of cases in each category or percentages associated with

common numeric responses: e.g. 45.0 (9) as seen above

â€¢ Source of the data

Since categorical variables are more simply described, there are fewer options for

description. The most common are:

â€¢ N(%): this is the number of responses (N) for each option with the percentage out of

the total. It is critical to report not just the % but also the N. This is because you can

get 20.0% from many difference ways.

o The majority of respondents indicated they have Never been told they have high

blood pressure, 45.0% (9).

o Or you will also see it as N(%) or 9 (45.0), either is fine.

â€¢ Mode: The mode is the most common response in a distribution. For a continuous

variable it is the number that occurs most frequently out of all the responses. Be

sure to report the actual number, not the frequency. The mode is appropriate to

report for all variable types, those collected as a number and those collected as a

category.

â€¢ Mode = 0 (Never) in this example

Measures of Dispersion â€“ the spread of your data

In common conversation, you would ask how much spread is in responses? Recall with

continuous data, there were 3 main ways to describe dispersion: range, IQR, or SD.

With categorical responses, the frequency distribution helps to demonstrate the spread

of data (histograms are a common way to display categorical responses). But is there

variance associated with categorical responses?

Yesâ€¦butâ€¦ you donâ€™t need to worry about reporting the variance with categorical

responses. The variance here is a function of the proportions. That is, if you know the

proportion, you can get the variance from that value. So unlike continuous variable

where you need to report both a mean and SD because you can have the same mean

but different SDs or the same SD with a different means (statistical note: mean and SD

are independent!), with categorical responses only need to report the n and proportion

of each response.

Biostatistics for Public Health Undergraduates

26

So what do you report? In general:

â€¢ N (%)

â€¢ Cumulative % if the responses are ordinal

â€¢ Mode if that is informative

â€¢ Sometimes it is useful to report or show graphically all the categories with the

frequencies of each category

Steps in JMP:

1. Open your data

2. Analyze > Distribution (this means go to the Analyze Menu and select

Distribution)

3. Select your variable, then click on Y, Columns

4. Press OK

For this example:

5. Open PUH 250 Module 3 Example.jmp

6. Analyze > Distribution

7. Select Number of Times High Blood Pressure

8. Press OK

Example Interpretations:

â€¢ Twenty (20) adults were asked how many times they had been told they had high

blood pressures. The majority reported that they had never been told they had

high blood pressure, 9 (45.0%). With the same number of respondents saying

they had been told 1-2 times or 3-5 times, both 4 (20.0%), and only 15% (3)

reporting they had been told 6 or more times.

â€¢ You could also reference a table or figure and use fewer words: Twenty (20)

adults were asked how many times they had been told they had high blood

pressures. The majority reported that they had never been told they had high

blood pressure, 9 (45.0%), with 11 (55.0%) reporting 1 or more times (Table X or

Figure X).

The second interpretation here, collapsed the categories for more simple reporting.

That is, they took all the categories where someone responded 1-2, 3-5, or 6 or more

and combined them / added them up: 4+4+3 = 11 and divided by the total 20 Ã 11/20 =

0.55 Ã 0.55*100 = 55.0%.

Biostatistics for Public Health Undergraduates

27

â€¢

You can do this in 2 ways in JMP:

o Recode option: easy if you have only a small number of original categories

o Create a formula: if you have a lot of categories or if you want to go from a

continuous variable to categories

o Both will be demonstrated in the practice JMP videos

Continuous vs Categorical

If you can collect a variable as a continuous number, then do so! You can always

create categories later. Sometimes it is not possible to collect a number, even when

something is truly quantitative in nature. For example:

â€¢ It would be possible to collect the age someone was when they first had an

alcoholic beverage but depending upon when you asked the question, people

may not remember exactly. In this case you would likely give them response

options like Fit Y by X

3. Put your Y (Helmet Use) in the Y, Response box

4. Put your X (Sex) in the X, Factor box

5. Press OK

Aside: JMP is going to tell you what analysis will be run, if you look at the data types on

the X and Y in the Fit Y by X platform window.

Y: Helmet use is a

Nominal variable

(red histogram)

Biostatistics for Public Health Undergraduates

X: Sex is a Nominal variable

(red histogram)

36

Letâ€™s look at the JMP Output you get:

The first thing you see is the Mosaic Plot. This

plot shows you the proportion of Females (Left)

where Helmet Use = Yes (Blue, Top) and = No

(Red, Bottom). It also shows the same for Males

(Right).

The far right, thin bar shows the proportion of

Helmet Use overall (without regard to Sex). Here

you can see that the observed proportion of Yes

for Females is a bit higher (more Blue) than for

Males. You can also see, however, that the split

of Yes/No is fairly close to the overall split of

Yes/No in the far right bar.

Note: The Mosaic Plot is only useful for YOU. This is not a plot you will turn in for an

assignment or include in a paper or your data brief. There are other ways to display this

information.

Next in JMP Output: Contingency Table

â€¢ The predictor (X) is in the rows (side)

â€¢ The outcome (Y) is in the columns (top)

â€¢ The numbers at the end of each row and the

bottom of each column show the number that have

that response, while the bottom right corner shows

the total number of observations (30):

o Row Totals: 15 females, 15 males

o Column Totals: 17 report not wearing a helmet,

13 report wearing a helmet

â€¢ The top number in each cell is the number of cases

with that combination of responses:

o 8 females report not wearing a helmet

o 7 females report wearing a helmet

o 9 males report not wearing a helmet

o 6 males report wearing a helmet

â€¢ There are 3 % values given for each combination of responses

o Total %: this is the number / the total:

Â§ 8 / 30 = 26.67 or 26.7% of all respondents are female AND wear a helmet

o Col %: this is the number / the total in that column:

Biostatistics for Public Health Undergraduates

37

â€¢

Â§ 8 / 17 = 47.06 or 47.1% of those that did not wear a helmet are female

o Row %: this is the number / the total in that row:

Â§ 8 / 15 = 53.33 or 53.3% of females did not wear a helmet

Â§ This is the most natural way to talk about contingency tables. You will

generally use Row % to report and compare values.

These are also called conditional probabilities. You are conditioning on one

variable and reporting the % of the other. Of Femalesâ€¦this % of the outcomeâ€¦

Interpreting contingency table analysis

1. For a 2×2 table, select the outcome of interest to report. Usually this is the presence

of some condition, attitude, behavior, event, etc. Occasionally we want to report â€œNoâ€

response (or the lack of absence of something), but this is less common. For a

larger table, you can report on the most interesting or unexpected outcome based

upon the patterns you observe.

2. Reading down the column for the outcome of interest (in this example, helmet use =

Yes), find the row percentages.

3. Assess the row percentages for each level of the predictor:

â€¢ Are the percentages the same, regardless of the category of the predictor? If so,

there is no relationship between the predictor and the outcome in this sample

â€¢ Are the percentages similar, regardless of the category of the predictor? If so,

there may be limited evidence of a relationship or a weak relationship between

the predictor and the outcome in this sample

â€¢ Are the percentages different (more than about 10%)? If so, there may be

evidence of a strong relationship between the predictor and the outcome in this

sample

In this case, we will select the Yes, wears a helmet response. Reading down the Yes

column for the row percentages for each gender, we observe that:

â€¢ 46.7% of females report wearing a helmet

â€¢ 40.0% of males report wearing a helmet

â€¢ We have limited evidence of a relationship between gender and wearing a

helmet when riding a ZYP bike in this sample from Birmingham, AL.

â€¢ There will be a statistical test for this to formally decide between our H0 and

HA â€“ more on this later in the course

Biostatistics for Public Health Undergraduates

38

Comparisons of Group Means: Categorical predictor and continuous outcome

Using this same Fit Y by X approach, we can compare the mean value on an outcome

variable separately for two (or more) groups of a predictor variable.

Appropriate ways of posing research questions for contingency table analysis:

â€¢ Do females report a greater average number of times riding a ZYP bike?

â€¢ Is there a gender difference in average number of times riding a ZYP bike?

â€¢ Does the average number of times riding a ZYP bike vary by gender?

Put in terms of the hypotheses:

â€¢ H0: There is no difference in mean number of rides by gender

â€¢ HA: There is a difference in in mean number of rides by gender

You could also write it this way:

â€¢ H0: The mean number of times people ride a bike is the same for each sex

â€¢ HA: The mean number of times people ride a bike is the same for each sex

Steps in JMP:

1. Make sure your data is recoded!

2. Analyze > Fit Y by X

3. Put your Y (Zyp times) in the Y, Response box

4. Put your X (Sex) in the X, Factor box

5. Press OK

Y: Zyp times is a

continuous variable

(blue triangle)

X: Sex is a Nominal variable

(red histogram)

Biostatistics for Public Health Undergraduates

39

Letâ€™s look at the JMP Output: Bivariate Plot of Zyp Times by Sex

There are two groups on the X

axis (bottom horizontal line), and

the observed Zyp times on the Y

axis (vertical left line). You see a

dot for each Zyp time reported but

be careful, this isnâ€™t all the people

that responded. There may be

multiple people with the same

response. There are only 14 dots

here and we know there were 30

responses.

The grey horizontal line (just

below 5) shows the overall mean

of Zyp times. You would already know this by doing Analyze > Distribution on Zyptimes

overall when you started examining your data with univariate analysis (which you

ALWAYS do before you start bivariate analysis). So how do you get the means by each

sex?

Click on the red triangle next to Oneway Analysis ofâ€¦ and select â€œMeans and Std Devâ€

Biostatistics for Public Health Undergraduates

40

Important things to notice about the comparison on group means:

â€¢ The predictor is on the X axis

â€¢ The outcome is on the Y axis

â€¢ The plot shows the overall mean

(grey bar)

â€¢ And means for each group (small

middle blue dash, hard to see)

â€¢ The bottom panel shows the mean

and SD for each group:

o The mean (SD) for females = 3.7

(3.9)

o The mean (SD) for males = 4.5

(4.5)

â€¢ JMP gives you way too many decimal

places, usually 1 is sufficient, this is

up to YOU to edit for your data.

Interpreting the comparison of group means

â€¢ The mean number of times riding a ZYP bike for females is 3.7 (SD = 3.9), while the

mean number for males is 4.5 (4.5) in this sample from Birmingham, AL.

â€¢ For now, we will say it is observed that the males report riding about 1 time more

per month compared to females (4.5 vs 3.7).

â€¢ Again, there is a statistical test to help us pick between H0 and HA, more on that

later in the course.

What about medians, range, or IQR?

You can also get that from the red

triangle dropdown by selecting

Quantiles:

â€¢ Now you have a Box Plot (see the

boxes?) by Sex

â€¢ And you get the Quantiles, similar

to what you saw in the Univariate

analysis from Analyze > Distribution

option

â€¢ There will be a statistical test for

comparing medians, too.

Biostatistics for Public Health Undergraduates

41

Correlation Analysis: Continuous predictor and continuous outcome

Using this same Fit Y by X approach, we can assess the relationship between a

predictor and an outcome variable when both are continuous variables. You may

remember correlation from high school math class, when you graphed coordinates and

looked at the patter they formed. You may remember it as â€œRise over Runâ€ or Y = mx +

b?

In statistics, we use this concept with scatterplots and the correlation coefficient to

describe the relationship between two continuous variables. For correlation, there are

also multiple ways to pose the question but they are all asking the same thing: is there a

relationship between two variables?

Appropriate ways of posing research questions for contingency table analysis:

â€¢ Is there an association between age of rider and the number of times riding a ZYP

bike?

â€¢ Is there are relationship between the number of times riding a ZYP bike and the

age of the rider?

â€¢ Are age of rider and number of times riding a ZYP bike correlated?

â€¢ Note: in statistics, correlation is a very specific term for two continuous variables.

For other relationships, say association or relationship.

Put in terms of the hypotheses:

â€¢ H0: There is no relationship between age and number of rides

â€¢ HA: There is a relationship between age and number of rides

Steps in JMP:

1. Make sure your data is recoded!

2. Analyze > Fit Y by X

3. Put your Y (Zyp times) in the Y, Response box

4. Put your X (Age) in the X, Factor box.

â€¢ Why is age the predictor? Because age in this case is fixed (independent)

and we are trying to predict frequency of rides.

5. Press OK

Biostatistics for Public Health Undergraduates

42

Y: Zyp times is a

continuous variable

(blue triangle)

Look! Bivariate Analysis!

X: age is a continuous variable

(blue triangle)

Letâ€™s look at the JMP Output: Bivariate Fit of ZYP time by Age

â€¢

â€¢

â€¢

â€¢

Here, ZYP Time is on the Y axis

Age is on the X axis

There is a dot for each unique pair of

responses (for example, age = 19,

zyptimes = 1 is the lower left dot. Hover

over the point in JMP and it will tell you

this is row 1)

If there are multiple pairs with the same

set of responses, there will be only 1 dot

Getting correlation: Again, the red drop down

triangle:

â€¢

â€¢

Ask for a 0.95 Density Ellipse

Youâ€™ll then need to click on the grey triangle next

to Bivariate Normal Ellipse to open up the results

box.

Biostatistics for Public Health Undergraduates

43

Important things to notice about

correlation analysis:

â€¢ The predictor is on the X axis

â€¢ The outcome is on the Y axis

â€¢ The scattperplot shows the:

o Strength: is it a tight or spread

out cluster of points?

o Direction: is it positive

(increases left to right) or

negative (decreases left to

right)?

â€¢ The bottom panel gives the precise

value and direction of the

correlation

o Here it is negative, -0.37

o You also get the mean (SD) of

each variable

â€¢ For now, ignore the p-value, this is

the statistical test, and (you guessed it) more on that later in the course

Interpreting correlation

The correlation coefficient (written as â€œrâ€, also called Pearsonâ€™s correlation or Pearsonâ€™s

r) ranges from a negative 1 to a positive 1 or -1.0 to +1.0 (with 0 in the middle)

â€¢ Values nearer to -1.0 are a stronger negative correlation (inverse relationship, as

X increases, Y decreases)

â€¢ Values nearer to 0 indicate no association or a weaker relationship

â€¢ Values nearer to +1.0 are a stringer positive correlation (as X increases, Y

decreases)

â€¢ What about variables in the middle?

o Between 0 and 0.30 is considered weak

o Between 0.31 to 0.50 is moderate

o Over 0.50 is a strong correlation

o This is true for both positive and negative correlations, just make sure to

say if it is a positive or negative (inverse) relationship

In this case, we would say:

â€¢ There is a moderate, negative correlation between age and number of times

riding a ZYP bike (r = -0.37) in this sample from Birmingham, AL.

Biostatistics for Public Health Undergraduates

44

Interpretation guidelines for all bivariate analyses

â€¢ State the question

o Include how each variable is measured, including the question wording and the

variable type or levels)

â€¢ What is the N in the sample and what is the N on both variables (must have both to

be in a bivariate analysis)

â€¢ Univariate summary measures: n(%) for each level or mean (SD) / median (IQR or

Range)

â€¢ The bivariate summary (as applicable from above)

â€¢ The interpretation sentence, based on the bivariate analysis conducted.

â€¢ Include the sample information and source of the data

â€¢ In all of the above, make sure to use real words and terms NOT the variable names

Chapter 4: Wrap up Questions

â€¢

â€¢

â€¢

â€¢

â€¢

What is a research question?

What are the null (H0) and alternative hypotheses (HA)?

What are the 3 types of bivariate analyses?

What are the two variable types needed for each bivariate analysis?

What are the summaries reported for each type of analysis?

Biostatistics for Public Health Undergraduates

45

Chapter 4: Problems

Using the same Sample Data from Chapter 2: open Diabetes.jmp. This is a sample

dataset of 442 participants with diabetes.

Use Fit Y by X to answer the following (make sure that Gender is Recoded, 1 = Male, 2

= Female):

1. For Y = Y Binary and X = Gender:

â€¢ What is the question you are asking?

â€¢ What is the name of this analysis?

â€¢ What is the overall N(%) with High Glucose?

â€¢ What is the N(%) of Males with High Glucose? Of Females?

â€¢ Write the interpretation according to the guidelines for this type of analysis

2. For Y = Glucose and X = Gender:

â€¢ What is the question you are asking?

â€¢ What is the name of this analysis?

â€¢ What is the overall mean (SD) of Glucose? (You need to use Analyze >

Distribution for this, like you did in Chapter 2)

â€¢ What is the mean (SD) of Glucose for Males? For Females?

â€¢ Write the interpretation according to the guidelines for this type of analysis

3. For Y = Glucose and X = Age:

â€¢ What is the question you are asking?

â€¢ What is the name of this analysis?

â€¢ What is the overall mean (SD) of age and Glucose? You can get these from

Analyze > Distribution but they are also shown in the JMP output for this

analysis.

â€¢ What is the correlation?

â€¢ Write the interpretation according to the guidelines for this type of analysis

Biostatistics for Public Health Undergraduates

46

Chapter 5.

Collecting the Evidence

Module Learning Objectives (Course Learning Objectives): At the end of this

module students should be able to:

â€¢

â€¢

â€¢

â€¢

â€¢

MO 5.1: Define population and a population sample (CLO 1, 2)

MO 5.2: Identify sampling techniques (CLO 1, 2)

MO 5.3: Differentiate between observational and experimental study designs

(CLO 1, 2)

MO 5.4: Define causality (CLO 1, 2, 3)

MO 5.5: Identify the criteria for concluding a causal relationship (CLO 1, 2, 3, 4)

Definitions for this module:

â€¢ Population

â€¢ Sample

â€¢ Representative

â€¢ Simple random sample

â€¢ Non-random or convenience sample

â€¢ Bias

â€¢ Generalizability

â€¢ Inference

â€¢ Unit of analysis

â€¢ Case

â€¢ Control

â€¢ Study designs:

o Observation

o Experiment

â€¢ Causation:

o Temporal precedence

o Correlation

o No plausible alternative

In the last few modules, we have looked at descriptive summary statistics (univariate

statistics) and using two variables to determine if they are related (bivariate statistics).

We briefly talked about identifying the population of reference so that your results have

context. That is, you wouldnâ€™t want to report results from a study in only adult women

and try to apply those results to adolescent males. So it is important to understand

where you got your data sample, from which population, so you can make statements

about the results.

Biostatistics for Public Health Undergraduates

47

Population

A population is defined as â€œthe whole number of people or inhabitants in a country or

region or the total of individuals occupying an area or making up a wholeâ€ according to

Merriam-Webster Dictionary. Which means that you can define a population very

broadly (United States) or very narrowly (student athletes at UAB). But in either case,

population refers to the entire group of things meeting your definition. Very often, we

wonâ€™t actually know the true number in our populations, e.g. people with arthritis. Why?

Because some people wonâ€™t ever get diagnosed. That is why you often hear about a

population in research being defined as â€œpeople DIAGNOSED withâ€ a condition.

Statistically, a population is a well-defined group that is of interest for a research

question. Here also, it may not be possible to actually collect information on all people

40-65 years of age with arthritis. So we turn to samples from the population.

Sampling

Although our goal in public health is to describe and hopefully improve the health of

populations, we usually only study samples. It is often too expensive, time-consuming,

and logistically challenging to study a whole population, and luckily the power of

inferential statistics means we donâ€™t have to (more on this in the second half of the

semester!). Inference or inferential statistics means we can take the results from our

sample and infer (or conclude) they apply to the population, too.

The key is selecting a sample that is representative (meaning it shares the same

characteristics) of the population from which is was drawn. The best way to do this is

through simple random sampling â€“ putting the names of every single person in the

population in a hat, shaking it up, and then randomly selecting a subset of names. But

as you can imagine, simple random sampling is often not so simple: Where do we get

all the names? How do we find all the people whose names we drew and convince them

to participate in our study? And just how big is this hat???

Though simple random sampling is the ideal procedure, it is rarely followed. More often

we draw a non-random or convenience sample, trying as much as possible to avoid

drawing it in such a way that it is also decidedly non-representative. For example, we

want to avoid sampling only one type of person, or sampling only at a specific time or

location. Doing so is likely to introduce bias, meaning that our sample will represent

some portions of the population but not others. This can make certain characteristics,

behaviors, outcomes, etc. appear more or less common than they actually are, limiting

the generalizability (our ability to accurately describe a population using a sample) of

Biostatistics for Public Health Undergraduates

48

our results. Ideally we want to include lots of different types of people in our sample, to

account for all the different types of people that make up the population.

Although we tend to think in terms of individuals making up our samples and the overall

US population as â€œtheâ€ population, samples and populations can be made up of

anything. Imagine weâ€™re conducting a study of air pollution in US cities. We could collect

air pollution data in a sample of cities, with the goal of generalizing to all US cities. The

unit of analysis here is a city â€“ thatâ€™s who (or what) weâ€™re collecting data from and who

the data describe. Each city is a case, or individual unit in our study.

Study Design

One of the most interesting aspects of public health research is the variety â€“ there are

countless ways to collect public health data. All of these data collection techniques can

generally be sorted into two key study designs, based on the role of the researcher. In

observational designs, the researcher directly observes data; for example, by asking

questions using a survey (like the YRBS) or by taking measurements of contaminants in

the air (like in the air pollution study described above).

In experimental designs, by contrast, the researcher is more involved. An experimental

study starts with the researcher randomly sorting study participants into two groups: the

experimental group and the control group. The researcher then provides some

treatment or stimulus to the experimental group â€“ gives them nutritional counseling,

provides them with a medication â€“ whatever the researcher is attempting to determine

the effect of. The control group does not receive this same treatment. The researcher

then examines the outcome of interest in both groups to see if being in the experimental

group has led to a different value on the outcome.

Experimental designs can be hard to implement, and sometimes it isnâ€™t possible due to

ethical concerns (you certainly couldnâ€™t assign one group to smoke and one group not to

smoke to determine whether smoking causes cancer!). But, experimental designs are

an incredibly powerful tool for establishing cause and effect due to the fact that study

participants are randomized into either the experimental or control group.

Randomization means that the two groups will be very similar in their characteristics

with only their assignment to group (experimental or control) differing, so the effect of

whatever treatment they receive in the experiment will be clear. This is why

experimental designs are considered the â€œgold standardâ€ in the research world.

Biostatistics for Public Health Undergraduates

49

Causality

As noted above, experimental designs are necessary to establish causality. Causality

refers to situations in which values on a predictor variable are understood to directly

cause values on the outcome variable (like in the smoking -> cancer example we looked

at earlier in the semester). We often speak casually about causality, but in fact it is very

challenging to establish evidence of a causal relationship. To do so, three key criteria

must be met (see additional information on Hillâ€™s Criteria in online Module):

1. Temporal precedence. This sounds fancy, but it really just means that the cause

has to come before the effect in time. This can be difficult to establish, especially in

observational designs that just collect data at one time point. How do we determine

that X causes Y, rather than Y causing X? If we observe both variables at the same

time, we struggle to establish which occurred first. This is one reason experimental

designs are so powerful â€“ we can be certain that the treatment (X) occurs before the

outcome (Y) weâ€™re interested in!

2. Correlation/Association. This is exactly like it sounds and refers to whether there is

an association or relationship between the two variables weâ€™re interested in.

Statistically, we have multiple ways to establish this, but the two most common are:

Pearsonâ€™s correlation coefficient (written as â€œrâ€) â€“ this is a measure of

association for two continuous variables. That is, what happens to Y as X

increases? It can be negative (indicating an inverse relationship, Y decreases as

X increases) or positive (Y increases as X increases). It ranges from -1.0 to +1.0,

which 0 meaning no association. Closer to -1.0 or +1.0 indicates a stronger

association. From the SBP Weight data, see the relationship below where as

weight (X) increases, SBP (Y) also increases, with an r = 0.33; indicating a

moderate positive relationship.

Systolic Blood Pressure (mmHg)

â€¢

150

140

130

120

110

100

90

100

150

200 250 300

Weight (Pounds)

Biostatistics for Public Health Undergraduates

350

50

â€¢

Contingency tables â€“ this technique allows us to look at the association

between two categorical variables. We arrange the responses on the predictor

in the rows and responses on the outcome in the columns, and calculate the

percentage of cases in each row experiencing the outcome of interest. Recall

this contingency table from our discussion of how researchers established the

evidence that smoking is associated with lung cancer.

3. No plausible alternative explanation. This third criterion can be challenging to

establish. It requires that we demonstrate that the apparent association between

the two variables weâ€™re interested in cannot be explained away by a third

variable. The video on How Ice Cream Kills! presents some great examples of

situations in which a third variable explains an apparent association, but this

tends to be much more complicated in public health research when weâ€™re dealing

with real people who live wonderfully complicated lives.

Again, you can see why experimental designs are so powerful â€“ if we just manipulate

one thing, the treatment, itâ€™s much easier to rule out the effect of a third variable. Not so

in observational designs, where we measure lots of variables but struggle to know how

they fit together and in what order.

â€¢ Observational designs are good for determining correlations between variables, but

not good at establishing temporal precedence or excluding alternative explanations.

â€¢ Experimental designs, when theyâ€™re done well, can address all three criteria for

causality.

Biostatistics for Public Health Undergraduates

51

It is critical to know that:

â€¢

â€¢

Observational designs can establish correlation and only in very special

circumstances (very rare), establish causality.

Experimental designs can establish both correlation and causality, when properly

designed.

Correlation does not equal Causation!

Correlation â‰ Causation!

Chapter 5:

â€¢

â€¢

â€¢

â€¢

â€¢

Wrap up Questions

What is a population?

How is a sample different from a population?

How do you obtain a representative sample?

What is the difference between an observational and experimental study

design?

When can you draw a conclusion of correlation from a study?

Biostatistics for Public Health Undergraduates

52

Chapter 5:

Problems

1. If your research question is about students at UAB taking undergraduate

biostatistics courses, and you obtain your samples in the following ways, what

type of samples are they? Convenience or simple random sample

â€¢ Asking all students in your class to fill out a survey?

â€¢ Asking the teacher to generate a list of students such that each student has

the same chance of being asked to complete the survey?

2. What population would these samples represent? List out the likely

characteristics of students in undergraduate biostatistics courses.

3. Could a single survey tell you about correlation? Or causation?

4. What if you asked students at the start of class and after the class the same set

of questions, could this draw causation? Why or why not?

5. What if you divided up students into two different types of study groups and

followed their course performance by group? Could this draw causation? Why or

why not?

Biostatistics for Public Health Undergraduates

53

Chapter 6.

Review Chapter 1-5

The module objectives from Chapters 1-5 outline what you should know or do thus far:

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

MO 1.1: List several of the ways that biostatistics provide the evidence

necessary for public health research and practice

MO 1.2: Justify the importance of a particular public health problem on the basis of

an evaluation of the evidence

MO 2.1: Identify the characteristics of a continuous variable

MO 2.2: Define measures of central tendency: mean, median, mode

MO 2.3: Define measures of spread (dispersion): variance, standard deviation,

range, inter-quartile range

MO 2.4: Calculate measures of central tendency: mean, median, mode

MO 2.5: Calculate measures of spread (dispersion): variance, standard

deviation, range, inter-quartile range

MO 2.6: Interpret measures of central tendency: mean, median, mode

MO 2.7: Interpret measures of spread (dispersion): variance, standard deviation,

range, inter-quartile range

MO 3.1: Identify the characteristics of a categorical variable

MO 3.2: Determine variable type based on examination of the categories

(response options)

MO 3.3: Order variables from lower to higher level of measurement

MO 3.4: Create variables with mutually exclusive and exhaustive categories

(response options)

MO 3.5: Define proportion and percentage

MO 3.6: Calculate proportions and percentages

MO 3.7: Interpret proportions and percentages

MO 3.8: Distinguish between categorical and continuous variables

MO 4.1: Define research questions and hypotheses

MO 4.2: Define predictor and outcome variables

MO 4.3: Distinguish between independent (predictor) and dependent (outcome)

variables in a research scenario

MO 4.4: Define the null and alternative hypotheses

MO 4.5: Differentiate between the null and alternative hypotheses

MO 4.6: Determine the appropriate summary statistics for a bivariate analysis

MO 5.1: Define population and a population sample

MO 5.2: Identify sampling techniques

MO 5.3: Differentiate between observational and experimental study designs

MO 5.4: Define causality

MO 5.5: Identify the criteria for concluding a causal relationship

Biostatistics for Public Health Undergraduates

54

Below is a listing of what has been covered broken out into:

â€¢ Knowledge: definitions and concepts

â€¢ Skills: what you do in JMP and what to answer and report

You may want to make notes of your own below and note which module contains the

materials so that you can find them during then exam (hint, use the listing of Module

Objectives).

Definitions and concepts

â€¢ Evidence: Identify how biostatistics provides evidence for public health

â€¢ Creating variables: including the question and response options

â€¢ Variable types: distinguishing dichotomous, nominal, ordinal, and continuous

variables:

o Mutually exclusive and exhaustive categories of variables

â€¢ Creating and interpreting frequency distributions: what must be included in

the interpretation

â€¢ Measures of central tendency: calculation and interpretation of mode, median,

and mean; for which variable type each is appropriate

â€¢ Measures of dispersion: calculation and interpretation of range and interquartile

range; interpretation of standard deviation

â€¢ Describing distributions: know when to use mean & standard deviation and

when to report a five-number summary

â€¢ Shape of distributions: modality, symmetry, skewed right and left distributions

â€¢ Outliers: recognizing if and when they should be excluded from analysis

(remember, sometimes they are errors)

â€¢ Samples, populations: be able to identify from a research description

o Understand the difference between sample and population

â€¢ Independent (predictor) and dependent variables: be able to define and

identify from a research description

â€¢ Writing research questions based on independent and dependent

variables: make sure they end in a question mark and take the form of:

o How does [Independent variable] influence [Dependent variable]? or

o Are [one category of the IV] or [other category of the IV] more likely to

[DV]?

Biostatistics for Public Health Undergraduates

55

Preparing and Examining Data (with JMP)

â€¢ Recoding to give variables meaningful levels for categorical variables

â€¢ Recoding to combine categories/values

â€¢ Using Analyze > Distribution to examine categorical and continuous data

o For continuous: what is the shape of the distribution?

o For categorical: what is the outcome level of interest?

â€¢ Using the By command in Distribution to analyze separate groups/samples â€¢

â€¢ Using Analyze > Fit Y by X to describe an outcome by a predictor

For each Fit Y by X, be able to answer:

â€¢ Which variable is Y (dependent / outcome)?

â€¢ Which variable is X (independent / predictor)?

â€¢ Which null hypothesis would be appropriate, given the variables you have?

â€¢ Which alternative hypothesis would be appropriate, given the variables that you

have?

â€¢ Are you looking for:

o Continency Table

o Group Means

o Correlation

â€¢ What are the appropriate summary statistics for your outcome?

â€¢ What can you conclude about the relationship between these variables, based on

the observed summary statistics?

Writing about your results:

â€¢ Make sure to state your question in words (not variable names)

â€¢ Include relevant summary statistics, including the N (number in your sample)

â€¢ Answer the question based upon your observed results

â€¢ Give the source of your data

Answering questions with data follows the general pattern:

â€¢ Begin by writing down what you understand

â€¢ Outline what the data says and form clear and succinct questions pertaining

to what the data may imply (or what you would like to show)

â€¢ Form a scientific question to determine if the results are random

â€¢ Compare the data from each side of the question and decide what to believe

â€¢ Write down what you found and what it means

Chapter 6: Problems

The worksheet for Chapter 6 is found in Module 6.

Biostatistics for Public Health Undergraduates

56

Chapter 7.

Midterm Exam

There is no new material in this chapter. You will focus on taking the Midterm Exam. As

you do that, consider the module objectives that have been addressed through Module

5 (reference materials in Chapter 6).

Read each objective and make sure you understand them and/or how to address them

using JMP.

When approaching an analysis, you need to:

â€¢ Define the questions that you want to address

â€¢ Review the variables and variable types

â€¢ Clean or make sure the data is clean and ready for use

â€¢ Summarize each variable as appropriate for the type of variable and distribution

â€¢ Create a table or descriptive figures for summary, as applicable (i.e. Table 1)

â€¢ Answer your questions using statistical analysis appropriate for your variable types

â€¢ Make a statistical decision from the analysis output and record the applicable results

â€¢ Write summary conclusions in general language, including the statistical results but

not focusing on statistical language

Biostatistics for Public Health Undergraduates

57

Chapter 8.

Midterm Exam Reflection and Revision

For this chapter, you will review key concepts on variable types, sampling strategies,

and how to summarize data and analyze bivariate data. After the review, you will have

the opportunity to resubmit your Midterm Exam.

Review what you missed on the midterm exam, determine which objective the question

is associated with and then return to that chapter and module to help determine the

correct answer.

Biostatistics for Public Health Undergraduates

58

Chapter 9.

Determining Significance

Module Learning Objectives (Course Learning Objectives): At the end of this

module students should be able to:

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

â€¢

MO 4.4 (review): Define the null and alternative hypotheses (CLO 1)

MO 4.5 (review): Differentiate between the null and alternative hypotheses

(CLO 1, 2)

MO 9.1: Define the 4 steps of hypothesis testing (CLO 1, 2, 3, 5)

MO 9.2: Calculate the p-value using JMP to answer your statistical question for

a Contingency table (Chi-square test), Group Means (t-test), and

Correlation (t-test) analysis (CLO 2, 5)

MO 9.3: Define statistical inference (CLO 1, 3)

MO 9.4: Estimate the Confidence Interval with JMP (CLO 5)

MO 9.5: Interpret the Confidence Interval (CLO 3)

Definitions

â€¢ Hypothesis

â€¢ Null Hypothesis

â€¢ Alternative Hypothesis

â€¢ Alpha

â€¢ P-value

â€¢ Statistical Significance

â€¢ Confidence Interval

â€¢ Upper bound

â€¢ Lower bound

â€¢ Inference

So far, weâ€™ve looked at univariate and bivariate statistical approaches to describe the

distributions of variables in a particular sample. This is a critical first step in any

statistical analysis, but our ultimate goal is usually to describe the population from

which a particular sample was drawn. Although we have a sample of a population in a

dataset, our goal is to describe the whole population.

Using data from a single sample to describe a population â€“ known as statistical

inference â€“ requires that we take into account the fact that some samples are more

similar to the populations from which they were drawn than others. We have some

control over this â€“ as we said previously, randomly selected samples are more likely to

be representative of the population, as are larger samples. Much of this, however, is

due to random chance â€“ sometimes we draw a representative sample, but sometimes

Biostatistics for Public Health Undergraduates

59

we do not. This is called sampling variability, and luckily for us it is based on known

laws of probability. We wonâ€™t get into any probability calculations in this course, but a

basic understanding of the role of probability or random chance in statistical analysis is

essential.

Hypothesis Testing (Review from Module 4)

Hypothesis testing is the most commonly used â€“ and most misunderstood! â€“ approach

for determining whether a relationship exists between two variables in the population

(remember, we can directly observe a relationship between variables in the sample

using contingency tables, correlation, and group means, but our goal now is to describe

the population).

In hypothesis testing, we start by assuming that there is NO relationship between two

variables in the population (H0 or null hypothesis). We then examine evidence from the

sample and consider how likely it is that we would observe a relationship between

variables in the sample IF there was truly NO relationship between the variables in the

population.

For example, in adults age 30-60:

â€¢ We assume that there is no relationship between liking coffee and being scared of

clowns. (yes, this is a silly example)

â€¢ We would collect data of people aged 30-60 (ideally a random sample, not from a

clown college or while in a coffee shop)

â€¢ We would then use Fit Y by X to determine if there is a relationship between liking

coffee and being scared of clowns. There are two options:

1. If the relationship between two variables in the sample is relatively weak, we

conclude that it is likely just due to sampling variability â€“ we happened to draw

a sample in which there appears to be some relationship, but itâ€™s not very strong

and probably doesnâ€™t reflect a true relationship between those variables in the

population. We conclude there is no relationship, we stay with H0.

2. If the relationship between two variables in the sample is strong, we have

reason to believe that it is likely NOT due to sampling variability â€“ we have

evidence of a true relationship between those variables in the population. We

will reject the null hypothesis and conclude there is a relationship.

This makes intuitive sense â€“ if we see a strong relationship between two variables in the

sample, we get excited and say, Wow thereâ€™s really something going on here!

Biostatistics for Public Health Undergraduates

60

Where we get confused is in the formal application of formal hypothesis testing â€“

remember, we start by stating that there is NO relationship between two variables and

then we look for evidence to support or reject that claim. This is similar to the â€œinnocent

until found guiltyâ€ approach we take in the legal system.

But as scientists, it feels strange (backwards even) to start off by saying that we donâ€™t

think there is a relationship between two variabl…

Purchase answer to see full

attachment