The project will
be a short exploratory data analysis using a real data set to investigate some research
questions. At a minimum, the project must involve importing and cleaning data and
creating visualizations and/or statistical summaries that help to address the research
questions. The most successful projects will incorporate multiple and varied aspects of
the coding techniques we cover in class. There is no required page length. The document
should be written in R Markdown and turned in as a .pdf or .html file. Note that the R
Markdown file for the final project should be well-organized (i.e., with sections and
headers) and readable (i.e., it should be free from typographical errors and the writing
should flow well). Data from Kaggle and machine learning repositories are not allowed.
PLOS ONE is my first recommendation for places to look for data.
Here are some links to publicly available data resources:
PLOS ONE â€“ open access journal that requires all data be made publicly
available; search for a randomized trial; papers published after 2016 are more
likely to have data files available
data.gov â€“ open data from the US government; search for a randomized trial
Harvard Datahub for Field Experiments â€“ data from randomized experiments in
the social sciences
OPEN ICPSC â€“ social, behavioral, and health sciences research data
Journal of Open Psychology Data – openly available data from a variety of papers
Data – Open Access Journal – open access journal on data in science
The National Center for Education Statistics, including ECLS
The Youth Risk Behavior Survey, available from the CDC
The Current Population Survey, available from the U.S. Census Bureau
The Fragile Families & Child Wellbeing Study, available from Princeton
HUDM 5026 – Introduction to Data Analysis and
Graphics in R
01 – Introduction
Why use R?
â€¢ Itâ€™s free.
â€¢ It accommodates all manner of basic statistical analysis and many, many advanced
and new methods.
â€¢ There is an active community of programmers, academics, and developers who continually work on improving R and creating and improving auxiliary software such as
contributed packages and IDEs like RStudio.
â€¢ Many new procedures and capabilities come out first in R and are often supported by
publications in peer-reviewed journals, such as the Journal of Statistical Software.
â€¢ R runs on all the major operating platforms, including Mac, PC, and Unix.
â€¢ Râ€™s ability to produce publication quality plots and graphics are unparalleled.
Some recent examples of plots I have made in R can be found on my faculty webpage
at https://www.tc.columbia.edu/faculty/bsk2131/. In particular, see pdf versions of
papers entitled Variable Selection for Causal Eï¬€ect Estimation and Heterogeneous Subgroup
Downloading R and RStudio
Visit the Comprehensive R Archive Network (CRAN) website at https://cran.r-project.
org/ to download R for your computerâ€™s operating system. If you have an older version of R
on your machine, now is the time to download the most recent version from CRAN. This is
especially important because some of the packages we will use are not compatible with older
versions of R.
Although you donâ€™t need to use RStudio to work with R, it makes R easier to work
with. RStudio is a separate download; go to https://www.rstudio.com/ and download
and install the right version for your operating system. You will likely want to put an
RStudio shortcut icon in your menu bar or start menu. Open RStudio and notice the pane
design for integrated viewing of diï¬€erent processes. Go to the â€œRStudioâ€ drop-down menu
and select â€œPreferencesâ€. Then select â€œPanesâ€. You will see a screen that looks like the figure
below. My preference is to put the source pane in the top left, the console pane in the
bottom left, the environment/history pane in the top right, and the plots/help pane in the
Next, go back to the â€œPreferencesâ€ menu and select the â€œGeneralâ€ tab. I prefer to uncheck
all boxes except the three shown in the figure below. Furthermore, I recommend setting the
drop down menu so that RStudio never saves your workspace on exit. The rationale for
setting this to never save is that if you donâ€™t, R will save whatever data is in your workspace
(i.e., objects visible using ls()) in a history file so that the next time you open RStudio, it
will all be available in your working environment. While, in principal, this is a nice idea, in
practice, you end up storing way more data and objects than are necessary and the clutter
causes RStudio to open slowly and get glitchy.
Instead of storing important information in your workspace, I will encourage you to begin
to think of your source file, which is a text file that contains your saved lines of code, as the
best place to store your important information in R. Most operations you will ask R to do
will take only a fraction of second to run, so storing your code and then running the code
each time you need it is a good habit to get into. To that end, you will work on writing
eï¬ƒcient code that is understandable, so that the next time you read it you can see clearly
what you were trying to do when you wrote it. Adding comments to code are a big part of
making it understandable.
There are two file type extensions that we will use a lot in this course: â€œ.Râ€ and â€œ.Rdataâ€.
At some point in the near future, you will want to instruct your computer to open both
those file types with Rstudio by default. The way to do this varies based on the operating
system you are using, but typically it can be done by right-clicking on the file and choosing
â€œopen withâ€ and then selecting the option to make Rstudio the default.
Tip for macOS users
If you use macOS, download XQuartz from https://www.xquartz.org/.
The Four RStudio Panes
The console pane is where you may interact directly with the R command line. If you type
code in the console and press enter (or return), your code will run, and, if called for, R will
produce output, also in the console. Letâ€™s try it. Do some basic math in the console, like 5
+ 5, and press enter. You should see the answer 10 printed as output. Note that you must
use the asterisk â€œ*â€ for multiplication and the forward slash â€œ/â€ for division.
The history of code run in your console will be recorded in the history pane. Go to the
top right pane and click on the history tab. You should see all the code that you just ran in
the console. Click on a line of code in the history pane that you want to run again. Then,
with your cursor on the line (you donâ€™t have to highlight the whole line) find and click the
â€œTo Consoleâ€ button. You should notice that the line now appears in your console.
In general, as I mentioned above, you will work in the source pane rather than the
console because the source panel allows you to save your code as text and adds intelligent
color coding and tabbing to help make code easier to read and debug. Go to the â€œFileâ€ menu
and select â€œNew Fileâ€ and â€œR Scriptâ€. A new text file should open up in your source window
pane. Save the file. You may send code from your history to your source text file by clicking
on â€œTo Sourceâ€. Try that as well. You should now see the line in your source panel. To run
a line of code in your source file, put your cursor anywhere on that line and, on a Mac OS,
push command and return at the same time. On Windows OS, push control and enter at the
same time. The line will run and the cursor will move to the next line. This is a convenient
way to move through a document. If you wish to run a specific part of a line, or more than
one line at a time, simply select the code you wish to run and then push command and
return or control and enter. There is also a â€œRunâ€ button at the top of the RStudio source
pane in case you prefer to point and click to run.
The CRAN Website and the R Community
In addition to downloading R, you may also visit the Comprehensive R Archive Network
(CRAN) to access various help manuals at https://cran.r-project.org/manuals.html.
If you have a question about something R-related, a Google search is typically a great first
step toward finding answers to R questions. Check out the R help pages at Stack Overflow
here https://stackoverflow.com/. Chances are, if you have an R related question, someone else has already asked about it on one of the stack exchange sites. The Quick R website
is another resource. It is available at http://www.statmethods.net/.
Working with Code in the Source Pane
Create a new R syntax file for today by going to File !New File !R Script and
save the file and call it â€œ01_Intro.Râ€. Then locate the file where you saved it and right click
on the file and either get properties or get info and select â€œopen withâ€. Find the option to
change the default so that all .R files open with RStudio by default. Then, double-click on
the â€œ01_Intro.Râ€ and verify that it opens in RStudio. If not, try again.
Once you have written code in the syntax file, there are a few ways to run it.
â€¢ Select the code you wish to run and then click the â€œRunâ€ button. This option will run
all the selected code.
â€¢ Select the code you wish to run and simultaneously push command and return on
MacOS or control and enter on Windows. This option will run all the selected code.
â€¢ Put your cursor anywhere on a line you wish to run and and simultaneously push
command and return on MacOS or control and enter on Windows. This option will
only run one line at a time.
Using R as a calculator
â€¢ The hashtag â€œ#â€ is the comment character in R. Anything on a line following a hashtag
will be ignored.
â€¢ As we have seen, R will do arithmetic operations using the usual symbols.
â€¢ Other math functions include sqrt for square root, exp for the exponential function,
log for log base e, trigonometric functions using sin, cos, and tan.
â€¢ R will follow order of operations. For example, running 5 + 3 * 4 will return 17, not
â€¢ Parenthesis may be used as well. For example, running (5 + 3) * 4 will return 32.
â€¢ Scientific notation can be specified with the letter e, which is interpreted as â€œtimes ten
to the power ofâ€ when written in a numerical expression. For example, 2e2 is 200.
Activity 1 Use R as a calculator by writing and saving code to your syntax file and then
running it. Experiment with comments, order of operations, and scientific notation. Experiment with all three methods given above for running the code in your syntax file.
â€¢ Before we get into assigning values, look at your environment tab in the environment/history pane. It should be empty at this point, meaning that no variables have
been assigned, and nothing is stored in your R workspace. To check this with code, you
may enter ls(), which will list all the names of objects stored in your environment.
â€¢ The help page for the assign function notes that the first two arguments passed to the
function are called x and value, where x needs to be a character string representing
the name of the variable you want to create, and value is the value you want the
variable to have. Letâ€™s call the variable var1, and letâ€™s assign it a value of 5133.
â€¢ The value of var1 may be overwritten. Suppose we want to update var1 to be its old
value less 5000. See the source file for code to do just that.
â€¢ There is a shortcut for the assign function that involves using the less than symbol
and the hyphen to construct an assignment arrow,
Purchase answer to see full