R is a free, open source software for statistical computations and graphing. An identical version of R is replicated on a collection of sites around the world on what's called the Comprehensive R Archive Network (CRAN). Select the site nearest to you (i.e. Berkeley) to download R.
RStudio desktop is an integrated development environment (IDE) which makes it easier to work with R (e.g. visualizing data, plotting data, saving history, workspace organization, etc.).
1 # vector of length 1. R has no "scalar" type
c(1,3,4) # vector of 3 items
c(30,150, 120, 0) # vector of 4 items
100:200 # vector of 100, 101, 102, ..., 199, 200
seq(100,200) # same as above. seq = sequence
200:100 # 200 to 100, by -1.
rep(c(1,2,3), 10 ) # repeat c(1,2,3) ten times
c(1,2,3, c(4,5)) # same as c(1,2,3,4,5). c() inside another c() flattens
There are several “primitive types” in R including “numeric” (a double-precision floating point, which is the default and most common numerical type), “integer” (which you will not often make yourself), “logical”, and “factor” (which is not exactly like the others… more on this later).
c(1,1,2,3,5,8) # numeric (floating point) is the default type
c(1L, 10L, 20L) # suffixing non-decimal numbers creates an integer vector.
# usually not something you do manually
c(TRUE, FALSE, FALSE, F, T) # logical, used primary for subsetting (see below)
# T, F are shorthand for TRUE, FALSE
c("hello", "world") # string/character vector... not used very often
There are three different syntax's for assigning to variables. Variables can have alphanumeric, '.', '_' characters.
Yes, the dot/period character can be part of a variable name.
So height.of.women
, number.of.drinks.tim.has.had.today
are valid variable names.
x <- c(123)
y <- c(10, 11, 3:5, 999)
10:100 -> z # note the direction of the assignment
this.is.a.long.var.name <- c(x,y,z) # combine vectors above
R indices start from 1
, not 0
!
x[1] # square bracket retrieves that item from vector
x[c(1,2,3)] # first three items.
x[1:10] # first ten items
x[21:30] # items 21 to 30
x[-1] # return all items BUT item 1
x[-1:-10] # return all items BUT first 10
x[-length(x)] # return all items BUT last item
c(TRUE, FALSE, T, F) # a four item logical vector. TRUE and T, FALSE and F are the same thing
c(T,T,F) & c(F,T,T) # returns c(F,T,F), the item-wise logical-and
c(T,T,F) | c(F,T,T) # returns c(T,T,T), the item-wise logical-or
x <- c(1:5, 5:1) # create a sample vector c(1,2,3,4,5,5,4,3,2,1)
x > 3 # returns c(F, F, F, T, T, T, T, F, F, F)
# with an F where condition is true and T where it isn't.
# remember T is same as TRUE, F is FALSE
x >= 2 & x <= 4 # returns c(F,T,T,T,F,F,T,T,T,F)
x[x >= 2 & x <= 4] # returns every item which has a T, so: c( 2,3,4,4,3,2)
x[x > 3] # c(4,5,5,4).
length(x) # length of x
rev(x) # reverse
mean(x) # mean
sd(x) # standard deviation
var(x) # variance
max(x) # max
min(x) # minimum
diff(x) # returns c(x[2]-x[1], x[3]-x[2], ...), sized one less than the original,
diff(x)/x[-length(x)] # per-unit percent change
cumsum(x) # cumulative sum. each item is the sum of all previous ones.
sin(x), cos(x), exp(x), log(x), etc. # math. returns a vector of same length as x, evaluated at each item.
R allows you to be a relatively flexible when calling functions.
All parameters of functions are named.
For example, the first two parameters to seq
are called from from
and to.
You can call the function with purely positional arguments, by name, or mixed.
The following three calls are equivalent:
seq(1, 10) # positional arguments/parameters. R assumes, by the position
# of parameters, that 1 and 10 are values for 'from' and 'to'
seq(1, to=10) # first is positional and assume to be a value for "from".
seq(to=10, from=1) # named parameters can be in any order
seq(to=10, 1) # named parameters can even be before positional variables
?
can be used to look up a specific function/concept in R.
??
can be used to search, useful if you don't know the exact name.
?seq # display various help pages
?plot
?rep
??plot # search
Real data has missing values. In any sort of modelling and analysis, these missing values must either be
systematically ignored, factored into the model, or otherwise dealt with.
In R, an missing value is encoded as NA
. It is not a string, it has no type and can be a part of any type of vector.
NA # literal NA
x <- c(1,2,3, NA, 4,5,6) # vector with a missing value/gap/NA in the middle
is.na(x) # c(F,F,F,T,F,F,F), with a "T" where it is NA
! is.na(x) # c(T,T,T,F,T,T,F) logical inverse of above.
mean(x) # notice, mean returns NA
mean(x, na.rm=T) # with this syntax, we can ignore the NA when calculating the mean
# this will protect you!
sd(x, na.rm=T) # ditto for sd().
x <- x[!is.na(x)] # returns x with the NA's removed c(1,2,3,4,5,6).
Along with vectors, lists are another fundamental data structure in R.
The primary difference between vectors and lists are that lists can hold different primitive types.
l <- list(1, "c", TRUE) # the items are stored as is
v <- c(1, "c", TRUE) # the vector are "coerced" to c("1", "c", "TRUE") because
# vectors must be the same primitive type in R, and R does
# its best by converting to a more generic primitive type
unlist(l) # converts list to a vector with generic primitive type
Lists can be access just like vectors by passing it an integer (or a vector of integers)
l[1] # 1
l[2] # "c"
l[3] # TRUE
v[1] # "1"
v[2] # "c"
v[3] # "TRUE"
# Notice the items in the vector are all characters,
# and the items in the list are numeric, character, and logical.
Items of lists can have names:
prof <- list(height = 175,
name = "Coye",
beard = "Epic",
favorite.number = 3.14,
employed = TRUE
)
When items are given names, they can be access via a special syntax using the $
operator:
prof$name # "Coye"
prof$height # "180"
prof$beard # "Epic"
They can still be accessed by index, in which the position is the :
prof[4] # 3.14, because 3.14 was third argument
prof[5] # TRUE, because 'employed' was the fifth item
Because lists can store different data types, lists can contain lists. This means that
lists can be used to store arbitrarily complex data.
This can be useful when you're converting data from XML or other hierarchical data formats.
prof2 = list(height = 175,
name = "Coye",
beard = "Epic",
favorite.number = 3.14,
employed = TRUE,
courses.taught = list(
list(
name = "quantitative methods",
course.id = "i271b"
),
list(
name = "underwater basket weaving",
course.id = "i203"
)
)
)
Therefore, lists can also be thought of as a dictionary, hash table, associative array, map, or whatever you want to call a data structure which associates a key (string) to some value. However, it is not as efficient since it does not actually hash keys.
A data frame is, at its core, a list where every item is a vector, with all vectors having the same length. Think of it as a spreadsheet, or a (denormalized) table in a relational database. It is the primary data structure that most interesting R functions expect, and therefore: Putting everything you need into a data frame is the goal of your data pre-processing step.
We can manually create a data frame with the data.frame
function:
data.frame(x = c("a", "B", "X"), y = c(3.14, 69, 420), z = c("pizza", "falafel", "curry"))
# which creates a data frame with 3 rows, 3 columns named x, y, and z:
# x y z
# 1 a 3.14 pizza
# 2 B 69.00 falafel
# 3 X 420.00 curry
Another one:
scotch <- data.frame(
name = c("Dalmore", "Lagavulin", "Dalwhinnie", "Ardbeg"),
age = c(12, 16, 15, 10),
region = c("Highland", "Islay", "Speyside", "Islay"), # more on factors later
abv = c(40, 56.5, 43, 46)
)
# name age region abv
# 1 Dalmore 12 Highland 40.0
# 2 Lagavulin 16 Islay 56.5
# 3 Dalwhinnie 15 Speyside 43.0
# 4 Ardbeg 10 Islay 46.0
View(scotch) # view a formatted data frame
This is an error though, since the number of items in each column is not the same:
data.frame(x <- c("a", "B", "X"), y <- c(3.14, 69, 420), z <- c("pizza", "falafel", "curry", "burrito"))
Since data frames are lists, you can use the $
notation to get at individual column vectors:
scotch$age # 12 16 15 10
scotch$abv # 40.0 56.5 43.0 46.0
You can also get a specific items within a data frame using either the vector notation
scotch$abv[2] # 56.5. nothing new here. scotch$abv is a vector, and [2] gets the second item
OR we can use new notation which allows you to specify both the row and column at the same time.
The part before the ,
is the row, after is the column:
scotch[2, "abv"] # 56.5. row 2, column "abv"
scotch[1, "region"] # "highland". row 1, column "region"
Now, here's the kicker: we can use all of the vector subsetting/indexing techinques we used with this notation. For example:
scotch[c(1,2,3), "abv"] # 40.0 56.5 43.0. abv for row 1,2,3
scotch[c(1,2), c("abv","region")] # abv and region for row 1,2
We can even slice/subset:
scotch$region == "Islay" # FALSE TRUE FALSE TRUE, because the 2nd and 4th are the Islay malts
scotch[scotch$region == "Islay", "abv"]
# returns:
# 56.5 46.0
# we are selecting the abv column for rows where region == "Islay".
# yep, Islay scotch are stronger.
Another example:
mean(scotch$age) # mean age is 13.25. nothing special yet
scotch$age < mean(scotch$age) # TRUE FALSE FALSE TRUE, because Dalmore and Ardbeg are the young'uns
scotch[scotch$age < mean(scotch$age), c("name", "age", "region", "abv")]
# returns:
# name age region abv
# 1 Dalmore 12 Highland 40
# 4 Ardbeg 10 Islay 46
# because we selected all four columns where the rows met the condition that the age was less than average
R actually comes with many example data frames built-in.
You can take a look at a list of all available sample data frames with ls("package:datasets")
.
Some fun and self-explanatory examples include USArrests
, mtcars
, and trees
, and iris
.
In particular, iris
is a
famous data set gathered
by R. A. Fischer, the
chain-smoking father of statistics, and contains various data about flower
parts. As it is
a good example of a quantitative data, it is used in a few examples below.
str
, head
, tail
, summary
will likely be the first tools you reach for when dealing with a fresh data frame.
str(iris) # str for 'str'ucture dissect the data structure, very useful for debugging:
# 'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris) # first few
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
tail(iris) # last few
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 145 6.7 3.3 5.7 2.5 virginica
# 146 6.7 3.0 5.2 2.3 virginica
# 147 6.3 2.5 5.0 1.9 virginica
# 148 6.5 3.0 5.2 2.0 virginica
# 149 6.2 3.4 5.4 2.3 virginica
# 150 5.9 3.0 5.1 1.8 virginica
summary(iris) # summary stats
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
# 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
# Median :5.800 Median :3.000 Median :4.350 Median :1.300
# Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
# 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
# Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
# Species
# setosa :50
# versicolor:50
# virginica :50
nrow(iris) # number of observations in data set
ncol(iris) # number of variables in data set
plot(iris) # try it!
read.table
is the primary function you'll be using to read various
delimiter-separated text files. If you look up its manual page
?read.table
, you'll see there are many different options. Some of the most used are:
header
, which tells the function to interpret the first line as a list of column names.
FALSE by default.sep
, for separator. By default, the function will split on any whitespace. You'll
most likely want to set an explicit separator, such as sep="\t"
for tab or sep=","
for comma.na.strings
for specifying what strings should be interpretted as a missing value.
So, if your data file happens to use .
and -999
as missing values, you can pass
na.strings=c("-999", ".")
.
c(“NA”) by default.# movies
movies <- read.table(
"http://courses.ischool.berkeley.edu/i271b/f15/data/movies-subset.txt",
header=T,
sep=";"
)
# Note, each observation is numbered and R was able to determine that the numbers
# don't correspond to a variable.
# google daily
goog <- read.table(
"http://courses.ischool.berkeley.edu/i271b/f15/data/goog-daily.txt",
header=T,
sep=","
)
# complicated example
berkeley.temperature <- read.table(
"http://courses.ischool.berkeley.edu/i271b/f15/data/37.78N-122.03W-TAVG-Trend.txt",
blank.lines.skip=T, # ignores blank lines
comment="%", # removes lines commented out with a %
na.strings="NaN" # changes 'NaN' in .txt file to NA in data frame
)
# Note, the columns are not named something meaningful, because R cannot decipher the hierarchical
# column names. We can use the following to assign a new name to a column:
colnames(berkeley.temperature)[1] <- 'year'
Packages are used in R to provide additional functionality beyond the base functionality of R. Anyone can write an R package, and many have been written for advanced statistics and graphing. ggplot2 is a popular package for graphing within R. We will be using it in this course.
install.packages("ggplot2") # installs the ggplot2 package in the R environment
library(ggplot2) # loads the package
See Google's complete style guide for more information, but some common conventions to be aware of:
# Assignment
a <- 5 # good (Tip: use alt-minus sign to automatically create spaces and assignment symbol)
a<-5 # bad
a = 5 # bad
# Naming variables
new.variable # preferred
newVariable # accepted
new_variable # bad
# Spacing
library(ggplot2) #good
library( ggplot2 ) #bad
# Comments
#this is a bad comment
# This is a good comment.