i271b R Workshop Notes - Fundamentals

Installing R and RStudio

R is a free, open source software for statistical computations and graphing. An identical version of R is replicated on a collection of sites around the world on what's called the Comprehensive R Archive Network (CRAN). Select the site nearest to you (i.e. Berkeley) to download R.

RStudio desktop is an integrated development environment (IDE) which makes it easier to work with R (e.g. visualizing data, plotting data, saving history, workspace organization, etc.).

Creating vectors from scratch

1                  # vector of length 1.  R has no "scalar" type
c(1,3,4)           # vector of 3 items
c(30,150, 120, 0)  # vector of 4 items
100:200            # vector of 100, 101, 102, ..., 199, 200
seq(100,200)       # same as above. seq = sequence
200:100            # 200 to 100, by -1.
rep(c(1,2,3), 10 ) # repeat c(1,2,3) ten times
c(1,2,3, c(4,5))   # same as c(1,2,3,4,5). c() inside another c() flattens

Different Types of Vectors

There are several “primitive types” in R including “numeric” (a double-precision floating point, which is the default and most common numerical type), “integer” (which you will not often make yourself), “logical”, and “factor” (which is not exactly like the others… more on this later).

c(1,1,2,3,5,8)                 # numeric (floating point) is the default type
c(1L, 10L, 20L)                # suffixing non-decimal numbers creates an integer vector.
                               # usually not something you do manually
c(TRUE, FALSE, FALSE, F, T)    # logical, used primary for subsetting (see below)
                               # T, F are shorthand for TRUE, FALSE
c("hello", "world")            # string/character vector... not used very often

Variable Assignment

There are three different syntax's for assigning to variables. Variables can have alphanumeric, '.', '_' characters.

Yes, the dot/period character can be part of a variable name.
So height.of.women, number.of.drinks.tim.has.had.today are valid variable names.

x <- c(123)
y <- c(10, 11, 3:5, 999)
10:100 -> z                         # note the direction of the assignment
this.is.a.long.var.name <- c(x,y,z) # combine vectors above

Accessing vectors by index

R indices start from 1, not 0!

x[1]          # square bracket retrieves that item from vector
x[c(1,2,3)]   # first three items.
x[1:10]       # first ten items
x[21:30]      # items 21 to 30
x[-1]         # return all items BUT item 1
x[-1:-10]     # return all items BUT first 10
x[-length(x)] # return all items BUT last item

Subsetting by logical vectors

c(TRUE, FALSE, T, F)  # a four item  logical vector. TRUE and T, FALSE and F are the same thing
c(T,T,F) & c(F,T,T)   # returns c(F,T,F), the item-wise logical-and
c(T,T,F) | c(F,T,T)   # returns c(T,T,T), the item-wise logical-or

x <- c(1:5, 5:1)      # create a sample vector c(1,2,3,4,5,5,4,3,2,1)
x > 3                 # returns c(F, F, F, T, T, T, T, F, F, F)
                      # with an F where condition is true and T where it isn't.
                      # remember T is same as TRUE, F is FALSE
x >= 2 & x <= 4       # returns c(F,T,T,T,F,F,T,T,T,F)
x[x >= 2 & x <= 4]    # returns every item which has a T, so: c( 2,3,4,4,3,2)
x[x > 3]              # c(4,5,5,4).

Interesting functions

length(x)              # length of x
rev(x)                 # reverse
mean(x)                # mean
sd(x)                  # standard deviation
var(x)                 # variance
max(x)                 # max
min(x)                 # minimum
diff(x)                # returns c(x[2]-x[1], x[3]-x[2], ...), sized one less than the original, 
diff(x)/x[-length(x)]  # per-unit percent change 
cumsum(x)              # cumulative sum. each item is the sum of all previous ones.
sin(x), cos(x), exp(x), log(x), etc. # math. returns a vector of same length as x, evaluated at each item.

Calling Functions

R allows you to be a relatively flexible when calling functions. All parameters of functions are named. For example, the first two parameters to seq are called from from and to. You can call the function with purely positional arguments, by name, or mixed. The following three calls are equivalent:

seq(1, 10)         # positional arguments/parameters. R assumes, by the position
                   # of parameters, that 1 and 10 are values for 'from' and 'to'
seq(1, to=10)      # first is positional and assume to be a value for "from".
seq(to=10, from=1) # named parameters can be in any order
seq(to=10, 1)      # named parameters can even be before positional variables

Looking for help

? can be used to look up a specific function/concept in R. ?? can be used to search, useful if you don't know the exact name.

?seq   # display various help pages
?plot
?rep
??plot # search

Missing Values/NA's

Real data has missing values. In any sort of modelling and analysis, these missing values must either be systematically ignored, factored into the model, or otherwise dealt with.
In R, an missing value is encoded as NA. It is not a string, it has no type and can be a part of any type of vector.

NA                        # literal NA
x <- c(1,2,3, NA, 4,5,6)  # vector with a missing value/gap/NA in the middle
is.na(x)                  # c(F,F,F,T,F,F,F), with a "T" where it is NA
! is.na(x)                # c(T,T,T,F,T,T,F)  logical inverse of above.
mean(x)                   # notice, mean returns NA
mean(x, na.rm=T)          # with this syntax, we can ignore the NA when calculating the mean
                          # this will protect you!
sd(x, na.rm=T)            # ditto for sd().
x <- x[!is.na(x)]         # returns x with the NA's removed c(1,2,3,4,5,6).

Lists

Along with vectors, lists are another fundamental data structure in R.

The primary difference between vectors and lists are that lists can hold different primitive types.

l <- list(1, "c", TRUE)   # the items are stored as is
v <- c(1, "c", TRUE)      # the vector are "coerced" to c("1", "c", "TRUE") because
                          # vectors must be the same primitive type in R, and R does 
                          # its best by converting to a more generic primitive type
unlist(l)                 # converts list to a vector with generic primitive type

Lists can be access just like vectors by passing it an integer (or a vector of integers)

l[1] # 1 
l[2] # "c" 
l[3] #  TRUE
v[1] #  "1"
v[2] # "c" 
v[3] #  "TRUE"
# Notice the items in the vector are all characters,
# and the items in the list are numeric, character, and logical.

Items of lists can have names:

prof <- list(height = 175,            
            name = "Coye",
            beard = "Epic",
            favorite.number = 3.14,
            employed = TRUE
        )

When items are given names, they can be access via a special syntax using the $ operator:

prof$name   # "Coye"
prof$height # "180"
prof$beard  # "Epic"

They can still be accessed by index, in which the position is the :

prof[4] # 3.14, because 3.14 was third argument
prof[5] # TRUE, because 'employed' was the fifth item

Because lists can store different data types, lists can contain lists. This means that lists can be used to store arbitrarily complex data.
This can be useful when you're converting data from XML or other hierarchical data formats.

prof2 = list(height = 175,            
            name = "Coye",
            beard = "Epic",
            favorite.number = 3.14,
            employed = TRUE,
            courses.taught = list(
                list(
                    name = "quantitative methods",
                    course.id = "i271b"
                ),
                list(
                    name = "underwater basket weaving",
                    course.id = "i203"
                )
            )
        )

Therefore, lists can also be thought of as a dictionary, hash table, associative array, map, or whatever you want to call a data structure which associates a key (string) to some value. However, it is not as efficient since it does not actually hash keys.

Data frames

A data frame is, at its core, a list where every item is a vector, with all vectors having the same length. Think of it as a spreadsheet, or a (denormalized) table in a relational database. It is the primary data structure that most interesting R functions expect, and therefore: Putting everything you need into a data frame is the goal of your data pre-processing step.

We can manually create a data frame with the data.frame function:

data.frame(x = c("a", "B", "X"), y = c(3.14, 69, 420), z = c("pizza", "falafel", "curry"))

# which creates a data frame with 3 rows, 3 columns named x, y, and z:
#   x      y       z
# 1 a   3.14   pizza
# 2 B  69.00 falafel
# 3 X 420.00   curry

Another one:

scotch <- data.frame(
    name = c("Dalmore", "Lagavulin", "Dalwhinnie", "Ardbeg"), 
    age = c(12, 16, 15, 10), 
    region = c("Highland", "Islay", "Speyside", "Islay"),  # more on factors later
    abv = c(40, 56.5, 43, 46)
)

#         name age   region  abv
# 1    Dalmore  12 Highland 40.0
# 2  Lagavulin  16    Islay 56.5
# 3 Dalwhinnie  15 Speyside 43.0
# 4     Ardbeg  10    Islay 46.0

View(scotch)    # view a formatted data frame

This is an error though, since the number of items in each column is not the same:

data.frame(x <- c("a", "B", "X"), y <- c(3.14, 69, 420), z <- c("pizza", "falafel", "curry", "burrito"))

Since data frames are lists, you can use the $ notation to get at individual column vectors:

scotch$age # 12 16 15 10
scotch$abv # 40.0 56.5 43.0 46.0

You can also get a specific items within a data frame using either the vector notation

scotch$abv[2] # 56.5. nothing new here. scotch$abv is a vector, and [2] gets the second item

OR we can use new notation which allows you to specify both the row and column at the same time. The part before the , is the row, after is the column:

scotch[2, "abv"]    # 56.5.  row 2, column "abv"
scotch[1, "region"] # "highland".  row 1, column "region"

Now, here's the kicker: we can use all of the vector subsetting/indexing techinques we used with this notation. For example:

scotch[c(1,2,3), "abv"]           # 40.0 56.5 43.0. abv for row 1,2,3
scotch[c(1,2), c("abv","region")] # abv and region for row 1,2

We can even slice/subset:

scotch$region == "Islay" # FALSE TRUE FALSE TRUE, because the 2nd and 4th are the Islay malts

scotch[scotch$region == "Islay", "abv"] 
# returns:
# 56.5 46.0
# we are selecting the abv column for rows where region == "Islay".
# yep, Islay scotch are stronger.

Another example:

mean(scotch$age)              # mean age is 13.25. nothing special yet
scotch$age < mean(scotch$age) # TRUE FALSE FALSE  TRUE, because Dalmore and Ardbeg are the young'uns
scotch[scotch$age < mean(scotch$age), c("name", "age", "region", "abv")]

# returns:
#      name age   region abv
# 1 Dalmore  12 Highland  40
# 4  Ardbeg  10    Islay  46
# because we selected all four columns where the rows met the condition that the age was less than average

Example data frames

R actually comes with many example data frames built-in. You can take a look at a list of all available sample data frames with ls("package:datasets"). Some fun and self-explanatory examples include USArrests, mtcars, and trees, and iris.

In particular, iris is a famous data set gathered by R. A. Fischer, the chain-smoking father of statistics, and contains various data about flower parts. As it is a good example of a quantitative data, it is used in a few examples below.

Looking at data

str, head, tail, summary will likely be the first tools you reach for when dealing with a fresh data frame.

str(iris) # str for 'str'ucture dissect the data structure, very useful for debugging:

# 'data.frame': 150 obs. of  5 variables:
#  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

head(iris) # first few

#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

tail(iris) # last few

#     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
# 145          6.7         3.3          5.7         2.5 virginica
# 146          6.7         3.0          5.2         2.3 virginica
# 147          6.3         2.5          5.0         1.9 virginica
# 148          6.5         3.0          5.2         2.0 virginica
# 149          6.2         3.4          5.4         2.3 virginica
# 150          5.9         3.0          5.1         1.8 virginica

summary(iris) # summary stats

#   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
#  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
#  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
#  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
#  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
#  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
#  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
#        Species  
#  setosa    :50  
#  versicolor:50  
#  virginica :50  

nrow(iris)   # number of observations in data set
ncol(iris)    # number of variables in data set

plot(iris) # try it!

Reading in data frames

read.table is the primary function you'll be using to read various delimiter-separated text files. If you look up its manual page ?read.table, you'll see there are many different options. Some of the most used are:

header, which tells the function to interpret the first line as a list of column names. FALSE by default.
sep, for separator. By default, the function will split on any whitespace. You'll most likely want to set an explicit separator, such as sep="\t" for tab or sep="," for comma.
na.strings for specifying what strings should be interpretted as a missing value. So, if your data file happens to use . and -999 as missing values, you can pass na.strings=c("-999", "."). c(“NA”) by default.

# movies
movies <- read.table(
    "http://courses.ischool.berkeley.edu/i271b/f15/data/movies-subset.txt", 
    header=T, 
    sep=";"
) 
# Note, each observation is numbered and R was able to determine that the numbers
# don't correspond to a variable.

# google daily
goog <- read.table(
    "http://courses.ischool.berkeley.edu/i271b/f15/data/goog-daily.txt", 
    header=T, 
    sep=","
)

# complicated example 
berkeley.temperature <- read.table(
    "http://courses.ischool.berkeley.edu/i271b/f15/data/37.78N-122.03W-TAVG-Trend.txt", 
    blank.lines.skip=T,       # ignores blank lines
    comment="%",              # removes lines commented out with a %
    na.strings="NaN"          # changes 'NaN' in .txt file to NA in data frame
    ) 

# Note, the columns are not named something meaningful, because R cannot decipher the hierarchical
# column names. We can use the following to assign a new name to a column:
colnames(berkeley.temperature)[1] <- 'year'

Loading in packages

Packages are used in R to provide additional functionality beyond the base functionality of R. Anyone can write an R package, and many have been written for advanced statistics and graphing. ggplot2 is a popular package for graphing within R. We will be using it in this course.

install.packages("ggplot2")       # installs the ggplot2 package in the R environment
library(ggplot2)                  # loads the package

R Style Guide

See Google's complete style guide for more information, but some common conventions to be aware of:

# Assignment
a <- 5              # good (Tip: use alt-minus sign to automatically create spaces and assignment symbol)
a<-5                # bad
a = 5               # bad

# Naming variables
new.variable        # preferred
newVariable         # accepted
new_variable        # bad

# Spacing 
library(ggplot2)    #good
library( ggplot2 )  #bad

# Comments
#this is a bad comment
# This is a good comment.