util, stats, datasets, …bootclass, KernSmooth, etcgetwd() = return current working directorysetwd() = set current working directory?function = brings up help for that functiondir.create("path/foldername", recursive = TRUE) = create directories/subdirectoriesunlink(directory, recursive = TRUE) = delete directory and subdirectoriesls() = list all objects in the local workspacelist.files(recursive = TRUE) = list all, including subdirectoriesargs(function) = returns arguments for the functionfile.create("name") = create file
.exists("name") = return true/false exists in working directory.info("name") = return file info.info("name")$property = returns value for the specific attribute.rename("name1", "name2") = rename file.copy("name1", "name2") = copy file.path("name1") = return path of file<- = assignment operator# = commententer and result is returnedprint(x) = explicitly printing[1] at the beginning of the output = which element of the vector is being shownnumeric objects (double precision real numbers - decimals)L to the end of a number(ex. 1L)Inf = infinity, can be used in calculationsNaN = not a number/undefinedsqrt(value) = square root of valuevariable <- value = assignment of a value to a variable namevector <- c(value1, value2, ...) = creates a vector with specified valuesvector1*vector2 = element by element multiplication (rather than matrix multiplication)
+, -, ==, /, etc.) are done element by element by default%*% = force matrix multiplication between vectors/matricesvector("class", n) = creates empty vector of length n and specified class
vector("numeric", 3) = creates 0 0 0c() = concatenate
T, F = shorthand for TRUE and FALSE1+0i = complex numbersas.numeric(x), as.logical(x), as.character(x), as.complex(x) = convert object from one class to anotheras.numeric(c("a", "b"))as.list(data.frame) = converts a data.frame object into a list objectas.character(list) = converts list into a character vectorx <- c(NA, 2, "D") will create a vector of character classlist() = special vector wit different classes of elements
list = vector of objects of different classes[[]], elements of other vectors use []TRUE, FALSE, and NA, values are generated as result of logical conditions comparing two objects/valuespaste(characterVector, collapse = " ") = join together elements of the vector and separating with the collapse parameterpaste(vec1, vec2, sep = " ") = join together different vectors and separating with the sep parameter
LETTERS, letters= predefined vectors for all 26 upper and lower lettersunique(values) = returns vector with all duplicates removedmatrix can contain only 1 type of datadata.frame can contain multiplematrix(values, nrow = n, ncol = m) = creates a n by m matrix
dim(m) <- c(2, 5)rbind(x, y), cbind(x, y) = combine rows/columns; can be used on vectors or matrices* and / = element by element computation between two matrices
%*% = matrix multiplicationdim(obj) = dimensions of an object (returns NULL if a vector)
dim(obj) <- c(4, 5) = assign dim attribute to an object
# initiate a vector
x <-c(NA, 1, "cx", NA, 2, "dsa")
class(x)
## [1] "character"
x
## [1] NA "1" "cx" NA "2" "dsa"
# convert to matrix
dim(x) <- c(3, 2)
class(x)
## [1] "matrix"
x
## [,1] [,2]
## [1,] NA NA
## [2,] "1" "2"
## [3,] "cx" "dsa"
data.frame(var = 1:4, var2 = c(….)) = creates a data frame
nrow(), ncol() = returns row and column numbersdata.frame(vector, matrix) = takes any number of arguments and returns a single object of class “data.frame” composed of original objectsas.data.frame(obj) = converts object to data frameread.table() and read.csv()data.matrix() = converts a matrix to data framecolMeans(matrix) or rowMeans(matrix) = returns means of the columns/rows of a matrix/dataframe in a vectoras.numeric(rownames(df)) = returns row indices for rows of a data frame with unnamed rowsnames, dimnames, row.names, dim (matrices, arrays), class, length, or any user-defined onesattributes(obj), class(obj) = return attributes/class for an R objectattr(object, "attribute") <- "value" = creates/assigns a value to a new/existing attribute for the objectnames attribute
names(x) = returns names (NULL if no name exists)
names(x) <- c("a", …) = can be used to assign names to vectorslist(a = 1, b = 2, …) = a, b are namesdimnames(matrix) <- list(c("a", "b"), c("c" , "d")) = assign names to matrices
colnames(data.frame) = return column names (can be used to set column names as well, similar to dim())row.names = names of rows in the data frame (attribute)array(data, dim, dimnames)
data = data to be stored in arraydim = dimensions of the array
dim = c(2, 2, 5) = 3 dimensional array \(\rightarrow\) creates 5 2x2 arraydimnames = add names to the dimensions
listlist must correspond in length to the dimensions of the arraydimnames(x) <- list(c("a", "b"), c("c", "d"), c("e", "f", "g", "h", "i")) = set the names for row, column, and third dimension respectively (2 x 2 x 5 in this case)dim() function can be used to create arrays from vectors or matrices
x <- rnorm(20); dim(x) <- c(2, 2, 5) = converts a 20 element vector to a 2x2x5 arraylm(), glm()factor(c("a", "b"), levels = c("1", "2")) = creates factor
levels() argument can be used to specify baseline levels vs other levels
table(factorVar) = how many of each are in the factorNaN or NA = missing values
NaN = undefined mathematical operationsNA = any value not available or missing in the statistical sense
NA results in NAis.na(), is.nan() = use to test if each element of the vector is NA and NaN
NA (with ==) as it is not a value but a placeholder for a quantity that is not availablesum(my_na) = sum of a logical vector (TRUE = 1 and FALSE = 0) is effectively the number of TRUEs
NA Values
is.na() = creates logical vector where T is where value exists, F is NA
complete.cases(obj1, obj2) = creates logical vector where TRUE is where both values exist, and FALSE is where any is NA
complete.cases(data.frame) = creates logical vectors indicating which observation/row is gooddata.frame[logicalVector, ] = returns all observations with complete dataImputing Missing Values = replacing missing values with estimates (can be averages from all other data with the similar conditions)
1:20 = creates a sequence of numbers from first number to second number
?':' = enclose help for operatorsseq(1, 20, by=0.5) = sequence 1 to 20 by increment of .5
length=30 argument can be used to specify number of values generatedlength(variable) = length of vector/sequenceseq_along(vector) or seq(along.with = vector) = create vector that is same length as another vectorrep(0, times = 40) = creates a vector with 40 zeroes
rep(c(1, 2), times = 10) = repeats combination of numbers 10 timesrep(c(1, 2), each = 10) = repeats first value 10 times followed by second value 10 timesx[0] returns numeric(0), not errorx[3000] returns NA (not out of bounds/error)[] = always returns object of same class, can select more than one element of an object (ex. [1:2])[[]] = can extract one element from list or data frame, returned object not necessarily list/dataframe$ = can extract elements from list/dataframe that have names associated with it, not necessarily same classx[1:10] = first 10 elements of vector xx[is.na(x)] = returns all NA elementsx[!is.na(x)] = returns all non NA elements
x > 0 = would return logical vector comparing all elements to 0 (TRUE/FALSE for all values except for NA and NA for NA elements (NA a placeholder)x[x>"a"] = selects all elements bigger than a (lexicographical order in place)x[logicalIndex] = select all elements where logical index = TRUEx[-c(2, 10)] = returns everything but the second and tenth elementvect <- c(a = 1, b = 2, c = 3) = names values of a vector with corresponding namesnames(vect) = returns element names for object
names(vet) <- c("a", "b", "c") = assign/change names of vectoridentical(obj1, obj2) = returns TRUE if two objects are exactly equalall.equal(obj1, obj2) = returns TRUE if two objects are near equalx <- list(foo = 1:4, bar = 0.6)x[1] or x["foo"] = returns the list object foox[[2]] or x[["bar"]] or x$bar = returns the content of the second element from the list (in this case vector without name attribute)
$ can’t extract multiple elements x[c(1, 3)] = extract multiple elements of listx[[name]] = extract using variable, where as $ must match name of elementx[[c(1, 3)]] or x[[1]][[3]] = extracted nested elements of list third element of the first object extracted from the listx[1, 2] = extract the (row, column) element
x[,2] or x[1,] = extract the entire column/rowx[ , 11:17] = subset the x data.frame with all rows, but only 11 to 17 columnsdrop = FALSE
x[1, 2, drop = F][[]] and $$ automatically partial matches the name (x$a)[[]] can partial match by adding exact = FALSE
x[["a", exact = false]]<, >= = less than, greater or equal to== = exact equality!= = inequalityA | B = unionA & B = intersection! = negation& or | evaluates every instance/element in vector&& or || evaluate only first element
isTRUE(condition) = returns TRUE or FALSE of the conditionxor(arg1, arg2) = exclusive OR, one argument must equal TRUE one must equal FALSEwhich(condition) = find the indicies of elements that satisfy the condition (TRUE)any(condition) = TRUE if one or more of the elements in logical vector is TRUEall(condition) = TRUE if all of the elements in logical vector is TRUEclass(), dim(), nrow(), ncol(), names() to understand dataset
object.size(data.frame) = returns how much space the dataset is occupying in memoryhead(data.frame, 10), tail(data.frame, 10) = returns first/last 10 rows of data; default = 6summary() = provides different output for each variable, depending on class,
table(data.frame$variable) = table of all values of the variable, and how many observations there are for each
str(data.frame) = structure of data, provides data class, num of observations vs variables, and name of class of each variable and preview of its contents
view(data.frame) = opens and view the content of the data framesplit()split(x, f, drop = FALSE)
x = vector/list/data framef = factor/list of factorsdrop = whether empty factor levels should be droppedinteractions(gl(2, 5), gl(5, 2)) = 1.1, 1.2, … 2.5
gl(n, m) = group level function
n = number of levelsm = number of repetitionssplit function can do this by passing in list(f1, f2) in argument
split(data, list(gl(2, 5), gl(5, 2))) = splits the data into 1.1, 1.2, … 2.5 levelsapply()apply(x, margin = 2, FUN, ...)
x = arrayMARGIN = 2 (column), 1 (row)FUN = function… = other arguments that need to be passed to other functionsapply(x, 1, sum) or apply(x, 1, mean) = find row sums/meansapply(x, 2, sum) or apply(x, 2, mean) = find column sums/meansapply(x, 1, quantile, props = c(0.25, 0.75)) = find 25% 75% percentile of each rowa <- array(rnorm(2*2*10), c(2, 2, 10)) = create 10 2x2 matrixapply(a, c(1, 2), mean) = returns the means of 10lapply()list and evaluate a function on each element and always returns a list
lapply(x, FUN, ...) = takes list/vector as input, applies a function to each element of the list, returns a list of the same length
x = list (if not list, will be coerced into list through “as.list”, if not possible —> error)
data.frame are treated as collections of lists and can be used hereFUN = function (without parentheses)
function(x) x[,1])… = other/additional arguments to be passed for FUN (i.e. min, max for runif())lapply(data.frame, class) = the data.frame is a list of vectors, the class value for each vector is returned in a list (name of function, class, is without parentheses)lapply(values, function(elem), elem[2]) = example of an anonymous functionsapply()lapply() except it simplifies the result
sapply returns a list (same as lapply())vapply()sapply in that it allows to you specify the format for the result
vapply(flags, class, character(1)) = returns the class of values in the flags variable in the form of character of length 1 (1 value)tapply()tapply(data, INDEX, FUN, ..., simplify = FALSE) = apply a function over subsets of a vector
data = vectorINDEX = factor/list of factorsFUN = function… = arguments to be passed to functionsimplify = whether to simplify the resultx <- c(rnorm(10), runif(10), rnorm(10, 1))f <- gl(3, 10); tapply(x, f, mean) = returns the mean of each group (f level) of x datamapply()mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE)
FUN = function… = arguments to apply overMoreArgs = list of other arguments to FUNSIMPLIFY = whether the result should be simplifiedmapply(rep, 1:4, 4:1)
## [[1]]
## [1] 1 1 1 1
##
## [[2]]
## [1] 2 2 2
##
## [[3]]
## [1] 3 3
##
## [[4]]
## [1] 4
aggregate()tapply at the same time)aggregate(list(name = dataToCompute), list(name = factorVar1,name = factorVar2), function, na.rm = TRUE)
dataToCompute = this is what the function will be applied onfactorVar1, factorVar1 = factor variables to split the data byfunction = what is applied to the subsets of data, can be sum/mean/median/etcna.rm = TRUE \(\rightarrow\) removes NA valuessample(values, n, replace = FALSE) = generate random samples
values = values to sample fromn = number of values generatedreplace = with or without replacementsample(1:6, 4, replace = TRUE, prob=c(.2, .2…)) = choose four values from the range specified with replacing (same numbers can show up twice), with probabilities specifiedsample(vector) = can be used to permute/rearrange elements of a vectorsample(c(y, z), 100) = select 100 random elements from combination of values y and zsample(10) = select positive integer sample of size 10 without repeatr*** function (for “random”) \(\rightarrow\) random number generation (ex. rnorm)d*** function (for “density”) \(\rightarrow\) calculate density (ex. dunif)p*** function (for “probability”) \(\rightarrow\) cumulative distribution (ex. ppois)q*** function (for “quantile”) \(\rightarrow\) quantile function (ex. qbinom)pnorm(q) = \(\Phi(q)\) and qnorm(p) = \(\Phi^{-1}(q)\).set.seed() = sets seed for randon number generator to ensure that the same data/analysis can be reproducedrbinom(1, size = 100, prob = 0.7) = returns a binomial random variable that represents the number of successes in a give number of independent trials
1 = corresponds number of observationssize = 100 = corresponds with the number of independent trials that culminate to each resultant observationprob = 0.7 = probability of successrnorm(n, mean = m, sd = s) = generate n random samples from the standard normal distribution (mean = 0, std deviation = 1 by default)
rnorm(1000) = 1000 draws from the standard normal distributionn = number of observation generatedmean = m = specified mean of distributionsd = s = specified standard deviation of distributiondnorm(x, mean = 0, sd = 1, log = FALSE)
log = evaluate on log scalepnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
lower.tail = left side, FALSE = rightqnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
lower.tail = left side, FALSE = rightrpois(n, lambda) = generate random samples from the poisson distrbution
n = number of observations generatedlambda = \(\lambda\) parameter for the poisson distribution or raterpois(n, r) = generating Poisson Data
n = number of valuesr = rateppois(n, r) = cumulative distribution
ppois(2, 2) = \(Pr(x<=2)\)replicate(n, rpois()) = repeat operation n timesset.seed(20)
x <- rnorm(100) # normal
x <- rbinom(100, 1, 0.5) # binomial
e <- rnorm(100, 0, 2)
y <- 0.5 + 2* x + e
x <- rnorm(100)
log.mu <- 0.5 + 0.3* x
y <- rpois(100, exp(log.mu))
Date = date class, stored as number of days since 1970-01-01POSIXct = time class, stored as number of seconds since 1970-01-01POSIXlt = time class, stored as list of sec min hoursSys.Date() = today’s dateunclass(obj) = returns what obj looks like internallySys.time() = current time in POSIXct classt2 <- as.POSIXlt(Sys.time()) = time in POSIXlt class
t2$min = return min of time (only works for POSIXlt class)weekdays(date), months(date), quarters(date) = returns weekdays, months, and quarters of time/date inputedstrptime(string, "%B %d, %Y %H:%M") = convert string into time format using the format specifieddifftime(time1, time2, units = 'days') = difference in times by the specified unitdata(set) = load dataplot(data) = R plots the data as best as it can
x = variable, x axisy = variablexlab, ylab = corresponding labelsmain, sub = title, subtitlecol = 2 or col = "red" = colorpch = 2 = different symbols for pointsxlim,ylim(v1, v2) = restrict range of plotboxplot(x ~ y, data = d) = creates boxplot for x vs y variables using the data.frame providedhist(x, breaks) = plots histogram of the data
break = 100 = split data into 100 binsread.table(), read.csv() = most common, read text files (rows, col) return data framereadLines() = read lines of text, returns character vectorsource(file) = read R codedget() = read R code files (R objects that have been reparsed)load(), unserialize() = read binary objectswrite.table(), writeLines(), dump(), put(), save(), serialize()read.table() arguments:
file = name of file/connectionheader = indicator if file contains headersep = string indicating how columns are separatedcolClasses = character vector indicating what each column is in terms of classnrows = number of rows in datasetcomment.char = char indicating beginning of commentskip = number of lines to skip in the beginningstringsAsFactors = defaults to TRUE, should characters be coded as Factorread.table can be used without any other argument to create data.frame
read.csv() = read.table except default sep is comma (read.table default is sep = " " and header = TRUE)numRow x numCol x 8 bytes/numeric value = size required in bitescomment.char = "" to save time if there are no comments in the filecolClasses can make reading data much fasternrow = n = number of rows to read in (can help with memory usage)
initial <- read.table("file", rows = 100) = read first 100 linesclasses <- sapply(initial, class) = determine what classes the columns aretabAll <- read.table("file", colClasses = classes) = load in the entire file with determined classesdump and dput preserve metadatadput(obj, file = "file.R") = creates R code to store all data and meta data in “file.R” (ex. data, class, names, row.names)dget("file.R") = loads the file/R code and reconstructs the R objectdput can only be used on one object, where as dump can be used on multiple objectsdump(c("obj1", "obj2"), file= "file2.R") = stores two objectssource("file2.R") = loads the objectsurl() = function can read from webpagesfile() = read uncompressed filesgzfile(), bzfile() = read compressed files (gzip, bzip2)file(description = "", open = "") = file syntax, creates connection
description = description of fileopen = r -readonly, w - writing, a - appending, rb/wb/ab - reading/writing/appending binaryclose() = closes connectionreadLines() = can be used to read lines after connection has been establisheddownload.file(fileURL, destfile = "fileName", method = "curl")
fileURL = url of the file that needs to be downloadeddestfile = "fileName" = specifies where the file is to be saved
dir/fileName = directories can be referenced heremethod = "curl" = necessary for downloading files from “https://” links on Macs
method = "auto" = should work on all other machinesif, else = testing a conditionfor = execute a loop a fixed number of timeswhile = execute a loop while a condition is truerepeat = execute an infinite loopbreak = break the execution of a loopnext = skip an interation of a loopreturn = exit a functionapply functions are more useful if - else# basic structure
if(<condition>) {
## do something
} else {
## do something else
}
# if tree
if(<condition1>) {
## do something
} else if(<condition2>) {
## do something different
} else {
## do something different
}
y <- if(x>3){10} else {0} = slightly different implementation than normal, focus on assigning valuefor# basic structure
for(i in 1:10) {
# print(i)
}
# nested for loops
x <- matrix(1:6, 2, 3)
for(i in seq_len(nrow(x))) {
for(j in seq_len(ncol(x))) {
# print(x[i, j])
}
}
for(letter in x) = loop through letter in character vectorseq_along(vector) = create a number sequence from 1 to length of the vectorseq_len(length) = create a number sequence that starts at 1 and ends at length specifiedwhilecount <- 0
while(count < 10) {
# print(count)
count <- count + 1
}
repeat and breakrepeat initiates an infinite looprepeat loop is to call breakx0 <- 1
tol <- 1e-8
repeat {
x1 <- computeEstimate()
if(abs(x1 - x0) < tol) {
break
} else {
x0 <- x1 # requires algorithm to converge
}
}
for loop) and then report whether convergence was achieved or not.next and returnnext = (no parentheses) skips an element, to continue to the next iterationreturn = signals that a function should exit and return a given valuefor(i in 1:100) {
if(i <= 20) {
## Skip the first 20 iterations
next
}
## Do something here
}
name <- function(arg1, arg2, …){ }
arg1 = 10NULLna.rm, can be set to TRUE to remove NA values from calculationstructure
f <- function(<arguments>) {
## Do something interesting
}x = mydata) which can be used to specifiy default values
sd(x = mydata) (matching by name)formals() = returns all formal argumentsargs() = return all arguments you can specifydata = x, use d = xf <- function (a, b) {a^2}f(5) will not produce an error... argument
... to pass extra arguments (i.e. mean = 1, sd = 2)... = paste(), cat()... must be explicitly matched and cannot be partially matched .GlobalEnvpackage:statspackage:graphicspackage:grDevicedpackage:utilspackage:datasetspackage:methodsAutoloadspackage:base.GlobalEnv = everything defined in the current workspacelibrary() gets put in position 2 of the above search listmake.power <- function(n) {
pow <- function(x) {
x^n
}
pow
}
cube <- make.power(3) # defines a function with only n defined (x^3)
square <- make.power(2) # defines a function with only n defined (x^2)
cube(3) # defines x = 3
## [1] 27
square(3) # defines x = 3
## [1] 9
# returns the free variables in the function
ls(environment(cube))
## [1] "n" "pow"
# retrieves the value of n in the cube function
get("n", environment(cube))
## [1] 3
y <- 10
f <- function(x) {
y <- 2
y^2 + g(x)
}
g <- function(x) {
x*y
}
f(3) \(\rightarrow\) calls g(x)y isn’t defined locally in g(x) \(\rightarrow\) searches in parent environment (working environment/global workspace)y \(\rightarrow\) y = 10f(3) \(\rightarrow\) calls g(x)y isn’t defined locally in g(x) \(\rightarrow\) searches in calling environment (f function)y \(\rightarrow\) y <- 2optim, nlm, optimize) require you to pass a function whose argument is a vector of parameters
# write constructor function
make.NegLogLik <- function(data, fixed=c(FALSE,FALSE)) {
params <- fixed
function(p) {
params[!fixed] <- p
mu <- params[1]
sigma <- params[2]
a <- -0.5*length(data)*log(2*pi*sigma^2)
b <- -0.5*sum((data-mu)^2) / (sigma^2)
-(a + b)
}
}
# initialize seed and print function
set.seed(1); normals <- rnorm(100, 1, 2)
nLL <- make.NegLogLik(normals); nLL
## function(p) {
## params[!fixed] <- p
## mu <- params[1]
## sigma <- params[2]
## a <- -0.5*length(data)*log(2*pi*sigma^2)
## b <- -0.5*sum((data-mu)^2) / (sigma^2)
## -(a + b)
## }
## <environment: 0x7ff878f72bb8>
# Estimating Prameters
optim(c(mu = 0, sigma = 1), nLL)$par
## mu sigma
## 1.218239 1.787343
# Fixing sigma = 2
nLL <- make.NegLogLik(normals, c(FALSE, 2))
optimize(nLL, c(-1, 3))$minimum
## [1] 1.217775
# Fixing mu = 1
nLL <- make.NegLogLik(normals, c(1, FALSE))
optimize(nLL, c(1e-6, 10))$minimum
## [1] 1.800596
message: generic notification/diagnostic message, execution continues
message() = generate messagewarning: something’s wrong but not fatal, execution continues
warning() = generate warningerror: fatal problem occurred, execution stops
stop() = generate errorcondition: generic concept for indicating something unexpected can occurinvisible() = suppresses auto printingset.seed to pinpoint problem) traceback: prints out function call stack after error occurs
debug: flags function for debug mode, allows to step through function one line at a time
debug(function) = enter debug modebrowser: suspends the execution of function wherever its placed
trace: allows inserting debugging code into a function at specific placesrecover: error handler, freezes at point of error
options(error = recover) = instead of console, brings up menu (similar to browser)# system.time example
system.time({
n <- 1000
r <- numeric(n)
for (i in 1:n) {
x <- rnorm(n)
r[i] <- mean(x)
}
})
## user system elapsed
## 0.155 0.004 0.191
system.time(expression)
{}proc_time
vecLib/Accelerate, ATLAS, ACML, MKLRprof() = useful for complex code only
Rprof() generates Rprof.out file by default
Rprof("output.out") = specify the output filesystem.time() summaryRprof() = summarizes Rprof() output, 2 methods for normalizing data
Rprof.out file by default, can specify output file summaryRprof("output.out")by.total = divide time spent in each function by total run timeby.self = first subtracts out time spent in functions above in call stack, and calculates ratio to total$sample.interval = 0.02 \(\rightarrow\) interval$sampling.time = 7.41 \(\rightarrow\) seconds, elapsed time
lm()), but the function simply calls helper functions to do work so it is not useful to know about the top level function times by.self = more useful as it focuses on each individual call/function unlist(rss) = converts a list object into data frame/vectorls("package:elasticnet") = list methods in package