2  Introduction to R language

R is a dynamically typed interpreted language. It is a highly capable enviornment for computing and graphics, for which R is often labeled a glue language. It has built-in functions for statistical computing.

Many great sources for R language with the focus on data science and statistical data processing are already written and free online. We will just name a few and limit the introduction at this site to bare minimum, which we use within this class.

High quality R sources:

For the newcomer to R ecosystem we highly encourage to print and use the various official cheat sheets, created by the R community and Posit Inc.

Some basic commands

Code
version   # start by checking version of your R, should be at least 4.0 and above
getwd()   # get working directory
ls()      # list object in current environment
print()   # evaluate to console
rm()      # removes object from 
View()    # show internals of and object
...

Then there are commands that work with file structure such as

Code
list.files()  
dir.create()
dir.exists()
file.exists()
...

and you can also invoke any shell/cmd command plus attributes with system2() interface.

2.1 Getting help

It is entered into the console in the form help(<function name>) or ?<function name>. If we would like to look directly into the code of the function, it is also possible, we just enter the name of the function in the console without brackets, or use the command View(<function name>). In addition, R also has help.search(<function name>) under the shortcut ??, which searches for full-text help across installed packages. Furthermore, it is still possible to search the R language mailing list using the function RSiteSearch(), which opens a new window of the predefined browser. In addition, thematically integrated help cards are very useful: ?Logical, ?Constants, ?Control, ?Arithmetic, ?Syntax, ?Special etc.

Exercise

Try to find help to DateTimeClasses. a) What do the POSIXct a POSIXlt represent? b) What is the difference between them? c) Find a function for calculating \(5!\)

2.2 R as scientific calculator

2.2.1 Arithmetic operations

Code
1 + 2           # addition
## [1] 3
1 - 2           # subtraction 
## [1] -1
1 / 2           # division
## [1] 0.5
1 * 2           # multiplication
## [1] 2
1 %/% 2         # integer division
## [1] 0
1 %% 2          # modulo oprator
## [1] 1

2.2.2 Special values

R is familiar with the concept of \(\pm\infty\), hence -Inf and Inf values are at disposal. You will get them most probably as results from computation heading to \(\frac{\pm1}{0}\) numerically. There are other special values like NULL (null value), NA (not assigned) and NaN (not a number). The concept of not assigned is one that is particularly important, since it has significant impact on the computed result ({(code-mean-rm?)}). NA is of default type logical. Otherwise it si possible to specify missing value in all other data type like NA_real_ (matches double), NA_integer_, NA_complex_ and NA_character_, these are all usable in pre-allocation of memory for data structures. Try to find the usage of functions na.omit(), is.na(), complete.cases().

Code
x <- seq(1:10)
x[c(5,6)] <- NA
print(x)
mean(x)
mean(x, na.rm = TRUE)
1
General sequence of numbers
2
change some elements to not assigned
3
without removal
4
and with removal
 [1]  1  2  3  4 NA NA  7  8  9 10
[1] NA
[1] 5.5

2.2.3 Set operations

For manipulating sets, there are a couple of essential functions union(), intersect(), setdiff() and operator %in%.

Code
set_A <- c("a", "a", "b", "c", "D")
set_B <- c("a", "b", "d")
union(set_A, set_B)
## [1] "a" "b" "c" "D" "d"
intersect(set_A, set_B)
## [1] "a" "b"
set_A %in% set_B
## [1]  TRUE  TRUE  TRUE FALSE FALSE

The operators fall in arithmetic, relation, assign categories and we also put set functions here.

Sign Meaning
+ , - , * , / , %% , %/% , ** nebo ^, %*% arithmetic operators (plus, minus, multiply, divide, modulo, integer division, power and matrix multiplication)
> ,>= , < , <= , == , != relation operators (larger/smaller than, equal, not equal)
! , & , && , | , || logical (negation, and, or)
~ functional relationship
<- , =, <<-, -> assign operator
$ naming indexation in heterogenic structures
: rangea
isTRUE() , all() , any() , %in% , setdiff() set functions

2.2.4 Mathematical functions

Function Meaning
log(x) logarithm \(x\) to the base \(e\)
exp(x) \(x(e^x)\)
log(x, n) logarithm \(x\) base \(n\)
log10(x) logarithm \(x\) base \(10\)
sqrt(x) square root from \(x\)
factorial(x) \(x!\)
choose(n, x) binomial coefficients
\[ \binom{n}{k} = \frac{n!}{k!(n-k)!} \]
ceiling(x) smallest integer large than \(x\)
floor(x) largest integer before \(x\)
trunc(x) closest number between \(x\) a 0
round(x, digits) round \(x\) to \(n\) decimals
signif(x, digits) round \(x\) to \(n\) significant numbers
cos(x) , sin(x) , tan(x) function ins rad
acos(x) , asin(x) , atan(x) inverse trigonometric functions
abs(x) absolute value
Exercise

Evaluate the following expressions:
a) \(1 + 3 \cdot (2 / 3)\:\mathrm{mod}\:3\)
b) \(\dfrac{\sin(2.3)}{\cos(\pi)}\)
c) \(\sum\limits_{i = 1}^{53}i\)
d) \(\dfrac{-\infty}{0}\), \(\dfrac{-\infty}{\infty}\), \(\dfrac{0}{0}\)
e) \(\left(\dfrac{2}{35}\right)^{0.5} \cdot 3 \cdot (2 / 3)\)
f) \(20!\)
g) \(\int_{0}^{3\pi} \sin(x) dx\)

Matrix operations

Let’s say we have a set of linear equations

\[ \begin{matrix} 2x& - 3y& &= 3\\ & - 2y& + 4z &= 9\\ 2x& + 13y& + 9z&= 10 \end{matrix}\\ \tag{2.1}\]

Solving {Equation 2.1} is a one-liner:

Code
A <- matrix(data = c(2, -3, 0, 0, -2, 4, 2, 13, 9), nrow = 3, byrow = TRUE)
B <- c(3, 9, 10)
solve(A, B)
[1]  0.5304878 -0.6463415  1.9268293

2.3 R as programming language

2.3.1 Variables and name conventions

It is highly discouraged using spaces and diacritical marks in naming, like the Czech translation of the term “variable” - proměnná. Most programmers use either camelNotation or snake_notation for naming purposes. Obviously the R is case-sensitive so camelNotation and CamelNotation are two different things. Variables do not contain spaces, quotes, arithmetical, logical nor relational operators neither they contain special characters like =, -, ``. Objects cannot be named by key words.

Key words

if, else, repeat, while, function, for, in, next, repeat, break, TRUE, FALSE, NULL, Inf, NaN, NA, NA_integer_, NA_real_, NA_complex_, NA_character_

It is not recommended to inlude dot in the name, like morava.prutoky, and to match the names with commonly used functions. R is “case-sensitive” which means, that X does not equal x.

Exercise

Intuitively, we might be guided to load the data into the data variable. This is the wrong however, since data() is a function to access datasets that are part of the basic R installation. Try it out.

Some cases of possible but wrong naming

aaa, Morávka průtok [m/s], moje.proměnná

2.3.2 Rules of quotation marks and parenthesses

Both represent the paired characters in R. Parenthesses are used in three versions: classical, square brackets and curly brackets (braces). All of them have specific non-overlaping usage.

  • () are always to be found right next to a function name they delineate the space where function arguments are to be specified.
  • [] are always use with the name of the object (vector, array, list, …) and signalize subselecting from the object.
  • {} mark a block of code, which should be executed at once.

Quotation marks introduce text strings. Both “double” and ‘single’ quotes can be used completely at will, they just need to be closed with the same type. Back quotes are also common and are used, for example, to delimit a non-standard column name in a structure.

2.3.3 Functions

You can define own functions using the function() construct. If you work in ****RStudio, just type fun and tabulate a snippet from the IDE help. The action produces {(code-function-snippet?)}.

Code
name <- function(variables) {
  ...
}

name is the name of the function we would like to create and variables are the arguments of that function. Space between the {and } is called a body of a function and contains all the computation which is invoked when the function is called.

Let’s put Here an example of creating own function to calculate weighted mean

\[ \bar{x} = \dfrac{\sum\limits_{i=1}^{n} w_ix_i}{\sum\limits_{i=1}^{n}w_i}, \] where \(x_iw_i\) are the individual weighted measurements.

We define a simple function for that purpose and run an example.

Code
w_mean <- function(x, w = 1/length(x)) {
  sum(x*w)/sum(w)
}
w_mean(1:10)
[1] 55

Here is a different example:

Code
x <- rnorm(100)
nejblizsi_hodnota <- function(x, value) {
  x[which(abs(x - value) == min(abs(x - value)))]
}

cat("Hodnota nejblíže 0 z vektoru x je:" , nejblizsi_hodnota(x = x, value = 0))
1
Example of function, which seeks the neares number from a vector x to a certain referential value.
Hodnota nejblíže 0 z vektoru x je: 0.008426125

We can test if we get the same result as the primitive function from R using all.equal() statement.

Code
all.equal(w_mean(x = 1:5, w = c(0.25, 0.25, 1, 2, 3)), 
          weighted.mean(x = 1:5, w = c(0.25, 0.25, 1, 2, 3)))
[1] TRUE

Any argument without default value in the function definition has to be provided on function call. You can frequently see functions with the possibility to specify ... a so-called three dot construct or ellipsis. The ellipsis allows for adding any number of arguments to a function call, after all the named ones.

2.3.4 Data types

The basic types are logical, integer, numeric, complex, character and raw. There are some additional types which we will encounter like Date. Since R is dynamically typed, it is not necessary for the user to declare variables before using them. Also the type changes without notice based on the stored values, where the chain goes from the least complex to the most. The summary is in the following table

Code
TRUE    # logical, also T as short version
## [1] TRUE
1L      # integer
## [1] 1
1.2     # numeric
## [1] 1.2
1+3i    # complex
## [1] 1+3i
"A"     # character, also 'A'
## [1] "A"

They represent the individual elements of data structures. R dynamically typed and does not require declarations before usage.

Basic types and coercions.
logical integer numeric complex character
logical logical integer numeric complex character
integer logical integer numeric complex character
numeric logical numeric numeric complex character
complex logical integer + warning numeric + warning complex character
character NA_logical NA_integer + warning NA_numeric + warning NA_complex + warning character

Two types of functions are connected to data types: is.___ a as.___. Is is either questioning or coertion of data type. Try also class(), mode().

Code
is.character("ABC")
[1] TRUE
Code
as.integer(11 + 1i)
Warning: imaginary parts discarded in coercion
[1] 11
Exercise
  1. Create in any way a vector x of 10 different numerical values, where \(x\in\mathbb{R}\).
  2. Write an expression to select numbers between -5 and 5 from this vector.
  3. Convert to integer type and discuss the result.
  4. Add 3 positions “A”, “B” and “C” to the vector, has the vector changed?

2.3.5 Data structures

Vectors

Atomic vectors are single-type linear structures. They can contain elements of any type, from logical, integer, numeric, complex, character. A vector is a basic building structure in the R language, there is nothing like a scalar quantity here. The concept of vector is understood here in the mathematical sense as a vector of values representing a point in \(n\)-dimensional space.

\[ \mathbf{\mathrm{u}} = \begin{pmatrix} 1\\ 1.5\\ -14\\ 7.223\\ \end{pmatrix}, \qquad \mathbf{\mathrm{v}} = \begin{pmatrix} \mathrm{TRUE}\\ \mathrm{FALSE}\\ \mathrm{TRUE}\\ \mathrm{TRUE}\\ \end{pmatrix}, \qquad \mathbf{\mathrm{u^T}} = \begin{pmatrix} 1 & 1.5 & -14 & 7.233\\ \end{pmatrix} \]

Many functions lead to creation of a vector, among the most used are vector(mode = "numeric", length = 10), function c(), or using subset operators [ or [[.

An important rule is tied to vectors - value recycling.

Code
v <- c(1.4, 2.0, 6.1, 2.7)
u <- c(2.0, 1.3)
u + v
u * v
u * 2.3
1
Adding two vectors while length of second is the multiple of the first
2
Multiplying two vectors while length of second is the multiple of the first
3
Multipling with single numeric value
[1] 3.4 3.3 8.1 4.0
[1]  2.80  2.60 12.20  3.51
[1] 4.60 2.99
Working with vectors
Code
x <- 1:10
x <- seq(10:1)
x <- vector(mode = "numeric", length = 10)
x <- replicate(n = 10, expr = eval(2))
x <- sample(x = 10, size = 10, replace = TRUE)
x <- rep(x = 15, times = 2)
x <- rnorm(n = 10, mean = 2, sd = 20)
t(x) * x
names(x) <- LETTERS[1:length(x)]
x[x > 0]
x[1:3]
1
Vector creation \(\boldsymbol{\mathrm{x}}\) by different approaches. Sequences, repeats, repetitions and sampling
2
Transposition of vector.
3
Naming elements of a vector.
4
Selection of elements from a vector based on a condition.
5
Selection of elements from a vector based on an index.
         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]    [,7]     [,8]
[1,] 16.88178 7.716215 495.0691 1146.226 3.849214 134.5762 28.2889 495.4629
         [,9]    [,10]
[1,] 203.6558 733.8789
        B         F         G         H         I 
 2.777808 11.600697  5.318731 22.258997 14.270804 
         A          B          C 
 -4.108745   2.777808 -22.250148 
Code
```{r}
#| label: test-code-annotation
V <- vector(mode = "numeric", length = 0) # empty numeric vector creation
V[1] <- "A"
```

Matrices and arrays

If the object has more than one dimension, it is treated as an array. A special type of array is a matrix. Both object types have accompanying functions like colSums(), rowMeans().

List of matrix bounded functions
Function Meaning
nrow(), ncol() number of rows, columns in matrix
dim() dtto
det() matrix determinant
eigen() eigenvalues, eigenvectors
colnames() column names in matrix
rowSums() row sums in matrix
colMeans() column means of matrix
M[m, ] Selection of \(m\)-th row of matrix
M[ ,n] Selection of \(n\)-th column of matrix
Code
x <- c(1:10)
dim(x) <- c(2, 5)
x
1
Conversion to \(2\times 2\) dimension
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10
Code
M <- matrix(data = 0, nrow = 5, ncol = 2) # empty matrix creation
M[1, 1] <- 1                              # add single value at origin
M[, 1] <- 1.5                             # store 1.5 to the whole first column
M[c(1,3), 1:2] <- rnorm(2)                # store random numbers to first two rows

colMeans(M) 
## [1]  0.5236945 -0.3763055
rowSums(M)
## [1] -1.229034  1.500000 -2.534021  1.500000  1.500000

It is possible to have matrices containing any data type, e.g.

\[ M = \left(\begin{matrix} \mathrm{A} & \mathrm{B}\\ \mathrm{C} & \mathrm{D} \end{matrix}\right),\qquad N = \left(\begin{matrix} 1+i & 5-3i\\ 10+2i & i \end{matrix}\right) \]

Data frames

data.frame structure is the workhorse of elementary data processing. It is a possibly heterogenic table-like structure, allowing storage of multiple data types (even other structures) in different columns. A column in any data frame is called a variable and row represents a single observation. If the data suffice this single condition, we say they are in tidy format. Processing tidy data is a big topic withing the R community and curious reader is encouraged to follow the development in tidyverse package ecosystem.

Code
thaya <- data.frame(date = NA, 
                    runoff = NA, 
                    precipitation = NA) # new empty data.frame with variables 'date', 'runoff', 'precipitation' and 'temperature'
#thaya$runoff <- rnorm(100, 1, 2)

Lists

List is the most general basic data structure. It is possible to store vectors, matrices, data frames and also other lists within a list. List structure does not pose any limitations on the internal objects lengths.

Code
l <- list() # empty list creation 
l["A"] <- 1
print(l)
$A
[1] 1
Code
l$A <- 2
print(l)
$A
[1] 2

Other objects

Although R is intended as functional programming language, more than one object oriented paradigm is implemented in the language. As new R users we encounter first OOP system in functions like summary and plot, which represent so called S3 generic functions. We will further work with S4 system when processing geospatial data using proxy libraries like sf and terra. The OOP is very complex and will not be further discussed within this text. For further study we recommend OOP sections in Advanced R by Hadley Wickham.

2.3.6 Control flow

Condition and cycles govern the run of the general flow of calculation, they are the building blocks of algorithms.

2.3.6.1 Conditions

A condition in code creates branching of computation. Placing a condition creates at least two options from which only one is to be satisfied. The condition is created either by if()/ifelse() or switch() construct. We can again call for a snippet from RStudio help resulting in

Code
if (condition) {
  ...
}

switch (object,
  case = action
)

ifelse(test, TRUE, FALSE)
if()
Code
A <- 1
if(A >= 1) {
  cat("A larger than or equal 1.")
}
A larger than or equal 1.
Code
A <- 5
if(A >= 2) {
  cat("A is larger than or equal 2.")
} else if(A > 2) {
  cat("A is larger than 2.")
}
1
The chain of conditions will close at the first evaluation which happens to be TRUE.
A is larger than or equal 2.
ifelse()

Vectorized condition, in general looks like

Code
x <- -5:5
cat("Element x + 3 is more than 0: ", ifelse(x - 3 > 0, yes = "Yes", no = "No"))
Element x + 3 is more than 0:  No No No No No No No No No Yes Yes

switch()

Code
variant <- "B"
2 * (switch(
      variant,
        "A" = 2,
        "B" = 3))
1
“A” variant did not happen,
2
instead the “B” variant is truthful, so the expression is evaluated as \(2\cdot 3 = 6\)
[1] 6
Exercise

Create a following grading scheme:

Grade Result
A 90 % - 100 %
B 75 % - 89 %
C 60 % - 74 %
D < 60 %

2.3.7 Loops

Loops (cycles) provide use with the ability to execute single statement of a block of code in {} multiple times. There are three key words for loop construction. They differ in use cases.

for cycle

Probably the most common loop is used when you know the number of iterations prior to calling. The iteration is therefore explicitly finite.

Code
for (variable in vector) {
  ...
}

An example

Code
for(i in 1:4) cat(i, ". iteration", "\n", sep = "")
1. iteration
2. iteration
3. iteration
4. iteration

while cycle

while is used in when it is impossible to state how many times something should be repeated. The case is rather in the form while some condition is or is not met, repeat what is inside the body. It is also used in intentionally infinite loop e.g. operating systems.

Code
i <- 1
while(i < 5) {
  cat("Iteration ", i, "\n", sep = "")
  i <- i + 1
}
Iteration 1
Iteration 2
Iteration 3
Iteration 4

repeat cycle

In the cases when we need the repetition at least once, we will evaluate the code inside until a condition is met.

Code
i <- 1
repeat {
  cat("Iteration", i, "\n")
  i <- i + 1
  if(i >= 5) break
}
1
Execute in loop,
2
if a condition is met, break stops the cycle.
Iteration 1 
Iteration 2 
Iteration 3 
Iteration 4 

break and next

There are two statements which controls the iteration flow. Anytime break is called, the rest of the body is skipped and the loop ends. Anytime next is called, the rest of the body is skipped and next iteration is started.

Exercise
  1. Create a cycle, which for the numbers \(x={1, 2, 3, 4, 5}\) writes out \(x^3\).
  2. Calculates the cumulative sum for these.s
  3. Calculates the factorial number for the number x.
  4. Withe the help of readline() function (requests a number from the user), prints the number. If the given number is negative, the loop ends.