2 Introduction to R language

R is a dynamically typed interpreted language. It is a highly capable enviornment for computing and graphics, for which R is often labeled a glue language. It has built-in functions for statistical computing.

Many great sources for R language with the focus on data science and statistical data processing are already written and free online. We will just name a few and limit the introduction at this site to bare minimum, which we use within this class.

High quality R sources:

CRAN manuals like Introduction to R at https://cran.r-project.org/doc/manuals/R-intro.pdf
The R Book is a very comprehensive source, although strongly oriented towards statistics
freeCodeCamp’s R programming tutorial is a very mild introduction to scripting language
Advanced R (2nd edition) from Hadley Wickham contains a lot of information about langugae fundamentals and extra information.

For the newcomer to R ecosystem we highly encourage to print and use the various official cheat sheets, created by the R community and Posit Inc.

Some basic commands

Code

version   # start by checking version of your R, should be at least 4.0 and above
getwd()   # get working directory
ls()      # list object in current environment
print()   # evaluate to console
rm()      # removes object from 
View()    # show internals of and object
...

Then there are commands that work with file structure such as

Code

list.files()  
dir.create()
dir.exists()
file.exists()
...

and you can also invoke any shell/cmd command plus attributes with system2() interface.

2.1 Getting help

It is entered into the console in the form help(<function name>) or ?<function name>. If we would like to look directly into the code of the function, it is also possible, we just enter the name of the function in the console without brackets, or use the command View(<function name>). In addition, R also has help.search(<function name>) under the shortcut ??, which searches for full-text help across installed packages. Furthermore, it is still possible to search the R language mailing list using the function RSiteSearch(), which opens a new window of the predefined browser. In addition, thematically integrated help cards are very useful: ?Logical, ?Constants, ?Control, ?Arithmetic, ?Syntax, ?Special etc.

Exercise

Try to find help to DateTimeClasses. a) What do the POSIXct a POSIXlt represent? b) What is the difference between them? c) Find a function for calculating $5!$

2.2 R as scientific calculator

2.2.1 Arithmetic operations

Code

1 + 2           # addition
## [1] 3
1 - 2           # subtraction 
## [1] -1
1 / 2           # division
## [1] 0.5
1 * 2           # multiplication
## [1] 2
1 %/% 2         # integer division
## [1] 0
1 %% 2          # modulo oprator
## [1] 1

2.2.2 Special values

R is familiar with the concept of $\pm\infty$, hence -Inf and Inf values are at disposal. You will get them most probably as results from computation heading to $\frac{\pm1}{0}$ numerically. There are other special values like NULL (null value), NA (not assigned) and NaN (not a number). The concept of not assigned is one that is particularly important, since it has significant impact on the computed result ({(code-mean-rm?)}). NA is of default type logical. Otherwise it si possible to specify missing value in all other data type like NA_real_ (matches double), NA_integer_, NA_complex_ and NA_character_, these are all usable in pre-allocation of memory for data structures. Try to find the usage of functions na.omit(), is.na(), complete.cases().

Code

x <- seq(1:10)
x[c(5,6)] <- NA
print(x)
mean(x)
mean(x, na.rm = TRUE)

1: General sequence of numbers
2: change some elements to not assigned
3: without removal
4: and with removal

 [1]  1  2  3  4 NA NA  7  8  9 10
[1] NA
[1] 5.5

2.2.3 Set operations

For manipulating sets, there are a couple of essential functions union(), intersect(), setdiff() and operator %in%.

Code

set_A <- c("a", "a", "b", "c", "D")
set_B <- c("a", "b", "d")
union(set_A, set_B)
## [1] "a" "b" "c" "D" "d"
intersect(set_A, set_B)
## [1] "a" "b"
set_A %in% set_B
## [1]  TRUE  TRUE  TRUE FALSE FALSE

The operators fall in arithmetic, relation, assign categories and we also put set functions here.

Sign	Meaning
`+` , `-` , `` , `/` , `%%` , `%/%` , `` nebo `^`, `%%`	arithmetic operators (plus, minus, multiply, divide, modulo, integer division, power and matrix multiplication)
`>` ,`>=` , `<` , `<=` , `==` , `!=`	relation operators (larger/smaller than, equal, not equal)
`!` , `&` , `&&` , `\|` , `\|\|`	logical (negation, and, or)
`~`	functional relationship
`<-` , `=`, `<<-`, `->`	assign operator
`$`	naming indexation in heterogenic structures
`:`	rangea
`isTRUE()` , `all()` , `any()` , `%in%` , `setdiff()`	set functions

2.2.4 Mathematical functions

Function	Meaning
`log(x)`	logarithm $x$ to the base $e$
`exp(x)`	$x(e^x)$
`log(x, n)`	logarithm $x$ base $n$
`log10(x)`	logarithm $x$ base $10$
`sqrt(x)`	square root from $x$
`factorial`(x)	$x!$
`choose(n, x)`	binomial coefficients \[ \binom{n}{k} = \frac{n!}{k!(n-k)!} \]
`ceiling(x)`	smallest integer large than $x$
`floor(x)`	largest integer before $x$
`trunc(x)`	closest number between $x$ a 0
`round(x, digits)`	round $x$ to $n$ decimals
`signif(x, digits)`	round $x$ to $n$ significant numbers
`cos(x)` , `sin(x)` , `tan(x)`	function ins rad
`acos(x)` , `asin(x)` , `atan(x)`	inverse trigonometric functions
`abs(x)`	absolute value

Exercise

Evaluate the following expressions:
a) $1 + 3 \cdot (2 / 3)\:\mathrm{mod}\:3$
b) $\dfrac{\sin(2.3)}{\cos(\pi)}$
c) $\sum\limits_{i = 1}^{53}i$
d) $\dfrac{-\infty}{0}$, $\dfrac{-\infty}{\infty}$, $\dfrac{0}{0}$
e) $\left(\dfrac{2}{35}\right)^{0.5} \cdot 3 \cdot (2 / 3)$
f) $20!$
g) $\int_{0}^{3\pi} \sin(x) dx$

Matrix operations

Let’s say we have a set of linear equations

\[ \begin{matrix} 2x& - 3y& &= 3\\ & - 2y& + 4z &= 9\\ 2x& + 13y& + 9z&= 10 \end{matrix}\\ \tag{2.1}\]

Solving {Equation 2.1} is a one-liner:

Code

A <- matrix(data = c(2, -3, 0, 0, -2, 4, 2, 13, 9), nrow = 3, byrow = TRUE)
B <- c(3, 9, 10)
solve(A, B)

[1]  0.5304878 -0.6463415  1.9268293

2.3 R as programming language

2.3.1 Variables and name conventions

It is highly discouraged using spaces and diacritical marks in naming, like the Czech translation of the term “variable” - proměnná. Most programmers use either camelNotation or snake_notation for naming purposes. Obviously the R is case-sensitive so camelNotation and CamelNotation are two different things. Variables do not contain spaces, quotes, arithmetical, logical nor relational operators neither they contain special characters like =, -, ``. Objects cannot be named by key words.

Key words

if, else, repeat, while, function, for, in, next, repeat, break, TRUE, FALSE, NULL, Inf, NaN, NA, NA_integer_, NA_real_, NA_complex_, NA_character_

It is not recommended to inlude dot in the name, like morava.prutoky, and to match the names with commonly used functions. R is “case-sensitive” which means, that X does not equal x.

Exercise

Intuitively, we might be guided to load the data into the data variable. This is the wrong however, since data() is a function to access datasets that are part of the basic R installation. Try it out.

Some cases of possible but wrong naming

aaa, Morávka průtok [m/s], moje.proměnná

2.3.2 Rules of quotation marks and parenthesses

Both represent the paired characters in R. Parenthesses are used in three versions: classical, square brackets and curly brackets (braces). All of them have specific non-overlaping usage.

() are always to be found right next to a function name they delineate the space where function arguments are to be specified.
[] are always use with the name of the object (vector, array, list, …) and signalize subselecting from the object.
{} mark a block of code, which should be executed at once.

Quotation marks introduce text strings. Both “double” and ‘single’ quotes can be used completely at will, they just need to be closed with the same type. Back quotes are also common and are used, for example, to delimit a non-standard column name in a structure.

2.3.3 Functions

You can define own functions using the function() construct. If you work in ****RStudio, just type fun and tabulate a snippet from the IDE help. The action produces {(code-function-snippet?)}.

Code

name <- function(variables) {
  ...
}

name is the name of the function we would like to create and variables are the arguments of that function. Space between the {and } is called a body of a function and contains all the computation which is invoked when the function is called.

Let’s put Here an example of creating own function to calculate weighted mean

\[ \bar{x} = \dfrac{\sum\limits_{i=1}^{n} w_ix_i}{\sum\limits_{i=1}^{n}w_i}, \] where $x_iw_i$ are the individual weighted measurements.

We define a simple function for that purpose and run an example.

Code

w_mean <- function(x, w = 1/length(x)) {
  sum(x*w)/sum(w)
}
w_mean(1:10)

[1] 55

Here is a different example:

Code

x <- rnorm(100)
nejblizsi_hodnota <- function(x, value) {
  x[which(abs(x - value) == min(abs(x - value)))]
}

cat("Hodnota nejblíže 0 z vektoru x je:" , nejblizsi_hodnota(x = x, value = 0))

1: Example of function, which seeks the neares number from a vector x to a certain referential value.

Hodnota nejblíže 0 z vektoru x je: -0.00756462

We can test if we get the same result as the primitive function from R using all.equal() statement.

Code

all.equal(w_mean(x = 1:5, w = c(0.25, 0.25, 1, 2, 3)), 
          weighted.mean(x = 1:5, w = c(0.25, 0.25, 1, 2, 3)))

[1] TRUE

Any argument without default value in the function definition has to be provided on function call. You can frequently see functions with the possibility to specify ... a so-called three dot construct or ellipsis. The ellipsis allows for adding any number of arguments to a function call, after all the named ones.

2.3.4 Data types

The basic types are logical, integer, numeric, complex, character and raw. There are some additional types which we will encounter like Date. Since R is dynamically typed, it is not necessary for the user to declare variables before using them. Also the type changes without notice based on the stored values, where the chain goes from the least complex to the most. The summary is in the following table

Code

TRUE    # logical, also T as short version
## [1] TRUE
1L      # integer
## [1] 1
1.2     # numeric
## [1] 1.2
1+3i    # complex
## [1] 1+3i
"A"     # character, also 'A'
## [1] "A"

They represent the individual elements of data structures. R dynamically typed and does not require declarations before usage.

Basic types and coercions.
	logical	integer	numeric	complex	character
logical	`logical`	`integer`	`numeric`	`complex`	`character`
integer	`logical`	`integer`	`numeric`	`complex`	`character`
numeric	`logical`	`numeric`	`numeric`	`complex`	`character`
complex	`logical`	`integer` + warning	`numeric` + warning	`complex`	`character`
character	`NA_logical`	`NA_integer` + warning	`NA_numeric` + warning	`NA_complex` + warning	`character`

Two types of functions are connected to data types: is.___ a as.___. Is is either questioning or coertion of data type. Try also class(), mode().

Code

is.character("ABC")

[1] TRUE

Code

as.integer(11 + 1i)

Warning: imaginary parts discarded in coercion

[1] 11

Exercise

Create in any way a vector x of 10 different numerical values, where $x\in\mathbb{R}$.
Write an expression to select numbers between -5 and 5 from this vector.
Convert to integer type and discuss the result.
Add 3 positions “A”, “B” and “C” to the vector, has the vector changed?

2.3.5 Data structures

Vectors

Atomic vectors are single-type linear structures. They can contain elements of any type, from logical, integer, numeric, complex, character. A vector is a basic building structure in the R language, there is nothing like a scalar quantity here. The concept of vector is understood here in the mathematical sense as a vector of values representing a point in $n$-dimensional space.

\[ \mathbf{\mathrm{u}} = \begin{pmatrix} 1\\ 1.5\\ -14\\ 7.223\\ \end{pmatrix}, \qquad \mathbf{\mathrm{v}} = \begin{pmatrix} \mathrm{TRUE}\\ \mathrm{FALSE}\\ \mathrm{TRUE}\\ \mathrm{TRUE}\\ \end{pmatrix}, \qquad \mathbf{\mathrm{u^T}} = \begin{pmatrix} 1 & 1.5 & -14 & 7.233\\ \end{pmatrix} \]

Many functions lead to creation of a vector, among the most used are vector(mode = "numeric", length = 10), function c(), or using subset operators [ or [[.

An important rule is tied to vectors - value recycling.

Code

v <- c(1.4, 2.0, 6.1, 2.7)
u <- c(2.0, 1.3)
u + v
u * v
u * 2.3

1: Adding two vectors while length of second is the multiple of the first
2: Multiplying two vectors while length of second is the multiple of the first
3: Multipling with single numeric value

[1] 3.4 3.3 8.1 4.0
[1]  2.80  2.60 12.20  3.51
[1] 4.60 2.99

Working with vectors

Code

x <- 1:10
x <- seq(10:1)
x <- vector(mode = "numeric", length = 10)
x <- replicate(n = 10, expr = eval(2))
x <- sample(x = 10, size = 10, replace = TRUE)
x <- rep(x = 15, times = 2)
x <- rnorm(n = 10, mean = 2, sd = 20)
t(x) * x
names(x) <- LETTERS[1:length(x)]
x[x > 0]
x[1:3]

1: Vector creation $\boldsymbol{\mathrm{x}}$ by different approaches. Sequences, repeats, repetitions and sampling
2: Transposition of vector.
3: Naming elements of a vector.
4: Selection of elements from a vector based on a condition.
5: Selection of elements from a vector based on an index.

         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]
[1,] 79.85034 39.35162 45.08142 14.08098 314.5691 1315.004 14.31418 26.11836
         [,9]    [,10]
[1,] 761.2817 511.2969
        A         C         D         E         F         H         J 
 8.935902  6.714270  3.752463 17.736097 36.262987  5.110613 22.611875 
        A         B         C 
 8.935902 -6.273087  6.714270

Code

```{r}
#| label: test-code-annotation
V <- vector(mode = "numeric", length = 0) # empty numeric vector creation
V[1] <- "A"
```

Matrices and arrays

If the object has more than one dimension, it is treated as an array. A special type of array is a matrix. Both object types have accompanying functions like colSums(), rowMeans().

List of `matrix` bounded functions
Function	Meaning
`nrow()`, `ncol()`	number of rows, columns in matrix
`dim()`	dtto
`det()`	matrix determinant
`eigen()`	eigenvalues, eigenvectors
`colnames()`	column names in matrix
`rowSums()`	row sums in matrix
`colMeans()`	column means of matrix
`M[m, ]`	Selection of $m$-th row of matrix
`M[ ,n]`	Selection of $n$-th column of matrix

Code

x <- c(1:10)
dim(x) <- c(2, 5)
x

1: Conversion to $2\times 2$ dimension

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10

Code

M <- matrix(data = 0, nrow = 5, ncol = 2) # empty matrix creation
M[1, 1] <- 1                              # add single value at origin
M[, 1] <- 1.5                             # store 1.5 to the whole first column
M[c(1,3), 1:2] <- rnorm(2)                # store random numbers to first two rows

colMeans(M) 
## [1]  0.703141 -0.196859
rowSums(M)
## [1] -1.0855828  1.5000000 -0.8830068  1.5000000  1.5000000

It is possible to have matrices containing any data type, e.g.

\[ M = \left(\begin{matrix} \mathrm{A} & \mathrm{B}\\ \mathrm{C} & \mathrm{D} \end{matrix}\right),\qquad N = \left(\begin{matrix} 1+i & 5-3i\\ 10+2i & i \end{matrix}\right) \]

Data frames

data.frame structure is the workhorse of elementary data processing. It is a possibly heterogenic table-like structure, allowing storage of multiple data types (even other structures) in different columns. A column in any data frame is called a variable and row represents a single observation. If the data suffice this single condition, we say they are in tidy format. Processing tidy data is a big topic withing the R community and curious reader is encouraged to follow the development in tidyverse package ecosystem.

Code

thaya <- data.frame(date = NA, 
                    runoff = NA, 
                    precipitation = NA) # new empty data.frame with variables 'date', 'runoff', 'precipitation' and 'temperature'
#thaya$runoff <- rnorm(100, 1, 2)

Lists

List is the most general basic data structure. It is possible to store vectors, matrices, data frames and also other lists within a list. List structure does not pose any limitations on the internal objects lengths.

Code

l <- list() # empty list creation 
l["A"] <- 1
print(l)

$A
[1] 1

Code

l$A <- 2
print(l)

$A
[1] 2

Other objects

Although R is intended as functional programming language, more than one object oriented paradigm is implemented in the language. As new R users we encounter first OOP system in functions like summary and plot, which represent so called S3 generic functions. We will further work with S4 system when processing geospatial data using proxy libraries like sf and terra. The OOP is very complex and will not be further discussed within this text. For further study we recommend OOP sections in Advanced R by Hadley Wickham.

2.3.6 Control flow

Condition and cycles govern the run of the general flow of calculation, they are the building blocks of algorithms.

2.3.6.1 Conditions

A condition in code creates branching of computation. Placing a condition creates at least two options from which only one is to be satisfied. The condition is created either by if()/ifelse() or switch() construct. We can again call for a snippet from RStudio help resulting in

Code

if (condition) {
  ...
}

switch (object,
  case = action
)

ifelse(test, TRUE, FALSE)

`if()`

Code

A <- 1
if(A >= 1) {
  cat("A larger than or equal 1.")
}

A larger than or equal 1.

Code

A <- 5
if(A >= 2) {
  cat("A is larger than or equal 2.")
} else if(A > 2) {
  cat("A is larger than 2.")
}

1: The chain of conditions will close at the first evaluation which happens to be TRUE.

A is larger than or equal 2.

`ifelse()`

Vectorized condition, in general looks like

Code

x <- -5:5
cat("Element x + 3 is more than 0: ", ifelse(x - 3 > 0, yes = "Yes", no = "No"))

Element x + 3 is more than 0:  No No No No No No No No No Yes Yes

`switch()`

Code

variant <- "B"
2 * (switch(
      variant,
        "A" = 2,
        "B" = 3))

1: “A” variant did not happen,
2: instead the “B” variant is truthful, so the expression is evaluated as $2\cdot 3 = 6$

[1] 6

Exercise

Create a following grading scheme:

Grade	Result
A	90 % - 100 %
B	75 % - 89 %
C	60 % - 74 %
D	< 60 %

2.3.7 Loops

Loops (cycles) provide use with the ability to execute single statement of a block of code in {} multiple times. There are three key words for loop construction. They differ in use cases.

`for` cycle

Probably the most common loop is used when you know the number of iterations prior to calling. The iteration is therefore explicitly finite.

Code

for (variable in vector) {
  ...
}

An example

Code

for(i in 1:4) cat(i, ". iteration", "\n", sep = "")

1. iteration
2. iteration
3. iteration
4. iteration

`while` cycle

while is used in when it is impossible to state how many times something should be repeated. The case is rather in the form while some condition is or is not met, repeat what is inside the body. It is also used in intentionally infinite loop e.g. operating systems.

Code

i <- 1
while(i < 5) {
  cat("Iteration ", i, "\n", sep = "")
  i <- i + 1
}

Iteration 1
Iteration 2
Iteration 3
Iteration 4

`repeat` cycle

In the cases when we need the repetition at least once, we will evaluate the code inside until a condition is met.

Code

i <- 1
repeat {
  cat("Iteration", i, "\n")
  i <- i + 1
  if(i >= 5) break
}

1: Execute in loop,
2: if a condition is met, break stops the cycle.

Iteration 1 
Iteration 2 
Iteration 3 
Iteration 4

`break` and `next`

There are two statements which controls the iteration flow. Anytime break is called, the rest of the body is skipped and the loop ends. Anytime next is called, the rest of the body is skipped and next iteration is started.

Exercise

Create a cycle, which for the numbers $x={1, 2, 3, 4, 5}$ writes out $x^3$.
Calculates the cumulative sum for these.s
Calculates the factorial number for the number x.
Withe the help of readline() function (requests a number from the user), prints the number. If the given number is negative, the loop ends.

Some basic commands

2.1 Getting help

2.2 R as scientific calculator

2.2.1 Arithmetic operations

2.2.2 Special values

2.2.3 Set operations

2.2.4 Mathematical functions

Matrix operations

2.3 R as programming language

2.3.1 Variables and name conventions

Key words

Exercise

Some cases of possible but wrong naming

2.3.2 Rules of quotation marks and parenthesses

2.3.3 Functions

2.3.4 Data types

2.3.5 Data structures

Vectors

Working with vectors

Matrices and arrays

Data frames

Lists

Other objects

2.3.6 Control flow

2.3.6.1 Conditions

if()

ifelse()

switch()

2.3.7 Loops

for cycle

while cycle

repeat cycle

break and next

`if()`

`ifelse()`

`switch()`

`for` cycle

`while` cycle

`repeat` cycle

`break` and `next`