1  Introduction to R language

R is a dynamically typed interpreted language. It is a highly capable enviornment for computing and graphics, for which R is often labeled a glue language. It has built-in functions for statistical computing.

Many great sources for R language with the focus on data science and statistical data processing are already written and free online. We will just name a few and limit the introduction at this site to bare minimum, which we use within this class.

High quality R sources:

For the newcomer to R ecosystem we highly encourage to print and use the various official cheat sheets, created by Posit Inc..

Some basic commands

Code
version   # start by checking version of your R, should be at least 4.0 and above
getwd()   # get working directory
ls()      # list object in current environment
print()   # evaluate to console
rm()      # removes object from 
View()    # show internals of and object
...

Then there are commands that work with file structure such as

Code
list.files()  
dir.create()
dir.exists()
file.exists()
...

and you can also invoke any shell/cmd command plus attributes with system2() interface.

1.1 R as scientific calculator

Arithmetic operations

Code
1 + 2           # addition
## [1] 3
1 - 2           # subtraction 
## [1] -1
1 / 2           # division
## [1] 0.5
1 * 2           # multiplication
## [1] 2
1 %/% 2         # integer division
## [1] 0
1 %% 2          # modulo oprator
## [1] 1

Special values

R is familiar with the concept of \(\pm\infty\), hence -Inf and Inf values are at disposal. You will get them most probably as results from computation heading to \(\frac{\pm1}{0}\) numerically. There are other special values like NULL (null value), NA (not assigned) and NaN (not a number). The concept of not assigned is one that is particularly important, since it has significant impact on the computed result ({(code-mean-rm?)}).

Code
x <- seq(1:10)             # general sequence of numbers
x[c(5,6)] <- NA            # change some elements to not assigned
print(x)
 [1]  1  2  3  4 NA NA  7  8  9 10
Code
mean(x)                    # without removal
[1] NA
Code
mean(x, na.rm = TRUE)      # and with removal
[1] 5.5

Set operations

For manipulating sets, there are a couple of essential functions union(), intersect(), setdiff() and operator %in%.

Code
set_A <- c("a", "a", "b", "c", "D")
set_B <- c("a", "b", "d")
union(set_A, set_B)
## [1] "a" "b" "c" "D" "d"
intersect(set_A, set_B)
## [1] "a" "b"
set_A %in% set_B
## [1]  TRUE  TRUE  TRUE FALSE FALSE

Matrix operations

For the purpose of following examples let’s use an arbitrary matrix \(M\) and a vectors \(U\) and \(V\).

\[ \mathbf{A} = \left(\begin{matrix} 2x& - 3y& &= 3\\ & - 2y& + 4z &= 9\\ 2x& + 13y& + 9z&= 10 \end{matrix}\right),\\ \tag{1.1}\]

\[ \mathbf{u} = \begin{pmatrix} 1\\ -3\\ 8\\ \end{pmatrix}, \mathbf{v} = \begin{pmatrix} 1\\ -3\\ 8\\ \end{pmatrix} \tag{1.2}\]

Solving a system of linear equations {Equation 1.1} is a one-liner:

Code
A <- matrix(data = c(2, -3, 0, 0, -2, 4, 2, 13, 9), nrow = 3, byrow = TRUE)
B <- c(3, 9, 10)
solve(A, B)
[1]  0.5304878 -0.6463415  1.9268293

1.2 R as programming language

1.2.1 Variables and name conventions

It is possible. We highly discourage using diacritical marks in naming, like the Czech translation of the term “variable” - proměnná. Most programmers use either camelNotation or snake_notation for naming purposes. Obviously the R is case-sensitive so camelNotation and CamelNotation are two different things. Variables do not contain spaces, quotes, arithmetical, logical nor relational operators neither they contain special characters like =, -, ``.

1.2.2 Functions

You can define own functions using the function() construct. If you work in ****RStudio, just type fun and tabulate a snippet from the IDE help. The action produces {(code-function-snippet?)}.

Code
name <- function(variables) {
  ...
}

name is the name of the function we would like to create and variables are the arguments of that function. Space between the {and } is called a body of a function and contains all the computation which is invoked when the function is called.

Let’s put Here an example of creating own function to calculate weighted mean

\[ \bar{x} = \dfrac{\sum\limits_{i=1}^{n} w_ix_i}{\sum\limits_{i=1}^{n}w_i}, \] where \(x_iw_i\) are the individual weighted measurements.

We define a simple function for that purpose and run an example.

Code
w_mean <- function(x, w = 1/length(x)) {
  sum(x*w)/sum(w)
}
w_mean(1:10)
[1] 55

We can test if we get the same result as the primitive function from R using all.equal() statement.

Code
all.equal(w_mean(x = 1:5, w = c(0.25, 0.25, 1, 2, 3)), 
          weighted.mean(x = 1:5, w = c(0.25, 0.25, 1, 2, 3)))
[1] TRUE

Any argument without default value in the function definition has to be provided on function call. You can frequently see functions with the possibility to specify ... a so-called three dot construct or ellipsis. The ellipsis allows for adding any number of arguments to a function call, after all the named ones.

1.2.3 Data types

The basic types are logical, integer, numeric, complex, character and raw. There are some additional types which we will encounter like Date. Since R is dynamically typed, it is not necessary for the user to declare variables before using them. Also the type changes without notice based on the stored values, where the chain goes from the least complex to the most. The summary is in the following table

Code
TRUE    # logical, also T as short version
## [1] TRUE
1L      # integer
## [1] 1
1.2     # numeric
## [1] 1.2
1+3i    # complex
## [1] 1+3i
"A"     # character, also 'A'
## [1] "A"

1.2.4 Data structures

Vectors

Atomic vectors are single-type linear structures. They can contain elements of any type, from logical, integer, numeric, complex, character.

Code
```{r}
#| label: test-code-annotation
V <- vector(mode = "numeric", length = 0) # empty numeric vector creation
V[1] <- "A"
```

Matrices and arrays

If the object has more than one dimension, it is treated as an array. A special type of array is a matrix. Both object types have accompanying functions like colSums(), rowMeans().

Code
M <- matrix(data = 0, nrow = 5, ncol = 2) # empty matrix creation
M[1, 1] <- 1                              # add single value at origin
M[, 1] <- 1.5                             # store 1.5 to the whole first column
M[c(1,3), 1:2] <- rnorm(2)                # store random numbers to first two rows

colMeans(M) 
## [1]  0.767096 -0.132904
rowSums(M)
## [1] -0.7383973  1.5000000 -0.5906430  1.5000000  1.5000000

It is possible to have matrices containing any data type, e.g.

\[ M = \left(\begin{matrix} \mathrm{A} & \mathrm{B}\\ \mathrm{C} & \mathrm{D} \end{matrix}\right),\qquad N = \left(\begin{matrix} 1+i & 5-3i\\ 10+2i & i \end{matrix}\right) \]

Data frames

data.frame structure is the workhorse of elementary data processing. It is a possibly heterogenic table-like structure, allowing storage of multiple data types (even other structures) in different columns. A column in any data frame is called a variable and row represents a single observation. If the data suffice this single condition, we say they are in tidy format. Processing tidy data is a big topic withing the R community and curious reader is encouraged to follow the development in tidyverse package ecosystem.

Code
thaya <- data.frame(date = NA, 
                    runoff = NA, 
                    precipitation = NA) # new empty data.frame with variables 'date', 'runoff', 'precipitation' and 'temperature'
#thaya$runoff <- rnorm(100, 1, 2)

Lists

List is the most general basic data structure. It is possible to store vectors, matrices, data frames and also other lists within a list. List structure does not pose any limitations on the internal objects lengths.

Code
l <- list() # empty list creation 
l["A"] <- 1
print(l)
$A
[1] 1
Code
l$A <- 2
print(l)
$A
[1] 2

Other objects

Although R is intended as functional programming language, more than one object oriented paradigm is implemented in the language. As new R users we encounter first OOP system in functions like summary and plot, which represent so called S3 generic functions. We will further work with S4 system when processing geospatial data using proxy libraries like sf and terra. The OOP is very complex and will not be further discussed within this text. For further study we recommend OOP sections in Advanced R by Hadley Wickham.

1.2.5 Conditions

A condition in code creates branching of computation. Placing a condition creates at least two options from which only one is to be satisfied. The condition is created either by if()/ifelse() or switch() construct. We can again call for a snippet from RStudio help resulting in

Code
if (condition) {
  ...
}

switch (object,
  case = action
)

ifelse(test, TRUE, FALSE)

The condition can be branched to larger structures like

Code
temperature <- 30
if (temperature > 30) {
  cat("The temperature is hot.")
} else if (temperature > 15) {
  cat("The temperature is warm.")
}
The temperature is warm.

1.2.6 Repetition

Loops (cycles) provide use with the ability to execute single statement of a block of code in {} multiple times. There are three key words for loop construction. They differ in use cases.

1.2.6.1 for cycle

Probably the most common loop is used when you know the number of iterations prior to calling. The iteration is therefore explicitly finite.

Code
for (variable in vector) {
  ...
}

An example

Code
for(i in 1:4) cat(i, ". iteration", "\n", sep = "")
1. iteration
2. iteration
3. iteration
4. iteration

1.2.6.2 while cycle

while is used in when it is impossible to state how many times something should be repeated. The case is rather in the form while some condition is or is not met, repeat what is inside the body. It is also used in intentionally infinite loop e.g. operating systems.

1.2.6.3 repeat cycle

In the cases when we need the repetition at least once, we will evaluate the code inside until a condition is met.

Code
repeat({
  x <- rnorm(1)
  cat(x)
  if(x > 0) break
})
-1.0191321.047872

1.2.6.4 break and next

There are two statements which controls the iteration flow. Anytime break is called, the rest of the body is skipped and the loop ends. Anytime next is called, the rest of the body is skipped and next iteration is started.

I would like to have text here

Sentence becomes longer, it should automatically stay in their column

and here

More text

1.2.7 Exercise

  1. create a sequence of numbers calling
  2. What type is NA, why would you say is it?