R is a dynamically typed interpreted language. It is a highly capable enviornment for computing and graphics, for which R is often labeled a glue language. It has built-in functions for statistical computing.
Many great sources for R language with the focus on data science and statistical data processing are already written and free online. We will just name a few and limit the introduction at this site to bare minimum, which we use within this class.
Advanced R (2nd edition) from Hadley Wickham contains a lot of information about langugae fundamentals and extra information.
For the newcomer to R ecosystem we highly encourage to print and use the various official cheat sheets, created by the R community and Posit Inc.
Some basic commands
Code
version # start by checking version of your R, should be at least 4.0 and abovegetwd() # get working directoryls() # list object in current environmentprint() # evaluate to consolerm() # removes object from View() # show internals of and object...
Then there are commands that work with file structure such as
and you can also invoke any shell/cmd command plus attributes with system2() interface.
2.1 Getting help
It is entered into the console in the form help(<function name>) or ?<function name>. If we would like to look directly into the code of the function, it is also possible, we just enter the name of the function in the console without brackets, or use the command View(<function name>). In addition, R also has help.search(<function name>) under the shortcut ??, which searches for full-text help across installed packages. Furthermore, it is still possible to search the R language mailing list using the function RSiteSearch(), which opens a new window of the predefined browser. In addition, thematically integrated help cards are very useful: ?Logical, ?Constants, ?Control, ?Arithmetic, ?Syntax, ?Special etc.
Exercise
Try to find help to DateTimeClasses. a) What do the POSIXct a POSIXlt represent? b) What is the difference between them? c) Find a function for calculating \(5!\)
R is familiar with the concept of \(\pm\infty\), hence -Inf and Inf values are at disposal. You will get them most probably as results from computation heading to \(\frac{\pm1}{0}\) numerically. There are other special values like NULL (null value), NA (not assigned) and NaN (not a number). The concept of not assigned is one that is particularly important, since it has significant impact on the computed result ({(code-mean-rm?)}). NA is of default type logical. Otherwise it si possible to specify missing value in all other data type like NA_real_ (matches double), NA_integer_, NA_complex_ and NA_character_, these are all usable in pre-allocation of memory for data structures. Try to find the usage of functions na.omit(), is.na(), complete.cases().
Code
x <-seq(1:10)x[c(5,6)] <-NAprint(x)mean(x)mean(x, na.rm =TRUE)
1
General sequence of numbers
2
change some elements to not assigned
3
without removal
4
and with removal
[1] 1 2 3 4 NA NA 7 8 9 10
[1] NA
[1] 5.5
2.2.3 Set operations
For manipulating sets, there are a couple of essential functions union(), intersect(), setdiff() and operator %in%.
A <-matrix(data =c(2, -3, 0, 0, -2, 4, 2, 13, 9), nrow =3, byrow =TRUE)B <-c(3, 9, 10)solve(A, B)
[1] 0.5304878 -0.6463415 1.9268293
2.3 R as programming language
2.3.1 Variables and name conventions
It is highly discouraged using spaces and diacritical marks in naming, like the Czech translation of the term “variable” - proměnná. Most programmers use either camelNotation or snake_notation for naming purposes. Obviously the R is case-sensitive so camelNotation and CamelNotation are two different things. Variables do not contain spaces, quotes, arithmetical, logical nor relational operators neither they contain special characters like =, -, ``. Objects cannot be named by key words.
Key words
if, else, repeat, while, function, for, in, next, repeat, break, TRUE, FALSE, NULL, Inf, NaN, NA, NA_integer_, NA_real_, NA_complex_, NA_character_
It is not recommended to inlude dot in the name, like morava.prutoky, and to match the names with commonly used functions. R is “case-sensitive” which means, that X does not equal x.
Exercise
Intuitively, we might be guided to load the data into the data variable. This is the wrong however, since data() is a function to access datasets that are part of the basic R installation. Try it out.
Some cases of possible but wrong naming
aaa, Morávka průtok [m/s], moje.proměnná
2.3.2 Rules of quotation marks and parenthesses
Both represent the paired characters in R. Parenthesses are used in three versions: classical, square brackets and curly brackets (braces). All of them have specific non-overlaping usage.
()are always to be found right next to a function name they delineate the space where function arguments are to be specified.
[]are always use with the name of the object (vector, array, list, …) and signalize subselecting from the object.
{}mark a block of code, which should be executed at once.
Quotation marks introduce text strings. Both “double” and ‘single’ quotes can be used completely at will, they just need to be closed with the same type. Back quotes are also common and are used, for example, to delimit a non-standard column name in a structure.
2.3.3 Functions
You can define own functions using the function() construct. If you work in ****RStudio, just type fun and tabulate a snippet from the IDE help. The action produces {(code-function-snippet?)}.
Code
name <-function(variables) { ...}
name is the name of the function we would like to create and variables are the arguments of that function. Space between the {and } is called a body of a function and contains all the computation which is invoked when the function is called.
Let’s put Here an example of creating own function to calculate weighted mean
\[
\bar{x} = \dfrac{\sum\limits_{i=1}^{n} w_ix_i}{\sum\limits_{i=1}^{n}w_i},
\] where \(x_iw_i\) are the individual weighted measurements.
We define a simple function for that purpose and run an example.
Code
w_mean <-function(x, w =1/length(x)) {sum(x*w)/sum(w)}w_mean(1:10)
[1] 55
Here is a different example:
Code
x <-rnorm(100)nejblizsi_hodnota <-function(x, value) { x[which(abs(x - value) ==min(abs(x - value)))]}cat("Hodnota nejblíže 0 z vektoru x je:" , nejblizsi_hodnota(x = x, value =0))
1
Example of function, which seeks the neares number from a vector x to a certain referential value.
Hodnota nejblíže 0 z vektoru x je: 0.008426125
We can test if we get the same result as the primitive function from R using all.equal() statement.
Code
all.equal(w_mean(x =1:5, w =c(0.25, 0.25, 1, 2, 3)), weighted.mean(x =1:5, w =c(0.25, 0.25, 1, 2, 3)))
[1] TRUE
Any argument without default value in the function definition has to be provided on function call. You can frequently see functions with the possibility to specify ... a so-called three dot construct or ellipsis. The ellipsis allows for adding any number of arguments to a function call, after all the named ones.
2.3.4 Data types
The basic types are logical, integer, numeric, complex, character and raw. There are some additional types which we will encounter like Date. Since R is dynamically typed, it is not necessary for the user to declare variables before using them. Also the type changes without notice based on the stored values, where the chain goes from the least complex to the most. The summary is in the following table
Code
TRUE# logical, also T as short version## [1] TRUE1L # integer## [1] 11.2# numeric## [1] 1.21+3i # complex## [1] 1+3i"A"# character, also 'A'## [1] "A"
They represent the individual elements of data structures. R dynamically typed and does not require declarations before usage.
Basic types and coercions.
logical
integer
numeric
complex
character
logical
logical
integer
numeric
complex
character
integer
logical
integer
numeric
complex
character
numeric
logical
numeric
numeric
complex
character
complex
logical
integer + warning
numeric + warning
complex
character
character
NA_logical
NA_integer + warning
NA_numeric + warning
NA_complex + warning
character
Two types of functions are connected to data types: is.___ a as.___. Is is either questioning or coertion of data type. Try also class(), mode().
Code
is.character("ABC")
[1] TRUE
Code
as.integer(11+ 1i)
Warning: imaginary parts discarded in coercion
[1] 11
Exercise
Create in any way a vector x of 10 different numerical values, where \(x\in\mathbb{R}\).
Write an expression to select numbers between -5 and 5 from this vector.
Convert to integer type and discuss the result.
Add 3 positions “A”, “B” and “C” to the vector, has the vector changed?
2.3.5 Data structures
Vectors
Atomic vectors are single-type linear structures. They can contain elements of any type, from logical, integer, numeric, complex, character. A vector is a basic building structure in the R language, there is nothing like a scalar quantity here. The concept of vector is understood here in the mathematical sense as a vector of values representing a point in \(n\)-dimensional space.
Many functions lead to creation of a vector, among the most used are vector(mode = "numeric", length = 10), function c(), or using subset operators [ or [[.
An important rule is tied to vectors - value recycling.
Code
v <-c(1.4, 2.0, 6.1, 2.7)u <-c(2.0, 1.3)u + vu * vu *2.3
1
Adding two vectors while length of second is the multiple of the first
2
Multiplying two vectors while length of second is the multiple of the first
Vector creation \(\boldsymbol{\mathrm{x}}\) by different approaches. Sequences, repeats, repetitions and sampling
2
Transposition of vector.
3
Naming elements of a vector.
4
Selection of elements from a vector based on a condition.
5
Selection of elements from a vector based on an index.
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 16.88178 7.716215 495.0691 1146.226 3.849214 134.5762 28.2889 495.4629
[,9] [,10]
[1,] 203.6558 733.8789
B F G H I
2.777808 11.600697 5.318731 22.258997 14.270804
A B C
-4.108745 2.777808 -22.250148
If the object has more than one dimension, it is treated as an array. A special type of array is a matrix. Both object types have accompanying functions like colSums(), rowMeans().
M <-matrix(data =0, nrow =5, ncol =2) # empty matrix creationM[1, 1] <-1# add single value at originM[, 1] <-1.5# store 1.5 to the whole first columnM[c(1,3), 1:2] <-rnorm(2) # store random numbers to first two rowscolMeans(M) ## [1] 0.5236945 -0.3763055rowSums(M)## [1] -1.229034 1.500000 -2.534021 1.500000 1.500000
It is possible to have matrices containing any data type, e.g.
\[
M = \left(\begin{matrix}
\mathrm{A} & \mathrm{B}\\
\mathrm{C} & \mathrm{D}
\end{matrix}\right),\qquad
N = \left(\begin{matrix}
1+i & 5-3i\\
10+2i & i
\end{matrix}\right)
\]
Data frames
data.frame structure is the workhorse of elementary data processing. It is a possibly heterogenic table-like structure, allowing storage of multiple data types (even other structures) in different columns. A column in any data frame is called a variable and row represents a single observation. If the data suffice this single condition, we say they are in tidy format. Processing tidy data is a big topic withing the R community and curious reader is encouraged to follow the development in tidyverse package ecosystem.
Code
thaya <-data.frame(date =NA, runoff =NA, precipitation =NA) # new empty data.frame with variables 'date', 'runoff', 'precipitation' and 'temperature'#thaya$runoff <- rnorm(100, 1, 2)
Lists
List is the most general basic data structure. It is possible to store vectors, matrices, data frames and also other lists within a list. List structure does not pose any limitations on the internal objects lengths.
Code
l <-list() # empty list creation l["A"] <-1print(l)
$A
[1] 1
Code
l$A <-2print(l)
$A
[1] 2
Other objects
Although R is intended as functional programming language, more than one object oriented paradigm is implemented in the language. As new R users we encounter first OOP system in functions like summary and plot, which represent so called S3 generic functions. We will further work with S4 system when processing geospatial data using proxy libraries like sf and terra. The OOP is very complex and will not be further discussed within this text. For further study we recommend OOP sections in Advanced R by Hadley Wickham.
2.3.6 Control flow
Condition and cycles govern the run of the general flow of calculation, they are the building blocks of algorithms.
2.3.6.1 Conditions
A condition in code creates branching of computation. Placing a condition creates at least two options from which only one is to be satisfied. The condition is created either by if()/ifelse() or switch() construct. We can again call for a snippet from RStudio help resulting in
Code
if (condition) { ...}switch (object,case = action)ifelse(test, TRUE, FALSE)
if()
Code
A <-1if(A >=1) {cat("A larger than or equal 1.")}
A larger than or equal 1.
Code
A <-5if(A >=2) {cat("A is larger than or equal 2.")} elseif(A >2) {cat("A is larger than 2.")}
1
The chain of conditions will close at the first evaluation which happens to be TRUE.
A is larger than or equal 2.
ifelse()
Vectorized condition, in general looks like
Code
x <--5:5cat("Element x + 3 is more than 0: ", ifelse(x -3>0, yes ="Yes", no ="No"))
Element x + 3 is more than 0: No No No No No No No No No Yes Yes
switch()
Code
variant <-"B"2* (switch( variant,"A"=2,"B"=3))
1
“A” variant did not happen,
2
instead the “B” variant is truthful, so the expression is evaluated as \(2\cdot 3 = 6\)
[1] 6
Exercise
Create a following grading scheme:
Grade
Result
A
90 % - 100 %
B
75 % - 89 %
C
60 % - 74 %
D
< 60 %
2.3.7 Loops
Loops (cycles) provide use with the ability to execute single statement of a block of code in {} multiple times. There are three key words for loop construction. They differ in use cases.
for cycle
Probably the most common loop is used when you know the number of iterations prior to calling. The iteration is therefore explicitly finite.
while is used in when it is impossible to state how many times something should be repeated. The case is rather in the form while some condition is or is not met, repeat what is inside the body. It is also used in intentionally infinite loop e.g. operating systems.
Code
i <-1while(i <5) {cat("Iteration ", i, "\n", sep ="") i <- i +1}
Iteration 1
Iteration 2
Iteration 3
Iteration 4
repeat cycle
In the cases when we need the repetition at least once, we will evaluate the code inside until a condition is met.
Code
i <-1repeat {cat("Iteration", i, "\n") i <- i +1if(i >=5) break}
1
Execute in loop,
2
if a condition is met, break stops the cycle.
Iteration 1
Iteration 2
Iteration 3
Iteration 4
break and next
There are two statements which controls the iteration flow. Anytime break is called, the rest of the body is skipped and the loop ends. Anytime next is called, the rest of the body is skipped and next iteration is started.
Exercise
Create a cycle, which for the numbers \(x={1, 2, 3, 4, 5}\) writes out \(x^3\).
Calculates the cumulative sum for these.s
Calculates the factorial number for the number x.
Withe the help of readline() function (requests a number from the user), prints the number. If the given number is negative, the loop ends.