A dataset is a collection of measurements which is used to create or evaluate assumptions about a population. Within this session, only univariate data are studied.
At the end of the lecture you should be familiar with following concepts:
hydrological year
data screening and cleaning
basic data summation and aggregation
descriptive statistics
visualization of statistical data
2.1 Exploratory Data Analysis (EDA)
EDA is not a strictly formalized set of procedures, rather it is an general approach on how to study, describe, extract, evaluate and report about the data. Hence the workflow is specific to data in process.
When studying data, we will look at certain statistical features, which represent esteemed characteristcs for the whole data set.
2.1.1 Measures of location
The farthest one can reduce data, while still retaining any information at all is by a single value. They describe a central tendency of the data.
2.1.1.1 Mean
The arithmetic mean (or average) is simply the sum of observations devided by the number of observations.
\[
\text{mean} = \dfrac{\text{sum of data}}{\text{number of data}}, \qquad \bar{x} = \dfrac{1}{n}\sum\limits_{i=1}^{n} x_i
\]
Code
set.seed(123) # using the set seed will ensure that we will get the samme generated numbers every timex <-rnorm(n =30, mean =50, sd =10) # some random-generated numbers from the normal distributionx## [1] 44.39524 47.69823 65.58708 50.70508 51.29288 67.15065 54.60916 37.34939## [9] 43.13147 45.54338 62.24082 53.59814 54.00771 51.10683 44.44159 67.86913## [17] 54.97850 30.33383 57.01356 45.27209 39.32176 47.82025 39.73996 42.71109## [25] 43.74961 33.13307 58.37787 51.53373 38.61863 62.53815
2.1.1.2 Median
The median is another central tendency based on the sorted data set. It is the \(50\)th quantile in the data. A half of the values is smaller than median, and a half is larger.
With measures of variability we describe the spread of the data around its central tendency. Some degree of variability is a natural occurring phenomenon.
2.1.2.1 Variation range
The simples measure of spread is the variation range. It is computed as the difference of both end extremes of the data.
\[
R = \max{x} - \min{x}
\]
Code
max(x) -min(x)
[1] 37.5353
2.1.2.2 Variance
We calculate the variance as the average of the squares of the deviations of individual character values from their arithmetic mean.
\[
\sigma^2 = \frac{1}{n}(x_i - \bar{x})^2
\]
2.1.2.3 Standard deviation
The standard deviation is defined as sqruare root of variance.
Code
sd(x)## [1] 9.810307var(x)^0.5## [1] 9.810307
2.1.2.4 Coefficient of variation
The coefficient of variation is given by the ratio of the standard deviation and the arithmetic mean, and it follows from the definition of this ratio that it is a dimensionless indicator.
\[
\nu = \frac{s}{\bar{x}}
\]
2.1.2.5 Interquartile range
The \(\text{IQR}\) is the upper quartile (\(75\)th percentile) minus the lower quartile (\(25\)th percentile)
\[
\text{IQR} = Q_{\text{III}} - Q_{\text{I}}
\]
Code
IQR(x)
[1] 11.60016
2.2 Hydrological data
Data sets containing hydrological data are most commonly, although not exclusively, in tabular (rectangular) shape. Let’s take a look at sample data from MOOPEX. It is a simple curated large sample data set.
Let’s add some features to the data. For the purpose of the analysis it would be beneficial to add year, month, quarters, decade and hyear as hydrological year.
How many freezing days in late spring (May) and summer?
10 highest and 10 lowest temperatures
These functions are based on grouping. In hydrology, the natural groups involve - stations/basins, decades/years/months, groundwater regions, etc.
2.3.1 Box-plot
Come handy, when we want to visualize the important quantiles related to any categorical variable.
Code
station <-read.delim("./data/6242910_Q_Day.Cmd.txt", skip =36, header =TRUE, sep =";", col.names =c("date", "time", "value"))station$date <-as.Date(station$date, format ="%Y-%m-%d")station_agg <-aggregate(station$value ~as.factor(data.table::month(station$date)), FUN ="quantile", probs =c(0.1, 0.25, 0.5, 0.75, 0.9))names(station_agg) <-c("Month", "Discharge")par(font.main =1, adj =0.1, xaxs ="i", yaxs ="i")boxplot(data = station_agg, Discharge ~ Month, main ="Distribution of discharge in months", frame =FALSE, ylim =c(0,20), xlab ="", ylab =bquote(Discharge), font.main =1)
2.3.2 Q-Q plot
The so called rankit graph produces a quantile-quantile graph of the values from selected data. This one can compare if the distribution of the data fit with the assumed distribution, e.q. Normal. qqline then adds the theoretical line, which outputs
Code
par(font.main =1, adj =0.1, xaxs ="i", yaxs ="i")qqnorm(mpx$tmax, pch =21, col ="black", bg ="lightgray", cex =0.5, main ="")qqline(mpx$tmax)
2.3.3 Empirical distribution function
Large portion of the data which is processed in Hydrology originates from time series of streamflow or runoff. This enables us to construct empirical probability of exceeding certain value in the data \(P(X\geq x_k)\), simply using the well know definition
\[
P = \dfrac{m}{n}
\] where \(m\) is the number of reaching or exceeding of value \(x_k\) and \(n\) is the length of the series. This equation is valid strictly for \(n \rightarrow \infty\).
There are several empirical formulas in use to calculate the empirical exceedance probability like the one from Čegodajev
\[
P = \dfrac{m - 0.3}{n + 0.4}
\] In R we can utilize a highe order function called ecdf()
Code
ecdf_data <-ecdf(na.omit(mpx$r))max(mpx$r, na.rm =TRUE)ecdf_data(35) # percent of data lower than 35
1
Create the empirical cumulative distribution function (ECDF) for the data
We calculate the return period as inverse to the exceedance probability
Code
return_period <-1/exceedance_prob plot(return_period, r_sorted, type ="l", xlab ="Return Period", ylab ="Runoff Value", main ="Empirical Return Period")
2.4 Hydrological indices
Originated from the combination of aggregation and summation methods and the measures of location and dispersion.
2.4.1 Runoff coefficient
The concept of runoff coefficient comes from the presumption, that over long-time period
\[
\varphi [-] = \dfrac{R}{P}
\] where \(R\) represents runoff and \(P\) precipitation in long term, typically 30 year so called Normal period, which is a decadal moving 30 window; the current being 1991–2020.
Code
R <-mean(aggregate(r ~ year, FUN = mean, data = mpx)[, "r"])P <-mean(aggregate(prec ~ year, FUN = mean, data = mpx)[, "prec"])R/P
[1] 0.493947
2.4.2 Flow duration curve
Code
M <-c(30, 60, 90, 120, 150, 180, 210, 240, 270, 300, 330, 355, 364)m_day <-function(streamflow) {quantile(na.omit(streamflow), 1- M/365.25)}plot(M, m_day(mpx$r), type ="b", pch =21,bg ="darkgray",xlab ="", ylab ="Discharge", main ="M-day discharge")
2.4.3 Baseflow
The Total runoff comprises from direct runoff, hypodermic runoff and baseflow. From the baseflow separation process usually the baseflow index is calculated \(\text{BFI}=\dfrac{\text{baseflow}}{\text{total runoff}}\)