2 Statistical processing of hydrological dataset

A dataset is a collection of measurements which is used to create or evaluate assumptions about a population. Within this session, only univariate data are studied.

At the end of the lecture you should be familiar with following concepts:

hydrological year
data screening and cleaning
basic data summation and aggregation
descriptive statistics
visualization of statistical data

2.1 Exploratory Data Analysis (EDA)

EDA is not a strictly formalized set of procedures, rather it is an general approach on how to study, describe, extract, evaluate and report about the data. Hence the workflow is specific to data in process.
When studying data, we will look at certain statistical features, which represent esteemed characteristcs for the whole data set.

2.1.1 Measures of location

The farthest one can reduce data, while still retaining any information at all is by a single value. They describe a central tendency of the data.

2.1.1.1 Mean

The arithmetic mean (or average) is simply the sum of observations devided by the number of observations.

\[ \text{mean} = \dfrac{\text{sum of data}}{\text{number of data}}, \qquad \bar{x} = \dfrac{1}{n}\sum\limits_{i=1}^{n} x_i \]

Code

set.seed(123) # using the set seed will ensure that we will get the samme generated numbers every time
x <- rnorm(n = 30, mean = 50, sd = 10) # some random-generated numbers from the normal distribution
x
##  [1] 44.39524 47.69823 65.58708 50.70508 51.29288 67.15065 54.60916 37.34939
##  [9] 43.13147 45.54338 62.24082 53.59814 54.00771 51.10683 44.44159 67.86913
## [17] 54.97850 30.33383 57.01356 45.27209 39.32176 47.82025 39.73996 42.71109
## [25] 43.74961 33.13307 58.37787 51.53373 38.61863 62.53815

2.1.1.2 Median

The median is another central tendency based on the sorted data set. It is the \(50\)th quantile in the data. A half of the values is smaller than median, and a half is larger.

\[ \tilde{x} = \]

Code

median(x)
## [1] 49.26267
(sort(x)[length(x)/2] + sort(x)[length(x)/2 + 1])/2
## [1] 49.26267

2.1.1.3 Mode

For continuous variables in real numbers, computation of modus does make sense when applied on bins of values.

Code

 [1] 44.39524 47.69823 65.58708 50.70508 51.29288 67.15065 54.60916 37.34939
 [9] 43.13147 45.54338 62.24082 53.59814 54.00771 51.10683 44.44159 67.86913
[17] 54.97850 30.33383 57.01356 45.27209 39.32176 47.82025 39.73996 42.71109
[25] 43.74961 33.13307 58.37787 51.53373 38.61863 62.53815

Code

cut(x, breaks = 10)

 [1] (41.6,45.3] (45.3,49.1] (64.1,67.9] (49.1,52.9] (49.1,52.9] (64.1,67.9]
 [7] (52.9,56.6] (34.1,37.8] (41.6,45.3] (45.3,49.1] (60.4,64.1] (52.9,56.6]
[13] (52.9,56.6] (49.1,52.9] (41.6,45.3] (64.1,67.9] (52.9,56.6] (30.3,34.1]
[19] (56.6,60.4] (41.6,45.3] (37.8,41.6] (45.3,49.1] (37.8,41.6] (41.6,45.3]
[25] (41.6,45.3] (30.3,34.1] (56.6,60.4] (49.1,52.9] (37.8,41.6] (60.4,64.1]
10 Levels: (30.3,34.1] (34.1,37.8] (37.8,41.6] (41.6,45.3] ... (64.1,67.9]

Code

y <- cut(x, breaks = seq(from = -10, to = 100, by = 10))
table(y)

y
 (-10,0]   (0,10]  (10,20]  (20,30]  (30,40]  (40,50]  (50,60]  (60,70] 
       0        0        0        0        6        9       10        5 
 (70,80]  (80,90] (90,100] 
       0        0        0

Code

barplot(height = table(y))

2.1.1.4 \(n-\)th quantile

Code

quantile(x, probs = c(0.1, 0.9))

     10%      90% 
38.49171 62.84304

2.1.2 Measures of spread (variability)

With measures of variability we describe the spread of the data around its central tendency. Some degree of variability is a natural occurring phenomenon.

2.1.2.1 Variation range

The simples measure of spread is the variation range. It is computed as the difference of both end extremes of the data.

\[ R = \max{x} - \min{x} \]

Code

max(x) - min(x)

[1] 37.5353

2.1.2.2 Variance

We calculate the variance as the average of the squares of the deviations of individual character values from their arithmetic mean.

\[ \sigma^2 = \frac{1}{n}(x_i - \bar{x})^2 \]

2.1.2.3 Standard deviation

The standard deviation is defined as sqruare root of variance.

Code

sd(x)
## [1] 9.810307
var(x)^0.5
## [1] 9.810307

2.1.2.4 Coefficient of variation

The coefficient of variation is given by the ratio of the standard deviation and the arithmetic mean, and it follows from the definition of this ratio that it is a dimensionless indicator.

\[ \nu = \frac{s}{\bar{x}} \]

2.1.2.5 Interquartile range

The \(\text{IQR}\) is the upper quartile (\(75\)th percentile) minus the lower quartile (\(25\)th percentile)

\[ \text{IQR} = Q_{\text{III}} - Q_{\text{I}} \]

Code

IQR(x)

[1] 11.60016

2.2 Hydrological data

Data sets containing hydrological data are most commonly, although not exclusively, in tabular (rectangular) shape. Let’s take a look at sample data from MOOPEX. It is a simple curated large sample data set.

This dataset covers the same 438 catchments in daily step with measured discharge. The dataset is publicly available at https://hydrology.nws.noaa.gov/pub/gcip/mopex/US_Data/Us_438_Daily/

Code

url <- "./data/01138000.dly"
mpx <- read.fwf(file = url, 
                widths = c(8, rep(10, times = 5)), 
                header = FALSE)
names(mpx) <- c("date", "prec", "pet", "r", "tmax", "tmin")
mpx$date <- gsub(x = mpx$date, pattern = " ", replacement = "0")
mpx$date <- as.Date(mpx$date, format = "%Y%m%d")
mpx[which(mpx$r < 0), "r"] <- NA 
head(mpx)
##         date prec   pet      r    tmax     tmin
## 1 1948-01-01 0.00 0.080 0.2620 -2.9167 -11.2556
## 2 1948-01-02 4.44 0.081 0.2501 -2.9556 -11.8500
## 3 1948-01-03 4.74 0.081 0.2501 -0.1778  -5.6444
## 4 1948-01-04 0.00 0.081 0.2620 -1.8778  -4.2222
## 5 1948-01-05 0.00 0.083 0.2501  0.8778  -4.3389
## 6 1948-01-06 1.78 0.084 0.2620 -0.2889  -5.4833

Let’s add some features to the data. For the purpose of the analysis it would be beneficial to add year, month, quarters, decade and hyear as hydrological year.

2.3 Aggregation and summation

Code

mpx$year <- as.integer(format(mpx$date, "%Y"))
mpx$month <- as.integer(format(mpx$date, "%m"))

Try to add quart variable using quarters() function.

Code

mpx$hyear <- ifelse(mpx$month > 10, as.integer(mpx$year + 1), as.integer(mpx$year))

Now exercise some basic calculations

Fraction of days with no precipitation
How many freezing days occur?
How many freezing days in late spring (May) and summer?
10 highest and 10 lowest temperatures

These functions are based on grouping. In hydrology, the natural groups involve - stations/basins, decades/years/months, groundwater regions, etc.

2.3.1 Box-plot

Come handy, when we want to visualize the important quantiles related to any categorical variable.

Code

station <- read.delim("./data/6242910_Q_Day.Cmd.txt", skip = 36, header = TRUE, sep = ";", col.names = c("date", "time", "value"))
station$date <- as.Date(station$date, format = "%Y-%m-%d")

station_agg <- aggregate(station$value ~ as.factor(data.table::month(station$date)), 
                         FUN = "quantile", 
                         probs = c(0.1, 0.25, 0.5, 0.75, 0.9))
names(station_agg) <- c("Month", "Discharge")
par(font.main = 1, 
    adj = 0.1, 
    xaxs = "i", 
    yaxs = "i")
boxplot(data = station_agg, 
        Discharge ~ Month, 
        main = "Distribution of discharge in months", 
        frame = FALSE, 
        ylim = c(0,20), 
        xlab = "", 
        ylab = bquote(Discharge), font.main = 1)

2.3.2 Q-Q plot

The so called rankit graph produces a quantile-quantile graph of the values from selected data. This one can compare if the distribution of the data fit with the assumed distribution, e.q. Normal. qqline then adds the theoretical line, which outputs

Code

par(font.main = 1, 
    adj = 0.1, 
    xaxs = "i", 
    yaxs = "i")
qqnorm(mpx$tmax, 
       pch = 21, 
       col = "black", 
       bg = "lightgray", cex = 0.5, 
       main = "")
qqline(mpx$tmax)

2.3.3 Empirical distribution function

Large portion of the data which is processed in Hydrology originates from time series of streamflow or runoff. This enables us to construct empirical probability of exceeding certain value in the data \(P(X\geq x_k)\), simply using the well know definition

\[ P = \dfrac{m}{n} \] where \(m\) is the number of reaching or exceeding of value \(x_k\) and \(n\) is the length of the series. This equation is valid strictly for \(n \rightarrow \infty\).
There are several empirical formulas in use to calculate the empirical exceedance probability like the one from Čegodajev

\[ P = \dfrac{m - 0.3}{n + 0.4} \] In R we can utilize a highe order function called ecdf()

Code

ecdf_data <- ecdf(na.omit(mpx$r))
max(mpx$r, na.rm = TRUE)
ecdf_data(35) # percent of data lower than 35

1: Create the empirical cumulative distribution function (ECDF) for the data

[1] 35.7248
[1] 0.9998328

2.3.4 Exceedance curve

Code

r_sorted <- sort(mpx$r, decreasing = TRUE)

# Empirical
n <- length(r_sorted)
exceedance_prob <- 1:n / n

plot(exceedance_prob, r_sorted, type = "l", xlab = "Exceedance Probability", ylab = "Runoff Value", main = "Empirical Exceedance Curve")

2.3.5 Return period

We calculate the return period as inverse to the exceedance probability

Code

return_period <- 1/exceedance_prob 

plot(return_period, r_sorted, type = "l", xlab = "Return Period", ylab = "Runoff Value", main = "Empirical Return Period")

2.4 Hydrological indices

Originated from the combination of aggregation and summation methods and the measures of location and dispersion.

2.4.1 Runoff coefficient

The concept of runoff coefficient comes from the presumption, that over long-time period

\[ \varphi [-] = \dfrac{R}{P} \] where \(R\) represents runoff and \(P\) precipitation in long term, typically 30 year so called Normal period, which is a decadal moving 30 window; the current being 1991–2020.

Code

R <- mean(aggregate(r ~ year, FUN = mean, data = mpx)[, "r"])
P <- mean(aggregate(prec ~ year, FUN = mean, data = mpx)[, "prec"])

R/P

[1] 0.493947

2.4.2 Flow duration curve

Code

M <- c(30, 60, 90, 120, 150, 180, 210, 240, 270, 300, 330, 355, 364)
m_day <- function(streamflow) {
  quantile(na.omit(streamflow), 1 - M/365.25)
}
plot(M, m_day(mpx$r), 
     type = "b", 
     pch = 21,
     bg = "darkgray",
     xlab = "", 
     ylab = "Discharge", 
     main = "M-day discharge")

2.4.3 Baseflow

The Total runoff comprises from direct runoff, hypodermic runoff and baseflow. From the baseflow separation process usually the baseflow index is calculated \(\text{BFI}=\dfrac{\text{baseflow}}{\text{total runoff}}\)