3 Processing of hydrological dataset

The following concepts will be discussed:

Concepts

A hydrologic year,
data screening and cleaning,
summation and aggregation of data,
descriptive statistics,
visualization of statistical data.

A dataset is a collection of measurements which is used to create or evaluate assumptions about a population. Within this session, only univariate data are studied.

3.1 Exploratory Data Analysis (EDA)

EDA is not a strictly formalized set of procedures, rather it is an general approach on how to study, describe, extract, evaluate and report about the data. Hence the workflow is specific to data in process.
When studying data, we will look at certain statistical features, which represent esteemed characteristics of the whole data set.

Code

set.seed(123)
x <- rnorm(n = 30, mean = 50, sd = 10)
x
##  [1] 44.39524 47.69823 65.58708 50.70508 51.29288 67.15065 54.60916 37.34939
##  [9] 43.13147 45.54338 62.24082 53.59814 54.00771 51.10683 44.44159 67.86913
## [17] 54.97850 30.33383 57.01356 45.27209 39.32176 47.82025 39.73996 42.71109
## [25] 43.74961 33.13307 58.37787 51.53373 38.61863 62.53815

1: Using the set seed will ensure that we will get the same generated numbers every time.
2: Some random-generated numbers from the Normal distribution.

Exercise

Try to play with the function rnorm() by chaging the parameters. What is the function of set.seed()?

3.1.1 Measures of location

The farthest one can reduce data, while still retaining any information at all is by a single value. They describe a central tendency of the data.

3.1.1.1 Mean

The arithmetic mean (or average) is simply the sum of observations devided by the number of observations.

\[ \bar{x} = \dfrac{1}{n}\sum\limits_{i=1}^{n} x_i \]

3.1.1.2 Median

The median is another central tendency based on the sorted data set. It is the \(50\)th quantile in the data. A half of the values is smaller than median, and a half is larger.

Code

median(x)
## [1] 49.26267
(sort(x)[length(x)/2] + sort(x)[length(x)/2 + 1])/2
## [1] 49.26267

3.1.1.3 Mode

For continuous variables in real numbers, computation of mode can be done either by cutting the values into meaningful bins or using the kernel density estimate.

Code

cut(x, breaks = 10)

 [1] (41.6,45.3] (45.3,49.1] (64.1,67.9] (49.1,52.9] (49.1,52.9] (64.1,67.9]
 [7] (52.9,56.6] (34.1,37.8] (41.6,45.3] (45.3,49.1] (60.4,64.1] (52.9,56.6]
[13] (52.9,56.6] (49.1,52.9] (41.6,45.3] (64.1,67.9] (52.9,56.6] (30.3,34.1]
[19] (56.6,60.4] (41.6,45.3] (37.8,41.6] (45.3,49.1] (37.8,41.6] (41.6,45.3]
[25] (41.6,45.3] (30.3,34.1] (56.6,60.4] (49.1,52.9] (37.8,41.6] (60.4,64.1]
10 Levels: (30.3,34.1] (34.1,37.8] (37.8,41.6] (41.6,45.3] ... (64.1,67.9]

Code

y <- cut(x, breaks = seq(from = -10, to = 100, by = 10))
table(y)

y
 (-10,0]   (0,10]  (10,20]  (20,30]  (30,40]  (40,50]  (50,60]  (60,70] 
       0        0        0        0        6        9       10        5 
 (70,80]  (80,90] (90,100] 
       0        0        0

Code

barplot(height = table(y))

Exercise

1. Change the cut() function up so that the vector \(x\) is cut into 20 bins.
2. Add a single number to vector x so that the sample mean rises to

3.1.1.4 \(n^{\mathrm{th}}\) quantile

Code

quantile(x, probs = c(0.1, 0.9))

     10%      90% 
38.49171 62.84304

3.1.2 Measures of spread

With measures of variability we describe the spread of the data around its central tendency. Some degree of variability is a natural occurring phenomenon.

3.1.2.1 Variation range

The simples measure of spread is the variation range. It is computed as the difference of both end extremes of the data.

\[ R = \max{x} - \min{x} \]

Code

max(x) - min(x)

[1] 37.5353

3.1.2.2 Variance and sample variance

We calculate the variance as the average of the squares of the deviations of individual character values from their arithmetic mean \(\bar{x}\).

\[ \begin{array}{rl} \sigma^2 =& \dfrac{1}{n}\sum\limits_{i=1}^{n}(x_i - \bar{x})^2\\ s^2 =& \dfrac{1}{n-1}\sum\limits_{i=1}^{n}(x_i - \bar{x})^2 \end{array} \] Usually we do not have the ability to work with the whole population data. \(s^2\), the sample variance formula needs to be used for and unbiased estimate.

3.1.2.3 Standard deviation

The standard deviation is defined as square root of variance.

Code

sd(x)
## [1] 9.810307
var(x)^0.5
## [1] 9.810307

3.1.2.4 Coefficient of variation

The coefficient of variation is given by the ratio of the standard deviation and the arithmetic mean, and it follows from the definition of this ratio that it is a dimensionless indicator.

\[ \nu = \frac{s}{\bar{x}} \]

3.1.2.5 Interquartile range

The \(\text{IQR}\) is the upper quartile (\(75\)th percentile) minus the lower quartile (\(25\)th percentile)

\[ \text{IQR} = Q_{\text{III}} - Q_{\text{I}} \]

Code

IQR(x)

[1] 11.60016

Exercise

Using the knowledge from the Chapter 2 create own function to calculate the sample variance according the formula.

3.2 Hydrological data

Datasets containing hydrological data are quite commonly, although not exclusively, in tabular (rectangular) shape. Let’s take a look at sample data from MOPEX. It is a fairly extensive curated dataset.

This dataset covered some 438 catchments in daily step with measured discharge. The dataset used to be publicly available at https://hydrology.nws.noaa.gov/pub/gcip/mopex/US_Data/Us_438_Daily, the site is unavailable now. Several even more extensive datasets recently came out.

Code

url <- "data/01138000.dly"
mpx_01138000 <- read.fwf(file = url,
                widths = c(8, rep(10, times = 5)),
                header = FALSE)
names(mpx_01138000) <- c("date", "prec", "pet", "r", "tmax", "tmin")
mpx_01138000$date <- gsub(x = mpx_01138000$date,
                          pattern = " ",
                          replacement = "0")
mpx_01138000$date <- as.Date(x = mpx_01138000$date,
                             format = "%Y%m%d")
mpx_01138000[which(mpx_01138000$r < 0), "r"] <- NA
rbind(head(mpx_01138000, n = 5),
      tail(mpx_01138000, n = 5))

1: This file is stored locally as part of this site. You have to change to your path.
2: Load the data in fixed width format using the read.fwf() function.
3: Provide variable names, since there aren’t any.
4: Now there is a trouble with dates. If you evaluate sequentially, you could see that the date format expects “YYYY-mm-dd” format, we have to fill the blank spots with zeroes to comply.
5: After that, some variables contain non-physical values like -9999, these would make troubles in statistical processing, we have to replace them with NA (not assigned).
6: Let’s display the 5 first and 5 last rows of the result.

            date prec   pet      r    tmax     tmin
1     1948-01-01 0.00 0.080 0.2620 -2.9167 -11.2556
2     1948-01-02 4.44 0.081 0.2501 -2.9556 -11.8500
3     1948-01-03 4.74 0.081 0.2501 -0.1778  -5.6444
4     1948-01-04 0.00 0.081 0.2620 -1.8778  -4.2222
5     1948-01-05 0.00 0.083 0.2501  0.8778  -4.3389
20450 2003-12-27 0.59 0.095     NA  0.1556  -5.6667
20451 2003-12-28 0.00 0.091     NA  1.6444 -12.8111
20452 2003-12-29 0.20 0.088     NA  4.1667 -12.5500
20453 2003-12-30 2.03 0.086     NA  4.8889  -9.4222
20454 2003-12-31 0.84 0.084     NA  3.5278  -1.7500

Exercise

Data coming from field measurements are usually accompanied with metadata. This could be information about when the data were collected, the site location, precision, missing values etc. Sometimes the metadata are in a header of the file. Download and process this file in a similar manner using help of this read.txt() function.

3.3 Aggregation and summation

Let’s add some features to the data. For the purpose of the analysis it would be beneficial to add year, month, quarters, decade and hyear as hydrological year.

Code

mpx_01138000$year <- as.integer(format(mpx_01138000$date, "%Y"))
mpx_01138000$month <- as.integer(format(mpx_01138000$date, "%m"))

1: With the help of special characters we are extracting the years and months consecutively.

Exercise

1. Add quart variable to mpx_01138000 using quarters() function.
2. Provide the fraction of days with no precipitation:
3. How many freezing days (Tmin < 0) occured?
4. How many freezing days occured in late spring (May) and summer?
5. Find 3 highest and 3 lowest temperatures (delimit by a comma).

Code

mpx_01138000$hyear <- ifelse(test = mpx_01138000$month > 10,
                             yes = as.integer(mpx_01138000$year + 1),
                             no = as.integer(mpx_01138000$year))

1: The hydrological year in Czechia is defined from \(1^{\mathrm{st}}\) November to \(31^{\mathrm{st}}\) October next year.

These functions are based on grouping. In hydrology, the natural groups involve - stations/basins, decades/years/months, groundwater regions, etc.

3.3.1 Box-plot

Come handy, when we want to visualize the important quantiles related to any categorical variable.

Code

station <- read.delim(file = "./data/6242910_Q_Day.Cmd.txt", 
                      skip = 36, 
                      header = TRUE, 
                      sep = ";", 
                      col.names = c("date", "time", "value"))
station$date <- as.Date(station$date, 
                        format = "%Y-%m-%d")

station_agg <- aggregate(station$value ~ as.factor(data.table::month(station$date)), 
                         FUN = "quantile", 
                         probs = c(0.1, 0.25, 0.5, 0.75, 0.9))

names(station_agg) <- c("Month", "Discharge")
par(font.main = 1, 
    adj = 0.1, 
    xaxs = "i", 
    yaxs = "i")
boxplot(data = station_agg, 
        Discharge ~ Month, 
        main = "Distribution of discharge in months", 
        frame = FALSE, 
        ylim = c(0,20), 
        xlab = "", 
        ylab = bquote(Discharge), font.main = 1)

3.3.2 Q-Q plot

The so called rankit graph produces a quantile-quantile graph of the values from selected data. This one can compare if the distribution of the data fit with the assumed distribution, e.q. Normal. qqline then adds the theoretical line.

Code

par(font.main = 1, 
    adj = 0.1, 
    xaxs = "i", 
    yaxs = "i")
qqnorm(mpx_01138000$tmax, 
       pch = 21, 
       col = "black", 
       bg = "lightgray", cex = 0.5, 
       main = "")
qqline(mpx_01138000$tmax)

3.3.3 Empirical distribution function

Large portion of the data which is processed in Hydrology originates from time series of streamflow or runoff. This enables us to construct empirical probability of exceeding certain value in the data \(P(X\geq x_k)\), simply using the well know definition

\[ P = \dfrac{m}{n} \] where \(m\) is the number of reaching or exceeding of value \(x_k\) and \(n\) is the length of the series. This equation is valid strictly for \(n \rightarrow \infty\).
There are several empirical formulas in use to calculate the empirical exceedance probability like the one from Čegodajev

\[ P = \dfrac{m - 0.3}{n + 0.4} \] In R we can utilize a highe order function called ecdf()

Code

ecdf_data <- ecdf(na.omit(mpx_01138000$r))
ecdf_data(35)

1: Create the empirical cumulative distribution function (ECDF) for the data.
2: Calculate percent of data lower than 35.

[1] 0.9998328

3.3.4 Exceedance curve

Provides an information on how many times or for how long a streamflow value was reached or exceeded within a certain period.

Code

r_sorted <- na.omit(sort(mpx_01138000$r, decreasing = TRUE))

# Empirical
n <- length(r_sorted)
exceedance_prob <- 1:n / n

rank <- seq(1:n)
return_period <- (n + 1) / rank
exceedance_prob <- 1 / return_period

plot(x = exceedance_prob, 
     y = r_sorted, 
     type = "l", 
     xlab = "Exceedance Probability", 
     ylab = "Runoff Value", 
     main = "Empirical Exceedance Curve")

3.4 Hydroclimatic indices

Originated from the combination of aggregation and summation methods and the measures of location and dispersion.

3.4.1 Runoff coefficient

The concept of runoff coefficient comes from the presumption, that over long-time period

\[ \varphi [-] = \dfrac{R}{P} \] where \(R\) represents runoff and \(P\) precipitation in long term, typically 30 year so called Climatic period, which is a decadal 30 year reference moving window; the latest being 1991–2020.

Code

R <- mean(aggregate(r ~ year, FUN = mean, data = mpx_01138000)[, "r"])
P <- mean(aggregate(prec ~ year, FUN = mean, data = mpx_01138000)[, "prec"])

R/P

[1] 0.493947

Exercise

Calculate the runoff coefficient \(\varphi\) for the two latest decades separately.

3.4.2 Flow duration curve

This is an essential curve in energy as well as river ecology sector.

Code

M <- c(30, 60, 90, 120, 150, 180, 210, 240, 270, 300, 330, 355, 364)
m_day <- function(streamflow) {
  quantile(na.omit(streamflow), 1 - M/365.25)
}
plot(M, m_day(mpx_01138000$r), 
     type = "b", 
     pch = 21,
     bg = "darkgray",
     xlab = "", 
     ylab = "Discharge", 
     main = "M-day discharge")