A dataset is a collection of measurements which is used to create or evaluate assumptions about a population. Within this session, only univariate data are studied.
3.1 Exploratory Data Analysis (EDA)
EDA is not a strictly formalized set of procedures, rather it is an general approach on how to study, describe, extract, evaluate and report about the data. Hence the workflow is specific to data in process.
When studying data, we will look at certain statistical features, which represent esteemed characteristics of the whole data set.
The median is another central tendency based on the sorted data set. It is the \(50\)th quantile in the data. A half of the values is smaller than median, and a half is larger.
For continuous variables in real numbers, computation of mode can be done either by cutting the values into meaningful bins or using the kernel density estimate.
Change the cut() function up so that the vector \(x\) is cut into 20 bins.
Add a single number to vector x so that the sample mean rises to
3.1.1.4\(n^{\mathrm{th}}\) quantile
Code
quantile(x, probs =c(0.1, 0.9))
10% 90%
38.49171 62.84304
3.1.2 Measures of spread
With measures of variability we describe the spread of the data around its central tendency. Some degree of variability is a natural occurring phenomenon.
3.1.2.1 Variation range
The simples measure of spread is the variation range. It is computed as the difference of both end extremes of the data.
\[
R = \max{x} - \min{x}
\]
Code
max(x) -min(x)
[1] 37.5353
3.1.2.2 Variance and sample variance
We calculate the variance as the average of the squares of the deviations of individual character values from their arithmetic mean \(\bar{x}\).
\[
\begin{array}{rl}
\sigma^2 =& \dfrac{1}{n}\sum\limits_{i=1}^{n}(x_i - \bar{x})^2\\
s^2 =& \dfrac{1}{n-1}\sum\limits_{i=1}^{n}(x_i - \bar{x})^2
\end{array}
\] Usually we do not have the ability to work with the whole population data. \(s^2\), the sample variance formula needs to be used for and unbiased estimate.
3.1.2.3 Standard deviation
The standard deviation is defined as square root of variance.
Code
sd(x)## [1] 9.810307var(x)^0.5## [1] 9.810307
3.1.2.4 Coefficient of variation
The coefficient of variation is given by the ratio of the standard deviation and the arithmetic mean, and it follows from the definition of this ratio that it is a dimensionless indicator.
\[
\nu = \frac{s}{\bar{x}}
\]
3.1.2.5 Interquartile range
The \(\text{IQR}\) is the upper quartile (\(75\)th percentile) minus the lower quartile (\(25\)th percentile)
\[
\text{IQR} = Q_{\text{III}} - Q_{\text{I}}
\]
Code
IQR(x)
[1] 11.60016
Exercise
Using the knowledge from the Chapter 2 create own function to calculate the sample variance according the formula.
3.2 Hydrological data
Datasets containing hydrological data are quite commonly, although not exclusively, in tabular (rectangular) shape. Let’s take a look at sample data from MOPEX. It is a fairly extensive curated dataset.
This dataset covered some 438 catchments in daily step with measured discharge. The dataset used to be publicly available at https://hydrology.nws.noaa.gov/pub/gcip/mopex/US_Data/Us_438_Daily, the site is unavailable now. Several even more extensive datasets recently came out.
This file is stored locally as part of this site. You have to change to your path.
2
Load the data in fixed width format using the read.fwf() function.
3
Provide variable names, since there aren’t any.
4
Now there is a trouble with dates. If you evaluate sequentially, you could see that the date format expects “YYYY-mm-dd” format, we have to fill the blank spots with zeroes to comply.
5
After that, some variables contain non-physical values like -9999, these would make troubles in statistical processing, we have to replace them with NA (not assigned).
6
Let’s display the 5 first and 5 last rows of the result.
date prec pet r tmax tmin
1 1948-01-01 0.00 0.080 0.2620 -2.9167 -11.2556
2 1948-01-02 4.44 0.081 0.2501 -2.9556 -11.8500
3 1948-01-03 4.74 0.081 0.2501 -0.1778 -5.6444
4 1948-01-04 0.00 0.081 0.2620 -1.8778 -4.2222
5 1948-01-05 0.00 0.083 0.2501 0.8778 -4.3389
20450 2003-12-27 0.59 0.095 NA 0.1556 -5.6667
20451 2003-12-28 0.00 0.091 NA 1.6444 -12.8111
20452 2003-12-29 0.20 0.088 NA 4.1667 -12.5500
20453 2003-12-30 2.03 0.086 NA 4.8889 -9.4222
20454 2003-12-31 0.84 0.084 NA 3.5278 -1.7500
Exercise
Data coming from field measurements are usually accompanied with metadata. This could be information about when the data were collected, the site location, precision, missing values etc. Sometimes the metadata are in a header of the file. Download and process this file in a similar manner using help of this read.txt() function.
3.3 Aggregation and summation
Let’s add some features to the data. For the purpose of the analysis it would be beneficial to add year, month, quarters, decade and hyear as hydrological year.
The hydrological year in Czechia is defined from \(1^{\mathrm{st}}\) November to \(31^{\mathrm{st}}\) October next year.
These functions are based on grouping. In hydrology, the natural groups involve - stations/basins, decades/years/months, groundwater regions, etc.
3.3.1 Box-plot
Come handy, when we want to visualize the important quantiles related to any categorical variable.
Code
station <-read.delim(file ="./data/6242910_Q_Day.Cmd.txt", skip =36, header =TRUE, sep =";", col.names =c("date", "time", "value"))station$date <-as.Date(station$date, format ="%Y-%m-%d")station_agg <-aggregate(station$value ~as.factor(data.table::month(station$date)), FUN ="quantile", probs =c(0.1, 0.25, 0.5, 0.75, 0.9))names(station_agg) <-c("Month", "Discharge")par(font.main =1, adj =0.1, xaxs ="i", yaxs ="i")boxplot(data = station_agg, Discharge ~ Month, main ="Distribution of discharge in months", frame =FALSE, ylim =c(0,20), xlab ="", ylab =bquote(Discharge), font.main =1)
3.3.2 Q-Q plot
The so called rankit graph produces a quantile-quantile graph of the values from selected data. This one can compare if the distribution of the data fit with the assumed distribution, e.q. Normal. qqline then adds the theoretical line.
Code
par(font.main =1, adj =0.1, xaxs ="i", yaxs ="i")qqnorm(mpx_01138000$tmax, pch =21, col ="black", bg ="lightgray", cex =0.5, main ="")qqline(mpx_01138000$tmax)
3.3.3 Empirical distribution function
Large portion of the data which is processed in Hydrology originates from time series of streamflow or runoff. This enables us to construct empirical probability of exceeding certain value in the data \(P(X\geq x_k)\), simply using the well know definition
\[
P = \dfrac{m}{n}
\] where \(m\) is the number of reaching or exceeding of value \(x_k\) and \(n\) is the length of the series. This equation is valid strictly for \(n \rightarrow \infty\).
There are several empirical formulas in use to calculate the empirical exceedance probability like the one from Čegodajev
\[
P = \dfrac{m - 0.3}{n + 0.4}
\] In R we can utilize a highe order function called ecdf()
Originated from the combination of aggregation and summation methods and the measures of location and dispersion.
3.4.1 Runoff coefficient
The concept of runoff coefficient comes from the presumption, that over long-time period
\[
\varphi [-] = \dfrac{R}{P}
\] where \(R\) represents runoff and \(P\) precipitation in long term, typically 30 year so called Climatic period, which is a decadal 30 year reference moving window; the latest being 1991–2020.
Code
R <-mean(aggregate(r ~ year, FUN = mean, data = mpx_01138000)[, "r"])P <-mean(aggregate(prec ~ year, FUN = mean, data = mpx_01138000)[, "prec"])R/P
[1] 0.493947
Exercise
Calculate the runoff coefficient \(\varphi\) for the two latest decades separately.
3.4.2 Flow duration curve
This is an essential curve in energy as well as river ecology sector.
Code
M <-c(30, 60, 90, 120, 150, 180, 210, 240, 270, 300, 330, 355, 364)m_day <-function(streamflow) {quantile(na.omit(streamflow), 1- M/365.25)}plot(M, m_day(mpx_01138000$r), type ="b", pch =21,bg ="darkgray",xlab ="", ylab ="Discharge", main ="M-day discharge")