pytesmo.metrics package
Metrics for pairs and triplets
This module provides functions to calculate metrics and their associated confidence intervals.
Pairwise Metrics
For pairwise metrics, confidence intervals can be calculated analytically (if available) or via boostrapping. Example with bias:
from pytesmo.metrics import bias, with_analytical_ci
x, y = np.random.randn(100), np.rand.randn(100)
# only metric calculation
b = bias(x, y)
# with analytical CI
b, lower, upper = with_analytical_ci(bias, x, y, alpha=0.05)
# using bootstrapping instead (uses nsamples=1000 by default)
b, lower, upper = with_bootstrapped_ci(bias, x, y, alpha=0.05)
Analytical CIs are available for
bias
rmsd
nrmsd
ubrmsd
mse
pearson_r
spearman_r
kendall_tau
You can use pytesmo.metrics.has_analytical_ci()
to check
programmatically whether analytical CIs are available.
Triple Collocation Metrics
Triple collocation metric can be calculated with the function
pytesmo.metrics.tcol_metrics()
. For bootstrapping CIs, use
pytesmo.metrics.tcol_metrics_with_bootstrapped_ci()
:
from pytesmo.metrics import tcol_metrics, tcol_metric
x, y, z = ...
# only metrics
snr, err_std, beta = tcol_metrics(x, y, z)
# with cis, using nsamples=1000 and alpha=0.05 (default)
snr_and_ci, err_std_and_ci, beta_and_ci = (
tcol_metrics_with_bootstrapped_ci(x, y, z)
)
snr, lower_snr, upper_snr = snr_and_ci
Developer notes
Pairwise metrics are implemented in pytesmo.metrics.pairwise
, triple
collocation metrics in pytesmo.metrics.tcol
.
- pytesmo.metrics.RSS(signatures, args, kwargs, defaults)
Residual sum of squares.
- Parameters:
x (numpy.ndarray) – Observations.
y (numpy.ndarray) – Predictions.
- Returns:
res – Residual sum of squares.
- Return type:
- pytesmo.metrics.aad(x, y)[source]
Average (=mean) absolute deviation (AAD).
- Parameters:
x (numpy.ndarray) – First input vector.
y (numpy.ndarray) – Second input vector.
- Returns:
d – Mean absolute deviation.
- Return type:
- pytesmo.metrics.bias(signatures, args, kwargs, defaults)
Difference of the mean values.
Sign of output depends on argument order. We calculate mean(x) - mean(y).
- Parameters:
x (numpy.ndarray) – First input vector.
y (numpy.ndarray) – Second input vector.
- Returns:
bias – Bias between x and y.
- Return type:
- pytesmo.metrics.ecol(data, correlated=None, err_cov=None, abs_est=True)[source]
Extended collocation analysis to obtain estimates of:
signal variances
error variances
signal-to-noise ratios [dB]
error cross-covariances (and -correlations)
based on an arbitrary number of N>3 data sets.
EACH DATA SET MUST BE MEMBER OF >= 1 TRIPLET THAT FULFILLS THE CLASSICAL TRIPLE COLLOCATION ASSUMPTIONS
- Parameters:
data (pd.DataFrame) – Temporally matched input data sets in each column
correlated (tuple of tuples (string)) – A tuple containing tuples of data set names (column names), between which the error cross-correlation shall be estimated. e.g. [[‘AMSR-E’,’SMOS’],[‘GLDAS’,’ERA’]] estimates error cross-correlations between (AMSR-E and SMOS), and (GLDAS and ERA), respectively.
err_cov – A priori known error cross-covariances that shall be included in the estimation (to obtain unbiased estimates)
abs_est – Force absolute values for signal and error variance estimates (to mitiate the issue of estimation uncertainties)
- Returns:
A dictionary with the following entries (<name> correspond to data set (df
column’s) names
- sig_<name> (signal variance of <name>)
- err_<name> (error variance of <name>)
- snr_<name> (SNR (in dB) of <name>)
- err_cov_<name1>_<name2> (error covariance between <name1> and <name2>)
- err_corr_<name1>_<name2> (error correlation between <name1> and <name2>)
Notes
Rescaling parameters can be derived from the signal variances e.g., scaling <src> against <ref>: beta = np.sqrt(sig_<ref> / sig_<src>) rescaled = (data[<src>] - data[<src>].mean()) * beta + data[<ref>].mean()
Examples
# Just random numbers for demonstrations ds1 = np.random.normal(0,1,500) ds2 = np.random.normal(0,1,500) ds3 = np.random.normal(0,1,500) ds4 = np.random.normal(0,1,500) ds5 = np.random.normal(0,1,500) # Three data sets without cross-correlated errors: This is equivalent # to standard triple collocation. df = pd.DataFrame({'ds1':ds1,'ds2':ds2,'ds3':ds3}, index=np.arange(500)) res = ecol(df) # Five data sets, where data sets (1 and 2), and (3 and 4), are assumed # to have cross-correlated errors. df = pd.DataFrame({'ds1':ds1,'ds2':ds2,'ds3':ds3,'ds4':ds4,'ds5':ds5}, index=np.arange(500),) correlated = [['ds1','ds2'],['ds3','ds4']] res = ecol(df,correlated=correlated)
References
[Gruber2016]Gruber, A., Su, C. H., Crow, W. T., Zwieback, S., Dorigo, W. A., & Wagner, W. (2016). Estimating error cross-correlations in soil moisture data sets using extended collocation analysis. Journal of Geophysical Research: Atmospheres, 121(3), 1208-1219.
- pytesmo.metrics.has_analytical_ci(metric_func)[source]
Checks whether an analytical CI implementation is available.
- Parameters:
metric_func (callable) – Function that calculates metric value. Must be from
pytesmo.metrics
.- Returns:
has_ci –
True
if there is a function with namemetric_func.__name__ + "_ci"
inpytesmo.metric_cis
.- Return type:
- pytesmo.metrics.index_of_agreement(o, p)[source]
Index of agreement was proposed by Willmot (1981), to overcome the insenstivity of Nash-Sutcliffe efficiency E and R^2 to differences in the observed and predicted means and variances (Legates and McCabe, 1999). The index of agreement represents the ratio of the mean square error and the potential error (Willmot, 1984). The potential error in the denominator represents the largest value that the squared difference of each pair can attain. The range of d is similar to that of R^2 and lies between 0 (no correlation) and 1 (perfect fit).
- Parameters:
o (numpy.ndarray) – Observations.
p (numpy.ndarray) – Predictions.
- Returns:
d – Index of agreement.
- Return type:
- pytesmo.metrics.kendall_tau(x, y)[source]
Wrapper for scipy.stats.kendalltau
- Parameters:
x (numpy.array) – First input vector.
y (numpy.array) – Second input vector.
- Returns:
tau – Kendall’s tau statistic
- Return type:
See also
- pytesmo.metrics.mad(x, y)[source]
Median absolute deviation (MAD).
- Parameters:
x (numpy.ndarray) – First input vector.
y (numpy.ndarray) – Second input vector.
- Returns:
d – Median absolute deviation.
- Return type:
- pytesmo.metrics.msd(x, y)[source]
Mean square deviation/mean square error.
For validation, MSD (same as MSE) is defined as
..math:
MSD = \frac{1}{n}\sum\limits_{i=1}^n (x_i - y_i)^2
MSD can be decomposed into a term describing the deviation of x and y attributable to non-perfect correlation (r < 1), a term depending on the difference in variances between x and y, and the difference in means between x and y (bias).
..math:
MSD &= MSD_{corr} + MSD_{var} + MSD_{bias}\\ &= 2\sigma_x\sigma_y (1-r) + (\sigma_x - \sigma_y)^2 + (\mu_x - \mu_y)^2
This function calculates the full MSD, the function msd_corr, msd_var, and msd_bias can be used to calculate the individual components.
- Parameters:
x (numpy.ndarray) – First input vector
y (numpy.ndarray) – Second input vector
- Returns:
msd – Mean square deviation
- Return type:
- pytesmo.metrics.mse_bias(signatures, args, kwargs, defaults)
Bias component of MSE.
MSE can be decomposed into a term describing the deviation of x and y attributable to non-perfect correlation (r < 1), a term depending on the difference in variances between x and y, and the difference in means between x and y (bias).
..math:
MSE &= MSE_{corr} + MSE_{var} + MSE_{bias}\\ &= 2\sigma_x\sigma_y (1-r) + (\sigma_x - \sigma_y)^2 + (\mu_x - \mu_y)^2
This function calculates the term \(MSE_{bias} = (\mu_x - \mu_y)^2\).
- Parameters:
x (numpy.ndarray) – First input vector.
y (numpy.ndarray) – Second input vector.
- Returns:
mse_bias – Bias component of MSE.
- Return type:
- pytesmo.metrics.mse_corr(signatures, args, kwargs, defaults)
Correlation component of MSE.
MSE can be decomposed into a term describing the deviation of x and y attributable to non-perfect correlation (r < 1), a term depending on the difference in variances between x and y, and the difference in means between x and y (bias).
..math:
MSE &= MSE_{corr} + MSE_{var} + MSE_{bias}\\ &= 2\sigma_x\sigma_y (1-r) + (\sigma_x - \sigma_y)^2 + (\mu_x - \mu_y)^2
This function calculates the term \(MSE_{corr} = 2\sigma_x\sigma_y(1-r)\).
- Parameters:
x (numpy.ndarray) – First input vector.
y (numpy.ndarray) – Second input vector.
- Returns:
mse_corr – Correlation component of MSE.
- Return type:
- pytesmo.metrics.mse_decomposition(signatures, args, kwargs, defaults)
Mean square deviation/mean square error.
For validation, MSE (same as MSE) is defined as
..math:
MSE = \frac{1}{n}\sum\limits_{i=1}^n (x_i - y_i)^2
MSE can be decomposed into a term describing the deviation of x and y attributable to non-perfect correlation (r < 1), a term depending on the difference in variances between x and y, and the difference in means between x and y (bias).
..math:
MSE &= MSE_{corr} + MSE_{var} + MSE_{bias}\\ &= 2\sigma_x\sigma_y (1-r) + (\sigma_x - \sigma_y)^2 + (\mu_x - \mu_y)^2
This function calculates the all components as well as the sum.
- Parameters:
x (numpy.ndarray) – First input vector
y (numpy.ndarray) – Second input vector
- Returns:
mse (float) – Mean square deviation
mse_corr (float) – Correlation component of MSE.
mse_bias (float) – Bias component of the MSE.
mse_var (float) – Variance component of the MSE.
- pytesmo.metrics.mse_var(signatures, args, kwargs, defaults)
Variance component of MSE.
MSE can be decomposed into a term describing the deviation of x and y attributable to non-perfect correlation (r < 1), a term depending on the difference in variances between x and y, and the difference in means between x and y (bias).
..math:
MSE &= MSE_{corr} + MSE_{var} + MSE_{bias}\\ &= 2\sigma_x\sigma_y (1-r) + (\sigma_x - \sigma_y)^2 + (\mu_x - \mu_y)^2
This function calculates the term \(MSE_{var} = (\sigma_x - \sigma_y)^2\).
- Parameters:
x (numpy.ndarray) – First input vector.
y (numpy.ndarray) – Second input vector.
- Returns:
mse_var – Variance component of MSE.
- Return type:
- pytesmo.metrics.nash_sutcliffe(o, p)[source]
Nash Sutcliffe model efficiency coefficient E. The range of E lies between 1.0 (perfect fit) and -inf.
- Parameters:
o (numpy.ndarray) – Observations.
p (numpy.ndarray) – Predictions.
- Returns:
E – Nash Sutcliffe model efficiency coefficient E.
- Return type:
- pytesmo.metrics.nrmsd(x, y, ddof=0)[source]
Normalized root-mean-square deviation (nRMSD).
This is normalizes RMSD by
max(max(x), max(y)) - min(min(x), min(y))
.- Parameters:
x (numpy.ndarray) – First input vector.
y (numpy.ndarray) – Second input vector.
ddof (int, optional) – Delta degree of freedom.The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero. DEPRECATED: ddof is deprecated and might be removed in future versions.
- Returns:
nrmsd – Normalized root-mean-square deviation (nRMSD).
- Return type:
- pytesmo.metrics.pearson_r(x, y)[source]
Pearson’s linear correlation coefficient.
- Parameters:
x (numpy.ndarray) – First input vector.
y (numpy.ndarray) – Second input vector.
- Returns:
r – Pearson’s correlation coefficent.
- Return type:
See also
- pytesmo.metrics.rmsd(x, y, ddof=0)[source]
Root-mean-square deviation (RMSD).
This is the root of MSD (see
pytesmo.metrics.msd()
). If x and y have the same mean (i.e. mean(x - y = 0) RMSD corresponds to the square root of the variance of x - y.- Parameters:
x (numpy.ndarray) – First input vector.
y (numpy.ndarray) – Second input vector.
ddof (int, optional, DEPRECATED) – Delta degree of freedom.The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero. DEPRECATED: ddof is deprecated and might be removed in future versions.
- Returns:
rmsd – Root-mean-square deviation.
- Return type:
- pytesmo.metrics.rolling_pr_rmsd(signatures, args, kwargs, defaults)
Computation of rolling Pearson R and RMSD.
- Parameters:
timestamps (float64) – Time stamps as julian dates.
data (numpy.ndarray) – Time series data in 2d array.
window_size (float64) – Window size in fraction of days.
center (bool) – Set window at the center and include window_size in both directions.
min_periods (int) – Minimum number of observations in window required for computation.
Results –
------- –
pr_arr (numpy.array) – Rolling Pearson R and p-value.
rmsd_arr (numpy.array) – Rolling RMSD
- pytesmo.metrics.spearman_r(x, y)[source]
Spearman’s rank correlation coefficient.
- Parameters:
x (numpy.array) – First input vector.
y (numpy.array) – Second input vector.
- Returns:
rho – Spearman correlation coefficient
- Return type:
See also
scipy.stats.spearmenr
- pytesmo.metrics.tcol_metrics(x, y, z, ref_ind=0)[source]
Triple collocation based estimation of signal-to-noise ratio, absolute errors, and rescaling coefficients
- Parameters:
x (1D numpy.ndarray) – first input dataset
y (1D numpy.ndarray) – second input dataset
z (1D numpy.ndarray) – third input dataset
ref_ind (int) – Index of reference data set for estimating scaling coefficients. Default: 0 (x)
- Returns:
snr (numpy.ndarray) – signal-to-noise (variance) ratio [dB]
err_std (numpy.ndarray) – SCALED error standard deviation
beta (numpy.ndarray) – scaling coefficients (i_scaled = i * beta_i)
Notes
This function estimates the triple collocation errors, the scaling parameter \(\beta\) and the signal to noise ratio directly from the covariances of the dataset. For a general overview and how this function and
pytesmo.metrics.tcol_error()
are related please see [Gruber2015].Estimation of the error variances from the covariances of the datasets (e.g. \(\sigma_{XY}\) for the covariance between \(x\) and \(y\)) is done using the following formula:
\[\sigma_{\varepsilon_x}^2 = \sigma_{X}^2 - \frac{\sigma_{XY}\sigma_{XZ}}{\sigma_{YZ}}\]\[\sigma_{\varepsilon_y}^2 = \sigma_{Y}^2 - \frac{\sigma_{YX}\sigma_{YZ}}{\sigma_{XZ}}\]\[\sigma_{\varepsilon_z}^2 = \sigma_{Z}^2 - \frac{\sigma_{ZY}\sigma_{ZX}}{\sigma_{YX}}\]\(\beta\) can also be estimated from the covariances:
\[\beta_x = 1\]\[\beta_y = \frac{\sigma_{XZ}}{\sigma_{YZ}}\]\[\beta_z=\frac{\sigma_{XY}}{\sigma_{ZY}}\]The signal to noise ratio (SNR) is also calculated from the variances and covariances:
\[\text{SNR}_X[dB] = -10\log\left(\frac{\sigma_{X}^2\sigma_{YZ}} {\sigma_{XY}\sigma_{XZ}}-1\right)\]\[\text{SNR}_Y[dB] = -10\log\left(\frac{\sigma_{Y}^2\sigma_{XZ}} {\sigma_{YX}\sigma_{YZ}}-1\right)\]\[\text{SNR}_Z[dB] = -10\log\left(\frac{\sigma_{Z}^2\sigma_{XY}} {\sigma_{ZX}\sigma_{ZY}}-1\right)\]It is given in dB to make it symmetric around zero. If the value is zero it means that the signal variance and the noise variance are equal. +3dB means that the signal variance is twice as high as the noise variance.
References
[Gruber2015]Gruber, A., Su, C., Zwieback, S., Crow, W., Dorigo, W., Wagner, W. (2015). Recent advances in (soil moisture) triple collocation analysis. International Journal of Applied Earth Observation and Geoinformation, in review
- pytesmo.metrics.tcol_metrics_with_bootstrapped_ci(x, y, z, ref_ind=0, alpha=0.05, method='percentile', nsamples=1000, minimum_data_length=100)[source]
Evaluates triple collocation metrics and bootstraps confidence interval.
This calculates the SNR, error standard deviation, and scaling parameter value using Triple Collocation Analysis and uses bootstrapping to find confidence intervals.
- Parameters:
x (np.ndarray) – Data to be compared.
y (np.ndarray) – Data to be compared.
z (np.ndarray) – Data to be compared.
ref_ind (int) – Index of reference data set for estimating scaling coefficients. Default: 0 (x)
alpha (float, optional) – Confidence level, default is 0.05.
method (str, optional) –
The method to use to calculate confidence intervals. Available methods are:
”percentile” (default): Uses the percentiles of the bootstrapped metric distribution.
”basic”: Uses the percentiles of the differences to the original metric value.
”BCa”: Bias-corrected and accelerated bootstrap.
For more info and a comparison of the methods, see [Gilleland2010], especially Table 6 on page 38.
nsamples (int, optional) – Number of bootstrap samples, default is 1000.
minimum_data_length (int, optional) – Minimum amount of data required to do bootstrapping. Default is 100.
- Returns:
snr_result (tuple) –
(snr, lower, upper)
, where each entry is an array of length 3.err_std_result (tuple) –
(err_std, lower, upper)
, where each entry is an array of length 3.beta_result (tuple) –
(beta, lower, upper)
, where each entry is an array of length 3. Note that beta is always 1 for the reference dataset (see ref_ind), therefore the lower and upper values are set to 1 too.
References
[Gilleland2010]Gilleland, E. (2010). Confidence Intervals for Forecast Verification (No. NCAR/TN-479+STR). University Corporation for Atmospheric Research. doi:10.5065/D6WD3XJM
- pytesmo.metrics.ubrmsd(x, y, ddof=0)[source]
Unbiased root-mean-square deviation (uRMSD).
This corresponds to RMSD with mean biases removed beforehand, that is
..math:
ubRMSD = \sqrt{\frac{1}{n}\sum\limits_{i=1}^n \left((x - \bar{x}) - (y - \bar{y}))^2}
NOTE: If you are scaling the data beforehand to have zero mean bias, this is exactly the same as RMSD.
- Parameters:
x (numpy.ndarray) – First input vector.
y (numpy.ndarray) – Second input vector.
ddof (int, optional) – Delta degree of freedom.The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero. DEPRECATED: ddof is deprecated and might be removed in future versions.
- Returns:
ubrmsd – Unbiased root-mean-square deviation (uRMSD).
- Return type:
- pytesmo.metrics.with_analytical_ci(metric_func, x, y, alpha=0.05)[source]
Evaluates metric and analytical confidence interval.
This calculates the metric value and an analytical confidence interval. For this to work, the metric function must be from
pytesmo.pairwise_metric_cis
Usepytesmo.metrics.has_analytical_ci()
to check whether it’s available.- Parameters:
metric_func (callable or str) – Function that calculates metric value. Must be from
pytesmo.metrics
, and have an analytical CI function implemented inpytesmo.metric_cis
. The metric function must have the following signature:(x : np.ndarray, y : np.ndarray) -> float
Alternatively can be the name of the function in pytesmo.metrics.x (np.ndarray) – Data to be compared.
y (np.ndarray) – Data to be compared.
alpha (float, optional) – Confidence level, default is 0.05.
- Returns:
m (float) – Metric value
lower (float) – Lower bound of confidence interval
upper (float) – Upper bound of confidence interval
- Raises:
ValueError : – If no analytical CI function is available.
- pytesmo.metrics.with_bootstrapped_ci(metric_func, x, y, alpha=0.05, method='percentile', nsamples=1000, minimum_data_length=100)[source]
Evaluates metric and bootstraps confidence interval.
This calculates the metric value and uses bootstrapping to find a confidence interval for the metric. This works only for pairwise metrics, use
pytesmo.metrics.tcol_metrics_with_bootstrap_ci()
for TCA metrics.- Parameters:
metric_func (callable) – Function that calculates metric value. The metric function must have the following signature:
(x : np.ndarray, y : np.ndarray) -> float
x (np.ndarray) – Data to be compared.
y (np.ndarray) – Data to be compared.
alpha (float, optional) – Confidence level, default is 0.05.
method (str, optional) –
The method to use to calculate confidence intervals. Available methods are:
”percentile” (default): Uses the percentiles of the bootstrapped metric distribution.
”basic”: Uses the percentiles of the differences to the original metric value.
”BCa”: Bias-corrected and accelerated bootstrap.
For more info and a comparison of the methods, see [Gilleland2010], especially Table 6 on page 38.
nsamples (int, optional) – Number of bootstrap samples, default is 1000.
minimum_data_length (int, optional) – Minimum amount of data required to do bootstrapping. Default is 100.
- Returns:
m (float) – Metric value for pairwise metrics
lower (float or array of floats) – Lower bound of confidence interval
upper (float or array of floats) – Upper bound of confidence interval
References
[Gilleland2010]Gilleland, E. (2010). Confidence Intervals for Forecast Verification (No. NCAR/TN-479+STR). University Corporation for Atmospheric Research. doi:10.5065/D6WD3XJM