pytesmo.metrics package

Metrics for pairs and triplets

This module provides functions to calculate metrics and their associated confidence intervals.

Pairwise Metrics

For pairwise metrics, confidence intervals can be calculated analytically (if available) or via boostrapping. Example with bias:

from pytesmo.metrics import bias, with_analytical_ci

x, y = np.random.randn(100), np.rand.randn(100)
# only metric calculation
b = bias(x, y)

# with analytical CI
b, lower, upper = with_analytical_ci(bias, x, y, alpha=0.05)

# using bootstrapping instead (uses nsamples=1000 by default)
b, lower, upper = with_bootstrapped_ci(bias, x, y, alpha=0.05)

Analytical CIs are available for

bias
rmsd
nrmsd
ubrmsd
mse
pearson_r
spearman_r
kendall_tau

You can use pytesmo.metrics.has_analytical_ci() to check programmatically whether analytical CIs are available.

Triple Collocation Metrics

Triple collocation metric can be calculated with the function pytesmo.metrics.tcol_metrics(). For bootstrapping CIs, use pytesmo.metrics.tcol_metrics_with_bootstrapped_ci():

from pytesmo.metrics import tcol_metrics, tcol_metric

x, y, z = ...

# only metrics
snr, err_std, beta = tcol_metrics(x, y, z)

# with cis, using nsamples=1000 and alpha=0.05 (default)
snr_and_ci, err_std_and_ci, beta_and_ci = (
    tcol_metrics_with_bootstrapped_ci(x, y, z)
)
snr, lower_snr, upper_snr = snr_and_ci

Developer notes

Pairwise metrics are implemented in pytesmo.metrics.pairwise, triple collocation metrics in pytesmo.metrics.tcol.

pytesmo.metrics.RSS(signatures, args, kwargs, defaults)

Residual sum of squares.

Parameters:

x (numpy.ndarray) – Observations.
y (numpy.ndarray) – Predictions.

Returns:

res – Residual sum of squares.

Return type:

float

pytesmo.metrics.aad(x, y)[source]

Average (=mean) absolute deviation (AAD).

Parameters:

x (numpy.ndarray) – First input vector.
y (numpy.ndarray) – Second input vector.

Returns:

d – Mean absolute deviation.

Return type:

float

pytesmo.metrics.bias(signatures, args, kwargs, defaults)

Difference of the mean values.

Sign of output depends on argument order. We calculate mean(x) - mean(y).

Parameters:

x (numpy.ndarray) – First input vector.
y (numpy.ndarray) – Second input vector.

Returns:

bias – Bias between x and y.

Return type:

float

pytesmo.metrics.ecol(data, correlated=None, err_cov=None, abs_est=True)[source]

Extended collocation analysis to obtain estimates of:

signal variances

error variances

signal-to-noise ratios [dB]

error cross-covariances (and -correlations)

based on an arbitrary number of N>3 data sets.

EACH DATA SET MUST BE MEMBER OF >= 1 TRIPLET THAT FULFILLS THE CLASSICAL TRIPLE COLLOCATION ASSUMPTIONS

Parameters:

data (pd.DataFrame) – Temporally matched input data sets in each column
correlated (tuple of tuples (string)) – A tuple containing tuples of data set names (column names), between which the error cross-correlation shall be estimated. e.g. [[‘AMSR-E’,’SMOS’],[‘GLDAS’,’ERA’]] estimates error cross-correlations between (AMSR-E and SMOS), and (GLDAS and ERA), respectively.
err_cov – A priori known error cross-covariances that shall be included in the estimation (to obtain unbiased estimates)
abs_est – Force absolute values for signal and error variance estimates (to mitiate the issue of estimation uncertainties)

Returns:

A dictionary with the following entries (<name> correspond to data set (df
column’s) names
- sig_<name> (signal variance of <name>)
- err_<name> (error variance of <name>)
- snr_<name> (SNR (in dB) of <name>)
- err_cov_<name1>_<name2> (error covariance between <name1> and <name2>)
- err_corr_<name1>_<name2> (error correlation between <name1> and <name2>)

Notes

Rescaling parameters can be derived from the signal variances e.g., scaling <src> against <ref>: beta = np.sqrt(sig_<ref> / sig_<src>) rescaled = (data[<src>] - data[<src>].mean()) * beta + data[<ref>].mean()

Examples

# Just random numbers for demonstrations
ds1 = np.random.normal(0,1,500)
ds2 = np.random.normal(0,1,500)
ds3 = np.random.normal(0,1,500)
ds4 = np.random.normal(0,1,500)
ds5 = np.random.normal(0,1,500)

# Three data sets without cross-correlated errors: This is equivalent
# to standard triple collocation.
df = pd.DataFrame({'ds1':ds1,'ds2':ds2,'ds3':ds3},
                  index=np.arange(500))
res = ecol(df)

# Five data sets, where data sets (1 and 2), and (3 and 4), are assumed
# to have cross-correlated errors.
df = pd.DataFrame({'ds1':ds1,'ds2':ds2,'ds3':ds3,'ds4':ds4,'ds5':ds5},
                  index=np.arange(500),)
correlated = [['ds1','ds2'],['ds3','ds4']]
res = ecol(df,correlated=correlated)

References

[Gruber2016]

Gruber, A., Su, C. H., Crow, W. T., Zwieback, S., Dorigo, W. A., & Wagner, W. (2016). Estimating error cross-correlations in soil moisture data sets using extended collocation analysis. Journal of Geophysical Research: Atmospheres, 121(3), 1208-1219.

pytesmo.metrics.has_analytical_ci(metric_func)[source]

Checks whether an analytical CI implementation is available.

Parameters:: metric_func (callable) – Function that calculates metric value. Must be from pytesmo.metrics.
Returns:: has_ci – True if there is a function with name metric_func.__name__ + "_ci" in pytesmo.metric_cis.
Return type:: bool

pytesmo.metrics.index_of_agreement(o, p)[source]

Index of agreement was proposed by Willmot (1981), to overcome the insenstivity of Nash-Sutcliffe efficiency E and R^2 to differences in the observed and predicted means and variances (Legates and McCabe, 1999). The index of agreement represents the ratio of the mean square error and the potential error (Willmot, 1984). The potential error in the denominator represents the largest value that the squared difference of each pair can attain. The range of d is similar to that of R^2 and lies between 0 (no correlation) and 1 (perfect fit).

Parameters:

o (numpy.ndarray) – Observations.
p (numpy.ndarray) – Predictions.

Returns:

d – Index of agreement.

Return type:

float

pytesmo.metrics.kendall_tau(x, y)[source]

Wrapper for scipy.stats.kendalltau

Parameters:

x (numpy.array) – First input vector.
y (numpy.array) – Second input vector.

Returns:

tau – Kendall’s tau statistic

Return type:

float

See also

scipy.stats.kendalltau

pytesmo.metrics.mad(x, y)[source]

Median absolute deviation (MAD).

Parameters:

x (numpy.ndarray) – First input vector.
y (numpy.ndarray) – Second input vector.

Returns:

d – Median absolute deviation.

Return type:

float

pytesmo.metrics.msd(x, y)[source]

Mean square deviation/mean square error.

For validation, MSD (same as MSE) is defined as

..math:

MSD = \frac{1}{n}\sum\limits_{i=1}^n (x_i - y_i)^2

MSD can be decomposed into a term describing the deviation of x and y attributable to non-perfect correlation (r < 1), a term depending on the difference in variances between x and y, and the difference in means between x and y (bias).

..math:

MSD &= MSD_{corr} + MSD_{var} + MSD_{bias}\\
    &= 2\sigma_x\sigma_y (1-r) + (\sigma_x - \sigma_y)^2
       + (\mu_x - \mu_y)^2

This function calculates the full MSD, the function msd_corr, msd_var, and msd_bias can be used to calculate the individual components.

Parameters:

x (numpy.ndarray) – First input vector
y (numpy.ndarray) – Second input vector

Returns:

msd – Mean square deviation

Return type:

float

pytesmo.metrics.mse_bias(signatures, args, kwargs, defaults)

Bias component of MSE.

MSE can be decomposed into a term describing the deviation of x and y attributable to non-perfect correlation (r < 1), a term depending on the difference in variances between x and y, and the difference in means between x and y (bias).

..math:

MSE &= MSE_{corr} + MSE_{var} + MSE_{bias}\\
    &= 2\sigma_x\sigma_y (1-r) + (\sigma_x - \sigma_y)^2
       + (\mu_x - \mu_y)^2

This function calculates the term \(MSE_{bias} = (\mu_x - \mu_y)^2\).

Parameters:

x (numpy.ndarray) – First input vector.
y (numpy.ndarray) – Second input vector.

Returns:

mse_bias – Bias component of MSE.

Return type:

float

pytesmo.metrics.mse_corr(signatures, args, kwargs, defaults)

Correlation component of MSE.

MSE can be decomposed into a term describing the deviation of x and y attributable to non-perfect correlation (r < 1), a term depending on the difference in variances between x and y, and the difference in means between x and y (bias).

..math:

MSE &= MSE_{corr} + MSE_{var} + MSE_{bias}\\
    &= 2\sigma_x\sigma_y (1-r) + (\sigma_x - \sigma_y)^2
       + (\mu_x - \mu_y)^2

This function calculates the term \(MSE_{corr} = 2\sigma_x\sigma_y(1-r)\).

Parameters:

x (numpy.ndarray) – First input vector.
y (numpy.ndarray) – Second input vector.

Returns:

mse_corr – Correlation component of MSE.

Return type:

float

pytesmo.metrics.mse_decomposition(signatures, args, kwargs, defaults)

Mean square deviation/mean square error.

For validation, MSE (same as MSE) is defined as

..math:

MSE = \frac{1}{n}\sum\limits_{i=1}^n (x_i - y_i)^2

MSE can be decomposed into a term describing the deviation of x and y attributable to non-perfect correlation (r < 1), a term depending on the difference in variances between x and y, and the difference in means between x and y (bias).

..math:

MSE &= MSE_{corr} + MSE_{var} + MSE_{bias}\\
    &= 2\sigma_x\sigma_y (1-r) + (\sigma_x - \sigma_y)^2
       + (\mu_x - \mu_y)^2

This function calculates the all components as well as the sum.

Parameters:

x (numpy.ndarray) – First input vector
y (numpy.ndarray) – Second input vector

Returns:

mse (float) – Mean square deviation
mse_corr (float) – Correlation component of MSE.
mse_bias (float) – Bias component of the MSE.
mse_var (float) – Variance component of the MSE.

pytesmo.metrics.mse_var(signatures, args, kwargs, defaults)

Variance component of MSE.

MSE can be decomposed into a term describing the deviation of x and y attributable to non-perfect correlation (r < 1), a term depending on the difference in variances between x and y, and the difference in means between x and y (bias).

..math:

MSE &= MSE_{corr} + MSE_{var} + MSE_{bias}\\
    &= 2\sigma_x\sigma_y (1-r) + (\sigma_x - \sigma_y)^2
       + (\mu_x - \mu_y)^2

This function calculates the term \(MSE_{var} = (\sigma_x - \sigma_y)^2\).

Parameters:

x (numpy.ndarray) – First input vector.
y (numpy.ndarray) – Second input vector.

Returns:

mse_var – Variance component of MSE.

Return type:

float

pytesmo.metrics.nash_sutcliffe(o, p)[source]

Nash Sutcliffe model efficiency coefficient E. The range of E lies between 1.0 (perfect fit) and -inf.

Parameters:

o (numpy.ndarray) – Observations.
p (numpy.ndarray) – Predictions.

Returns:

E – Nash Sutcliffe model efficiency coefficient E.

Return type:

float

pytesmo.metrics.nrmsd(x, y, ddof=0)[source]

Normalized root-mean-square deviation (nRMSD).

This is normalizes RMSD by max(max(x), max(y)) - min(min(x), min(y)).

Parameters:

x (numpy.ndarray) – First input vector.
y (numpy.ndarray) – Second input vector.
ddof (int, optional) – Delta degree of freedom.The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero. DEPRECATED: ddof is deprecated and might be removed in future versions.

Returns:

nrmsd – Normalized root-mean-square deviation (nRMSD).

Return type:

float

pytesmo.metrics.pearson_r(x, y)[source]

Pearson’s linear correlation coefficient.

Parameters:

x (numpy.ndarray) – First input vector.
y (numpy.ndarray) – Second input vector.

Returns:

r – Pearson’s correlation coefficent.

Return type:

float

See also

scipy.stats.spearmenr

pytesmo.metrics.tcol_metrics(x, y, z, ref_ind=0)[source]

Triple collocation based estimation of signal-to-noise ratio, absolute errors, and rescaling coefficients

Parameters:

x (1D numpy.ndarray) – first input dataset
y (1D numpy.ndarray) – second input dataset
z (1D numpy.ndarray) – third input dataset
ref_ind (int) – Index of reference data set for estimating scaling coefficients. Default: 0 (x)

Returns:

snr (numpy.ndarray) – signal-to-noise (variance) ratio [dB]
err_std (numpy.ndarray) – SCALED error standard deviation
beta (numpy.ndarray) – scaling coefficients (i_scaled = i * beta_i)

Notes

This function estimates the triple collocation errors, the scaling parameter \(\beta\) and the signal to noise ratio directly from the covariances of the dataset. For a general overview and how this function and pytesmo.metrics.tcol_error() are related please see [Gruber2015].

Estimation of the error variances from the covariances of the datasets (e.g. \(\sigma_{XY}\) for the covariance between \(x\) and \(y\)) is done using the following formula:

\[\sigma_{\varepsilon_x}^2 = \sigma_{X}^2 - \frac{\sigma_{XY}\sigma_{XZ}}{\sigma_{YZ}}\]

\[\sigma_{\varepsilon_y}^2 = \sigma_{Y}^2 - \frac{\sigma_{YX}\sigma_{YZ}}{\sigma_{XZ}}\]

\[\sigma_{\varepsilon_z}^2 = \sigma_{Z}^2 - \frac{\sigma_{ZY}\sigma_{ZX}}{\sigma_{YX}}\]

\(\beta\) can also be estimated from the covariances:

\[\beta_x = 1\]

\[\beta_y = \frac{\sigma_{XZ}}{\sigma_{YZ}}\]

\[\beta_z=\frac{\sigma_{XY}}{\sigma_{ZY}}\]

The signal to noise ratio (SNR) is also calculated from the variances and covariances:

\[\text{SNR}_X[dB] = -10\log\left(\frac{\sigma_{X}^2\sigma_{YZ}} {\sigma_{XY}\sigma_{XZ}}-1\right)\]

\[\text{SNR}_Y[dB] = -10\log\left(\frac{\sigma_{Y}^2\sigma_{XZ}} {\sigma_{YX}\sigma_{YZ}}-1\right)\]

\[\text{SNR}_Z[dB] = -10\log\left(\frac{\sigma_{Z}^2\sigma_{XY}} {\sigma_{ZX}\sigma_{ZY}}-1\right)\]

It is given in dB to make it symmetric around zero. If the value is zero it means that the signal variance and the noise variance are equal. +3dB means that the signal variance is twice as high as the noise variance.

References

[Gruber2015]

Gruber, A., Su, C., Zwieback, S., Crow, W., Dorigo, W., Wagner, W. (2015). Recent advances in (soil moisture) triple collocation analysis. International Journal of Applied Earth Observation and Geoinformation, in review

pytesmo.metrics.tcol_metrics_with_bootstrapped_ci(x, y, z, ref_ind=0, alpha=0.05, method='percentile', nsamples=1000, minimum_data_length=100)[source]

Evaluates triple collocation metrics and bootstraps confidence interval.

This calculates the SNR, error standard deviation, and scaling parameter value using Triple Collocation Analysis and uses bootstrapping to find confidence intervals.

Parameters:

x (np.ndarray) – Data to be compared.
y (np.ndarray) – Data to be compared.
z (np.ndarray) – Data to be compared.
ref_ind (int) – Index of reference data set for estimating scaling coefficients. Default: 0 (x)
alpha (float, optional) – Confidence level, default is 0.05.
method (str, optional) –
The method to use to calculate confidence intervals. Available methods are:
- ”percentile” (default): Uses the percentiles of the bootstrapped metric distribution.
- ”basic”: Uses the percentiles of the differences to the original metric value.
- ”BCa”: Bias-corrected and accelerated bootstrap.
For more info and a comparison of the methods, see [Gilleland2010], especially Table 6 on page 38.
nsamples (int, optional) – Number of bootstrap samples, default is 1000.
minimum_data_length (int, optional) – Minimum amount of data required to do bootstrapping. Default is 100.

Returns:

snr_result (tuple) – (snr, lower, upper), where each entry is an array of length 3.
err_std_result (tuple) – (err_std, lower, upper), where each entry is an array of length 3.
beta_result (tuple) – (beta, lower, upper), where each entry is an array of length 3. Note that beta is always 1 for the reference dataset (see ref_ind), therefore the lower and upper values are set to 1 too.

References

[Gilleland2010]

Gilleland, E. (2010). Confidence Intervals for Forecast Verification (No. NCAR/TN-479+STR). University Corporation for Atmospheric Research. doi:10.5065/D6WD3XJM

pytesmo.metrics.ubrmsd(x, y, ddof=0)[source]

Unbiased root-mean-square deviation (uRMSD).

This corresponds to RMSD with mean biases removed beforehand, that is

..math:

ubRMSD = \sqrt{\frac{1}{n}\sum\limits_{i=1}^n
                   \left((x - \bar{x}) - (y - \bar{y}))^2}

NOTE: If you are scaling the data beforehand to have zero mean bias, this is exactly the same as RMSD.

Parameters:

x (numpy.ndarray) – First input vector.
y (numpy.ndarray) – Second input vector.
ddof (int, optional) – Delta degree of freedom.The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero. DEPRECATED: ddof is deprecated and might be removed in future versions.

Returns:

ubrmsd – Unbiased root-mean-square deviation (uRMSD).

Return type:

float

pytesmo.metrics.with_analytical_ci(metric_func, x, y, alpha=0.05)[source]

Evaluates metric and analytical confidence interval.

This calculates the metric value and an analytical confidence interval. For this to work, the metric function must be from pytesmo.pairwise_metric_cis Use pytesmo.metrics.has_analytical_ci() to check whether it’s available.

Parameters:

metric_func (callable or str) – Function that calculates metric value. Must be from pytesmo.metrics, and have an analytical CI function implemented in pytesmo.metric_cis. The metric function must have the following signature: (x : np.ndarray, y : np.ndarray) -> float Alternatively can be the name of the function in pytesmo.metrics.
x (np.ndarray) – Data to be compared.
y (np.ndarray) – Data to be compared.
alpha (float, optional) – Confidence level, default is 0.05.

Returns:

m (float) – Metric value
lower (float) – Lower bound of confidence interval
upper (float) – Upper bound of confidence interval

Raises:

ValueError : – If no analytical CI function is available.

pytesmo.metrics.with_bootstrapped_ci(metric_func, x, y, alpha=0.05, method='percentile', nsamples=1000, minimum_data_length=100)[source]

Evaluates metric and bootstraps confidence interval.

This calculates the metric value and uses bootstrapping to find a confidence interval for the metric. This works only for pairwise metrics, use pytesmo.metrics.tcol_metrics_with_bootstrap_ci() for TCA metrics.

Parameters:

metric_func (callable) – Function that calculates metric value. The metric function must have the following signature: (x : np.ndarray, y : np.ndarray) -> float
x (np.ndarray) – Data to be compared.
y (np.ndarray) – Data to be compared.
alpha (float, optional) – Confidence level, default is 0.05.
method (str, optional) –
The method to use to calculate confidence intervals. Available methods are:
- ”percentile” (default): Uses the percentiles of the bootstrapped metric distribution.
- ”basic”: Uses the percentiles of the differences to the original metric value.
- ”BCa”: Bias-corrected and accelerated bootstrap.
For more info and a comparison of the methods, see [Gilleland2010], especially Table 6 on page 38.
nsamples (int, optional) – Number of bootstrap samples, default is 1000.
minimum_data_length (int, optional) – Minimum amount of data required to do bootstrapping. Default is 100.

Returns:

m (float) – Metric value for pairwise metrics
lower (float or array of floats) – Lower bound of confidence interval
upper (float or array of floats) – Upper bound of confidence interval

References

[Gilleland2010]

Gilleland, E. (2010). Confidence Intervals for Forecast Verification (No. NCAR/TN-479+STR). University Corporation for Atmospheric Research. doi:10.5065/D6WD3XJM

pytesmo.metrics package

Metrics for pairs and triplets

Pairwise Metrics

Triple Collocation Metrics

Developer notes

Submodules