pytesmo.metrics package

Metrics for pairs and triplets

This module provides functions to calculate metrics and their associated confidence intervals.

Pairwise Metrics

For pairwise metrics, confidence intervals can be calculated analytically (if available) or via boostrapping. Example with bias:

from pytesmo.metrics import bias, with_analytical_ci

x, y = np.random.randn(100), np.rand.randn(100)
# only metric calculation
b = bias(x, y)

# with analytical CI
b, lower, upper = with_analytical_ci(bias, x, y, alpha=0.05)

# using bootstrapping instead (uses nsamples=1000 by default)
b, lower, upper = with_bootstrapped_ci(bias, x, y, alpha=0.05)

Analytical CIs are available for

  • bias

  • rmsd

  • nrmsd

  • ubrmsd

  • mse

  • pearson_r

  • spearman_r

  • kendall_tau

You can use pytesmo.metrics.has_analytical_ci() to check programmatically whether analytical CIs are available.

Triple Collocation Metrics

Triple collocation metric can be calculated with the function pytesmo.metrics.tcol_metrics(). For bootstrapping CIs, use pytesmo.metrics.tcol_metrics_with_bootstrapped_ci():

from pytesmo.metrics import tcol_metrics, tcol_metric

x, y, z = ...

# only metrics
snr, err_std, beta = tcol_metrics(x, y, z)

# with cis, using nsamples=1000 and alpha=0.05 (default)
snr_and_ci, err_std_and_ci, beta_and_ci = (
    tcol_metrics_with_bootstrapped_ci(x, y, z)
)
snr, lower_snr, upper_snr = snr_and_ci

Developer notes

Pairwise metrics are implemented in pytesmo.metrics.pairwise, triple collocation metrics in pytesmo.metrics.tcol.

pytesmo.metrics.RSS(signatures, args, kwargs, defaults)

Residual sum of squares.

Parameters:
Returns:

res – Residual sum of squares.

Return type:

float

pytesmo.metrics.aad(x, y)[source]

Average (=mean) absolute deviation (AAD).

Parameters:
Returns:

d – Mean absolute deviation.

Return type:

float

pytesmo.metrics.bias(signatures, args, kwargs, defaults)

Difference of the mean values.

Sign of output depends on argument order. We calculate mean(x) - mean(y).

Parameters:
Returns:

bias – Bias between x and y.

Return type:

float

pytesmo.metrics.ecol(data, correlated=None, err_cov=None, abs_est=True)[source]

Extended collocation analysis to obtain estimates of:

  • signal variances

  • error variances

  • signal-to-noise ratios [dB]

  • error cross-covariances (and -correlations)

based on an arbitrary number of N>3 data sets.

EACH DATA SET MUST BE MEMBER OF >= 1 TRIPLET THAT FULFILLS THE CLASSICAL TRIPLE COLLOCATION ASSUMPTIONS

Parameters:
  • data (pd.DataFrame) – Temporally matched input data sets in each column

  • correlated (tuple of tuples (string)) – A tuple containing tuples of data set names (column names), between which the error cross-correlation shall be estimated. e.g. [[‘AMSR-E’,’SMOS’],[‘GLDAS’,’ERA’]] estimates error cross-correlations between (AMSR-E and SMOS), and (GLDAS and ERA), respectively.

  • err_cov – A priori known error cross-covariances that shall be included in the estimation (to obtain unbiased estimates)

  • abs_est – Force absolute values for signal and error variance estimates (to mitiate the issue of estimation uncertainties)

Returns:

  • A dictionary with the following entries (<name> correspond to data set (df

  • column’s) names

  • - sig_<name> (signal variance of <name>)

  • - err_<name> (error variance of <name>)

  • - snr_<name> (SNR (in dB) of <name>)

  • - err_cov_<name1>_<name2> (error covariance between <name1> and <name2>)

  • - err_corr_<name1>_<name2> (error correlation between <name1> and <name2>)

Notes

Rescaling parameters can be derived from the signal variances e.g., scaling <src> against <ref>: beta = np.sqrt(sig_<ref> / sig_<src>) rescaled = (data[<src>] - data[<src>].mean()) * beta + data[<ref>].mean()

Examples

# Just random numbers for demonstrations
ds1 = np.random.normal(0,1,500)
ds2 = np.random.normal(0,1,500)
ds3 = np.random.normal(0,1,500)
ds4 = np.random.normal(0,1,500)
ds5 = np.random.normal(0,1,500)

# Three data sets without cross-correlated errors: This is equivalent
# to standard triple collocation.
df = pd.DataFrame({'ds1':ds1,'ds2':ds2,'ds3':ds3},
                  index=np.arange(500))
res = ecol(df)

# Five data sets, where data sets (1 and 2), and (3 and 4), are assumed
# to have cross-correlated errors.
df = pd.DataFrame({'ds1':ds1,'ds2':ds2,'ds3':ds3,'ds4':ds4,'ds5':ds5},
                  index=np.arange(500),)
correlated = [['ds1','ds2'],['ds3','ds4']]
res = ecol(df,correlated=correlated)

References

[Gruber2016]

Gruber, A., Su, C. H., Crow, W. T., Zwieback, S., Dorigo, W. A., & Wagner, W. (2016). Estimating error cross-correlations in soil moisture data sets using extended collocation analysis. Journal of Geophysical Research: Atmospheres, 121(3), 1208-1219.

pytesmo.metrics.has_analytical_ci(metric_func)[source]

Checks whether an analytical CI implementation is available.

Parameters:

metric_func (callable) – Function that calculates metric value. Must be from pytesmo.metrics.

Returns:

has_ciTrue if there is a function with name metric_func.__name__ + "_ci" in pytesmo.metric_cis.

Return type:

bool

pytesmo.metrics.index_of_agreement(o, p)[source]

Index of agreement was proposed by Willmot (1981), to overcome the insenstivity of Nash-Sutcliffe efficiency E and R^2 to differences in the observed and predicted means and variances (Legates and McCabe, 1999). The index of agreement represents the ratio of the mean square error and the potential error (Willmot, 1984). The potential error in the denominator represents the largest value that the squared difference of each pair can attain. The range of d is similar to that of R^2 and lies between 0 (no correlation) and 1 (perfect fit).

Parameters:
Returns:

d – Index of agreement.

Return type:

float

pytesmo.metrics.kendall_tau(x, y)[source]

Wrapper for scipy.stats.kendalltau

Parameters:
  • x (numpy.array) – First input vector.

  • y (numpy.array) – Second input vector.

Returns:

tau – Kendall’s tau statistic

Return type:

float

pytesmo.metrics.mad(x, y)[source]

Median absolute deviation (MAD).

Parameters:
Returns:

d – Median absolute deviation.

Return type:

float

pytesmo.metrics.msd(x, y)[source]

Mean square deviation/mean square error.

For validation, MSD (same as MSE) is defined as

..math:

MSD = \frac{1}{n}\sum\limits_{i=1}^n (x_i - y_i)^2

MSD can be decomposed into a term describing the deviation of x and y attributable to non-perfect correlation (r < 1), a term depending on the difference in variances between x and y, and the difference in means between x and y (bias).

..math:

MSD &= MSD_{corr} + MSD_{var} + MSD_{bias}\\
    &= 2\sigma_x\sigma_y (1-r) + (\sigma_x - \sigma_y)^2
       + (\mu_x - \mu_y)^2

This function calculates the full MSD, the function msd_corr, msd_var, and msd_bias can be used to calculate the individual components.

Parameters:
Returns:

msd – Mean square deviation

Return type:

float

pytesmo.metrics.mse_bias(signatures, args, kwargs, defaults)

Bias component of MSE.

MSE can be decomposed into a term describing the deviation of x and y attributable to non-perfect correlation (r < 1), a term depending on the difference in variances between x and y, and the difference in means between x and y (bias).

..math:

MSE &= MSE_{corr} + MSE_{var} + MSE_{bias}\\
    &= 2\sigma_x\sigma_y (1-r) + (\sigma_x - \sigma_y)^2
       + (\mu_x - \mu_y)^2

This function calculates the term \(MSE_{bias} = (\mu_x - \mu_y)^2\).

Parameters:
Returns:

mse_bias – Bias component of MSE.

Return type:

float

pytesmo.metrics.mse_corr(signatures, args, kwargs, defaults)

Correlation component of MSE.

MSE can be decomposed into a term describing the deviation of x and y attributable to non-perfect correlation (r < 1), a term depending on the difference in variances between x and y, and the difference in means between x and y (bias).

..math:

MSE &= MSE_{corr} + MSE_{var} + MSE_{bias}\\
    &= 2\sigma_x\sigma_y (1-r) + (\sigma_x - \sigma_y)^2
       + (\mu_x - \mu_y)^2

This function calculates the term \(MSE_{corr} = 2\sigma_x\sigma_y(1-r)\).

Parameters:
Returns:

mse_corr – Correlation component of MSE.

Return type:

float

pytesmo.metrics.mse_decomposition(signatures, args, kwargs, defaults)

Mean square deviation/mean square error.

For validation, MSE (same as MSE) is defined as

..math:

MSE = \frac{1}{n}\sum\limits_{i=1}^n (x_i - y_i)^2

MSE can be decomposed into a term describing the deviation of x and y attributable to non-perfect correlation (r < 1), a term depending on the difference in variances between x and y, and the difference in means between x and y (bias).

..math:

MSE &= MSE_{corr} + MSE_{var} + MSE_{bias}\\
    &= 2\sigma_x\sigma_y (1-r) + (\sigma_x - \sigma_y)^2
       + (\mu_x - \mu_y)^2

This function calculates the all components as well as the sum.

Parameters:
Returns:

  • mse (float) – Mean square deviation

  • mse_corr (float) – Correlation component of MSE.

  • mse_bias (float) – Bias component of the MSE.

  • mse_var (float) – Variance component of the MSE.

pytesmo.metrics.mse_var(signatures, args, kwargs, defaults)

Variance component of MSE.

MSE can be decomposed into a term describing the deviation of x and y attributable to non-perfect correlation (r < 1), a term depending on the difference in variances between x and y, and the difference in means between x and y (bias).

..math:

MSE &= MSE_{corr} + MSE_{var} + MSE_{bias}\\
    &= 2\sigma_x\sigma_y (1-r) + (\sigma_x - \sigma_y)^2
       + (\mu_x - \mu_y)^2

This function calculates the term \(MSE_{var} = (\sigma_x - \sigma_y)^2\).

Parameters:
Returns:

mse_var – Variance component of MSE.

Return type:

float

pytesmo.metrics.nash_sutcliffe(o, p)[source]

Nash Sutcliffe model efficiency coefficient E. The range of E lies between 1.0 (perfect fit) and -inf.

Parameters:
Returns:

E – Nash Sutcliffe model efficiency coefficient E.

Return type:

float

pytesmo.metrics.nrmsd(x, y, ddof=0)[source]

Normalized root-mean-square deviation (nRMSD).

This is normalizes RMSD by max(max(x), max(y)) - min(min(x), min(y)).

Parameters:
  • x (numpy.ndarray) – First input vector.

  • y (numpy.ndarray) – Second input vector.

  • ddof (int, optional) – Delta degree of freedom.The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero. DEPRECATED: ddof is deprecated and might be removed in future versions.

Returns:

nrmsd – Normalized root-mean-square deviation (nRMSD).

Return type:

float

pytesmo.metrics.pearson_r(x, y)[source]

Pearson’s linear correlation coefficient.

Parameters:
Returns:

r – Pearson’s correlation coefficent.

Return type:

float

pytesmo.metrics.rmsd(x, y, ddof=0)[source]

Root-mean-square deviation (RMSD).

This is the root of MSD (see pytesmo.metrics.msd()). If x and y have the same mean (i.e. mean(x - y = 0) RMSD corresponds to the square root of the variance of x - y.

Parameters:
  • x (numpy.ndarray) – First input vector.

  • y (numpy.ndarray) – Second input vector.

  • ddof (int, optional, DEPRECATED) – Delta degree of freedom.The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero. DEPRECATED: ddof is deprecated and might be removed in future versions.

Returns:

rmsd – Root-mean-square deviation.

Return type:

float

pytesmo.metrics.rolling_pr_rmsd(signatures, args, kwargs, defaults)

Computation of rolling Pearson R and RMSD.

Parameters:
  • timestamps (float64) – Time stamps as julian dates.

  • data (numpy.ndarray) – Time series data in 2d array.

  • window_size (float64) – Window size in fraction of days.

  • center (bool) – Set window at the center and include window_size in both directions.

  • min_periods (int) – Minimum number of observations in window required for computation.

  • Results

  • -------

  • pr_arr (numpy.array) – Rolling Pearson R and p-value.

  • rmsd_arr (numpy.array) – Rolling RMSD

pytesmo.metrics.spearman_r(x, y)[source]

Spearman’s rank correlation coefficient.

Parameters:
  • x (numpy.array) – First input vector.

  • y (numpy.array) – Second input vector.

Returns:

rho – Spearman correlation coefficient

Return type:

float

See also

scipy.stats.spearmenr

pytesmo.metrics.tcol_metrics(x, y, z, ref_ind=0)[source]

Triple collocation based estimation of signal-to-noise ratio, absolute errors, and rescaling coefficients

Parameters:
  • x (1D numpy.ndarray) – first input dataset

  • y (1D numpy.ndarray) – second input dataset

  • z (1D numpy.ndarray) – third input dataset

  • ref_ind (int) – Index of reference data set for estimating scaling coefficients. Default: 0 (x)

Returns:

  • snr (numpy.ndarray) – signal-to-noise (variance) ratio [dB]

  • err_std (numpy.ndarray) – SCALED error standard deviation

  • beta (numpy.ndarray) – scaling coefficients (i_scaled = i * beta_i)

Notes

This function estimates the triple collocation errors, the scaling parameter \(\beta\) and the signal to noise ratio directly from the covariances of the dataset. For a general overview and how this function and pytesmo.metrics.tcol_error() are related please see [Gruber2015].

Estimation of the error variances from the covariances of the datasets (e.g. \(\sigma_{XY}\) for the covariance between \(x\) and \(y\)) is done using the following formula:

\[\sigma_{\varepsilon_x}^2 = \sigma_{X}^2 - \frac{\sigma_{XY}\sigma_{XZ}}{\sigma_{YZ}}\]
\[\sigma_{\varepsilon_y}^2 = \sigma_{Y}^2 - \frac{\sigma_{YX}\sigma_{YZ}}{\sigma_{XZ}}\]
\[\sigma_{\varepsilon_z}^2 = \sigma_{Z}^2 - \frac{\sigma_{ZY}\sigma_{ZX}}{\sigma_{YX}}\]

\(\beta\) can also be estimated from the covariances:

\[\beta_x = 1\]
\[\beta_y = \frac{\sigma_{XZ}}{\sigma_{YZ}}\]
\[\beta_z=\frac{\sigma_{XY}}{\sigma_{ZY}}\]

The signal to noise ratio (SNR) is also calculated from the variances and covariances:

\[\text{SNR}_X[dB] = -10\log\left(\frac{\sigma_{X}^2\sigma_{YZ}} {\sigma_{XY}\sigma_{XZ}}-1\right)\]
\[\text{SNR}_Y[dB] = -10\log\left(\frac{\sigma_{Y}^2\sigma_{XZ}} {\sigma_{YX}\sigma_{YZ}}-1\right)\]
\[\text{SNR}_Z[dB] = -10\log\left(\frac{\sigma_{Z}^2\sigma_{XY}} {\sigma_{ZX}\sigma_{ZY}}-1\right)\]

It is given in dB to make it symmetric around zero. If the value is zero it means that the signal variance and the noise variance are equal. +3dB means that the signal variance is twice as high as the noise variance.

References

[Gruber2015]

Gruber, A., Su, C., Zwieback, S., Crow, W., Dorigo, W., Wagner, W. (2015). Recent advances in (soil moisture) triple collocation analysis. International Journal of Applied Earth Observation and Geoinformation, in review

pytesmo.metrics.tcol_metrics_with_bootstrapped_ci(x, y, z, ref_ind=0, alpha=0.05, method='percentile', nsamples=1000, minimum_data_length=100)[source]

Evaluates triple collocation metrics and bootstraps confidence interval.

This calculates the SNR, error standard deviation, and scaling parameter value using Triple Collocation Analysis and uses bootstrapping to find confidence intervals.

Parameters:
  • x (np.ndarray) – Data to be compared.

  • y (np.ndarray) – Data to be compared.

  • z (np.ndarray) – Data to be compared.

  • ref_ind (int) – Index of reference data set for estimating scaling coefficients. Default: 0 (x)

  • alpha (float, optional) – Confidence level, default is 0.05.

  • method (str, optional) –

    The method to use to calculate confidence intervals. Available methods are:

    • ”percentile” (default): Uses the percentiles of the bootstrapped metric distribution.

    • ”basic”: Uses the percentiles of the differences to the original metric value.

    • ”BCa”: Bias-corrected and accelerated bootstrap.

    For more info and a comparison of the methods, see [Gilleland2010], especially Table 6 on page 38.

  • nsamples (int, optional) – Number of bootstrap samples, default is 1000.

  • minimum_data_length (int, optional) – Minimum amount of data required to do bootstrapping. Default is 100.

Returns:

  • snr_result (tuple) – (snr, lower, upper), where each entry is an array of length 3.

  • err_std_result (tuple) – (err_std, lower, upper), where each entry is an array of length 3.

  • beta_result (tuple) – (beta, lower, upper), where each entry is an array of length 3. Note that beta is always 1 for the reference dataset (see ref_ind), therefore the lower and upper values are set to 1 too.

References

[Gilleland2010]

Gilleland, E. (2010). Confidence Intervals for Forecast Verification (No. NCAR/TN-479+STR). University Corporation for Atmospheric Research. doi:10.5065/D6WD3XJM

pytesmo.metrics.ubrmsd(x, y, ddof=0)[source]

Unbiased root-mean-square deviation (uRMSD).

This corresponds to RMSD with mean biases removed beforehand, that is

..math:

ubRMSD = \sqrt{\frac{1}{n}\sum\limits_{i=1}^n
                   \left((x - \bar{x}) - (y - \bar{y}))^2}

NOTE: If you are scaling the data beforehand to have zero mean bias, this is exactly the same as RMSD.

Parameters:
  • x (numpy.ndarray) – First input vector.

  • y (numpy.ndarray) – Second input vector.

  • ddof (int, optional) – Delta degree of freedom.The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero. DEPRECATED: ddof is deprecated and might be removed in future versions.

Returns:

ubrmsd – Unbiased root-mean-square deviation (uRMSD).

Return type:

float

pytesmo.metrics.with_analytical_ci(metric_func, x, y, alpha=0.05)[source]

Evaluates metric and analytical confidence interval.

This calculates the metric value and an analytical confidence interval. For this to work, the metric function must be from pytesmo.pairwise_metric_cis Use pytesmo.metrics.has_analytical_ci() to check whether it’s available.

Parameters:
  • metric_func (callable or str) – Function that calculates metric value. Must be from pytesmo.metrics, and have an analytical CI function implemented in pytesmo.metric_cis. The metric function must have the following signature: (x : np.ndarray, y : np.ndarray) -> float Alternatively can be the name of the function in pytesmo.metrics.

  • x (np.ndarray) – Data to be compared.

  • y (np.ndarray) – Data to be compared.

  • alpha (float, optional) – Confidence level, default is 0.05.

Returns:

  • m (float) – Metric value

  • lower (float) – Lower bound of confidence interval

  • upper (float) – Upper bound of confidence interval

Raises:

ValueError : – If no analytical CI function is available.

pytesmo.metrics.with_bootstrapped_ci(metric_func, x, y, alpha=0.05, method='percentile', nsamples=1000, minimum_data_length=100)[source]

Evaluates metric and bootstraps confidence interval.

This calculates the metric value and uses bootstrapping to find a confidence interval for the metric. This works only for pairwise metrics, use pytesmo.metrics.tcol_metrics_with_bootstrap_ci() for TCA metrics.

Parameters:
  • metric_func (callable) – Function that calculates metric value. The metric function must have the following signature: (x : np.ndarray, y : np.ndarray) -> float

  • x (np.ndarray) – Data to be compared.

  • y (np.ndarray) – Data to be compared.

  • alpha (float, optional) – Confidence level, default is 0.05.

  • method (str, optional) –

    The method to use to calculate confidence intervals. Available methods are:

    • ”percentile” (default): Uses the percentiles of the bootstrapped metric distribution.

    • ”basic”: Uses the percentiles of the differences to the original metric value.

    • ”BCa”: Bias-corrected and accelerated bootstrap.

    For more info and a comparison of the methods, see [Gilleland2010], especially Table 6 on page 38.

  • nsamples (int, optional) – Number of bootstrap samples, default is 1000.

  • minimum_data_length (int, optional) – Minimum amount of data required to do bootstrapping. Default is 100.

Returns:

  • m (float) – Metric value for pairwise metrics

  • lower (float or array of floats) – Lower bound of confidence interval

  • upper (float or array of floats) – Upper bound of confidence interval

References

[Gilleland2010]

Gilleland, E. (2010). Confidence Intervals for Forecast Verification (No. NCAR/TN-479+STR). University Corporation for Atmospheric Research. doi:10.5065/D6WD3XJM

Submodules