pytesmo package

Submodules

pytesmo.df_metrics module

Module contains wrappers for methods in pytesmo.metrics which can be given pandas.DataFrames instead of single numpy.arrays . If the DataFrame has more columns than the function has input parameters the function will be applied pairwise

Created on Aug 14, 2013

@author: Christoph Paulik Christoph.Paulik@geo.tuwien.ac.at

exception pytesmo.df_metrics.DataFrameDimensionError[source]

Bases: exceptions.Exception

__delattr__

x.__delattr__(‘name’) <==> del x.name

__format__()

default object formatter

__getattribute__

x.__getattribute__(‘name’) <==> x.name

__getitem__

x.__getitem__(y) <==> x[y]

__getslice__

x.__getslice__(i, j) <==> x[i:j]

Use of negative indices is not supported.

__hash__
__reduce_ex__()

helper for pickle

__repr__
__setattr__

x.__setattr__(‘name’, value) <==> x.name = value

__sizeof__() → int

size of object in memory, in bytes

__str__
args
message
pytesmo.df_metrics.RSS(df)[source]

Redidual sum of squares

Returns:result – with column names of df for which the calculation was done as name of the element separated by ‘_and_’
Return type:namedtuple
pytesmo.df_metrics.bias(df)[source]

Bias

Returns:bias – of shape (len(df.columns),len(df.columns))
Return type:pandas.Dataframe
pytesmo.df_metrics.kendalltau(df)[source]

Wrapper for scipy.stats.kendalltau

Returns:result – with column names of df for which the calculation was done as name of the element separated by ‘_and_’
Return type:namedtuple
pytesmo.df_metrics.mse(df)[source]

Mean square error (MSE) as a decomposition of the RMSD into individual error components

Returns:result – with column names of df for which the calculation was done as name of the element separated by ‘_and_’
Return type:namedtuple
pytesmo.df_metrics.nash_sutcliffe(df)[source]

Nash Sutcliffe model efficiency coefficient

Returns:result – with column names of df for which the calculation was done as name of the element separated by ‘_and_’
Return type:namedtuple
pytesmo.df_metrics.nrmsd(df)[source]

Normalized root-mean-square deviation

Returns:result – with column names of df for which the calculation was done as name of the element separated by ‘_and_’
Return type:namedtuple
pytesmo.df_metrics.pairwise_apply(df, method, comm=False)[source]

Compute given method pairwise for all columns, excluding NA/null values

Parameters:
  • df (pandas.DataFrame) – input data, method will be applied to each column pair
  • method (function) – method to apply to each column pair. has to take 2 input arguments of type numpy.array and return one value or tuple of values
Returns:

results

Return type:

pandas.DataFrame

pytesmo.df_metrics.pearsonr(df)[source]

Wrapper for scipy.stats.pearsonr

Returns:result – with column names of df for which the calculation was done as name of the element separated by ‘_and_’
Return type:namedtuple
pytesmo.df_metrics.rmsd(df)[source]

Root-mean-square deviation

Returns:result – with column names of df for which the calculation was done as name of the element separated by ‘_and_’
Return type:namedtuple
pytesmo.df_metrics.spearmanr(df)[source]

Wrapper for scipy.stats.spearmanr

Returns:result – with column names of df for which the calculation was done as name of the element separated by ‘_and_’
Return type:namedtuple

See also

pytesmo.metrics.spearmenr(), scipy.stats.spearmenr()

pytesmo.df_metrics.tcol_error(df)[source]

Triple collocation error estimate In this case df has to have exactly 3 columns, since triple wise application of a function is not yet implemented and would probably return a complicated structure

Returns:result – with column names of df
Return type:namedtuple
pytesmo.df_metrics.ubrmsd(df)[source]

Unbiased root-mean-square deviation

Returns:result – with column names of df for which the calculation was done as name of the element separated by ‘_and_’
Return type:namedtuple

pytesmo.metrics module

pytesmo.metrics.RSS(o, p)[source]

Residual sum of squares.

Parameters:
Returns:

res – Residual sum of squares.

Return type:

float

pytesmo.metrics.aad(o, p)[source]

Average (=mean) absolute deviation (AAD).

Parameters:
Returns:

d – Mean absolute deviation.

Return type:

float

pytesmo.metrics.bias(o, p)[source]

Difference of the mean values.

Parameters:
Returns:

bias – Bias between observations and predictions.

Return type:

float

pytesmo.metrics.index_of_agreement(o, p)[source]

Index of agreement was proposed by Willmot (1981), to overcome the insenstivity of Nash-Sutcliffe efficiency E and R^2 to differences in the observed and predicted means and variances (Legates and McCabe, 1999). The index of agreement represents the ratio of the mean square error and the potential error (Willmot, 1984). The potential error in the denominator represents the largest value that the squared difference of each pair can attain. The range of d is similar to that of R^2 and lies between 0 (no correlation) and 1 (perfect fit).

Parameters:
Returns:

d – Index of agreement.

Return type:

float

pytesmo.metrics.kendalltau(o, p)[source]

Wrapper for scipy.stats.kendalltau

Parameters:
Returns:

  • Kendall’s tau (float) – The tau statistic
  • p-value (float) – The two-sided p-palue for a hypothesis test whose null hypothesis is an absence of association, tau = 0.

pytesmo.metrics.mad(o, p)[source]

Median absolute deviation (MAD).

Parameters:
Returns:

d – Median absolute deviation.

Return type:

float

pytesmo.metrics.mse(o, p, ddof=0)[source]

Mean square error (MSE) as a decomposition of the RMSD into individual error components. The MSE is the second moment (about the origin) of the error, and thus incorporates both the variance of the estimator and its bias. For an unbiased estimator, the MSE is the variance of the estimator. Like the variance, MSE has the same units of measurement as the square of the quantity being estimated. The delta degree of freedom keyword (ddof) can be used to correct for the case the true variance is unknown and estimated from the population. Concretely, the naive sample variance estimator sums the squared deviations and divides by n, which is biased. Dividing instead by n - 1 yields an unbiased estimator.

Parameters:
  • o (numpy.ndarray) – Observations.
  • p (numpy.ndarray) – Predictions.
  • ddof (int, optional) – Delta degree of freedom.The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero.
Returns:

  • mse (float) – Mean square error (MSE).
  • mse_corr (float) – Correlation component of MSE.
  • mse_bias (float) – Bias component of the MSE.
  • mse_var (float) – Variance component of the MSE.

pytesmo.metrics.nash_sutcliffe(o, p)[source]

Nash Sutcliffe model efficiency coefficient E. The range of E lies between 1.0 (perfect fit) and -inf.

Parameters:
Returns:

E – Nash Sutcliffe model efficiency coefficient E.

Return type:

float

pytesmo.metrics.nrmsd(o, p, ddof=0)[source]

Normalized root-mean-square deviation (nRMSD).

Parameters:
  • o (numpy.ndarray) – Observations.
  • p (numpy.ndarray) – Predictions.
  • ddof (int, optional) – Delta degree of freedom.The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero.
Returns:

nrmsd – Normalized root-mean-square deviation (nRMSD).

Return type:

float

pytesmo.metrics.pearson_conf(r, n, c=95)[source]

Calcalates the confidence interval of a given pearson correlation coefficient using a fisher z-transform, only valid for correlation coefficients calculated from a bivariate normal distribution

Parameters:
  • r (float or numpy.ndarray) – Correlation coefficient
  • n (int or numpy.ndarray) – Number of observations used in determining the correlation coefficient
  • c (float) – Level of confidence in percent, from 0-100.
Returns:

  • r_lower (float or numpy.ndarray) – Lower confidence boundary.
  • r_upper (float or numpy.ndarray) – Upper confidence boundary.

pytesmo.metrics.pearsonr(o, p)[source]

Wrapper for scipy.stats.pearsonr. Calculates a Pearson correlation coefficient and the p-value for testing non-correlation.

Parameters:
Returns:

  • r (float) – Pearson’s correlation coefficent.
  • p-value (float) – 2 tailed p-value.

pytesmo.metrics.pearsonr_recursive(x, y, n_old=0, sum_xi_yi=0, sum_xi=0, sum_yi=0, sum_x2=0, sum_y2=0)[source]

Calculate pearson correlation in a recursive manner based on

r_{xy} = \frac{n\sum x_iy_i-\sum x_i\sum y_i} {\sqrt{n\sum x_i^2-(\sum x_i)^2}~\sqrt{n\sum y_i^2-(\sum y_i)^2}}

Parameters:
  • x (numpy.ndarray) – New values for x
  • y (numpy.ndarray) – New values for y
  • n_old (float, optional) – number of observations from previous pass
  • sum_xi_yi (float, optional) –

    \sum x_iy_i

    from previous pass

  • sum_xi (float, optional) –

    \sum x_i

    from previous pass

  • sum_yi (float, optional) –

    \sum y_i

    from previous pass

  • sum_x2 (float, optional) –

    \sum x_i^2

    from previous pass

  • sum_y2 (float, optional) –

    \sum y_i^2

    from previous pass

Returns:

  • r (float) – Pearson correlation coefficient
  • params (tuple) – tuple of (n_new, sum_xi_yi, sum_xi, sum_yi, sum_x2, sum_y2) . Can be used when calling the next iteration as *params.

pytesmo.metrics.rmsd(o, p, ddof=0)[source]

Root-mean-square deviation (RMSD). It is implemented for an unbiased estimator, which means the RMSD is the square root of the variance, also known as the standard error. The delta degree of freedom keyword (ddof) can be used to correct for the case the true variance is unknown and estimated from the population. Concretely, the naive sample variance estimator sums the squared deviations and divides by n, which is biased. Dividing instead by n -1 yields an unbiased estimator

Parameters:
  • o (numpy.ndarray) – Observations.
  • p (numpy.ndarray) – Predictions.
  • ddof (int, optional) – Delta degree of freedom.The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero.
Returns:

rmsd – Root-mean-square deviation.

Return type:

float

pytesmo.metrics.spearmanr(o, p)[source]

Wrapper for scipy.stats.spearmanr. Calculates a Spearman rank-order correlation coefficient and the p-value to test for non-correlation.

Parameters:
Returns:

  • rho (float) – Spearman correlation coefficient
  • p-value (float) – The two-sided p-value for a hypothesis test whose null hypothesis is that two sets of data are uncorrelated

See also

scipy.stats.spearmenr()

pytesmo.metrics.tcol_error(x, y, z)[source]

Triple collocation error estimate of three calibrated/scaled datasets.

Parameters:
Returns:

  • e_x (float) – Triple collocation error for x.
  • e_y (float) – Triple collocation error for y.
  • e_z (float) – Triple collocation error for z.

Notes

This function estimates the triple collocation error based on already scaled/calibrated input data. It follows formula 4 given in [Scipal2008].

\sigma_{\varepsilon_x}^2 = \langle (x-y)(x-z) \rangle

\sigma_{\varepsilon_y}^2 = \langle (y-x)(y-z) \rangle

\sigma_{\varepsilon_z}^2 = \langle (z-x)(z-y) \rangle

where the \langle\rangle brackets mean the temporal mean.

References

[Scipal2008]Scipal, K., Holmes, T., De Jeu, R., Naeimi, V., & Wagner, W. (2008). A possible solution for the problem of estimating the error structure of global soil moisture data sets. Geophysical Research Letters, 35(24), .
pytesmo.metrics.tcol_snr(x, y, z, ref_ind=0)[source]

triple collocation based estimation of signal-to-noise ratio, absolute errors, and rescaling coefficients

Parameters:
  • x (1D numpy.ndarray) – first input dataset
  • y (1D numpy.ndarray) – second input dataset
  • z (1D numpy.ndarray) – third input dataset
  • ref_ind (int) – index of reference data set for estimating scaling coefficients. default: 0 (x)
Returns:

  • snr (numpy.ndarray) – signal-to-noise (variance) ratio [dB]
  • err_std (numpy.ndarray) – SCALED error standard deviation
  • beta (numpy.ndarray) – scaling coefficients (i_scaled = i * beta_i)

Notes

This function estimates the triple collocation errors, the scaling parameter \beta and the signal to noise ratio directly from the covariances of the dataset. For a general overview and how this function and pytesmo.metrics.tcol_error() are related please see [Gruber2015].

Estimation of the error variances from the covariances of the datasets (e.g. \sigma_{XY} for the covariance between x and y) is done using the following formula:

\sigma_{\varepsilon_x}^2 = \sigma_{X}^2 - \frac{\sigma_{XY}\sigma_{XZ}}{\sigma_{YZ}}

\sigma_{\varepsilon_y}^2 = \sigma_{Y}^2 - \frac{\sigma_{YX}\sigma_{YZ}}{\sigma_{XZ}}

\sigma_{\varepsilon_z}^2 = \sigma_{Z}^2 - \frac{\sigma_{ZY}\sigma_{ZX}}{\sigma_{YX}}

\beta can also be estimated from the covariances:

\beta_x = 1

\beta_y = \frac{\sigma_{XZ}}{\sigma_{YZ}}

\beta_z=\frac{\sigma_{XY}}{\sigma_{ZY}}

The signal to noise ratio (SNR) is also calculated from the variances and covariances:

\text{SNR}_X[dB] = -10\log\left(\frac{\sigma_{X}^2\sigma_{YZ}}{\sigma_{XY}\sigma_{XZ}}-1\right)

\text{SNR}_Y[dB] = -10\log\left(\frac{\sigma_{Y}^2\sigma_{XZ}}{\sigma_{YX}\sigma_{YZ}}-1\right)

\text{SNR}_Z[dB] = -10\log\left(\frac{\sigma_{Z}^2\sigma_{XY}}{\sigma_{ZX}\sigma_{ZY}}-1\right)

It is given in dB to make it symmetric around zero. If the value is zero it means that the signal variance and the noise variance are equal. +3dB means that the signal variance is twice as high as the noise variance.

References

[Gruber2015]Gruber, A., Su, C., Zwieback, S., Crow, W., Dorigo, W., Wagner, W. (2015). Recent advances in (soil moisture) triple collocation analysis. International Journal of Applied Earth Observation and Geoinformation, in review
pytesmo.metrics.ubrmsd(o, p, ddof=0)[source]

Unbiased root-mean-square deviation (uRMSD).

Parameters:
  • o (numpy.ndarray) – Observations.
  • p (numpy.ndarray) – Predictions.
  • ddof (int, optional) – Delta degree of freedom.The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero.
Returns:

urmsd – Unbiased root-mean-square deviation (uRMSD).

Return type:

float

pytesmo.scaling module

Created on Apr 17, 2013

@author: Christoph Paulik christoph.paulik@geo.tuwien.ac.at

pytesmo.scaling.add_scaled(df, method='linreg', label_in=None, label_scale=None)[source]

takes a dataframe and appends a scaled time series to it. If no labels are given the first column will be scaled to the second column of the DataFrame

Parameters:
  • df (pandas.DataFrame) – input dataframe
  • method (string) – scaling method
  • label_in (string, optional) – the column of the dataframe that should be scaled to that with label_scale default is the first column
  • label_scale (string, optional) – the column of the dataframe the label_in column should be scaled to default is the second column
Returns:

df – input dataframe with new column labeled label_in+’_scaled_’+method

Return type:

pandas.DataFrame

pytesmo.scaling.cdf_match(src, ref, min_val=None, max_val=None, nbins=100)[source]

computes cumulative density functions of src and ref at their respective bin-edges by 5th order spline interpolation; then matches CDF of src to CDF of ref.

This function does not make sure that the percentiles are unique so it can happen that multiple measurements are scaled to one point or that there are NaN values in the output array.

Parameters:
  • src (numpy.array) – input dataset which will be scaled
  • ref (numpy.array) – src will be scaled to this dataset
  • min_val (float, optional) – Minimum allowed value, output data is capped at this value
  • max_val (float, optional) – Maximum allowed value, output data is capped at this value
  • nbins (int, optional) – Number of bins to use for estimation of the CDF
Returns:

CDF matched values – dataset src with CDF as ref

Return type:

numpy.array

pytesmo.scaling.gen_cdf_match(src, perc_src, perc_ref, min_val=None, max_val=None, k=1)[source]

General cdf matching:

  1. computes discrete cumulative density functions of src- and ref at the given percentiles
  2. computes continuous CDFs by k-th order spline fitting
  3. CDF of src is matched to CDF of ref
Parameters:
  • src (numpy.array) – input dataset which will be scaled
  • perc_src (numpy.array) – percentiles of src
  • perc_ref (numpy.array) – percentiles of reference data estimated through method of choice, must be same size as perc_src
  • min_val (float, optional) – Minimum allowed value, output data is capped at this value
  • max_val (float, optional) – Maximum allowed value, output data is capped at this value
  • k (int, optional) – Order of spline to fit
Returns:

CDF matched values – dataset src with CDF as ref

Return type:

numpy.array

pytesmo.scaling.lin_cdf_match(src, ref, min_val=None, max_val=None, percentiles=[0, 5, 10, 30, 50, 70, 90, 95, 100])[source]

computes cumulative density functions of src and ref at their respective bin-edges by linear interpolation; then matches CDF of src to CDF of ref.

This function does not make sure that the percentiles are unique so it can happen that multiple measurements are scaled to one point or that there are NaN values in the output array.

Parameters:
  • src (numpy.array) – input dataset which will be scaled
  • ref (numpy.array) – src will be scaled to this dataset
  • min_val (float, optional) – Minimum allowed value, output data is capped at this value
  • max_val (float, optional) – Maximum allowed value, output data is capped at this value
  • percentiles (list or numpy.ndarray) – Percentiles to use for CDF matching
Returns:

CDF matched values – dataset src with CDF as ref

Return type:

numpy.array

pytesmo.scaling.lin_cdf_match_stored_params(src, perc_src, perc_ref, min_val=None, max_val=None)[source]
Parameters:
  • src (numpy.array) – input data to scale
  • perc_src (numpy.array) – percentiles of src estimated through method of choice
  • perc_ref (numpy.array) – percentiles of reference data estimated through method of choice, must be same size as perc_src
  • min_val (float, optional) – Minimum allowed value, output data is capped at this value
  • max_val (float, optional) – Maximum allowed value, output data is capped at this value
pytesmo.scaling.linreg(src, ref)[source]

scales the input datasets using linear regression

Parameters:
  • src (numpy.array) – input dataset which will be scaled
  • ref (numpy.array) – src will be scaled to this dataset
Returns:

scaled dataset – dataset scaled using linear regression

Return type:

numpy.array

pytesmo.scaling.mean_std(src, ref)[source]

scales the input datasets so that they have the same mean and standard deviation afterwards

Parameters:
  • src (numpy.array) – input dataset which will be scaled
  • ref (numpy.array) – src will be scaled to this dataset
Returns:

scaled dataset – dataset src with same mean and standard deviation as ref

Return type:

numpy.array

pytesmo.scaling.min_max(src, ref)[source]

scales the input datasets so that they have the same minimum and maximum afterwards

Parameters:
  • src (numpy.array) – input dataset which will be scaled
  • ref (numpy.array) – src will be scaled to this dataset
Returns:

scaled dataset – dataset src with same maximum and minimum as ref

Return type:

numpy.array

pytesmo.scaling.scale(df, method='linreg', reference_index=0)[source]

takes pandas.DataFrame and scales all columns to the column specified by reference_index with the chosen method

Parameters:
  • df (pandas.DataFrame) – containing matched time series that should be scaled
  • method (string, optional) – method definition, has to be a function in globals() that takes 2 numpy.array as input and returns one numpy.array of same length
  • reference_index (int, optional) – default 0, column index of reference dataset in dataframe
Returns:

scaled data – all time series of the input DataFrame scaled to the one specified by reference_index

Return type:

pandas.DataFrame

pytesmo.temporal_matching module

Provides a temporal matching function

pytesmo.temporal_matching.df_match(reference, *args, **kwds)[source]

Finds temporal match between the reference pandas.DataFrame (index has to be datetime) and n other pandas.DataFrame (index has to be datetime).

Parameters:
  • reference (pandas.DataFrame or pandas.TimeSeries) – The index of this dataframe will be the reference.
  • *args

    The index of this dataframe(s) will be matched. If it is a pandas.Series then it has to have a name. Otherwise no column name can be assigned to the matched DataFrame.

  • window (float) – Fraction of days of the maximum pos./neg. distance allowed, i.e. the value of window represents the half-winow size (e.g. window=0.5, will search for matches between -12 and +12 hours) (default: None)
  • dropna (boolean) – Drop rows containing only NaNs (default: False)
  • dropduplicates (boolean) – Drop duplicated temporal matched (default: False)
  • asym_window (string, optional) – <= stands for using a smaller and equal only for the left/smaller side of the window comparison >= stands for using a larger and equal only for the right/larger side of the window comparison The default is to use <= and >= for both sides of the search window
Returns:

temporal_matched_args – Dataframe with index from matched reference index

Return type:

pandas.DataFrame or tuple of pandas.DataFrame

pytesmo.temporal_matching.matching(reference, *args, **kwargs)[source]

Finds temporal match between the reference pandas.TimeSeries (index has to be datetime) and n other pandas.TimeSeries (index has to be datetime).

Parameters:
  • reference (pandas.TimeSeries) – The index of this Series will be the reference.
  • *args

    The index of these Series(s) will be matched.

  • window (float) – Fraction of days of the maximum pos./neg. distance allowed, i.e. the value of window represents the half-winow size (e.g. window=0.5, will search for matches between -12 and +12 hours) (default: None)
Returns:

temporal_match – containing the index of the reference Series and a column for each of the other input Series

Return type:

pandas.DataFrame

pytesmo.utils module

Module containing utility functions that do not fit into other modules

pytesmo.utils.element_iterable(el)[source]

Test if a element is iterable

Parameters:el (object) –
Returns:iterable – if True then then el is iterable if Fales then not
Return type:boolean
pytesmo.utils.ensure_iterable(el)[source]

Ensure that an object is iterable by putting it into a list.

Parameters:el (object) –
Returns:iterable – [el]
Return type:list
pytesmo.utils.interp_uniq(src)[source]

replace non unique values by their linear interpolated value This method interpolates iteratively like it is done in IDL.

Parameters:src (numpy.array) – array to ensure uniqueness of
Returns:src – interpolated unique values in array of same size as src
Return type:numpy.array
pytesmo.utils.ml_percentile(in_data, percentiles)[source]

Calculate percentiles in the way Matlab and IDL do it.

By using interpolation between the lowest an highest rank and the minimum and maximum outside.

Parameters:
Returns:

perc – values of the percentiles

Return type:

numpy.ndarray

Module contents