pytesmo.validation_framework package¶

Submodules¶

pytesmo.validation_framework.adapters module¶

class pytesmo.validation_framework.adapters.AnomalyAdapter(cls, window_size=35, columns=None)[source]¶

Bases: pytesmo.validation_framework.adapters.BasicAdapter

Takes the pandas DataFrame that the read_ts or read method of the instance returns and calculates the anomaly of the time series based on a moving average.

Parameters

cls (class instance) – Must have a read_ts or read method returning a pandas.DataFrame
window_size (float, optional) – The window-size [days] of the moving-average window to calculate the anomaly reference (only used if climatology is not provided) Default: 35 (days)
columns (list, optional) – columns in the dataset for which to calculate anomalies.

calc_anom(data)[source]¶

read(*args, **kwargs)[source]¶

read_ts(*args, **kwargs)[source]¶

class pytesmo.validation_framework.adapters.AnomalyClimAdapter(cls, columns=None, **kwargs)[source]¶

Bases: pytesmo.validation_framework.adapters.BasicAdapter

Takes the pandas DataFrame that the read_ts or read method of the instance returns and calculates the anomaly of the time series based on a moving average.

Parameters

cls (class instance) – Must have a read_ts or read method returning a pandas.DataFrame
columns (list, optional) – columns in the dataset for which to calculate anomalies.
kwargs – Any additional arguments will be given to the calc_climatology function.

calc_anom(data)[source]¶

read(*args, **kwargs)[source]¶

read_ts(*args, **kwargs)[source]¶

class pytesmo.validation_framework.adapters.BasicAdapter(cls, data_property_name='data')[source]¶

Bases: object

Base class for other adapters that works around data readers that don’t return a DataFrame (e.g. ASCAT). Also removes unnecessary timezone information in data.

read(*args, **kwargs)[source]¶

read_ts(*args, **kwargs)[source]¶

class pytesmo.validation_framework.adapters.MaskingAdapter(cls, op, threshold, column_name=None)[source]¶

Bases: pytesmo.validation_framework.adapters.BasicAdapter

Transform the given class to return a boolean dataset given the operator and threshold. This class calls the read_ts and read methods of the given instance and applies boolean masking to the returned data using the given operator and threshold.

Parameters

cls (object) – has to have a read_ts or read method
operator (string) – one of ‘<’, ‘<=’, ‘==’, ‘>=’, ‘>’, ‘!=’
threshold – value to use as the threshold combined with the operator
column_name (string, optional) – name of the column to cut the read masking dataset to

read(*args, **kwargs)[source]¶

read_ts(*args, **kwargs)[source]¶

class pytesmo.validation_framework.adapters.SelfMaskingAdapter(cls, op, threshold, column_name)[source]¶

Bases: pytesmo.validation_framework.adapters.BasicAdapter

Transform the given (reader) class to return a dataset that is masked based on the given column, operator, and threshold. This class calls the read_ts or read method of the given reader instance, applies the operator/threshold to the specified column, and masks the whole dataframe with the result.

Parameters

cls (object) – has to have a read_ts or read method
operator (string) – one of ‘<’, ‘<=’, ‘==’, ‘>=’, ‘>’, ‘!=’
threshold – value to use as the threshold combined with the operator
column_name (string) – name of the column to apply the threshold to

read(*args, **kwargs)[source]¶

read_ts(*args, **kwargs)[source]¶

pytesmo.validation_framework.data_manager module¶

class pytesmo.validation_framework.data_manager.DataManager(datasets, ref_name, period=None, read_ts_names='read_ts')[source]¶

Bases: object

Class to handle the data management.

Parameters

datasets (dict of dicts) –

Keys

string, datasets names

Values

dict, containing the following fields

’class’: object
Class containing the method read_ts for reading the data.

’columns’: list
List of columns which will be used in the validation process.

’args’: list, optional
Args for reading the data.

’kwargs’: dict, optional
Kwargs for reading the data

’grids_compatible’: boolean, optional
If set to True the grid point index is used directly when reading other, if False then lon, lat is used and a nearest neighbour search is necessary. default: False

’use_lut’: boolean, optional
If set to True the grid point index (obtained from a calculated lut between reference and other) is used when reading other, if False then lon, lat is used and a nearest neighbour search is necessary. default: False

’lut_max_dist’: float, optional
Maximum allowed distance in meters for the lut calculation. default: None
ref_name (string) – Name of the reference dataset
period (list, optional) – Of type [datetime start, datetime end]. If given then the two input datasets will be truncated to start <= dates <= end.
read_ts_names (string or dict of strings, optional) – if another method name than ‘read_ts’ should be used for reading the data then it can be specified here. If it is a dict then specify a function name for each dataset.

use_lut(other_name)¶: Returns lut between reference and other if use_lut for other dataset was set to True.

get_result_names()¶: Return results names based on reference and others names.

read_reference(*args)[source]¶: Function to read and prepare the reference dataset.

read_other(other_name, *args)[source]¶: Function to read and prepare the other datasets.

property ds_dict¶

get_data(gpi, lon, lat)[source]¶

Get all the data from this manager for a certain grid point, longitude, latidude combination.

Parameters

gpi (int) – grid point indices
lon (float) – grid point longitude
lat (type) – grid point latitude

Returns

df_dict – Dictionary with dataset names as the key and pandas.DataFrames containing the data for the point as values. The dict will be empty if no data is available.

Return type

dict of pandas.DataFrames

get_luts()[source]¶

Returns luts between reference and others if use_lut for other datasets was set to True.

Returns: luts – Keys: other datasets names Values: lut between reference and other, or None
Return type: dict

get_other_data(gpi, lon, lat)[source]¶

Get all the data for non reference datasets from this manager for a certain grid point, longitude, latidude combination.

Parameters

gpi (int) – grid point indices
lon (float) – grid point longitude
lat (type) – grid point latitude

Returns

other_dataframes – Dictionary with dataset names as the key and pandas.DataFrames containing the data for the point as values. The dict will be empty if no data is available.

Return type

dict of pandas.DataFrames

get_results_names(n=2)[source]¶

read_ds(name, *args)[source]¶

Function to read and prepare a datasets.

Calls read_ts of the dataset.

Takes either 1 (gpi) or 2 (lon, lat) arguments.

Parameters

name (string) – Name of the other dataset.
gpi (int) – Grid point index
lon (float) – Longitude of point
lat (float) – Latitude of point

Returns

data_df – Data DataFrame.

Return type

pandas.DataFrame or None

read_other(name, *args)[source]

Function to read and prepare a datasets.

Calls read_ts of the dataset.

Takes either 1 (gpi) or 2 (lon, lat) arguments.

Parameters

name (string) – Name of the other dataset.
gpi (int) – Grid point index
lon (float) – Longitude of point
lat (float) – Latitude of point

Returns

data_df – Data DataFrame.

Return type

pandas.DataFrame or None

read_reference(*args)[source]

Function to read and prepare the reference dataset.

Calls read_ts of the dataset. Takes either 1 (gpi) or 2 (lon, lat) arguments.

Parameters

gpi (int) – Grid point index
lon (float) – Longitude of point
lat (float) – Latitude of point

Returns

ref_df – Reference dataframe.

Return type

pandas.DataFrame or None

pytesmo.validation_framework.data_manager.flatten(seq)[source]¶

pytesmo.validation_framework.data_manager.get_result_names(ds_dict, refkey, n=2)[source]¶

Return result names based on all possible combinations based on a reference dataset.

Parameters

ds_dict (dict) – Dict of lists containing the dataset names as keys and a list of the columns to read from the dataset as values.
refkey (string) – dataset name to use as a reference
n (int) – Number of datasets for combine with each other. If n=2 always two datasets will be combined into one result. If n=3 always three datasets will be combined into one results and so on. n has to be <= the number of total datasets.

Returns

results_names – Containing all combinations of (referenceDataset.column, otherDataset.column)

Return type

list of tuples

pytesmo.validation_framework.data_scalers module¶

Data scaler classes to be used together with the validation framework.

class pytesmo.validation_framework.data_scalers.CDFStoreParamsScaler(path, grid, percentiles=[0, 5, 10, 30, 50, 70, 90, 95, 100])[source]¶

Bases: object

CDF scaling using stored parameters if available. If stored parameters are not available they are calculated and written to disk.

Parameters

path (string) – Path where the data is/should be stored
grid (pygeogrids.grids.CellGrid instance) – Grid on which the data is stored. Should be the same as the spatial reference grid of the validation framework instance in which this scaler is used.
percentiles (list or np.ndarray) – Percentiles to use for CDF matching

calc_parameters(data)[source]¶

Calculate the percentiles used for CDF matching.

Parameters: data (pandas.DataFrame) – temporally matched dataset
Returns: parameters – keys -> Names of columns in the input data frame values -> numpy.ndarrays with the percentiles
Return type: dictionary

get_parameters(data, gpi)[source]¶

Function to get scaling parameters. Try to load them, if they are not found we calculate them and store them.

Parameters

data (pandas.DataFrame) – temporally matched dataset
gpi (int) – grid point index of self.grid

Returns

params – keys -> Names of columns in the input data frame values -> numpy.ndarrays with the percentiles

Return type

dictionary

load_parameters(gpi)[source]¶

scale(data, reference_index, gpi_info)[source]¶

Scale all columns in data to the column at the reference_index.

Parameters

data (pandas.DataFrame) – temporally matched dataset
reference_index (int) – Which column of the data contains the scaling reference.
gpi_info (tuple) – tuple of at least, (gpi, lon, lat) Where gpi has to be the grid point indices of the grid of this scaler.

Raises

ValueError – if scaling is not successful

store_parameters(gpi, parameters)[source]¶

Store parameters for gpi into netCDF file.

Parameters

gpi (int) – grid point index of self.grid
params (dictionary) – keys -> Names of columns in the input data frame values -> numpy.ndarrays with the percentiles

class pytesmo.validation_framework.data_scalers.DefaultScaler(method)[source]¶

Bases: object

Scaling class that implements the scaling based on a given method from the pytesmo.scaling module.

Parameters: method (string) – The data will be scaled into the reference space using the method specified by this string.

scale(data, reference_index, gpi_info)[source]¶

Scale all columns in data to the column at the reference_index.

Parameters

data (pandas.DataFrame) – temporally matched dataset
reference_index (int) – Which column of the data contains the scaling reference.
gpi_info (tuple) – tuple of at least, (gpi, lon, lat) Where gpi has to be the grid point indices of the grid of this scaler.

Raises

ValueError – if scaling is not successful

pytesmo.validation_framework.metric_calculators module¶

Created on Sep 24, 2013

Metric calculators useable in together with core

@author: Christoph.Paulik@geo.tuwien.ac.at

class pytesmo.validation_framework.metric_calculators.BasicMetrics(other_name='k1', calc_tau=False, metadata_template=None)[source]¶

Bases: pytesmo.validation_framework.metric_calculators.MetadataMetrics

This class just computes the basic metrics, Pearson’s R Spearman’s rho optionally Kendall’s tau RMSD BIAS

it also stores information about gpi, lat, lon and number of observations

Parameters

other_name (string, optional) – Name of the column of the non-reference / other dataset in the pandas DataFrame
calc_tau (boolean, optional) – if True then also tau is calculated. This is set to False by default since the calculation of Kendalls tau is rather slow and can significantly impact performance of e.g. global validation studies

calc_metrics(data, gpi_info)[source]¶

calculates the desired statistics

Parameters

data (pandas.DataFrame) – with 2 columns, the first column is the reference dataset named ‘ref’ the second column the dataset to compare against named ‘other’
gpi_info (tuple) – of (gpi, lon, lat)

Notes

Kendall tau is calculation is optional at the moment because the scipy implementation is very slow which is problematic for global comparisons

class pytesmo.validation_framework.metric_calculators.BasicMetricsPlusMSE(other_name='k1', metadata_template=None)[source]¶

Bases: pytesmo.validation_framework.metric_calculators.BasicMetrics

Basic Metrics plus Mean squared Error and the decomposition of the MSE into correlation, bias and variance parts.

calc_metrics(data, gpi_info)[source]¶

calculates the desired statistics

Parameters

data (pandas.DataFrame) – with 2 columns, the first column is the reference dataset named ‘ref’ the second column the dataset to compare against named ‘other’
gpi_info (tuple) – of (gpi, lon, lat)

Notes

Kendall tau is calculation is optional at the moment because the scipy implementation is very slow which is problematic for global comparisons

class pytesmo.validation_framework.metric_calculators.BasicSeasonalMetrics(result_path=None, other_name='k1', metadata_template=None)[source]¶

Bases: pytesmo.validation_framework.metric_calculators.MetadataMetrics

This class just computes basic metrics on a seasonal basis. It also stores information about gpi, lat, lon and number of observations.

calc_metrics(data, gpi_info)[source]¶

calculates the desired statistics

Parameters

data (pandas.DataFrame) – with 2 columns, the first column is the reference dataset named ‘ref’ the second column the dataset to compare against named ‘other’
gpi_info (tuple) – Grid point info (i.e. gpi, lon, lat)

class pytesmo.validation_framework.metric_calculators.FTMetrics(frozen_flag=2, other_name='k1', metadata_template=None)[source]¶

Bases: pytesmo.validation_framework.metric_calculators.MetadataMetrics

This class computes Freeze/Thaw Metrics Calculated metrics are:

SSF frozen/temp unfrozen
SSF unfrozen/temp frozen
SSF unfrozen/temp unfrozen
SSF frozen/temp frozen

it also stores information about gpi, lat, lon and number of total observations

calc_metrics(data, gpi_info)[source]¶

calculates the desired statistics

Parameters

data (pandas.DataFrame) – with 2 columns, the first column is the reference dataset named ‘ref’ the second column the dataset to compare against named ‘other’
gpi_info (tuple) – of (gpi, lon, lat)

Notes

Kendall tau is not calculated at the moment because the scipy implementation is very slow which is problematic for global comparisons

class pytesmo.validation_framework.metric_calculators.HSAF_Metrics(other_name1='k1', other_name2='k2', dataset_names=None, metadata_template=None)[source]¶

Bases: pytesmo.validation_framework.metric_calculators.MetadataMetrics

This class computes metrics as defined by the H-SAF consortium in order to prove the operational readiness of a product. It also stores information about gpi, lat, lon and number of observations.

calc_metrics(data, gpi_info)[source]¶

calculates the desired statistics

Parameters

data (pandas.DataFrame) – with 3 columns, the first column is the reference dataset named ‘ref’ the second and third column are the datasets to compare against named ‘k1 and k2’
gpi_info (tuple) – Grid point info (i.e. gpi, lon, lat)

class pytesmo.validation_framework.metric_calculators.IntercomparisonMetrics(other_names=['k1', 'k2', 'k3'], calc_tau=False, dataset_names=None, metadata_template=None)[source]¶

Bases: pytesmo.validation_framework.metric_calculators.MetadataMetrics

Compare Basic Metrics of multiple satellite data sets to one reference data set. Pearson’s R and p Spearman’s rho and p optionally Kendall’s tau RMSD BIAS ubRMSD mse

Parameters

other_names (iterable, optional (default: ['k1', 'k2', 'k3])) – Name of the column of the non-reference / other datasets in the pandas DataFrame
calc_tau (boolean, optional) – if True then also tau is calculated. This is set to False by default since the calculation of Kendalls tau is rather slow and can significantly impact performance of e.g. global validation studies
dataset_names (list) – Names of the original datasets, that are used to find the lookup table for the df cols.

calc_metrics(data, gpi_info)[source]¶

calculates the desired statistics

Parameters

data (pandas.DataFrame) – with >2 columns, the first column is the reference dataset named ‘ref’ other columns are the datasets to compare against named ‘other_i’
gpi_info (tuple) – of (gpi, lon, lat)

Notes

Kendall tau is calculation is optional at the moment because the scipy implementation is very slow which is problematic for global comparisons

class pytesmo.validation_framework.metric_calculators.MetadataMetrics(other_name='k1', metadata_template=None)[source]¶

Bases: object

This class sets up the gpi info and metadata (if used) in the results template. This is used as the basis for all other metric calculators.

Parameters: other_name (string, optional) – Name of the column of the non-reference / other dataset in the pandas DataFrame metadata_template: dictionary, optional A dictionary containing additional fields (and types) of the form dict = {‘field’: np.float32([np.nan]}. Allows users to specify information in the job tuple, i.e. jobs.append((idx, metadata[‘longitude’], metadata[‘latitude’], metadata_dict)) which is then propagated to the end netCDF results file.

calc_metrics(data, gpi_info)[source]¶

Adds the gpi info and metadata to the results.

Parameters

data (pandas.DataFrame) – see individual calculators for more information. not directly used here.
gpi_info (tuple) – of (gpi, lon, lat) or, optionally, (gpi, lon, lat, metadata) where metadata is a dictionary

class pytesmo.validation_framework.metric_calculators.TCMetrics(other_name1='k1', other_name2='k2', calc_tau=False, dataset_names=None, metadata_template=None)[source]¶

Bases: pytesmo.validation_framework.metric_calculators.BasicMetrics

This class computes triple collocation metrics as defined in the QA4SM project. It uses 2 satellite and 1 reference data sets as inputs only. It can be extended to perform intercomparison between possible triples of more than 3 datasets.

calc_metrics(data, gpi_info)[source]¶

calculates the desired statistics

Parameters

data (pandas.DataFrame) – with >2 columns, the first column is the reference dataset named ‘ref’ other columns are the data sets to compare against named ‘other_i’
gpi_info (tuple) – of (gpi, lon, lat)

Notes

Kendall tau is calculation is optional at the moment because the scipy implementation is very slow which is problematic for global comparisons

pytesmo.validation_framework.metric_calculators.get_dataset_names(ref_key, datasets, n=3)[source]¶

Get dataset names in correct order as used in the validation framework

reference dataset = ref
first other dataset = k1
second other dataset = k2

This is important to correctly iterate through the H-SAF metrics and to save each metric with the name of the used datasets

Parameters

ref_key (basestring) – Name of the reference dataset
datasets (dict) – Dictionary of dictionaries as provided to the validation framework in order to perform the validation process.

Returns

dataset_names – List of the dataset names in correct order

Return type

list

pytesmo.validation_framework.results_manager module¶

Created on 01.06.2015 @author: Andreea Plocon andreea.plocon@geo.tuwien.ac.at

pytesmo.validation_framework.results_manager.build_filename(root, key)[source]¶

Create savepath/filename that does not exceed 255 characters

Parameters

root (str) – Directory where the file should be stored
key (list of tuples) – The keys are joined to create a filename from them. If the length of the joined keys is too long we shorten it.

Returns

fname – Full path to the netcdf file to store

Return type

str

pytesmo.validation_framework.results_manager.netcdf_results_manager(results, save_path, zlib=True)[source]¶

Function for writing the results of the validation process as NetCDF file.

Parameters

results (dict of dicts) – Keys: Combinations of (referenceDataset.column, otherDataset.column) Values: dict containing the results from metric_calculator
save_path (string) – Path where the file/files will be saved.

pytesmo.validation_framework.start_validation module¶

pytesmo.validation_framework.start_validation.func(job)[source]¶: Function which calls the start_processing method implemented in setup_code.

pytesmo.validation_framework.start_validation.start_validation(setup_code)[source]¶

Perform the validation with IPython parallel processing.

Parameters: setup_code (string) – Path to .py file containing the setup for the validation.

pytesmo.validation_framework.temporal_matchers module¶

Created on Sep 24, 2013

@author: Christoph.Paulik@geo.tuwien.ac.at

class pytesmo.validation_framework.temporal_matchers.BasicTemporalMatching(window=0.5)[source]¶

Bases: object

Temporal matching object

Parameters: window (float) – window size to use for temporal matching. A match in other will only be found if it is +- window size days away from a point in reference

combinatory_matcher(df_dict, refkey, n=2)[source]¶

Basic temporal matcher that matches always one Dataframe to the reference Dataframe resulting in matched DataFrame pairs.

If the input dict has the keys ‘data1’ and ‘data2’ then the output dict will have the key (‘data1’, ‘data2’). The new key is stored as a tuple to avoid any issues with string concetanation.

During matching the column names of the dataframes will be transformed into MultiIndex to ensure unique names.

Parameters

df_dict (dict of pandas.DataFrames) – dictionary containing the spatially colocated DataFrames.
refkey (string) – key into the df_dict of the DataFrame that should be taken as a reference.
n (int) – number of datasets to match at once

Returns

matched – Dictionary containing matched DataFrames. The key is put together from the keys of the input dict as a tuple of the keys of the datasets this dataframe contains.

Return type

dict of pandas.DataFrames

match(reference, *args)[source]¶: takes reference and other dataframe and returnes a joined Dataframe in this case the reference dataset for the grid is also the temporal reference dataset

pytesmo.validation_framework.temporal_matchers.df_name_multiindex(df, name)[source]¶: Rename columns of a DataFrame by using new column names that are tuples of (name, column_name) to ensure unique column names that can also be split again. This transforms the columns to a MultiIndex.

pytesmo.validation_framework.validation module¶

class pytesmo.validation_framework.validation.Validation(datasets, spatial_ref, metrics_calculators, temporal_matcher=None, temporal_window=0.041666666666666664, temporal_ref=None, masking_datasets=None, period=None, scaling='lin_cdf_match', scaling_ref=None)[source]¶

Bases: object

Class for the validation process.

Parameters

datasets (dict of dicts, or pytesmo.validation_framework.data_manager.DataManager) –

Keys

string, datasets names

Values

dict, containing the following fields

’class’: object
Class containing the method read_ts for reading the data.

’columns’: list
List of columns which will be used in the validation process.

’args’: list, optional
Args for reading the data.

’kwargs’: dict, optional
Kwargs for reading the data

’grids_compatible’: boolean, optional
If set to True the grid point index is used directly when reading other, if False then lon, lat is used and a nearest neighbour search is necessary.

’use_lut’: boolean, optional
If set to True the grid point index (obtained from a calculated lut between reference and other) is used when reading other, if False then lon, lat is used and a nearest neighbour search is necessary.

’lut_max_dist’: float, optional
Maximum allowed distance in meters for the lut calculation.
spatial_ref (string) – Name of the dataset used as a spatial, temporal and scaling reference. temporal and scaling references can be changed if needed. See the optional parameters temporal_ref and scaling_ref.
metrics_calculators (dict of functions) –
The keys of the dict are tuples with the following structure: (n, k) with n >= 2 and n>=k. n is the number of datasets that should be temporally matched to the reference dataset and k is how many columns the metric calculator will get at once. What this means is that it is e.g. possible to temporally match 3 datasets with 3 columns in total and then give the combinations of these columns to the metric calculator in sets of 2 by specifying the dictionary like:
```
{ (3, 2): metric_calculator}
```
The values are functions that take an input DataFrame with the columns ‘ref’ for the reference and ‘n1’, ‘n2’ and so on for other datasets as well as a dictionary mapping the column names to the names of the original datasets. In this way multiple metric calculators can be applied to different combinations of n input datasets.
temporal_matcher (function, optional) – function that takes a dict of dataframes and a reference_key. It performs the temporal matching on the data and returns a dictionary of matched DataFrames that should be evaluated together by the metric calculator.
temporal_window (float, optional) – Window to allow in temporal matching in days. The window is allowed on both sides of the timestamp of the temporal reference data. Only used with the standard temporal matcher.
temporal_ref (string, optional) – If the temporal matching should use another dataset than the spatial reference as a reference dataset then give the dataset name here.
period (list, optional) – Of type [datetime start, datetime end]. If given then the two input datasets will be truncated to start <= dates <= end.
masking_datasets (dict of dictionaries) – Same format as the datasets with the difference that the read_ts method of these datasets has to return pandas.DataFrames with only boolean columns. True means that the observations at this timestamp should be masked and False means that it should be kept.
scaling (string, None or class instance) –
- If set then the data will be scaled into the reference space using the method specified by the string using the pytesmo.validation_framework.data_scalers.DefaultScaler class.
- If set to None then no scaling will be performed.
- It can also be set to a class instance that implements a scale(self, data, reference_index, gpi_info) method. See pytesmo.validation_framework.data_scalers.DefaultScaler for an example.
scaling_ref (string, optional) – If the scaling should be done to another dataset than the spatial reference then give the dataset name here.

calc(job)[source]¶: Takes either a cell or a gpi_info tuple and performs the validation.

get_processing_jobs()[source]¶: Returns processing jobs that this process can understand.

calc(gpis, lons, lats, *args)[source]

The argument iterables (lists or numpy.ndarrays) are processed one after the other in tuples of the form (gpis[n], lons[n], lats[n], arg1[n], ..).

Parameters

gpis (iterable) – The grid point indices is an identificator by which the spatial reference dataset can be read. This is either a list or a numpy.ndarray or any other iterable containing this indicator.
lons (iterable) – Longitudes of the points identified by the gpis. Has to be the same size as gpis.
lats (iterable) – latitudes of the points identified by the gpis. Has to be the same size as gpis.
args (iterables) – any addiational arguments have to have the same size as the gpis iterable. They are given to the metrics calculators as metadata. Common usage is e.g. the long name or network name of an in situ station.

Returns

compact_results –

Keys: result names, combinations of (referenceDataset.column, otherDataset.column)
Values: dict containing the elements returned by metrics_calculator

Return type

dict of dicts

get_data_for_result_tuple(n_matched_data, result_tuple)[source]¶

Extract a dataframe for a given result tuple from the matched dataframes.

Parameters

n_matched_data (dict of pandas.DataFrames) – DataFrames in which n datasets were temporally matched. The key is a tuple of the dataset names.
result_tuple (tuple) – Tuple describing which datasets and columns should be extracted. ((dataset_name, column_name), (dataset_name2, column_name2))

Returns

data – pandas DataFrame with columns extracted from the temporally matched datasets

Return type

pd.DataFrame

get_processing_jobs()[source]

Returns processing jobs that this process can understand.

Returns: jobs – List of cells or gpis to process.
Return type: list

k_datasets_from(n_matched_data, result_names)[source]¶

Extract k datasets from n temporally matched ones.

This is used to send combinations of k datasets to metrics calculators expecting only k datasets.

Parameters

n_matched_data (dict of pandas.DataFrames) – DataFrames in which n datasets were temporally matched. The key is a tuple of the dataset names.
result_names (list) – result names to extract

Yields

data (pd.DataFrame) – pandas DataFrame with k columns extracted from the temporally matched datasets
result (tuple) – Tuple describing which datasets and columns are in the returned data. ((dataset_name, column_name), (dataset_name2, column_name2))

mask_dataset(ref_df, gpi_info)[source]¶

Mask the temporal reference dataset with the data read through the masking datasets.

Parameters: gpi_info (tuple) – tuple of at least, (gpi, lon, lat)
Returns: mask – boolean array of the size of the temporal reference read
Return type: numpy.ndarray

perform_validation(df_dict, gpi_info)[source]¶

Perform the validation for one grid point index and return the matched datasets as well as the calculated metrics.

Parameters

df_dict (dict of pandas.DataFrames) – DataFrames read by the data readers for each dataset
gpi_info (tuple) – tuple of at least, (gpi, lon, lat)

Returns

matched_n (dict of pandas.DataFrames) – temporally matched data stored by (n, k) tuples
results (dict) – Dictonary of calculated metrics stored by dataset combinations tuples.
used_data (dict) – The DataFrame used for calculation of each set of metrics.

temporal_match_datasets(df_dict)[source]¶

Temporally match all the requested combinations of datasets.

Parameters: df_dict (dict of pandas.DataFrames) – DataFrames read by the data readers for each dataset
Returns: matched_n – for each (n, k) in the metrics calculators the n temporally matched dataframes
Return type: dict of pandas.DataFrames

temporal_match_masking_data(ref_df, gpi_info)[source]¶

Temporal match the masking data to the reference DataFrame

Parameters

ref_df (pandas.DataFrame) – Reference data
gpi_info (tuple or list) – contains, (gpi, lon, lat)

Returns

matched_masking – Contains temporally matched masking data. This dict has only one key being a tuple that contains the matched datasets.

Return type

dict of pandas.DataFrames

pytesmo.validation_framework.validation.args_to_iterable(*args, **kwargs)[source]¶

Convert arguments to iterables.

Parameters

args (iterables or not) – arguments
n (int, optional) – number of explicit arguments

pytesmo.validation_framework package¶

Submodules¶

pytesmo.validation_framework.adapters module¶

pytesmo.validation_framework.data_manager module¶

pytesmo.validation_framework.data_scalers module¶

pytesmo.validation_framework.metric_calculators module¶

pytesmo.validation_framework.results_manager module¶

pytesmo.validation_framework.start_validation module¶

pytesmo.validation_framework.temporal_matchers module¶

pytesmo.validation_framework.validation module¶

Module contents¶