pytesmo.validation_framework.validation module

class pytesmo.validation_framework.validation.Validation(datasets, spatial_ref, metrics_calculators, temporal_matcher=None, temporal_window=0.041666666666666664, temporal_ref=None, masking_datasets=None, period=None, scaling='cdf_match', scaling_ref=None)[source]

Bases: object

Class for the validation process.

Parameters:
  • datasets (dict of dicts or DataManager) –

    Keys:

    string, datasets names

    Values:

    dict, containing the following fields

    pytesmo.validation_framework.data_manager.DataManager

    ’class’: object

    Class containing the method read_ts for reading the data.

    ’columns’: list

    List of columns which will be used in the validation process.

    ’args’: list, optional

    Args for reading the data.

    ’kwargs’: dict, optional

    Kwargs for reading the data

    ’grids_compatible’: boolean, optional

    If set to True the grid point index is used directly when reading other, if False then lon, lat is used and a nearest neighbour search is necessary.

    ’use_lut’: boolean, optional

    If set to True the grid point index (obtained from a calculated lut between reference and other) is used when reading other, if False then lon, lat is used and a nearest neighbour search is necessary.

    ’lut_max_dist’: float, optional

    Maximum allowed distance in meters for the lut calculation.

  • spatial_ref (string) – Name of the dataset used as a spatial, temporal and scaling reference. temporal and scaling references can be changed if needed. See the optional parameters temporal_ref and scaling_ref.

  • metrics_calculators (dict of functions) –

    The keys of the dict are tuples with the following structure: (n, k) with n >= 2 and n>=k. n must be equal to the number of datasets now. n is the number of datasets that should be temporally matched to the reference dataset and k is how many columns the metric calculator will get at once. What this means is that it is e.g. possible to temporally match 3 datasets with 3 columns in total and then give the combinations of these columns to the metric calculator in sets of 2 by specifying the dictionary like:

    { (3, 2): metric_calculator}
    

    The values are functions that take an input DataFrame with the columns ‘ref’ for the reference and ‘n1’, ‘n2’ and so on for other datasets as well as a dictionary mapping the column names to the names of the original datasets. In this way multiple metric calculators can be applied to different combinations of n input datasets.

  • temporal_matcher (function, optional) – function that takes a dict of dataframes and a reference_key. It performs the temporal matching on the data and returns a dictionary of matched DataFrames that should be evaluated together by the metric calculator.

  • temporal_window (float, optional) – Window to allow in temporal matching in days. The window is allowed on both sides of the timestamp of the temporal reference data. Only used with the standard temporal matcher.

  • temporal_ref (string, optional) – If the temporal matching should use another dataset than the spatial reference as a reference dataset then give the dataset name here.

  • period (list, optional) – Of type [datetime start, datetime end]. If given then the two input datasets will be truncated to start <= dates <= end.

  • masking_datasets (dict of dictionaries) – Same format as the datasets with the difference that the read method of these datasets has to return pandas.DataFrames with only boolean columns. True means that the observations at this timestamp should be masked and False means that it should be kept.

  • scaling (str or None or class instance) –

  • scaling_ref (string, optional) – If the scaling should be done to another dataset than the spatial reference then give the dataset name here.

calc(job)[source]

Takes either a cell or a gpi_info tuple and performs the validation.

get_processing_jobs()[source]

Returns processing jobs that this process can understand.

calc(gpis, lons, lats, *args, rename_cols=True, only_with_reference=False, handle_errors='raise') Mapping[Tuple[str], Mapping[str, ndarray]][source]

The argument iterables (lists or numpy.ndarrays) are processed one after the other in tuples of the form (gpis[n], lons[n], lats[n], arg1[n], ..).

Parameters:
  • gpis (iterable) – The grid point indices is an identificator by which the spatial reference dataset can be read. This is either a list or a numpy.ndarray or any other iterable containing this indicator.

  • lons (iterable) – Longitudes of the points identified by the gpis. Has to be the same size as gpis.

  • lats (iterable) – latitudes of the points identified by the gpis. Has to be the same size as gpis.

  • args (iterables) – any addiational arguments have to have the same size as the gpis iterable. They are given to the metrics calculators as metadata. Common usage is e.g. the long name or network name of an in situ station.

  • rename_cols (bool, optional) – Whether to rename the columns to “ref”, “k1”, … before passing the dataframe to the metrics calculators. Default is True.

  • only_with_reference (bool, optional) – If this is enabled, only combinations that include the reference dataset (from the data manager) are calculated.

  • handle_errors (str, optional (default: 'raise')) –

    Governs how to handle errors:

    * `raise`: If an error occurs during validation, raise exception.
    * `ignore`: If an error occurs, assign the correct return code
      to the result template and continue with the next GPI.
    

Returns:

compact_results

Keys:

result names, combinations of (referenceDataset.column, otherDataset.column)

Values:

dict containing the elements returned by metrics_calculator

Return type:

dict of dicts

dummy_validation_result(gpi_info, rename_cols=True, only_with_reference=False) Mapping[Tuple[str], List[Mapping[str, ndarray]]][source]

Creates an empty result dictionary to be used if perform_validation fails

get_data_for_result_tuple(n_matched_data, result_tuple)[source]

Extract a dataframe for a given result tuple from the matched dataframes.

Parameters:
  • n_matched_data (dict of pandas.DataFrames) – DataFrames in which n datasets were temporally matched. The key is a tuple of the dataset names.

  • result_tuple (tuple) –

    Tuple describing which datasets and columns should be extracted. ((dataset_name, column_name),

    (dataset_name2, column_name2))

Returns:

data – pandas DataFrame with columns extracted from the temporally matched datasets

Return type:

pd.DataFrame

get_processing_jobs()[source]

Returns processing jobs that this process can understand.

Returns:

jobs – List of cells or gpis to process.

Return type:

list

k_datasets_from(n_matched_data, result_names, include_scaling_ref=True)[source]

Extract k datasets from n temporally matched ones.

This is used to send combinations of k datasets to metrics calculators expecting only k datasets.

Parameters:
  • n_matched_data (dict of pandas.DataFrames) – DataFrames in which n datasets were temporally matched. The key is a tuple of the dataset names.

  • result_names (list) – result names to extract

  • include_scaling_ref (boolean, optional) – if set the scaling reference will always be included. Should only be disabled for getting the masking datasets

Yields:
  • data (pd.DataFrame) – pandas DataFrame with k columns extracted from the temporally matched datasets

  • result (tuple) – Tuple describing which datasets and columns are in the returned data. ((dataset_name, column_name), (dataset_name2, column_name2))

mask_dataset(ref_df, gpi_info)[source]

Mask the temporal reference dataset with the data read through the masking datasets.

Parameters:

gpi_info (tuple) – tuple of at least, (gpi, lon, lat)

Returns:

mask – boolean array of the size of the temporal reference read

Return type:

numpy.ndarray

perform_validation(df_dict, gpi_info, rename_cols=True, only_with_reference=False, handle_errors='raise') Mapping[Tuple[str], List[Mapping[str, ndarray]]][source]

Perform the validation for one grid point index and return the matched datasets as well as the calculated metrics.

Parameters:
  • df_dict (dict of pandas.DataFrames) – DataFrames read by the data readers for each dataset

  • gpi_info (tuple) – tuple of at least, (gpi, lon, lat)

  • rename_cols (bool, optional) – Whether to rename the columns to “ref”, “k1”, … before passing the dataframe to the metrics calculators. Default is True.

  • only_with_reference (bool, optional (default: False)) – Only compute metrics for dataset combinations where the reference is included.

Returns:

  • matched_n (dict of pandas.DataFrames) – temporally matched data stored by (n, k) tuples

  • results (dict) – Dictonary of calculated metrics stored by dataset combinations tuples.

  • used_data (dict) – The DataFrame used for calculation of each set of metrics.

Raises:
  • eh.TemporalMatchingError : – If temporal matching failed

  • eh.NoTempMatchedDataError : – If there is insufficient data or the temporal matching did not return data.

  • eh.ScalingError : – If scaling failed

temporal_match_datasets(df_dict)[source]

Temporally match all the requested combinations of datasets.

Parameters:

df_dict (dict of pandas.DataFrames) – DataFrames read by the data readers for each dataset

Returns:

matched_n – for each (n, k) in the metrics calculators the n temporally matched dataframes

Return type:

dict of pandas.DataFrames

temporal_match_masking_data(ref_df, gpi_info)[source]

Temporal match the masking data to the reference DataFrame

Parameters:
Returns:

matched_masking – Contains temporally matched masking data. This dict has only one key being a tuple that contains the matched datasets.

Return type:

dict of pandas.DataFrames

pytesmo.validation_framework.validation.args_to_iterable(*args, **kwargs)[source]

Convert arguments to iterables.

Parameters:
  • args (iterables or not) – arguments

  • n (int, optional) – number of explicit arguments