pygeochemtools.geochem

Geochemical data manipulation module

pygeochemtools.geochem.make_sarig_element_dataset(path: str, element: str, dh_only: bool = True, export: bool = False, out_path: Optional[str] = None) pandas.DataFrame[source]

Create a ‘clean’ single element drillhole dataset derived from the sarig_rs_chem_exp.csv.

This isolates the selected element from the whole dataset, converts BDL values to a low, non zero value, drops rows that contain other symbols such as ‘>’ and ‘-’ and converts oxides to elements and all values to ppm. It also adds chem methods to the dataset where possible to allow further EDA.

This data is used to create input data for further processing. This function uses dask to handle very large input datasets.

Important note: the the sarig_rs_chem_exp.csv data is in a long format, with each individual analysis as a single row!

This dataset may need additional EDA and cleaning prior to further processing. In that case set export to True to do further processing on the returned dataset.

Parameters
  • path (str) – Path to main sarig_rs_chem_exp.csv input file.

  • element (str) – The element to extract and create a sub-dataset of.

  • export (bool) – Wether to export a csv version of the element dataset. Defaults to False.

  • out_path (str, optional) – Path to place out put file. Defaults to path.

Returns

Dataframe of cleaned geochemical data

Return type

pd.DataFrame

pygeochemtools.geochem.sarig_long_to_wide(path: str, elements: Optional[List[str]] = None, sample_type: Optional[List[str]] = None, drillholes: Optional[Union[List[int], bool]] = None, include_units: bool = False, export_methods: bool = False, export: bool = False, out_path: Optional[str] = None) pandas.DataFrame[source]

Convert sarig long form data to wide form.

Takes optional list of elements, sample types or drillhole numbers and filters large dataset based on these inputs. Has the option to include or exclude units with the values. Can also export an additional methods file.

It handles duplicate values based on sample_id and element_id by taking the first duplicate value initially, then catching the second duplicate, performing a second pivot, and appengind the duplicates to the final table. It does not handle duplicate duplicates, in which case it will return only the first value.

Parameters
  • path (str) – Path to main sarig_rs_chem_exp.csv input file.

  • elements (Optional[List[str]]) – List of elements to filter dataset to.

  • sample_type (Optional[List[str]]) – List of sample types to filter dataset to.

  • drillholes (Optional[Union[List[int], bool]]) – List of drillhole numbers to filter dataset to.

  • include_units (bool) – Option to include units in the data export. Defaults to False.

  • export_methods (bool) – Option to include methods file in the data export. Defaults to False.

  • export (bool) – Option to export data to a csv file. Defaults to False

  • out_path (Optional[str]) – Optional path to output export file location. Defaults to path.

Returns

Dataframe with filtered datapoints converted to a wide form data

structure.

Return type

pd.DataFrame

pygeochemtools.geochem.aggregation

Functions to calculate the max chem value down hole

pygeochemtools.geochem.aggregation.max_dh_chem(input_data: Union[str, pandas.DataFrame], drillhole_id: str) pandas.DataFrame[source]

Function to aggregate the processed elemental geochemical data and return a dataframe containing max value in each drillhole.

Requires long format data.

Parameters
  • input_data (Union[str, pd.DataFrame]) – Path to clean and processed single element dataset in csv format or Pandas dataframe of clean and processed single element dataset.

  • drillhole_id (str) – drillhole identifier in dataset.

Raises

ValueError – Error raised if input file is not a valid csv file

Returns

Dataframe containing only the maximum value from each drill hole

Return type

pd.DataFrame

pygeochemtools.geochem.aggregation.max_dh_chem_interval(input_data: Union[str, pandas.DataFrame], interval: int, drillhole_id: str, start_depth_label: str, end_depth_label: str) pandas.DataFrame[source]

Function to aggregate the processed singel elemental geochemical data and return a dataframe containing max value in each interval down hole for each drillhole.

Requires long format data.

Parameters
  • input_data (Union[str, pd.DataFrame]) – Input single element geochemical data, in long form, as either a path to a csv input file or a pandas dataframe.

  • interval (int) – The interval, in whole meters, overwhich to aggregate down hole.

  • drillhole_id (str) – Column headder containing the drill hole identifier.

  • start_depth_label (str) – Column headder containing the start or from depth data.

  • end_depth_label (str) – Column headder containing the finish or to depth data.

Raises

ValueError – Error if input file is not a valid csv file

Returns

Dataframe continging the maximum value for each specified

interval.

Return type

pd.DataFrame

pygeochemtools.geochem.conversions

Functions to perform conversions on geochem data

pygeochemtools.geochem.conversions.convert_oxides(df: pandas.DataFrame, element: str, value: str) pandas.DataFrame[source]

Convert selected oxides to elements

Parameters
  • df (pd.DataFrame) – Input dataframe

  • element (str) – Oxide to convert. Can be any of: ‘Fe2O3’, ‘FeO’, ‘U3O8’, ‘CoO’, ‘NiO’

  • value (str) – Name of column containing geochemical data values.

Returns

Dataframe with oxides converted in place

Return type

pd.DataFrame

pygeochemtools.geochem.conversions.convert_ppm(df: pandas.DataFrame, value: str, units: str, convert_wtperc: bool = True) pandas.DataFrame[source]

Create new column called ‘converted_ppm’ and converts values to ppm.

Parameters
  • df (pd.DataFrame) – Input dataframe

  • value (str) – Name of column containing geochemical data values.

  • units (str) – Name of column containing geochemical data units.

  • convert_wtperc (bool) – Wether to convert wt% to ppm. Defaults to True

Returns

Dataframe with new ‘converted_ppm’ column

Return type

pd.DataFrame

pygeochemtools.geochem.create_dataset

Functions to load and filter input geochem data

class pygeochemtools.geochem.create_dataset.LoadAndFilter[source]

Bases: object

Class to load and filter geochem datasets from csv input.

__init__() None[source]

Dask dataframe object

list_columns()[source]

Return the column headers from the dataset

list_elements()[source]

Return a list of elements in the dataset

list_sample_types()[source]

Return a list of sample types in the dataset

load_chem_data(path: str) None[source]

Not implemented yet. Func to load generic datasets.

Parameters

path (str) – Path to input csv file.

load_sarig_data(path: str) None[source]

Load data from the sarig_rs_chem_exp.csv dataset.

This function uses dask to handle very large input datasets.

Warning

The the sarig_rs_chem_exp.csv data is in a long format, with each individual analysis as a single row!

Parameters

path (str) – Path to main sarig_rs_chem_exp.csv input file.

sarig_filter(sample_type: Optional[List[str]] = None, elements: Optional[List[str]] = None, drillholes: Optional[Union[List[int], bool]] = None) pandas.DataFrame[source]

Filter sarig dataset.

Reduce the size of the sarig_rs_chem_exp.csv dataset by filtering samples based on a list of elements, sample types and/or drillhole numbers, or a combination of all three.

Parameters
  • sample_type (Optional[List[str]], optional) – List of sample types to include. Defaults to None.

  • elements (Optional[List[str]], optional) – List of elements to include. Defaults to None.

  • drillholes (Optional[Union[List[int], bool]], optional) – Either a list of drillhole numbers to filter to, or True to filter dataset to just those samples from drillholes. Defaults to None.

Raises: MemoryError: If filtered dataset is still too large to fit in avaliable memory.

Returns

Dataframe containing only those samples belonging to the

listed sample types

Return type

pd.DataFrame

sarig_filter_drillhole_element(element: str, dh_only: bool) pandas.DataFrame[source]

Create a ‘clean’ single element dataset derived from the sarig_rs_chem_exp.csv.

This isolates samples from drillholes (ones that have a drill hole id) and the selected element from the whole dataset and is used to create input data for further processing.

Parameters
  • element (str) – The element to extract and create a sub-dataset of.

  • dh_only (bool) – Wether to filter to drillholes only or return all sample types.

Returns

Dataframe filtered to the desired element.

Return type

pd.DataFrame

pygeochemtools.geochem.create_dataset.add_sarig_chem_method(df: pandas.DataFrame) pandas.DataFrame[source]

Add normalised chem method columns to dataset.

Function to map normalised chem method types onto the SARIG CHEM_METHOD_CODE column. The chem methods provided in the SARIG dataset relate to individual lab codes. This function maps those codes, where known, to a generic analysis method, digestion and fusion type.

This is useful for further EDA and cleaning of data, as some methods are no longer applicable, or contain too much noise.

Parameters

df (pd.DataFrame) – Input dataframe

Returns

Dataframe with ‘CHEM_METHODE_CODE mapped to three new columns: ‘DETERMINATION’, ‘DIGESTION’ and ‘FUSION’

Return type

pd.DataFrame

pygeochemtools.geochem.create_dataset.clean_dataset(df: pandas.DataFrame, value: str, dash_BDL_indicator: bool = False) pandas.DataFrame[source]

Remove non-numeric characters.

Clean non-numeric characters from dataframe and flag below detection limit rows (1), and greater than measurable rows (2) in new BDL column.

Parameters
  • df (pd.DataFrame) – Input dataframe to clean.

  • value (str) – Name of column containing geochemical data values.

  • dash_BDL_indicator (bool) – Indicator if the ‘-’ sign indicates below detection limits or not. Defaults to False.

Returns

Cleaned dataframe

Return type

pd.DataFrame

pygeochemtools.geochem.create_dataset.handle_BDL(df: pandas.DataFrame, units: str) pandas.DataFrame[source]

Convert below detection limit values to low, non-zero values.

Converts below detection limit values, like “<10”, to low numeric ppm values. All BDL units are converted to a value of 0.001ppm except ppb values which are converted to 0.00001ppm.

Note

Requires clean_dataset() function to be run to create the “BDL” flag column first.

Parameters
  • df (pd.DataFrame) – Input dataframe to clean.

  • units (str) – Name of the units column headder in df.

Returns

DataFrame with BDL values converted to low ppm values in the

”converted_ppm” column.

Return type

pd.DataFrame

pygeochemtools.geochem.normalisation

Normalisation functions

pygeochemtools.geochem.normalisation.normalise_crustal_abundace(df: pandas.DataFrame, element: str, ppm_column_name: str) pandas.DataFrame[source]

Create a column with ppm element values normalised against average crusta abundance.

Uses the average crustal abundance values of Rudnick and Gao, 2004.

Note

Must be an elemental value and not an oxide. New elements can be added to the config file.

Parameters
  • df (pd.DataFrame) – Dataframe containing elemental ppm values to normalise.

  • element (str) – The element to normalise, to retreive value from config file.

  • ppm_column_name (str) – Column headder containing the ppm values to normalise.

Returns

Dataframe with Normalised_crustal_abund_(ppm) column added.

Return type

pd.DataFrame

pygeochemtools.geochem.transform

Data transformation

pygeochemtools.geochem.transform.long_to_wide(df: pandas.DataFrame, sample_id: str, element_id: str, value: str, units: str, include_units: bool = False) pandas.DataFrame[source]

Convert geochemical data tables from long to wide form.

This function takes a dataframe of long form geochemical data, i.e. data with one row per element, and runs a pivot to convert it to a standard wide form data with one row per sample and each element in a separate column.

It handles duplicate values based on sample_id and element_id by taking the first duplicate value initially, then catching the second duplicate, performing a second pivot, and appengind the duplicates to the final table. It does not handle duplicate duplicates, in which case it will return only the first value.

Parameters
  • df (pd.DataFrame) – Dataframe containing long form data.

  • sample_id (str) – Name of column containing sample ID’s.

  • element_id (str) – Name of column containing geochemical element names.

  • value (str) – Name of column containing geochemical data values.

  • units (str) – Name of column containing geochemical data units.

  • include_units (bool, optional) – Wether to include units in the output. Defaults to False.

Returns

Dataframe converted to wide table format with one sample per row and columns for each element. Contains only sample_id and element/unit values.

Return type

pd.DataFrame

pygeochemtools.geochem.transform.sarig_methods_wide(df: pandas.DataFrame, sample_id: str, element_id: str) pandas.DataFrame[source]

Create a corresponding methods table to match the pivoted wide form data.

Note

This requires the input dataframe to already have had methods mapping applied by running pygeochemtools.geochem.create_dataset.add_sarig_chem_method function.

Parameters
  • df (pd.DataFrame) – Dataframe containing long form data.

  • sample_id (str) – Name of column containing sample ID’s.

  • element_id (str) – Name of column containing geochemical element names.

Returns

Dataframe with mapped geochemical methods converted to wide form

with one method per sample.

Return type

pd.DataFrame