dataria.MATH

Functions

correlation([df, endpoint_url, query, col1, col2, ...])

Compute correlations between two columns of a DataFrame, including support for categorical (string) data.

plot_correlation_heatmap(correlation_df[, corr_col, ...])

Generate a heatmap plot from a correlation DataFrame.

upset([df, endpoint_url, query, col_item, col_sets, ...])

Generate an UpSet plot and/or a DataFrame suitable for upset.js from set membership data.

fuzzy_compare(df1[, df2, match_by, grouping_var, ...])

Perform a fast, memory-efficient fuzzy comparison between two DataFrames

fuzzy_compare_deprecated([df1, df2, ...])

Fuzzy string matching between two DataFrames (or SPARQL query results) based on a common element column.

fuzzy_compare_legacy([df1, df2, additional_vars_df1, ...])

!! Left here, as new function not tested much....

Module Contents

dataria.MATH.correlation(df=None, endpoint_url=None, query=None, col1=None, col2=None, sep=';', edges=0, csv_filename='correlations.csv', heatmap=True, heatmap_kwargs={}, dummies_matrix=True, save_PNG=True, verbose=True)

Compute correlations between two columns of a DataFrame, including support for categorical (string) data.

This function calculates Pearson correlations, handles dummy encoding for categorical data, supports SPARQL-based data retrieval, and can generate a heatmap of the results.

Parameters:
  • df (pd.DataFrame, optional) – The input DataFrame. If not provided, SPARQL must be used.

  • endpoint_url (str, optional) – SPARQL endpoint URL.

  • query (str, optional) – SPARQL query string.

  • col1 (str) – Name of the first column to compare.

  • col2 (str) – Name of the second column to compare.

  • sep (str, optional) – Separator for multi-value string fields. Default is ‘,’.

  • edges (int, optional) – If > 0, only returns top and bottom N correlations.

  • csv_filename (str, optional) – File path to save the result as CSV.

  • heatmap (bool, optional) – Whether to generate a heatmap. Default is True.

  • heatmap_kwargs (dict, optional) – Additional kwargs passed to the heatmap function.

  • dummies_matrix (bool, optional) – Treat col1 and col2 as binary dummies. Default is True, else bag of words, count frequency.

  • save_PNG (bool, optional) – Whether to save the heatmap as a PNG file.

  • verbose (bool, optional) – Whether to print insights into the dataframe.

Returns:

A DataFrame containing correlation values and p-values.

Return type:

pd.DataFrame

dataria.MATH.plot_correlation_heatmap(correlation_df, corr_col='Correlation', save_PNG=True, title='Correlation Heatmap', figsize=(10, 8), **heatmap_kwargs)

Generate a heatmap plot from a correlation DataFrame.

Supports various matrix formats (wide, long, 1x1) and visual customization.

Parameters:
  • correlation_df (pd.DataFrame) – DataFrame containing correlation values.

  • corr_col (str) – Name of the correlation column. Default is “Correlation”.

  • save_PNG (bool) – Whether to save the heatmap as a PNG.

  • title (str) – Title of the plot.

  • figsize (tuple) – Size of the figure (width, height).

  • **heatmap_kwargs – Additional arguments for seaborn.heatmap or seaborn.barplot.

Returns:

None

dataria.MATH.upset(df=None, endpoint_url=None, query=None, col_item='item', col_sets='set', sep=';', csv_filename='upset_data.csv', plot_upset=True, png_filename='upset_plot.png', verbose=True, **upset_kwargs)

Generate an UpSet plot and/or a DataFrame suitable for upset.js from set membership data.

Useful for visualizing overlapping categories or tag combinations. Data can be provided via SPARQL or directly as a DataFrame.

Parameters:
  • df (pd.DataFrame, optional) – Input DataFrame.

  • endpoint_url (str, optional) – SPARQL endpoint URL.

  • query (str, optional) – SPARQL query.

  • col_item (str) – Name of the item column. Default is “item”.

  • col_sets (str) – Name of the set membership column. Default is “set”.

  • sep (str) – Separator used in set column. Default is ‘,’.

  • csv_filename (str) – File path to save the transformed DataFrame.

  • plot_upset (bool) – Whether to generate an UpSet plot.

  • png_filename (str) – File path to save the UpSet plot as PNG.

  • verbose (bool, optional) – Whether to print insights into the dataframe.

  • **upset_kwargs – Additional arguments for up.UpSet().

Returns:

The transformed DataFrame (one-hot encoded for sets).

Return type:

pd.DataFrame

dataria.MATH.fuzzy_compare(df1, df2=None, match_by=None, grouping_var=None, treat_empty_as_match=False, match_all=True, unique_rows=False, additional_vars=None, csv_filename=None, verbose=True)

Perform a fast, memory-efficient fuzzy comparison between two DataFrames with strict AND logic across multiple columns.

Each (column, threshold) pair in match_by must satisfy its threshold for a group to be retained if match_all=True.

Parameters:
  • df1 (pd.DataFrame) – Primary DataFrame for comparison.

  • df2 (pd.DataFrame, optional) – Secondary DataFrame to compare against. If None, df1 is used (self-join).

  • match_by (list of (str, int)) – List of tuples specifying columns to compare and their fuzzy thresholds (0–100). Columns with threshold >=100 are treated as exact matches.

  • grouping_var (str, optional) – Column name used to define group identities between df1 and df2.

  • treat_empty_as_match (bool, default=False) – If True, empty strings (“”) count as perfect matches (score = 100).

  • match_all (bool, default=True) – If True, groups are retained only if all match_by conditions meet their respective thresholds (strict AND logic). If False, any matching condition is sufficient (OR logic).

  • unique_rows (bool, default=False) – For self-joins, exclude symmetric duplicates.

  • additional_vars (list of str, optional) – Additional columns to include in the aggregated output for context.

  • csv_filename (str, optional) – Path to save the aggregated results as CSV. If None, results are not saved.

  • verbose (bool, default=True) – If True, print progress and summary information.

Returns:

Aggregated match statistics per (group1, group2) pair, including per-column min/avg/max fuzzy scores and optional contextual fields.

Return type:

pd.DataFrame

dataria.MATH.fuzzy_compare_deprecated(df1=None, df2=None, additional_vars_df1=None, additional_vars_df2=None, endpoint_url=None, query=None, grouping_var=None, label_var=None, element_var=None, threshold=95, match_all=False, unique_rows=False, csv_filename='comparison.csv', verbose=True)

Fuzzy string matching between two DataFrames (or SPARQL query results) based on a common element column.

Supports optional grouping, label filtering, and aggregation of match statistics.

Parameters:
  • df1 (pd.DataFrame, optional) – First DataFrame.

  • df2 (pd.DataFrame, optional) – Second DataFrame. If not provided, df1 is used.

  • additional_vars_df1 (list, optional) – List of columns from df1 that will be aggregated in the result.

  • additional_vars_df2 (list, optional) – List of columns from df2 that will be aggregated in the result.

  • endpoint_url (str, optional) – SPARQL endpoint (used if df1 is None).

  • query (str, optional) – SPARQL query (used if df1 is None).

  • grouping_var (str, optional) – Column name used for grouping (must exist in both DataFrames).

  • label_var (str or list, optional) – Label conditions, one of: - str or (col, “identical”) or (col, 100): require exact match on this column - (col, int < 100): require fuzzy match on this column with given threshold

  • element_var (str) – Column containing the string values to compare with fuzzy matching.

  • threshold (int, optional) – Fuzzy matching threshold for element_var (0–100). Default: 95.

  • match_all (bool, optional) – If True, only include groups where all matches exceed the threshold.

  • unique_rows (bool, optional) – If True, suppress duplicate pairings (self-joins).

  • csv_filename (str, optional) – File path to save the aggregated results. If None, skip saving.

  • verbose (bool, optional) – If True, print debug info and head of results.

Returns:

Aggregated match statistics between df1 and df2 (or within df1).

Return type:

pd.DataFrame

dataria.MATH.fuzzy_compare_legacy(df1=None, df2=None, additional_vars_df1=None, additional_vars_df2=None, endpoint_url=None, query=None, grouping_var=None, label_var=None, element_var=None, threshold=95, match_all=False, unique_rows=False, csv_filename='comparison.csv', verbose=True)

!! Left here, as new function not tested much…. Fuzzy string matching between two DataFrames (or SPARQL query results) based on a common element column.

Supports optional grouping, label filtering, and aggregation of match statistics.

Parameters:
  • df1 (pd.DataFrame, optional) – First DataFrame.

  • df2 (pd.DataFrame, optional) – Second DataFrame. If not provided, df1 is used.

  • additional_vars_df1 (list, optional) – List of columns that will be aggregated in the result (using first per group).

  • additional_vars_df2 (list, optional) – List of columns that will be aggregated in the result (using first per group).

  • endpoint_url (str, optional) – SPARQL endpoint.

  • query (str, optional) – SPARQL query.

  • grouping_var (str, optional) – Column name used for grouping.

  • label_var (str, optional) – Optional label for filtering matches.

  • element_var (str) – Column containing the string values to compare.

  • threshold (int) – Fuzzy matching threshold (0–100). Default: 95.

  • match_all (bool) – If True, only include groups where all scores exceed the threshold.

  • unique_rows (bool) – If True, suppress duplicate pairings.

  • csv_filename (str) – File path to save the results.

  • verbose (bool, optional) – Whether to print insights into the dataframe.

Returns:

Aggregated match statistics between df1 and df2 (or within df1).

Return type:

pd.DataFrame