dataria.MATH¶

Functions¶

`correlation`([df, endpoint_url, query, col1, col2, ...])	Compute correlations between two columns of a DataFrame, including support for categorical (string) data.
`plot_correlation_heatmap`(correlation_df[, corr_col, ...])	Generate a heatmap plot from a correlation DataFrame.
`upset`([df, endpoint_url, query, col_item, col_sets, ...])	Generate an UpSet plot and/or a DataFrame suitable for upset.js from set membership data.
`fuzzy_compare`(df1[, df2, match_by, grouping_var, ...])	Perform a fast, memory-efficient fuzzy comparison between two DataFrames
`fuzzy_compare_deprecated`([df1, df2, ...])	Fuzzy string matching between two DataFrames (or SPARQL query results) based on a common element column.
`fuzzy_compare_legacy`([df1, df2, additional_vars_df1, ...])	!! Left here, as new function not tested much....

Module Contents¶

dataria.MATH.correlation(df=None, endpoint_url=None, query=None, col1=None, col2=None, sep=';', edges=0, csv_filename='correlations.csv', heatmap=True, heatmap_kwargs={}, dummies_matrix=True, save_PNG=True, verbose=True)¶

Compute correlations between two columns of a DataFrame, including support for categorical (string) data.

This function calculates Pearson correlations, handles dummy encoding for categorical data, supports SPARQL-based data retrieval, and can generate a heatmap of the results.

Parameters:

df (pd.DataFrame, optional) – The input DataFrame. If not provided, SPARQL must be used.
endpoint_url (str, optional) – SPARQL endpoint URL.
query (str, optional) – SPARQL query string.
col1 (str) – Name of the first column to compare.
col2 (str) – Name of the second column to compare.
sep (str, optional) – Separator for multi-value string fields. Default is ‘,’.
edges (int, optional) – If > 0, only returns top and bottom N correlations.
csv_filename (str, optional) – File path to save the result as CSV.
heatmap (bool, optional) – Whether to generate a heatmap. Default is True.
heatmap_kwargs (dict, optional) – Additional kwargs passed to the heatmap function.
dummies_matrix (bool, optional) – Treat col1 and col2 as binary dummies. Default is True, else bag of words, count frequency.
save_PNG (bool, optional) – Whether to save the heatmap as a PNG file.
verbose (bool, optional) – Whether to print insights into the dataframe.

Returns:

A DataFrame containing correlation values and p-values.

Return type:

pd.DataFrame

dataria.MATH.plot_correlation_heatmap(correlation_df, corr_col='Correlation', save_PNG=True, title='Correlation Heatmap', figsize=(10, 8), **heatmap_kwargs)¶

Generate a heatmap plot from a correlation DataFrame.

Supports various matrix formats (wide, long, 1x1) and visual customization.

Parameters:

correlation_df (pd.DataFrame) – DataFrame containing correlation values.
corr_col (str) – Name of the correlation column. Default is “Correlation”.
save_PNG (bool) – Whether to save the heatmap as a PNG.
title (str) – Title of the plot.
figsize (tuple) – Size of the figure (width, height).
**heatmap_kwargs – Additional arguments for seaborn.heatmap or seaborn.barplot.

Returns:

None

dataria.MATH.upset(df=None, endpoint_url=None, query=None, col_item='item', col_sets='set', sep=';', csv_filename='upset_data.csv', plot_upset=True, png_filename='upset_plot.png', verbose=True, **upset_kwargs)¶

Generate an UpSet plot and/or a DataFrame suitable for upset.js from set membership data.

Useful for visualizing overlapping categories or tag combinations. Data can be provided via SPARQL or directly as a DataFrame.

Parameters:

df (pd.DataFrame, optional) – Input DataFrame.
endpoint_url (str, optional) – SPARQL endpoint URL.
query (str, optional) – SPARQL query.
col_item (str) – Name of the item column. Default is “item”.
col_sets (str) – Name of the set membership column. Default is “set”.
sep (str) – Separator used in set column. Default is ‘,’.
csv_filename (str) – File path to save the transformed DataFrame.
plot_upset (bool) – Whether to generate an UpSet plot.
png_filename (str) – File path to save the UpSet plot as PNG.
verbose (bool, optional) – Whether to print insights into the dataframe.
**upset_kwargs – Additional arguments for up.UpSet().

Returns:

The transformed DataFrame (one-hot encoded for sets).

Return type:

pd.DataFrame

dataria.MATH.fuzzy_compare(df1, df2=None, match_by=None, grouping_var=None, treat_empty_as_match=False, match_all=True, unique_rows=False, additional_vars=None, csv_filename=None, verbose=True)¶

Perform a fast, memory-efficient fuzzy comparison between two DataFrames with strict AND logic across multiple columns.

Each (column, threshold) pair in match_by must satisfy its threshold for a group to be retained if match_all=True.

Parameters:

df1 (pd.DataFrame) – Primary DataFrame for comparison.
df2 (pd.DataFrame, optional) – Secondary DataFrame to compare against. If None, df1 is used (self-join).
match_by (list of (str, int)) – List of tuples specifying columns to compare and their fuzzy thresholds (0–100). Columns with threshold >=100 are treated as exact matches.
grouping_var (str, optional) – Column name used to define group identities between df1 and df2.
treat_empty_as_match (bool, default=False) – If True, empty strings (“”) count as perfect matches (score = 100).
match_all (bool, default=True) – If True, groups are retained only if all match_by conditions meet their respective thresholds (strict AND logic). If False, any matching condition is sufficient (OR logic).
unique_rows (bool, default=False) – For self-joins, exclude symmetric duplicates.
additional_vars (list of str, optional) – Additional columns to include in the aggregated output for context.
csv_filename (str, optional) – Path to save the aggregated results as CSV. If None, results are not saved.
verbose (bool, default=True) – If True, print progress and summary information.

Returns:

Aggregated match statistics per (group1, group2) pair, including per-column min/avg/max fuzzy scores and optional contextual fields.

Return type:

pd.DataFrame

dataria.MATH.fuzzy_compare_deprecated(df1=None, df2=None, additional_vars_df1=None, additional_vars_df2=None, endpoint_url=None, query=None, grouping_var=None, label_var=None, element_var=None, threshold=95, match_all=False, unique_rows=False, csv_filename='comparison.csv', verbose=True)¶

Fuzzy string matching between two DataFrames (or SPARQL query results) based on a common element column.

Supports optional grouping, label filtering, and aggregation of match statistics.

Parameters:

df1 (pd.DataFrame, optional) – First DataFrame.
df2 (pd.DataFrame, optional) – Second DataFrame. If not provided, df1 is used.
additional_vars_df1 (list, optional) – List of columns from df1 that will be aggregated in the result.
additional_vars_df2 (list, optional) – List of columns from df2 that will be aggregated in the result.
endpoint_url (str, optional) – SPARQL endpoint (used if df1 is None).
query (str, optional) – SPARQL query (used if df1 is None).
grouping_var (str, optional) – Column name used for grouping (must exist in both DataFrames).
label_var (str or list, optional) – Label conditions, one of: - str or (col, “identical”) or (col, 100): require exact match on this column - (col, int < 100): require fuzzy match on this column with given threshold
element_var (str) – Column containing the string values to compare with fuzzy matching.
threshold (int, optional) – Fuzzy matching threshold for element_var (0–100). Default: 95.
match_all (bool, optional) – If True, only include groups where all matches exceed the threshold.
unique_rows (bool, optional) – If True, suppress duplicate pairings (self-joins).
csv_filename (str, optional) – File path to save the aggregated results. If None, skip saving.
verbose (bool, optional) – If True, print debug info and head of results.

Returns:

Aggregated match statistics between df1 and df2 (or within df1).

Return type:

pd.DataFrame

dataria.MATH.fuzzy_compare_legacy(df1=None, df2=None, additional_vars_df1=None, additional_vars_df2=None, endpoint_url=None, query=None, grouping_var=None, label_var=None, element_var=None, threshold=95, match_all=False, unique_rows=False, csv_filename='comparison.csv', verbose=True)¶

!! Left here, as new function not tested much…. Fuzzy string matching between two DataFrames (or SPARQL query results) based on a common element column.

Supports optional grouping, label filtering, and aggregation of match statistics.

Parameters:

df1 (pd.DataFrame, optional) – First DataFrame.
df2 (pd.DataFrame, optional) – Second DataFrame. If not provided, df1 is used.
additional_vars_df1 (list, optional) – List of columns that will be aggregated in the result (using first per group).
additional_vars_df2 (list, optional) – List of columns that will be aggregated in the result (using first per group).
endpoint_url (str, optional) – SPARQL endpoint.
query (str, optional) – SPARQL query.
grouping_var (str, optional) – Column name used for grouping.
label_var (str, optional) – Optional label for filtering matches.
element_var (str) – Column containing the string values to compare.
threshold (int) – Fuzzy matching threshold (0–100). Default: 95.
match_all (bool) – If True, only include groups where all scores exceed the threshold.
unique_rows (bool) – If True, suppress duplicate pairings.
csv_filename (str) – File path to save the results.
verbose (bool, optional) – Whether to print insights into the dataframe.

Returns:

Aggregated match statistics between df1 and df2 (or within df1).

Return type:

pd.DataFrame

dataria.MATH¶

Functions¶

Module Contents¶

DATAria Utils

Navigation

Related Topics