dataria.MATH¶
Functions¶
|
Compute correlations between two columns of a DataFrame, including support for categorical (string) data. |
|
Generate a heatmap plot from a correlation DataFrame. |
|
Generate an UpSet plot and/or a DataFrame suitable for upset.js from set membership data. |
|
Perform a fast, memory-efficient fuzzy comparison between two DataFrames |
|
Fuzzy string matching between two DataFrames (or SPARQL query results) based on a common element column. |
|
!! Left here, as new function not tested much.... |
Module Contents¶
- dataria.MATH.correlation(df=None, endpoint_url=None, query=None, col1=None, col2=None, sep=';', edges=0, csv_filename='correlations.csv', heatmap=True, heatmap_kwargs={}, dummies_matrix=True, save_PNG=True, verbose=True)¶
Compute correlations between two columns of a DataFrame, including support for categorical (string) data.
This function calculates Pearson correlations, handles dummy encoding for categorical data, supports SPARQL-based data retrieval, and can generate a heatmap of the results.
- Parameters:
df (pd.DataFrame, optional) – The input DataFrame. If not provided, SPARQL must be used.
endpoint_url (str, optional) – SPARQL endpoint URL.
query (str, optional) – SPARQL query string.
col1 (str) – Name of the first column to compare.
col2 (str) – Name of the second column to compare.
sep (str, optional) – Separator for multi-value string fields. Default is ‘,’.
edges (int, optional) – If > 0, only returns top and bottom N correlations.
csv_filename (str, optional) – File path to save the result as CSV.
heatmap (bool, optional) – Whether to generate a heatmap. Default is True.
heatmap_kwargs (dict, optional) – Additional kwargs passed to the heatmap function.
dummies_matrix (bool, optional) – Treat col1 and col2 as binary dummies. Default is True, else bag of words, count frequency.
save_PNG (bool, optional) – Whether to save the heatmap as a PNG file.
verbose (bool, optional) – Whether to print insights into the dataframe.
- Returns:
A DataFrame containing correlation values and p-values.
- Return type:
pd.DataFrame
- dataria.MATH.plot_correlation_heatmap(correlation_df, corr_col='Correlation', save_PNG=True, title='Correlation Heatmap', figsize=(10, 8), **heatmap_kwargs)¶
Generate a heatmap plot from a correlation DataFrame.
Supports various matrix formats (wide, long, 1x1) and visual customization.
- Parameters:
correlation_df (pd.DataFrame) – DataFrame containing correlation values.
corr_col (str) – Name of the correlation column. Default is “Correlation”.
save_PNG (bool) – Whether to save the heatmap as a PNG.
title (str) – Title of the plot.
figsize (tuple) – Size of the figure (width, height).
**heatmap_kwargs – Additional arguments for seaborn.heatmap or seaborn.barplot.
- Returns:
None
- dataria.MATH.upset(df=None, endpoint_url=None, query=None, col_item='item', col_sets='set', sep=';', csv_filename='upset_data.csv', plot_upset=True, png_filename='upset_plot.png', verbose=True, **upset_kwargs)¶
Generate an UpSet plot and/or a DataFrame suitable for upset.js from set membership data.
Useful for visualizing overlapping categories or tag combinations. Data can be provided via SPARQL or directly as a DataFrame.
- Parameters:
df (pd.DataFrame, optional) – Input DataFrame.
endpoint_url (str, optional) – SPARQL endpoint URL.
query (str, optional) – SPARQL query.
col_item (str) – Name of the item column. Default is “item”.
col_sets (str) – Name of the set membership column. Default is “set”.
sep (str) – Separator used in set column. Default is ‘,’.
csv_filename (str) – File path to save the transformed DataFrame.
plot_upset (bool) – Whether to generate an UpSet plot.
png_filename (str) – File path to save the UpSet plot as PNG.
verbose (bool, optional) – Whether to print insights into the dataframe.
**upset_kwargs – Additional arguments for up.UpSet().
- Returns:
The transformed DataFrame (one-hot encoded for sets).
- Return type:
pd.DataFrame
- dataria.MATH.fuzzy_compare(df1, df2=None, match_by=None, grouping_var=None, treat_empty_as_match=False, match_all=True, unique_rows=False, additional_vars=None, csv_filename=None, verbose=True)¶
Perform a fast, memory-efficient fuzzy comparison between two DataFrames with strict AND logic across multiple columns.
Each (column, threshold) pair in match_by must satisfy its threshold for a group to be retained if match_all=True.
- Parameters:
df1 (pd.DataFrame) – Primary DataFrame for comparison.
df2 (pd.DataFrame, optional) – Secondary DataFrame to compare against. If None, df1 is used (self-join).
match_by (list of (str, int)) – List of tuples specifying columns to compare and their fuzzy thresholds (0–100). Columns with threshold >=100 are treated as exact matches.
grouping_var (str, optional) – Column name used to define group identities between df1 and df2.
treat_empty_as_match (bool, default=False) – If True, empty strings (“”) count as perfect matches (score = 100).
match_all (bool, default=True) – If True, groups are retained only if all match_by conditions meet their respective thresholds (strict AND logic). If False, any matching condition is sufficient (OR logic).
unique_rows (bool, default=False) – For self-joins, exclude symmetric duplicates.
additional_vars (list of str, optional) – Additional columns to include in the aggregated output for context.
csv_filename (str, optional) – Path to save the aggregated results as CSV. If None, results are not saved.
verbose (bool, default=True) – If True, print progress and summary information.
- Returns:
Aggregated match statistics per (group1, group2) pair, including per-column min/avg/max fuzzy scores and optional contextual fields.
- Return type:
pd.DataFrame
- dataria.MATH.fuzzy_compare_deprecated(df1=None, df2=None, additional_vars_df1=None, additional_vars_df2=None, endpoint_url=None, query=None, grouping_var=None, label_var=None, element_var=None, threshold=95, match_all=False, unique_rows=False, csv_filename='comparison.csv', verbose=True)¶
Fuzzy string matching between two DataFrames (or SPARQL query results) based on a common element column.
Supports optional grouping, label filtering, and aggregation of match statistics.
- Parameters:
df1 (pd.DataFrame, optional) – First DataFrame.
df2 (pd.DataFrame, optional) – Second DataFrame. If not provided, df1 is used.
additional_vars_df1 (list, optional) – List of columns from df1 that will be aggregated in the result.
additional_vars_df2 (list, optional) – List of columns from df2 that will be aggregated in the result.
endpoint_url (str, optional) – SPARQL endpoint (used if df1 is None).
query (str, optional) – SPARQL query (used if df1 is None).
grouping_var (str, optional) – Column name used for grouping (must exist in both DataFrames).
label_var (str or list, optional) – Label conditions, one of: - str or (col, “identical”) or (col, 100): require exact match on this column - (col, int < 100): require fuzzy match on this column with given threshold
element_var (str) – Column containing the string values to compare with fuzzy matching.
threshold (int, optional) – Fuzzy matching threshold for element_var (0–100). Default: 95.
match_all (bool, optional) – If True, only include groups where all matches exceed the threshold.
unique_rows (bool, optional) – If True, suppress duplicate pairings (self-joins).
csv_filename (str, optional) – File path to save the aggregated results. If None, skip saving.
verbose (bool, optional) – If True, print debug info and head of results.
- Returns:
Aggregated match statistics between df1 and df2 (or within df1).
- Return type:
pd.DataFrame
- dataria.MATH.fuzzy_compare_legacy(df1=None, df2=None, additional_vars_df1=None, additional_vars_df2=None, endpoint_url=None, query=None, grouping_var=None, label_var=None, element_var=None, threshold=95, match_all=False, unique_rows=False, csv_filename='comparison.csv', verbose=True)¶
!! Left here, as new function not tested much…. Fuzzy string matching between two DataFrames (or SPARQL query results) based on a common element column.
Supports optional grouping, label filtering, and aggregation of match statistics.
- Parameters:
df1 (pd.DataFrame, optional) – First DataFrame.
df2 (pd.DataFrame, optional) – Second DataFrame. If not provided, df1 is used.
additional_vars_df1 (list, optional) – List of columns that will be aggregated in the result (using first per group).
additional_vars_df2 (list, optional) – List of columns that will be aggregated in the result (using first per group).
endpoint_url (str, optional) – SPARQL endpoint.
query (str, optional) – SPARQL query.
grouping_var (str, optional) – Column name used for grouping.
label_var (str, optional) – Optional label for filtering matches.
element_var (str) – Column containing the string values to compare.
threshold (int) – Fuzzy matching threshold (0–100). Default: 95.
match_all (bool) – If True, only include groups where all scores exceed the threshold.
unique_rows (bool) – If True, suppress duplicate pairings.
csv_filename (str) – File path to save the results.
verbose (bool, optional) – Whether to print insights into the dataframe.
- Returns:
Aggregated match statistics between df1 and df2 (or within df1).
- Return type:
pd.DataFrame