dataria.MATH
============

.. py:module:: dataria.MATH


Functions
---------

.. autoapisummary::

   dataria.MATH.correlation
   dataria.MATH.plot_correlation_heatmap
   dataria.MATH.upset
   dataria.MATH.fuzzy_compare
   dataria.MATH.fuzzy_compare_deprecated
   dataria.MATH.fuzzy_compare_legacy


Module Contents
---------------

.. py:function:: correlation(df=None, endpoint_url=None, query=None, col1=None, col2=None, sep=';', edges=0, csv_filename='correlations.csv', heatmap=True, heatmap_kwargs={}, dummies_matrix=True, save_PNG=True, verbose=True)

   Compute correlations between two columns of a DataFrame, including support for categorical (string) data.

   This function calculates Pearson correlations, handles dummy encoding for categorical data, supports SPARQL-based data retrieval, and can generate a heatmap of the results.

   :param df: The input DataFrame. If not provided, SPARQL must be used.
   :type df: pd.DataFrame, optional
   :param endpoint_url: SPARQL endpoint URL.
   :type endpoint_url: str, optional
   :param query: SPARQL query string.
   :type query: str, optional
   :param col1: Name of the first column to compare.
   :type col1: str
   :param col2: Name of the second column to compare.
   :type col2: str
   :param sep: Separator for multi-value string fields. Default is ','.
   :type sep: str, optional
   :param edges: If > 0, only returns top and bottom N correlations.
   :type edges: int, optional
   :param csv_filename: File path to save the result as CSV.
   :type csv_filename: str, optional
   :param heatmap: Whether to generate a heatmap. Default is True.
   :type heatmap: bool, optional
   :param heatmap_kwargs: Additional kwargs passed to the heatmap function.
   :type heatmap_kwargs: dict, optional
   :param dummies_matrix: Treat col1 and col2 as binary dummies. Default is True, else bag of words, count frequency.
   :type dummies_matrix: bool, optional
   :param save_PNG: Whether to save the heatmap as a PNG file.
   :type save_PNG: bool, optional
   :param verbose: Whether to print insights into the dataframe.
   :type verbose: bool, optional

   :returns: A DataFrame containing correlation values and p-values.
   :rtype: pd.DataFrame


.. py:function:: plot_correlation_heatmap(correlation_df, corr_col='Correlation', save_PNG=True, title='Correlation Heatmap', figsize=(10, 8), **heatmap_kwargs)

   Generate a heatmap plot from a correlation DataFrame.

   Supports various matrix formats (wide, long, 1x1) and visual customization.

   :param correlation_df: DataFrame containing correlation values.
   :type correlation_df: pd.DataFrame
   :param corr_col: Name of the correlation column. Default is "Correlation".
   :type corr_col: str
   :param save_PNG: Whether to save the heatmap as a PNG.
   :type save_PNG: bool
   :param title: Title of the plot.
   :type title: str
   :param figsize: Size of the figure (width, height).
   :type figsize: tuple
   :param \*\*heatmap_kwargs: Additional arguments for `seaborn.heatmap` or `seaborn.barplot`.

   :returns: None


.. py:function:: upset(df=None, endpoint_url=None, query=None, col_item='item', col_sets='set', sep=';', csv_filename='upset_data.csv', plot_upset=True, png_filename='upset_plot.png', verbose=True, **upset_kwargs)

   Generate an UpSet plot and/or a DataFrame suitable for upset.js from set membership data.

   Useful for visualizing overlapping categories or tag combinations. Data can be provided via SPARQL or directly as a DataFrame.

   :param df: Input DataFrame.
   :type df: pd.DataFrame, optional
   :param endpoint_url: SPARQL endpoint URL.
   :type endpoint_url: str, optional
   :param query: SPARQL query.
   :type query: str, optional
   :param col_item: Name of the item column. Default is "item".
   :type col_item: str
   :param col_sets: Name of the set membership column. Default is "set".
   :type col_sets: str
   :param sep: Separator used in set column. Default is ','.
   :type sep: str
   :param csv_filename: File path to save the transformed DataFrame.
   :type csv_filename: str
   :param plot_upset: Whether to generate an UpSet plot.
   :type plot_upset: bool
   :param png_filename: File path to save the UpSet plot as PNG.
   :type png_filename: str
   :param verbose: Whether to print insights into the dataframe.
   :type verbose: bool, optional
   :param \*\*upset_kwargs: Additional arguments for `up.UpSet()`.

   :returns: The transformed DataFrame (one-hot encoded for sets).
   :rtype: pd.DataFrame


.. py:function:: fuzzy_compare(df1, df2=None, match_by=None, grouping_var=None, treat_empty_as_match=False, match_all=True, unique_rows=False, additional_vars=None, csv_filename=None, verbose=True)

   Perform a fast, memory-efficient fuzzy comparison between two DataFrames
   with strict AND logic across multiple columns.

   Each (column, threshold) pair in `match_by` must satisfy its threshold
   for a group to be retained if `match_all=True`.

   :param df1: Primary DataFrame for comparison.
   :type df1: pd.DataFrame
   :param df2: Secondary DataFrame to compare against. If None, `df1` is used (self-join).
   :type df2: pd.DataFrame, optional
   :param match_by: List of tuples specifying columns to compare and their fuzzy thresholds
                    (0–100). Columns with threshold >=100 are treated as exact matches.
   :type match_by: list of (str, int)
   :param grouping_var: Column name used to define group identities between df1 and df2.
   :type grouping_var: str, optional
   :param treat_empty_as_match: If True, empty strings ("") count as perfect matches (score = 100).
   :type treat_empty_as_match: bool, default=False
   :param match_all: If True, groups are retained only if *all* match_by conditions meet
                     their respective thresholds (strict AND logic). If False, any matching
                     condition is sufficient (OR logic).
   :type match_all: bool, default=True
   :param unique_rows: For self-joins, exclude symmetric duplicates.
   :type unique_rows: bool, default=False
   :param additional_vars: Additional columns to include in the aggregated output for context.
   :type additional_vars: list of str, optional
   :param csv_filename: Path to save the aggregated results as CSV. If None, results are not saved.
   :type csv_filename: str, optional
   :param verbose: If True, print progress and summary information.
   :type verbose: bool, default=True

   :returns: Aggregated match statistics per (group1, group2) pair, including
             per-column min/avg/max fuzzy scores and optional contextual fields.
   :rtype: pd.DataFrame


.. py:function:: fuzzy_compare_deprecated(df1=None, df2=None, additional_vars_df1=None, additional_vars_df2=None, endpoint_url=None, query=None, grouping_var=None, label_var=None, element_var=None, threshold=95, match_all=False, unique_rows=False, csv_filename='comparison.csv', verbose=True)

   Fuzzy string matching between two DataFrames (or SPARQL query results) based on a common element column.

   Supports optional grouping, label filtering, and aggregation of match statistics.

   :param df1: First DataFrame.
   :type df1: pd.DataFrame, optional
   :param df2: Second DataFrame. If not provided, df1 is used.
   :type df2: pd.DataFrame, optional
   :param additional_vars_df1: List of columns from df1 that will be aggregated in the result.
   :type additional_vars_df1: list, optional
   :param additional_vars_df2: List of columns from df2 that will be aggregated in the result.
   :type additional_vars_df2: list, optional
   :param endpoint_url: SPARQL endpoint (used if df1 is None).
   :type endpoint_url: str, optional
   :param query: SPARQL query (used if df1 is None).
   :type query: str, optional
   :param grouping_var: Column name used for grouping (must exist in both DataFrames).
   :type grouping_var: str, optional
   :param label_var: Label conditions, one of:
                     - str or (col, "identical") or (col, 100): require exact match on this column
                     - (col, int < 100): require fuzzy match on this column with given threshold
   :type label_var: str or list, optional
   :param element_var: Column containing the string values to compare with fuzzy matching.
   :type element_var: str
   :param threshold: Fuzzy matching threshold for element_var (0–100). Default: 95.
   :type threshold: int, optional
   :param match_all: If True, only include groups where all matches exceed the threshold.
   :type match_all: bool, optional
   :param unique_rows: If True, suppress duplicate pairings (self-joins).
   :type unique_rows: bool, optional
   :param csv_filename: File path to save the aggregated results. If None, skip saving.
   :type csv_filename: str, optional
   :param verbose: If True, print debug info and head of results.
   :type verbose: bool, optional

   :returns: Aggregated match statistics between df1 and df2 (or within df1).
   :rtype: pd.DataFrame


.. py:function:: fuzzy_compare_legacy(df1=None, df2=None, additional_vars_df1=None, additional_vars_df2=None, endpoint_url=None, query=None, grouping_var=None, label_var=None, element_var=None, threshold=95, match_all=False, unique_rows=False, csv_filename='comparison.csv', verbose=True)

   !! Left here, as new function not tested much....
   Fuzzy string matching between two DataFrames (or SPARQL query results) based on a common element column.

   Supports optional grouping, label filtering, and aggregation of match statistics.

   :param df1: First DataFrame.
   :type df1: pd.DataFrame, optional
   :param df2: Second DataFrame. If not provided, df1 is used.
   :type df2: pd.DataFrame, optional
   :param additional_vars_df1: List of columns that will be aggregated in the result (using first per group).
   :type additional_vars_df1: list, optional
   :param additional_vars_df2: List of columns that will be aggregated in the result (using first per group).
   :type additional_vars_df2: list, optional
   :param endpoint_url: SPARQL endpoint.
   :type endpoint_url: str, optional
   :param query: SPARQL query.
   :type query: str, optional
   :param grouping_var: Column name used for grouping.
   :type grouping_var: str, optional
   :param label_var: Optional label for filtering matches.
   :type label_var: str, optional
   :param element_var: Column containing the string values to compare.
   :type element_var: str
   :param threshold: Fuzzy matching threshold (0–100). Default: 95.
   :type threshold: int
   :param match_all: If True, only include groups where all scores exceed the threshold.
   :type match_all: bool
   :param unique_rows: If True, suppress duplicate pairings.
   :type unique_rows: bool
   :param csv_filename: File path to save the results.
   :type csv_filename: str
   :param verbose: Whether to print insights into the dataframe.
   :type verbose: bool, optional

   :returns: Aggregated match statistics between df1 and df2 (or within df1).
   :rtype: pd.DataFrame