dataria.MATH ============ .. py:module:: dataria.MATH Functions --------- .. autoapisummary:: dataria.MATH.correlation dataria.MATH.plot_correlation_heatmap dataria.MATH.upset dataria.MATH.fuzzy_compare dataria.MATH.fuzzy_compare_deprecated dataria.MATH.fuzzy_compare_legacy Module Contents --------------- .. py:function:: correlation(df=None, endpoint_url=None, query=None, col1=None, col2=None, sep=';', edges=0, csv_filename='correlations.csv', heatmap=True, heatmap_kwargs={}, dummies_matrix=True, save_PNG=True, verbose=True) Compute correlations between two columns of a DataFrame, including support for categorical (string) data. This function calculates Pearson correlations, handles dummy encoding for categorical data, supports SPARQL-based data retrieval, and can generate a heatmap of the results. :param df: The input DataFrame. If not provided, SPARQL must be used. :type df: pd.DataFrame, optional :param endpoint_url: SPARQL endpoint URL. :type endpoint_url: str, optional :param query: SPARQL query string. :type query: str, optional :param col1: Name of the first column to compare. :type col1: str :param col2: Name of the second column to compare. :type col2: str :param sep: Separator for multi-value string fields. Default is ','. :type sep: str, optional :param edges: If > 0, only returns top and bottom N correlations. :type edges: int, optional :param csv_filename: File path to save the result as CSV. :type csv_filename: str, optional :param heatmap: Whether to generate a heatmap. Default is True. :type heatmap: bool, optional :param heatmap_kwargs: Additional kwargs passed to the heatmap function. :type heatmap_kwargs: dict, optional :param dummies_matrix: Treat col1 and col2 as binary dummies. Default is True, else bag of words, count frequency. :type dummies_matrix: bool, optional :param save_PNG: Whether to save the heatmap as a PNG file. :type save_PNG: bool, optional :param verbose: Whether to print insights into the dataframe. :type verbose: bool, optional :returns: A DataFrame containing correlation values and p-values. :rtype: pd.DataFrame .. py:function:: plot_correlation_heatmap(correlation_df, corr_col='Correlation', save_PNG=True, title='Correlation Heatmap', figsize=(10, 8), **heatmap_kwargs) Generate a heatmap plot from a correlation DataFrame. Supports various matrix formats (wide, long, 1x1) and visual customization. :param correlation_df: DataFrame containing correlation values. :type correlation_df: pd.DataFrame :param corr_col: Name of the correlation column. Default is "Correlation". :type corr_col: str :param save_PNG: Whether to save the heatmap as a PNG. :type save_PNG: bool :param title: Title of the plot. :type title: str :param figsize: Size of the figure (width, height). :type figsize: tuple :param \*\*heatmap_kwargs: Additional arguments for `seaborn.heatmap` or `seaborn.barplot`. :returns: None .. py:function:: upset(df=None, endpoint_url=None, query=None, col_item='item', col_sets='set', sep=';', csv_filename='upset_data.csv', plot_upset=True, png_filename='upset_plot.png', verbose=True, **upset_kwargs) Generate an UpSet plot and/or a DataFrame suitable for upset.js from set membership data. Useful for visualizing overlapping categories or tag combinations. Data can be provided via SPARQL or directly as a DataFrame. :param df: Input DataFrame. :type df: pd.DataFrame, optional :param endpoint_url: SPARQL endpoint URL. :type endpoint_url: str, optional :param query: SPARQL query. :type query: str, optional :param col_item: Name of the item column. Default is "item". :type col_item: str :param col_sets: Name of the set membership column. Default is "set". :type col_sets: str :param sep: Separator used in set column. Default is ','. :type sep: str :param csv_filename: File path to save the transformed DataFrame. :type csv_filename: str :param plot_upset: Whether to generate an UpSet plot. :type plot_upset: bool :param png_filename: File path to save the UpSet plot as PNG. :type png_filename: str :param verbose: Whether to print insights into the dataframe. :type verbose: bool, optional :param \*\*upset_kwargs: Additional arguments for `up.UpSet()`. :returns: The transformed DataFrame (one-hot encoded for sets). :rtype: pd.DataFrame .. py:function:: fuzzy_compare(df1, df2=None, match_by=None, grouping_var=None, treat_empty_as_match=False, match_all=True, unique_rows=False, additional_vars=None, csv_filename=None, verbose=True) Perform a fast, memory-efficient fuzzy comparison between two DataFrames with strict AND logic across multiple columns. Each (column, threshold) pair in `match_by` must satisfy its threshold for a group to be retained if `match_all=True`. :param df1: Primary DataFrame for comparison. :type df1: pd.DataFrame :param df2: Secondary DataFrame to compare against. If None, `df1` is used (self-join). :type df2: pd.DataFrame, optional :param match_by: List of tuples specifying columns to compare and their fuzzy thresholds (0–100). Columns with threshold >=100 are treated as exact matches. :type match_by: list of (str, int) :param grouping_var: Column name used to define group identities between df1 and df2. :type grouping_var: str, optional :param treat_empty_as_match: If True, empty strings ("") count as perfect matches (score = 100). :type treat_empty_as_match: bool, default=False :param match_all: If True, groups are retained only if *all* match_by conditions meet their respective thresholds (strict AND logic). If False, any matching condition is sufficient (OR logic). :type match_all: bool, default=True :param unique_rows: For self-joins, exclude symmetric duplicates. :type unique_rows: bool, default=False :param additional_vars: Additional columns to include in the aggregated output for context. :type additional_vars: list of str, optional :param csv_filename: Path to save the aggregated results as CSV. If None, results are not saved. :type csv_filename: str, optional :param verbose: If True, print progress and summary information. :type verbose: bool, default=True :returns: Aggregated match statistics per (group1, group2) pair, including per-column min/avg/max fuzzy scores and optional contextual fields. :rtype: pd.DataFrame .. py:function:: fuzzy_compare_deprecated(df1=None, df2=None, additional_vars_df1=None, additional_vars_df2=None, endpoint_url=None, query=None, grouping_var=None, label_var=None, element_var=None, threshold=95, match_all=False, unique_rows=False, csv_filename='comparison.csv', verbose=True) Fuzzy string matching between two DataFrames (or SPARQL query results) based on a common element column. Supports optional grouping, label filtering, and aggregation of match statistics. :param df1: First DataFrame. :type df1: pd.DataFrame, optional :param df2: Second DataFrame. If not provided, df1 is used. :type df2: pd.DataFrame, optional :param additional_vars_df1: List of columns from df1 that will be aggregated in the result. :type additional_vars_df1: list, optional :param additional_vars_df2: List of columns from df2 that will be aggregated in the result. :type additional_vars_df2: list, optional :param endpoint_url: SPARQL endpoint (used if df1 is None). :type endpoint_url: str, optional :param query: SPARQL query (used if df1 is None). :type query: str, optional :param grouping_var: Column name used for grouping (must exist in both DataFrames). :type grouping_var: str, optional :param label_var: Label conditions, one of: - str or (col, "identical") or (col, 100): require exact match on this column - (col, int < 100): require fuzzy match on this column with given threshold :type label_var: str or list, optional :param element_var: Column containing the string values to compare with fuzzy matching. :type element_var: str :param threshold: Fuzzy matching threshold for element_var (0–100). Default: 95. :type threshold: int, optional :param match_all: If True, only include groups where all matches exceed the threshold. :type match_all: bool, optional :param unique_rows: If True, suppress duplicate pairings (self-joins). :type unique_rows: bool, optional :param csv_filename: File path to save the aggregated results. If None, skip saving. :type csv_filename: str, optional :param verbose: If True, print debug info and head of results. :type verbose: bool, optional :returns: Aggregated match statistics between df1 and df2 (or within df1). :rtype: pd.DataFrame .. py:function:: fuzzy_compare_legacy(df1=None, df2=None, additional_vars_df1=None, additional_vars_df2=None, endpoint_url=None, query=None, grouping_var=None, label_var=None, element_var=None, threshold=95, match_all=False, unique_rows=False, csv_filename='comparison.csv', verbose=True) !! Left here, as new function not tested much.... Fuzzy string matching between two DataFrames (or SPARQL query results) based on a common element column. Supports optional grouping, label filtering, and aggregation of match statistics. :param df1: First DataFrame. :type df1: pd.DataFrame, optional :param df2: Second DataFrame. If not provided, df1 is used. :type df2: pd.DataFrame, optional :param additional_vars_df1: List of columns that will be aggregated in the result (using first per group). :type additional_vars_df1: list, optional :param additional_vars_df2: List of columns that will be aggregated in the result (using first per group). :type additional_vars_df2: list, optional :param endpoint_url: SPARQL endpoint. :type endpoint_url: str, optional :param query: SPARQL query. :type query: str, optional :param grouping_var: Column name used for grouping. :type grouping_var: str, optional :param label_var: Optional label for filtering matches. :type label_var: str, optional :param element_var: Column containing the string values to compare. :type element_var: str :param threshold: Fuzzy matching threshold (0–100). Default: 95. :type threshold: int :param match_all: If True, only include groups where all scores exceed the threshold. :type match_all: bool :param unique_rows: If True, suppress duplicate pairings. :type unique_rows: bool :param csv_filename: File path to save the results. :type csv_filename: str :param verbose: Whether to print insights into the dataframe. :type verbose: bool, optional :returns: Aggregated match statistics between df1 and df2 (or within df1). :rtype: pd.DataFrame