Unique key detector tool #6900

simonaubertbd · 2024-09-21T07:05:24Z

What's your use case?
More than often, you have to deal with a dataset without knowing what's make a row unique. This can lead to misinterpret the data, cartesian product at join and other funny stuff.

What's your proposed solution?
This is a feature I haven't seen in any data prepation/etl. The core feature is to detect the unique key in a dataframe.

How do I imagine that ?

Entry; one dataframe, ability to select fields or check all, ability to specify a max number of field for combination (empty or 0=no max).
Algo : it tests the count distinct every combination of field versus the count of rows

Result : one row by field combination that works. If no result : "no field combination is unique. check for duplicate or need for aggregation upstream".

ex :

order_id	line_id	amount	customer	site
1	1	100	A	U_250
1	2	12	A	U_250
1	3	45	A	U_250
2	1	75	A	U_250
2	2	12	A	U_250
3	1	15	B	U_250
4	1	45	B	U_251

The user will select every field but excluding Amount (he knows that Amount would have no sense in key)

The algo will test the following key
-each separate field
-each combination of two fields
-each combination of three fields
-each combination of four fields

to match the number of row (7)
And gives something like that

choice	number of fields	field combination
very good	2	order_id,line_id
average	3	order_id,line_id, customer
average	3	order_id,line_id, site
bad	4	order_id,line_id, site, customer
…	…	….

Are there any alternative solutions?
N/A

Best regards,

Simon

simonaubertbd · 2024-11-22T20:15:39Z

Hello @BlazZupan an example of code

import pandas as pd
from itertools import combinations

def detect_unique_key(df, fields, max_combination=0):
    
    #Detect unique keys in a DataFrame based on field combinations.

    #Parameters:
        #df (pd.DataFrame): The DataFrame to analyze.
        #fields (list): List of fields to test for uniqueness.
        #max_combination (int): Maximum number of fields to combine (0 = no limit).

    #Returns:
        #pd.DataFrame: A DataFrame with columns ['choice', 'number_of_fields', 'field_combination'].
    
    total_rows = len(df)
    max_combination = max_combination if max_combination > 0 else len(fields)

    results = []
    found_minimum = None

    for r in range(1, max_combination + 1):
        for combo in combinations(fields, r):
            combo = list(combo)
            unique_count = len(df.drop_duplicates(subset=combo))

            if unique_count == total_rows:
                if found_minimum is None:
                    found_minimum = r

                if r == found_minimum:
                    choice = "very good"
                elif r == found_minimum + 1:
                    choice = "average"
                else:
                    choice = "bad"

                results.append({
                    "choice": choice,
                    "number_of_fields": r,
                    "field_combination": combo
                })

    if not results:
        print("No field combination is unique. Check for duplicates or aggregation upstream.")
        return pd.DataFrame(columns=["choice", "number_of_fields", "field_combination"])

    return pd.DataFrame(results)

# Example usage
data = {
    'A': [1, 2, 3, 4],
    'B': [1, 2, 2, 4],
    'C': [5, 6, 7, 8]
}

df = pd.DataFrame(data)
fields = ['A', 'B', 'C']
result = detect_unique_key(df, fields)
print(result)

janezd · 2024-11-23T08:46:50Z

Please see #6932 (comment).

janezd assigned BlazZupan Oct 4, 2024

janezd closed this as completed Nov 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unique key detector tool #6900

Unique key detector tool #6900

simonaubertbd commented Sep 21, 2024

simonaubertbd commented Nov 22, 2024

janezd commented Nov 23, 2024

Unique key detector tool #6900

Unique key detector tool #6900

Comments

simonaubertbd commented Sep 21, 2024

simonaubertbd commented Nov 22, 2024

janezd commented Nov 23, 2024