-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unique key detector tool #6900
Comments
Hello @BlazZupan an example of code import pandas as pd
from itertools import combinations
def detect_unique_key(df, fields, max_combination=0):
#Detect unique keys in a DataFrame based on field combinations.
#Parameters:
#df (pd.DataFrame): The DataFrame to analyze.
#fields (list): List of fields to test for uniqueness.
#max_combination (int): Maximum number of fields to combine (0 = no limit).
#Returns:
#pd.DataFrame: A DataFrame with columns ['choice', 'number_of_fields', 'field_combination'].
total_rows = len(df)
max_combination = max_combination if max_combination > 0 else len(fields)
results = []
found_minimum = None
for r in range(1, max_combination + 1):
for combo in combinations(fields, r):
combo = list(combo)
unique_count = len(df.drop_duplicates(subset=combo))
if unique_count == total_rows:
if found_minimum is None:
found_minimum = r
if r == found_minimum:
choice = "very good"
elif r == found_minimum + 1:
choice = "average"
else:
choice = "bad"
results.append({
"choice": choice,
"number_of_fields": r,
"field_combination": combo
})
if not results:
print("No field combination is unique. Check for duplicates or aggregation upstream.")
return pd.DataFrame(columns=["choice", "number_of_fields", "field_combination"])
return pd.DataFrame(results)
# Example usage
data = {
'A': [1, 2, 3, 4],
'B': [1, 2, 2, 4],
'C': [5, 6, 7, 8]
}
df = pd.DataFrame(data)
fields = ['A', 'B', 'C']
result = detect_unique_key(df, fields)
print(result) |
Please see #6932 (comment). |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
What's your use case?
More than often, you have to deal with a dataset without knowing what's make a row unique. This can lead to misinterpret the data, cartesian product at join and other funny stuff.
What's your proposed solution?
This is a feature I haven't seen in any data prepation/etl. The core feature is to detect the unique key in a dataframe.
How do I imagine that ?
Entry; one dataframe, ability to select fields or check all, ability to specify a max number of field for combination (empty or 0=no max).
Algo : it tests the count distinct every combination of field versus the count of rows
Result : one row by field combination that works. If no result : "no field combination is unique. check for duplicate or need for aggregation upstream".
ex :
<style> </style>The user will select every field but excluding Amount (he knows that Amount would have no sense in key)
The algo will test the following key
-each separate field
-each combination of two fields
-each combination of three fields
-each combination of four fields
to match the number of row (7)
<style> </style>And gives something like that
Are there any alternative solutions?
N/A
Best regards,
Simon
The text was updated successfully, but these errors were encountered: