Field grouping #6932

simonaubertbd · 2024-11-15T20:56:11Z

What's your use case?
Hello,

A lot of time, when you have a dataset, you want to know if there is a group of fields that works together. That can help to normalize (like de-joining) your data model for dataviz, performance issue or simplify your analysis.

Exemple

order_id	item_id	label	model_id	length	color	amount
1	1	A	10	15	Blue	101
2	1	A	10	15	Blue	101
3	2	B	10	15	Blue	101
4	2	B	10	15	Blue	101
5	2	B	10	15	Blue	101
6	3	C	20	25	Red	101
7	3	C	20	25	Red	101
8	3	C	20	25	Red	101
9	4	D	20	25	Red	101
10	4	D	20	25	Red	101
11	4	D	20	25	Red	101

Here, we could split the table in three :
-order

order_id	item_id	model_id	amount
1	1	10	101,2
2	1	10	103
3	2	10	104,8
4	2	10	106,6
5	2	10	108,4
6	3	20	110,2
7	3	20	112
8	3	20	113,8
9	4	20	115,6
10	4	20	117,4
11	4	20	119,2

-model

model_id	length	color
10	15	Blue
20	25	Red

-item

item_id	label
1	A
2	B
3	C
4	D

What's your proposed solution?
The tool would take :
-a dataframe in entry
-configuration : ability to select fields.
-output : a table with the recap of groups

field group	field	remaining fields
1	item_id	False
1	label	False
2	model_id	False
2	color	False
3	order_id	True
3	link to group 1	True
3	link to group 2	True
3	amount	True

Very important : the non-selected fields (like here, amount), are in the result but all in the "remaining" group.

Algo steps:
1/pre-groups : count distinct of each fields. goal : optimization of algo, to avoid to calculate all pairs
fields that has the same count distinct than the number of rows are automatically excluded and sent to the remaining group
fields that have have the same count distinct are set in the same pre-group

2/ for each group, for each pair of fields,
let's do a distinct of value of the pair
like here

item_id	label
1	A
2	B
3	C
4	D

if in this table, the count distinct of each field is equal to the number of rows, it's a "pair-group"

here, for the model, you will have
-model_id,length
-model_id,color
-length,color

3/Since a field can only belong to one group, it means model_id,length,color which would first (or second) group, then item_id and label

If a field does not belong to a group, he goes to "remaining group" at the end

in the remaining group, you can add a link to the other group since you don't know which field is the key.

field group	field	remaining fields
1	item_id	False
1	label	False
2	model_id	False
2	length	False
2	color	False
3	order_id	True
3	link to group 1	True
3	link to group 2	True
3	amount	True

Best regards,

Simon

PS : I have in mind an evolution with links between non-remaining table (like here, the model could be linked to the item as an option)

Are there any alternative solutions?

N/A

janezd · 2024-11-23T08:35:14Z

Dear @simonaubertbd, I couldn't help but notice that you've been opening dozens of issues all over the Github for months and months. Just in August, you have created 52, almost two per day. Were all of them well-thought-out? Well investigated? Worth of time that developers spent on answering them?

Developers are happy to get user feedback, to get bug reports, to understand how their software is being used and how they could make it better. We're also happy to help new users who lack understanding.

But there's another side to it. When we find ourselves in new environment - a new programming language, framework, software - we need to understand what it does, how it does it and, crucially, what it doesn't. Without sufficient experience, we tend to use a tool in a wrong way (e.g. a React programmer trying to use it as a jQuery replacement) or we tend to use a tool that we like for purposes that it wasn't designed for (e.g. using Orange as a replacement for Excel or SQL). Or simply proposing features that are already there.

This issue is a good example. The first sentence goes: "A lot of time, when you have a dataset, you want to know if there is a group of fields that works together." In data mining this is call frequent itemsets, and Orange has an add-on for it. Well, not exactly: you are talking about frequent itemsets with 100 % support. If anything like this occurs in the data, the analyst either already knows about it or (s)he doesn't know what the data is about. I can hardly imagine a situation where this would be useful.

I'm sorry for this dressing down. We are very happy to answer any questions from users, address their ideas and suggestions.

Be creative, create something. As for Orange: create an add-on and start coding the ideas that you proposed. I cannot speak for other Github repositories that you have flooded with your suggestions, but I believe that they would appreciate code, not just ideas. Ideas are cheap, we have boatloads of them.

simonaubertbd · 2024-11-23T10:35:57Z

Hello @janezd . First of all thanks for the time you take.

" Just in August, you have created 52, almost two per day. Were all of them well-thought-out? Well investigated? Worth of time that developers spent on answering them?"

Most of it were on Amphi, a very young tool with a very small team but promising. So lots of bugs, tons of missing features. The Amphi team solved already more of 40 of these issues and I have call with them every week precisely because they need some ideas from people that have used several competitors for more than a decade and also testers.
A very different context of Orange where I think most of the "datascience" features are here -clearly more than I need- but the missing things are more in the UI and "small but useful" things in data investigation.

"you are talking about frequent itemsets with 100 % support"

Oh, I read the help page and yes, you're right, that's exactly that. To be honest I didn't manage to find that when I first search. That's probably due to different technical background : After 14 years in data, I have never heard of this term. So thanks for the information, it will help and I learnt something.

" If anything like this occurs in the data, the analyst either already knows about it or (s)he doesn't know what the data is about"

Working as a consultant, I have usually no ideas (or only an inklink) about the datasets my customers give.

"I can hardly imagine a situation where this would be useful."

For me it's mainly to create a a nice star schema for dataviz, instead of one big table.

"Be creative, create something. As for Orange: create an add-on and start coding the ideas that you proposed"
I totally get whay you mean, I'm trying to learn Python in my free time before doing that. You may have noticed that I tried to at least writing some python code for the unique key. I use this code in an Alteryx macro and it works pretty well. I also spent my evening trying to write the code about this exact idea about field group, and I wasn't that far.

But yes, I have wrote the idea before doing it myself because I'm more confident in your skills than mine.

"Ideas are cheap, we have boatloads of them."
Good ideas aren't that common. I have used a lot of tools (Tableau, Qlik view, Qlik Sense, Dataïku, PowerBi, Alteryx, Easymorph, Amphi, ... and of course Orange Data Mining) and I have to notice there is a lack of benchmarking in the industry. A lot of quick wins can help. And I'm sorry to point out that was the case on Orange Data Mining for like the search bar. Again, these are ideas, not requirements, and I publish it when I think about it. But I don't want to annoy you, this is your product and I won't post again if it's not welcomed.

simonaubertbd · 2024-12-07T21:25:19Z

@janezd Hello. I have done some research on frequent itemsets especially this one : https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
Very interesting and I have used on another name on Alteryx (a tool about Market Basket Analysis). However, I have to correct myself : this is absolutly not what this post is about. Here, you have a look at values of the same field that are frequently met together (like customers frequently by sugar with flour). Here, I'm talking about fields, columns, that changes together.

Best regards,

Simon

janezd assigned simonaubertbd and BlazZupan and unassigned simonaubertbd Nov 22, 2024

janezd closed this as completed Nov 23, 2024

This was referenced Nov 23, 2024

Comparison set of tools #6918

Closed

Unique key detector tool #6900

Closed

Cartesian product detection in join tool #6899

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Field grouping #6932

Field grouping #6932

simonaubertbd commented Nov 15, 2024 •

edited

Loading

janezd commented Nov 23, 2024

simonaubertbd commented Nov 23, 2024 •

edited

Loading

simonaubertbd commented Dec 7, 2024

Field grouping #6932

Field grouping #6932

Comments

simonaubertbd commented Nov 15, 2024 • edited Loading

janezd commented Nov 23, 2024

simonaubertbd commented Nov 23, 2024 • edited Loading

simonaubertbd commented Dec 7, 2024

simonaubertbd commented Nov 15, 2024 •

edited

Loading

simonaubertbd commented Nov 23, 2024 •

edited

Loading