Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Field grouping #6932

Closed
simonaubertbd opened this issue Nov 15, 2024 · 3 comments
Closed

Field grouping #6932

simonaubertbd opened this issue Nov 15, 2024 · 3 comments
Assignees

Comments

@simonaubertbd
Copy link

simonaubertbd commented Nov 15, 2024

What's your use case?
Hello,

A lot of time, when you have a dataset, you want to know if there is a group of fields that works together. That can help to normalize (like de-joining) your data model for dataviz, performance issue or simplify your analysis.

Exemple

<style> </style>
order_id item_id label model_id length color amount
1 1 A 10 15 Blue 101
2 1 A 10 15 Blue 101
3 2 B 10 15 Blue 101
4 2 B 10 15 Blue 101
5 2 B 10 15 Blue 101
6 3 C 20 25 Red 101
7 3 C 20 25 Red 101
8 3 C 20 25 Red 101
9 4 D 20 25 Red 101
10 4 D 20 25 Red 101
11 4 D 20 25 Red 101

Here, we could split the table in three :
-order

<style> </style>
order_id item_id model_id amount
1 1 10 101,2
2 1 10 103
3 2 10 104,8
4 2 10 106,6
5 2 10 108,4
6 3 20 110,2
7 3 20 112
8 3 20 113,8
9 4 20 115,6
10 4 20 117,4
11 4 20 119,2

-model

<style> </style>
model_id length color
10 15 Blue
20 25 Red

-item

<style> </style>
item_id label
1 A
2 B
3 C
4 D

What's your proposed solution?
The tool would take :
-a dataframe in entry
-configuration : ability to select fields.
-output : a table with the recap of groups

<style> </style>
field group field remaining fields
1 item_id False
1 label False
2 model_id False
2 color False
3 order_id True
3 link to group 1 True
3 link to group 2 True
3 amount True

Very important : the non-selected fields (like here, amount), are in the result but all in the "remaining" group.

Algo steps:
1/pre-groups : count distinct of each fields. goal : optimization of algo, to avoid to calculate all pairs
fields that has the same count distinct than the number of rows are automatically excluded and sent to the remaining group
fields that have have the same count distinct are set in the same pre-group

2/ for each group, for each pair of fields,
let's do a distinct of value of the pair
like here

<style> </style>
item_id label
1 A
2 B
3 C
4 D

if in this table, the count distinct of each field is equal to the number of rows, it's a "pair-group"

here, for the model, you will have
-model_id,length
-model_id,color
-length,color

3/Since a field can only belong to one group, it means model_id,length,color which would first (or second) group, then item_id and label

If a field does not belong to a group, he goes to "remaining group" at the end

in the remaining group, you can add a link to the other group since you don't know which field is the key.

<style> </style>
field group field remaining fields
1 item_id False
1 label False
2 model_id False
2 length False
2 color False
3 order_id True
3 link to group 1 True
3 link to group 2 True
3 amount True

Best regards,

Simon

PS : I have in mind an evolution with links between non-remaining table (like here, the model could be linked to the item as an option)

Are there any alternative solutions?

N/A

@janezd
Copy link
Contributor

janezd commented Nov 23, 2024

Dear @simonaubertbd, I couldn't help but notice that you've been opening dozens of issues all over the Github for months and months. Just in August, you have created 52, almost two per day. Were all of them well-thought-out? Well investigated? Worth of time that developers spent on answering them?

Developers are happy to get user feedback, to get bug reports, to understand how their software is being used and how they could make it better. We're also happy to help new users who lack understanding.

But there's another side to it. When we find ourselves in new environment - a new programming language, framework, software - we need to understand what it does, how it does it and, crucially, what it doesn't. Without sufficient experience, we tend to use a tool in a wrong way (e.g. a React programmer trying to use it as a jQuery replacement) or we tend to use a tool that we like for purposes that it wasn't designed for (e.g. using Orange as a replacement for Excel or SQL). Or simply proposing features that are already there.

This issue is a good example. The first sentence goes: "A lot of time, when you have a dataset, you want to know if there is a group of fields that works together." In data mining this is call frequent itemsets, and Orange has an add-on for it. Well, not exactly: you are talking about frequent itemsets with 100 % support. If anything like this occurs in the data, the analyst either already knows about it or (s)he doesn't know what the data is about. I can hardly imagine a situation where this would be useful.

I'm sorry for this dressing down. We are very happy to answer any questions from users, address their ideas and suggestions.

Be creative, create something. As for Orange: create an add-on and start coding the ideas that you proposed. I cannot speak for other Github repositories that you have flooded with your suggestions, but I believe that they would appreciate code, not just ideas. Ideas are cheap, we have boatloads of them.

@simonaubertbd
Copy link
Author

simonaubertbd commented Nov 23, 2024

Hello @janezd . First of all thanks for the time you take.

" Just in August, you have created 52, almost two per day. Were all of them well-thought-out? Well investigated? Worth of time that developers spent on answering them?"

Most of it were on Amphi, a very young tool with a very small team but promising. So lots of bugs, tons of missing features. The Amphi team solved already more of 40 of these issues and I have call with them every week precisely because they need some ideas from people that have used several competitors for more than a decade and also testers.
A very different context of Orange where I think most of the "datascience" features are here -clearly more than I need- but the missing things are more in the UI and "small but useful" things in data investigation.

"you are talking about frequent itemsets with 100 % support"

Oh, I read the help page and yes, you're right, that's exactly that. To be honest I didn't manage to find that when I first search. That's probably due to different technical background : After 14 years in data, I have never heard of this term. So thanks for the information, it will help and I learnt something.

" If anything like this occurs in the data, the analyst either already knows about it or (s)he doesn't know what the data is about"

Working as a consultant, I have usually no ideas (or only an inklink) about the datasets my customers give.

"I can hardly imagine a situation where this would be useful."

For me it's mainly to create a a nice star schema for dataviz, instead of one big table.

"Be creative, create something. As for Orange: create an add-on and start coding the ideas that you proposed"
I totally get whay you mean, I'm trying to learn Python in my free time before doing that. You may have noticed that I tried to at least writing some python code for the unique key. I use this code in an Alteryx macro and it works pretty well. I also spent my evening trying to write the code about this exact idea about field group, and I wasn't that far.
image
But yes, I have wrote the idea before doing it myself because I'm more confident in your skills than mine.

"Ideas are cheap, we have boatloads of them."
Good ideas aren't that common. I have used a lot of tools (Tableau, Qlik view, Qlik Sense, Dataïku, PowerBi, Alteryx, Easymorph, Amphi, ... and of course Orange Data Mining) and I have to notice there is a lack of benchmarking in the industry. A lot of quick wins can help. And I'm sorry to point out that was the case on Orange Data Mining for like the search bar. Again, these are ideas, not requirements, and I publish it when I think about it. But I don't want to annoy you, this is your product and I won't post again if it's not welcomed.

@simonaubertbd
Copy link
Author

@janezd Hello. I have done some research on frequent itemsets especially this one : https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
Very interesting and I have used on another name on Alteryx (a tool about Market Basket Analysis). However, I have to correct myself : this is absolutly not what this post is about. Here, you have a look at values of the same field that are frequently met together (like customers frequently by sugar with flour). Here, I'm talking about fields, columns, that changes together.

Best regards,

Simon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants