-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Field grouping #6932
Comments
Dear @simonaubertbd, I couldn't help but notice that you've been opening dozens of issues all over the Github for months and months. Just in August, you have created 52, almost two per day. Were all of them well-thought-out? Well investigated? Worth of time that developers spent on answering them? Developers are happy to get user feedback, to get bug reports, to understand how their software is being used and how they could make it better. We're also happy to help new users who lack understanding. But there's another side to it. When we find ourselves in new environment - a new programming language, framework, software - we need to understand what it does, how it does it and, crucially, what it doesn't. Without sufficient experience, we tend to use a tool in a wrong way (e.g. a React programmer trying to use it as a jQuery replacement) or we tend to use a tool that we like for purposes that it wasn't designed for (e.g. using Orange as a replacement for Excel or SQL). Or simply proposing features that are already there. This issue is a good example. The first sentence goes: "A lot of time, when you have a dataset, you want to know if there is a group of fields that works together." In data mining this is call frequent itemsets, and Orange has an add-on for it. Well, not exactly: you are talking about frequent itemsets with 100 % support. If anything like this occurs in the data, the analyst either already knows about it or (s)he doesn't know what the data is about. I can hardly imagine a situation where this would be useful. I'm sorry for this dressing down. We are very happy to answer any questions from users, address their ideas and suggestions. Be creative, create something. As for Orange: create an add-on and start coding the ideas that you proposed. I cannot speak for other Github repositories that you have flooded with your suggestions, but I believe that they would appreciate code, not just ideas. Ideas are cheap, we have boatloads of them. |
Hello @janezd . First of all thanks for the time you take. " Just in August, you have created 52, almost two per day. Were all of them well-thought-out? Well investigated? Worth of time that developers spent on answering them?" Most of it were on Amphi, a very young tool with a very small team but promising. So lots of bugs, tons of missing features. The Amphi team solved already more of 40 of these issues and I have call with them every week precisely because they need some ideas from people that have used several competitors for more than a decade and also testers. "you are talking about frequent itemsets with 100 % support" Oh, I read the help page and yes, you're right, that's exactly that. To be honest I didn't manage to find that when I first search. That's probably due to different technical background : After 14 years in data, I have never heard of this term. So thanks for the information, it will help and I learnt something. " If anything like this occurs in the data, the analyst either already knows about it or (s)he doesn't know what the data is about" Working as a consultant, I have usually no ideas (or only an inklink) about the datasets my customers give. "I can hardly imagine a situation where this would be useful." For me it's mainly to create a a nice star schema for dataviz, instead of one big table. "Be creative, create something. As for Orange: create an add-on and start coding the ideas that you proposed" "Ideas are cheap, we have boatloads of them." |
@janezd Hello. I have done some research on frequent itemsets especially this one : https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/ Best regards, Simon |
What's your use case?
Hello,
A lot of time, when you have a dataset, you want to know if there is a group of fields that works together. That can help to normalize (like de-joining) your data model for dataviz, performance issue or simplify your analysis.
Exemple
<style> </style>Here, we could split the table in three :
<style> </style>-order
-model
<style> </style>-item
<style> </style>What's your proposed solution?
<style> </style>The tool would take :
-a dataframe in entry
-configuration : ability to select fields.
-output : a table with the recap of groups
Very important : the non-selected fields (like here, amount), are in the result but all in the "remaining" group.
Algo steps:
1/pre-groups : count distinct of each fields. goal : optimization of algo, to avoid to calculate all pairs
fields that has the same count distinct than the number of rows are automatically excluded and sent to the remaining group
fields that have have the same count distinct are set in the same pre-group
2/ for each group, for each pair of fields,
<style> </style>let's do a distinct of value of the pair
like here
if in this table, the count distinct of each field is equal to the number of rows, it's a "pair-group"
here, for the model, you will have
-model_id,length
-model_id,color
-length,color
3/Since a field can only belong to one group, it means model_id,length,color which would first (or second) group, then item_id and label
If a field does not belong to a group, he goes to "remaining group" at the end
in the remaining group, you can add a link to the other group since you don't know which field is the key.
<style> </style>Best regards,
Simon
PS : I have in mind an evolution with links between non-remaining table (like here, the model could be linked to the item as an option)
Are there any alternative solutions?
N/A
The text was updated successfully, but these errors were encountered: