Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegExp for matching two exact words in Statistics #1010

Open
abocin opened this issue Oct 2, 2023 · 1 comment
Open

RegExp for matching two exact words in Statistics #1010

abocin opened this issue Oct 2, 2023 · 1 comment
Labels
bug friday Friday's TODOs

Comments

@abocin
Copy link

abocin commented Oct 2, 2023

What's your use case?

For Statistics when I use the Contains feature "word" for searching specific words, it returns one or more entries and within which document the word is located. However, when I use the RexExp feature from statistics filling in for example a simple search string \bword\b, the search returns 0 results.

This is not a major issue as I can simply use the Contains feature to identify how many times the "word" can be found but whenever I want to find a group of two (key)words the Contains feature fails to return any results, i.e., Contains box I introduce "word1" space "word2" despite that these two words in the exact order exists in the text within the document I have. When I tried to use RegExp because Contains seems to not fit for such search, the RegExp expression seems to not work either... I used many RegExp from the simple /^(apple|banana)$/, to (apple|banana), apple|banana, \b(apple|banana)(?:\W+\w+){1,6}?\W+(apple|banana)\b.

My task is quite simple. I need to find some keywords in a document but sometimes these keywords are actually a group of two (or more) words that define the concept. For example, I want to find within the documents I have all the sentences that contain the "apple banana" group of words. Preferable with a space between them but it can be also found within a length of six words for example (see the RegExp example I gave above). I don't know exactly what input should be provided in the RegExp field from Statistics from the Text Mining add-in.

Maybe some examples would be useful. I have the documentation where RegExp is mentioned in other areas such as Corpus View or Preprocess Text and neither there I was able to summon the RegExp for two or more words.

What's your proposed solution?

Can you please provide the exact format of the RegExp input and which format or style for RegExp should be used in order to return valid search results for a group of two or more words "word1" space "word2" space "word3".

Are there any alternative solutions?

@VesnaT VesnaT transferred this issue from biolab/orange3 Oct 4, 2023
@ajdapretnar
Copy link
Collaborator

The issue is that Regex searches in tokens, which are by default constructed as 1-grams. Ideally, regex would look in the text, not tokens. We will think about a better solution for this.

@ajdapretnar ajdapretnar added bug friday Friday's TODOs labels Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug friday Friday's TODOs
Projects
None yet
Development

No branches or pull requests

2 participants