Vocabulary extraction #280

atantos · 2024-01-10T13:32:18Z

Dear maintainers. (@rssdev10 )

One of the common challenges I faced with the StringDocument type was extracting a vocabulary in the form of an OrderedDict for use in the cooccurrence matrix. Previously, it was necessary to create custom functions for the OrderedDict. A ready-made vocabulary dictionary would be an extremely useful tool for various tasks involved in processing a corpus of texts or a single text.

I added the function vocab() with its docstring in the document.jl with two types of input (StringDocument and Vector{String}) that returns the OrderedDict of the vocabulary. I also added two tests in test.jl and updated the documentation in coom.jl for TextAnalysis.coo_matrix().

PS: Do not know why I get the UndefVarError: OrderedDict not defined error in the tests. I thought that the DataStructures were imported. Locally, it runs ok!

rssdev10 · 2024-01-10T17:59:13Z

Hi Alex, thanks for the PR.

Please don't mix style changes with functional changes. There are a lot of changes in your PR that are not really changes. Style fixes should be merged into separate commits with no other changes.
I have a concern about where the vocab method should be located. See https://github.com/JuliaText/TextAnalysis.jl/blob/master/src/LM/vocab.jl . Maybe this is a better place.
The vocab name is confusing. Maybe ordered_vocab would be better for your case.
The error in the unit test occurs because document.jl doesn't contain using OrderedDict. Use the package test locally. You can activate it with using Pkg; Pkg.test() or run from a package manager with the ]test command.
Whenever you add documentation, use julia-repl keyword for the REPL examples. Or use
You can check documentation locally with include("docs/make.jl") from REPL.

src/document.jl

rssdev10 · 2024-01-10T18:07:00Z

src/document.jl

+    ordered_dict = OrderedDict{String,Int}()
+    sizehint!(ordered_dict, length(string_vector))
+
+    # reverse the order of the keys and values in the enumerate iterator to get an ordered dict.


The comment is unclear. The code doesn't contain an actual change of the order

I will remove the comment if it makes things unclear. I mean that we get the tuples (index, key) in that order and we reverse the order within the loop. However, it might be confusing and probably redundant as a comment. I would remove it entirely.

for (index, key) in enumerate(string_vector) ordered_dict[key] = index

atantos · 2024-01-10T18:12:42Z

Hi there!

Thanks for the comments!

Please don't mix style changes with functional changes. There are a lot of changes in your PR that are not really changes. Style fixes should be merged into separate commits with no other changes.

Sorry for teh style changes. They come from vscode editor automatically saving these kind of changes. I guess will have to undo these specific changes that it does (deleting tabs in exports, most of the times).

I have a concern about where the vocab method should be located. See https://github.com/JuliaText/TextAnalysis.jl/blob/master/src/LM/vocab.jl . Maybe this is a better place.

Sounds like a good idea. On the other hand, this is the place related to language modeling. Am not sure either. I wll give it a thought and will come back soon.

The vocab name is confusing. Maybe ordered_vocab would be better for your case.

Indeed! ordered_vocad sounds find. (a more general question: naming conventions do not follow Julia's guidelines, if I am right?)

The error in the unit test occurs because document.jl doesn't contain using OrderedDict. Use the package test locally. You can activate it with using Pkg; Pkg.test() or run from a package manager with the ]test command.

So, I guess this means that in order to have it pass the tests, I simply need to use the using statement in the beginning of the file just above @testset, right? It seems that was the case and it worked.

Whenever you add documentation, use julia-repl keyword for the REPL examples. Or use

Thanks! I updated the comments.

You can check documentation locally with include("docs/make.jl") from REPL.

Nice!

…cab will be stored)

rssdev10 · 2024-01-11T06:18:45Z

Sorry for the style changes. They come from vscode editor automatically saving these kind of changes. I guess will have to undo these specific changes that it does (deleting tabs in exports, most of the times).

In the VS Code there is a tool for doing partial commits. https://www.youtube.com/watch?v=sYTwr1OSUlo

Regarding names conversation. Technically, there is https://docs.julialang.org/en/v1/manual/style-guide/ . But the strict requirement for function names is no longer present. In most scripting languages, the snake style of function names is considered more readable than the initial math style of names without any separators in Julia.

So, I guess this means that in order to have it pass the tests, I simply need to use the using statement in the beginning of the file just above @testset, right? It seems that was the case and it worked.

correct. Also, my personal preference is to always have independent test files and be able to run them separately. This makes it easy to develop and debug your own code when you are working on a few specific functions.

Sounds like a good idea. On the other hand, this is the place related to language modeling. Am not sure either. I will give it a thought and will come back soon.

It would be good to take a pause and review this again. We need to avoid spreading similar functionality around the codebase. But from another side, some changes could be made to the existing structure.

atantos · 2024-01-11T11:21:29Z

In the VS Code there is a tool for doing partial commits. https://www.youtube.com/watch?v=sYTwr1OSUlo

Thanks! I will check out the video and install it.

Regarding names conversation. Technically, there is https://docs.julialang.org/en/v1/manual/style-guide/ . But the strict requirement for function names is no longer present. In most scripting languages, the snake style of function names is considered more readable than the initial math style of names without any separators in Julia.

Snake style it is then. Thanks for giving me the background!

correct. Also, my personal preference is to always have independent test files and be able to run them separately. This makes it easy to develop and debug your own code when you are working on a few specific functions.

I think you are probably right. Having them separately is clearer.

It would be good to take a pause and review this again. We need to avoid spreading similar functionality around the codebase. But from another side, some changes could be made to the existing structure.

Yes, I agree with both points. I'm planning to conduct a survey comparing Python and R's analogous packages/libraries, and I'll return with suggestions. I believe Slack is a better platform for discussing package design.

However, I have two preliminary points:

My background is in corpus/language data analysis, and there's a large community in the humanities without prior programming experience who want to use programming tools for tasks like statistical tests and modeling. They prefer tools that simplify their work, avoiding the need to switch between multiple, distinct packages. So, their workflow needs user-friendliness. TextAnalysis.jl is excellent for those who understand some underlying mechanics, but most users need straightforward tools that perform exactly as needed. R, currently popular in the humanities, offers this simplicity with one-liners for specific tasks. Users don't delve into the underlying functions and still achieve their goals, even if it takes slightly longer. So, how do we encourage them to try new tools? In my opinion, providing easy-to-use tools for tasks they're accustomed to in R/Python, and then giving them the perspective of having so many other utilities and flexibility with types and their hierarchies, clear code with multiple dispatch etc. The first step is to offer them familiar one-liners from R/Python to ease the transition.
Building on point 1, regarding the ordered_vocab() function, having a user-friendly vocabulary function is crucial for various types of language data analysis, like measuring complexity and/or associations based on cooccurrence matrices in various fields. Until now, users of TextAnalysis.jl would have to create their own vocabularies, a daunting task for many in the humanities already overwhelmed by R's one-liners. Integrating ordered_vocab() (i suggest we decide for a temporaray solution wrt the location but still do the merge so that I can prepare a tutorial or two on using TextAnalysis.jl soon) and I would go one step further and suggest that even hidding the vocab argument from the user (creating the vocabulary by default) would simplify usage, eliminating the need for users to manually create and add it as an argument to functions such as TextAnalysis.coo_matrix().

I realize these suggestions involve significant changes and discussions. But, having worked in the humanities for years and conducted tutorials for various audiences (e.g. philologists, linguists, teachers, psychologists, historians, librarians), I believe my insights are valuable in attracting people from these fields.

atantos added 2 commits January 10, 2024 13:39

edited the .gitignore file

ee99e85

Added vocabulary extraction function.

d659864

rssdev10 reviewed Jan 10, 2024

View reviewed changes

src/document.jl Outdated Show resolved Hide resolved

rssdev10 reviewed Jan 10, 2024

View reviewed changes

atantos added 3 commits January 10, 2024 20:38

Integrated suggestions made by @rssdev10 (except for where ordered_vo…

8bbbbcc

…cab will be stored)

Added julia-repl on the ordered_vocab() documentation example.

29a2558

All tests are passed locally.

0e53fc8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vocabulary extraction #280

Vocabulary extraction #280

atantos commented Jan 10, 2024 •

edited

Loading

rssdev10 commented Jan 10, 2024

rssdev10 Jan 10, 2024

atantos Jan 10, 2024

atantos commented Jan 10, 2024 •

edited

Loading

rssdev10 commented Jan 11, 2024

atantos commented Jan 11, 2024 •

edited

Loading

Vocabulary extraction #280

Are you sure you want to change the base?

Vocabulary extraction #280

Conversation

atantos commented Jan 10, 2024 • edited Loading

rssdev10 commented Jan 10, 2024

rssdev10 Jan 10, 2024

Choose a reason for hiding this comment

atantos Jan 10, 2024

Choose a reason for hiding this comment

atantos commented Jan 10, 2024 • edited Loading

rssdev10 commented Jan 11, 2024

atantos commented Jan 11, 2024 • edited Loading

atantos commented Jan 10, 2024 •

edited

Loading

atantos commented Jan 10, 2024 •

edited

Loading

atantos commented Jan 11, 2024 •

edited

Loading