Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vocabulary extraction #280

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Vocabulary extraction #280

wants to merge 5 commits into from

Conversation

atantos
Copy link
Contributor

@atantos atantos commented Jan 10, 2024

Dear maintainers. (@rssdev10 )

One of the common challenges I faced with the StringDocument type was extracting a vocabulary in the form of an OrderedDict for use in the cooccurrence matrix. Previously, it was necessary to create custom functions for the OrderedDict. A ready-made vocabulary dictionary would be an extremely useful tool for various tasks involved in processing a corpus of texts or a single text.

I added the function vocab() with its docstring in the document.jl with two types of input (StringDocument and Vector{String}) that returns the OrderedDict of the vocabulary. I also added two tests in test.jl and updated the documentation in coom.jl for TextAnalysis.coo_matrix().

PS: Do not know why I get the UndefVarError: OrderedDict not defined error in the tests. I thought that the DataStructures were imported. Locally, it runs ok!

@rssdev10
Copy link
Collaborator

Hi Alex, thanks for the PR.

  • Please don't mix style changes with functional changes. There are a lot of changes in your PR that are not really changes. Style fixes should be merged into separate commits with no other changes.
  • I have a concern about where the vocab method should be located. See https://github.com/JuliaText/TextAnalysis.jl/blob/master/src/LM/vocab.jl . Maybe this is a better place.
  • The vocab name is confusing. Maybe ordered_vocab would be better for your case.
  • The error in the unit test occurs because document.jl doesn't contain using OrderedDict. Use the package test locally. You can activate it with using Pkg; Pkg.test() or run from a package manager with the ]test command.
  • Whenever you add documentation, use julia-repl keyword for the REPL examples. Or use
  • You can check documentation locally with include("docs/make.jl") from REPL.

src/document.jl Outdated Show resolved Hide resolved
src/document.jl Outdated
ordered_dict = OrderedDict{String,Int}()
sizehint!(ordered_dict, length(string_vector))

# reverse the order of the keys and values in the enumerate iterator to get an ordered dict.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is unclear. The code doesn't contain an actual change of the order

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove the comment if it makes things unclear. I mean that we get the tuples (index, key) in that order and we reverse the order within the loop. However, it might be confusing and probably redundant as a comment. I would remove it entirely.

for (index, key) in enumerate(string_vector)
        ordered_dict[key] = index

@atantos
Copy link
Contributor Author

atantos commented Jan 10, 2024

Hi there!

Thanks for the comments!

  • Please don't mix style changes with functional changes. There are a lot of changes in your PR that are not really changes. Style fixes should be merged into separate commits with no other changes.

Sorry for teh style changes. They come from vscode editor automatically saving these kind of changes. I guess will have to undo these specific changes that it does (deleting tabs in exports, most of the times).

Sounds like a good idea. On the other hand, this is the place related to language modeling. Am not sure either. I wll give it a thought and will come back soon.

  • The vocab name is confusing. Maybe ordered_vocab would be better for your case.

Indeed! ordered_vocad sounds find. (a more general question: naming conventions do not follow Julia's guidelines, if I am right?)

  • The error in the unit test occurs because document.jl doesn't contain using OrderedDict. Use the package test locally. You can activate it with using Pkg; Pkg.test() or run from a package manager with the ]test command.

So, I guess this means that in order to have it pass the tests, I simply need to use the using statement in the beginning of the file just above @testset, right? It seems that was the case and it worked.

  • Whenever you add documentation, use julia-repl keyword for the REPL examples. Or use

Thanks! I updated the comments.

  • You can check documentation locally with include("docs/make.jl") from REPL.

Nice!

@rssdev10
Copy link
Collaborator

Sorry for the style changes. They come from vscode editor automatically saving these kind of changes. I guess will have to undo these specific changes that it does (deleting tabs in exports, most of the times).

In the VS Code there is a tool for doing partial commits. https://www.youtube.com/watch?v=sYTwr1OSUlo

Regarding names conversation. Technically, there is https://docs.julialang.org/en/v1/manual/style-guide/ . But the strict requirement for function names is no longer present. In most scripting languages, the snake style of function names is considered more readable than the initial math style of names without any separators in Julia.

So, I guess this means that in order to have it pass the tests, I simply need to use the using statement in the beginning of the file just above @testset, right? It seems that was the case and it worked.

correct. Also, my personal preference is to always have independent test files and be able to run them separately. This makes it easy to develop and debug your own code when you are working on a few specific functions.

Sounds like a good idea. On the other hand, this is the place related to language modeling. Am not sure either. I will give it a thought and will come back soon.

It would be good to take a pause and review this again. We need to avoid spreading similar functionality around the codebase. But from another side, some changes could be made to the existing structure.

@atantos
Copy link
Contributor Author

atantos commented Jan 11, 2024

In the VS Code there is a tool for doing partial commits. https://www.youtube.com/watch?v=sYTwr1OSUlo

Thanks! I will check out the video and install it.

Regarding names conversation. Technically, there is https://docs.julialang.org/en/v1/manual/style-guide/ . But the strict requirement for function names is no longer present. In most scripting languages, the snake style of function names is considered more readable than the initial math style of names without any separators in Julia.

Snake style it is then. Thanks for giving me the background!

correct. Also, my personal preference is to always have independent test files and be able to run them separately. This makes it easy to develop and debug your own code when you are working on a few specific functions.

I think you are probably right. Having them separately is clearer.

It would be good to take a pause and review this again. We need to avoid spreading similar functionality around the codebase. But from another side, some changes could be made to the existing structure.

Yes, I agree with both points. I'm planning to conduct a survey comparing Python and R's analogous packages/libraries, and I'll return with suggestions. I believe Slack is a better platform for discussing package design.

However, I have two preliminary points:

  1. My background is in corpus/language data analysis, and there's a large community in the humanities without prior programming experience who want to use programming tools for tasks like statistical tests and modeling. They prefer tools that simplify their work, avoiding the need to switch between multiple, distinct packages. So, their workflow needs user-friendliness. TextAnalysis.jl is excellent for those who understand some underlying mechanics, but most users need straightforward tools that perform exactly as needed. R, currently popular in the humanities, offers this simplicity with one-liners for specific tasks. Users don't delve into the underlying functions and still achieve their goals, even if it takes slightly longer. So, how do we encourage them to try new tools? In my opinion, providing easy-to-use tools for tasks they're accustomed to in R/Python, and then giving them the perspective of having so many other utilities and flexibility with types and their hierarchies, clear code with multiple dispatch etc. The first step is to offer them familiar one-liners from R/Python to ease the transition.

  2. Building on point 1, regarding the ordered_vocab() function, having a user-friendly vocabulary function is crucial for various types of language data analysis, like measuring complexity and/or associations based on cooccurrence matrices in various fields. Until now, users of TextAnalysis.jl would have to create their own vocabularies, a daunting task for many in the humanities already overwhelmed by R's one-liners. Integrating ordered_vocab() (i suggest we decide for a temporaray solution wrt the location but still do the merge so that I can prepare a tutorial or two on using TextAnalysis.jl soon) and I would go one step further and suggest that even hidding the vocab argument from the user (creating the vocabulary by default) would simplify usage, eliminating the need for users to manually create and add it as an argument to functions such as TextAnalysis.coo_matrix().

I realize these suggestions involve significant changes and discussions. But, having worked in the humanities for years and conducted tutorials for various audiences (e.g. philologists, linguists, teachers, psychologists, historians, librarians), I believe my insights are valuable in attracting people from these fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants