-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code-prose-composition tagger. #247
base: main
Are you sure you want to change the base?
Conversation
bed5fe8
to
9eae084
Compare
Add a tagger that adds attributes for code-prose-other composition of files based on line classifications. Coverage for code-prose-composition tagger. Improve error messages for spawn method checks. Set multiprocessing start method in test setup Set multiprocessing in test case with error handling. Add before suite hook to set mp start method. The default multiprocessing start method is "fork" which is not compatible with with runtime assertions that it is set to spawn. When running unit tests, it's possible to call an external library that sets the start method to "fork". Here we enforce the start method to be "spawn" for all tests before executing. linting. Remove error log messages.
9eae084
to
db8b744
Compare
Additionally update commentary and word choice.
Generated attribute line example: I think you might be able to keep this shorter by cutting the key in line_label, maybe elsewhere |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once you see if you can make the attribute keys shorter
from ..core.registry import TaggerRegistry | ||
|
||
|
||
@TaggerRegistry.add("code-prose-composition") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other taggers use underscores instead of dashes
Tagger for Code Prose Composition
Add a tagger that adds attributes for code-prose-other composition of files based on line classifications.
Produces tags for
Recommended filter for mixed prose/code content based on these tags is:
The code entropy adjusts for bias towards code for short string including "code-y" characters like (, ), [, ], : etc due to a lack of nice negative examples. Until time for an improved classifier is available, including a filter for high confidence code predictions via mean entropy works sufficiently well.
One More Thing
Updated a pre-suite hook to set the multiprocessing start method to
spawn
to prevent a side effect where test case dependencies may set it to the defaultfork
, violating runtime assertions.