Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code-prose-composition tagger. #247

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Code-prose-composition tagger. #247

wants to merge 2 commits into from

Conversation

no0p
Copy link

@no0p no0p commented Feb 28, 2025

Tagger for Code Prose Composition

Add a tagger that adds attributes for code-prose-other composition of files based on line classifications.

Produces tags for

  • code/prose/other composition as a percent of the document
  • code/prose/other mean entropy
  • code/prose/other line counts
  • coder-prose boundary count

Recommended filter for mixed prose/code content based on these tags is:

exp__code_prose_composition__code > 0.05
exp__code_prose_composition__prose > 0.3
exp__code_prose_composition__code_count >= 8
exp__code_prose_composition__code_mean_entropy < 0.5

The code entropy adjusts for bias towards code for short string including "code-y" characters like (, ), [, ], : etc due to a lack of nice negative examples. Until time for an improved classifier is available, including a filter for high confidence code predictions via mean entropy works sufficiently well.

One More Thing

Updated a pre-suite hook to set the multiprocessing start method to spawn to prevent a side effect where test case dependencies may set it to the default fork, violating runtime assertions.

@no0p no0p force-pushed the code-prose-composition-b branch 2 times, most recently from bed5fe8 to 9eae084 Compare February 28, 2025 21:20
Add a tagger that adds attributes for code-prose-other
composition of files based on line classifications.

Coverage for code-prose-composition tagger.

Improve error messages for spawn method checks.

Set multiprocessing start method in test setup

Set multiprocessing in test case with error handling.

Add before suite hook to set mp start method.

The default multiprocessing start method is "fork" which is not compatible with
with runtime assertions that it is set to spawn. When running unit tests, it's
possible to call an external library that sets the start method to "fork".
Here we enforce the start method to be "spawn" for all tests before executing.

linting.

Remove error log messages.
@no0p no0p force-pushed the code-prose-composition-b branch from 9eae084 to db8b744 Compare February 28, 2025 21:50
Additionally update commentary and word choice.
@no0p no0p requested review from soldni and Whattabatt February 28, 2025 22:14
Copy link
Contributor

@Whattabatt Whattabatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once you see if you can make the attribute keys shorter

from ..core.registry import TaggerRegistry


@TaggerRegistry.add("code-prose-composition")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other taggers use underscores instead of dashes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants