-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Semantic search in the U.S. Code #136
base: main
Are you sure you want to change the base?
Conversation
|
||
def count_tokens(text): | ||
"""Count how many tokens ``text`` is, in the model's encoding.""" | ||
# TODO: Find a way to use less memory when counting tokens in long texts. |
Check warning
Code scanning / pylint
TODO: Find a way to use less memory when counting tokens in long texts.
b1ab039
to
64c0c87
Compare
The standard-library ElementTree outputs modified xml with all tag names qualified. lxml does not.
This uses Polars, but I might switch to pandas, which is already an indirect dependency of the project, which produces nicer looking tables by default (albeit in part by omitting explicit type names), and which is more willing to sum Decimal values, though I worry this may be doing so by converting to float64 (and thus sometimes losing precision in a way that is undesirable for monetary values). The code here is not complete or altogether working, but I'm committing this before trying out pandas for this, in case I switch it to pandas but we want to revisit the Polars code later.
I had thought I might switch to pandas, but Polars seems to be working fine for it, so long as I compute the totals before adding the cost columns, which I should possibly be doing anyway.
+ Improve reporting style for "text" text. It appears no "text" or "tail" text appears in any element get_embeddable_elements would split up, in Title 42. We should probably provide a keyword argument to make it an error if it does happen, though, if we're not going to add logic to preserve such text in a situation where we would break up an element that has it. Such a "strict" mode should also apply to a too-big leaf, another condition we don't currently encounter but that would be skipped. This commit does *not* itself add any such strict mode. For this reason, one of the related FIXMEs is retained.
This modifies get_embeddable_elements to add a strict mode, which is enabled by default and which, after computing the entire breakdown, checks if *any* of the three situations in which content would be lost has occurred. If so, it raises ValueError, reporting how many times each of the three general situations occurred. (The idea behind it being ValueError is that all these situations occur, or not, based on the value of the input, not its type, or the result of attempting to look something up, etc.) This also upgrades the lost "text" text and lost "tail" text situations from warning to error in the logging done immediately when the problems are discovered, so all three have "error" severity.
print('\n'.join(textwrap.wrap(display_text, width=width))) | ||
|
||
|
||
# FIXME: Avoid breaking up elements like <em> that are not, in a conceptual |
Check warning
Code scanning / pylint
FIXME: Avoid breaking up elements like <em> that are not, in a conceptual.
This updates the USC archive version, reflected in usc.USC_STEM, adds automatic downloading, and re-runs all the notebooks (removing the old usc_token_counts.csv file so that it is is regenerated). A new dependency (pooch) is added for auto-downloading. For now the stem is fixed (it remains the USC_STEM constant) and the download is from the uscode.house.gov URL rather than a mirror; those may both change in the future.
+ Make sure we don't track the partially completed download, since it is downloaded as a file named like tmp* and renamed afterwards.
This keeps much of the previous refactoring, but it puts the code of _require_usc_archive back in extract_usc. That helper function didn't really do a single clear job or self-contained group of jobs, and I think it made the code harder rather than easier to understand.
This was from before we were using lxml.
This checks the namelist for paths that look like they may be usable to perform a directory traversal attack. It is not likely to be more robust than the checks zipfile.ZipFile.extractall is already equipped with, but it treats these as *errors* and refuses to use the archive, rather than using the archive but modifying the paths accordingly. It does this without excessive complexity by relying, to some extent, on knowledge about what can reasonably go in a USC archive (that is, it can afford to reject some archives that would be reasonable in other contexts). It also covers the unintuitive issue of how some paths on Windows are not regarded to be absolute (at least by pathlib.Path) even though they can access outside the current directory even without anything like ".." ever appearing as a path component. I *think* the ZipFile class's extract and extractall methods do cover this, but I'm not sure whether it strictly follows from the documentation that all conforming Python implementations would necessarily do so. Another reason for this change is to be in line with the strong encouragement in the documentation to inspect archives before extracting them. The stakes here are low, because USC archives are unlikely ever to be totally untrusted, but I think the increase in code length and complexity is not excessive.
The new get_schema_prefix function provides the behavior of the old code, but it is also refactored to format the string in a way that the code itself is easier to read and verify correct.
This extracts repeated ET.iterwalk logic from usc_42.ipynb to usc.py, as the new walk_tag function, modified to find the schema prefix from the root (rather than relying on having it as a global, as was done in the notebook). This also extracts get_direct_sections, having it use walk_tag to get its iterwalk iterator.
This deletes the cell in usc_42.ipynb that introduced notes_tag, because the code that used it instead uses facilities in usc.py that compute it as needed. In contrast, there is a small amount of use of section_tag in usc_42.ipynb, so schema_prefix and section_tag remain there.
One of the changes I made in #198, removing the notes_tag global variable from usc_42.ipynb, was incorrect, because there was still one place in that notebook that used it. I'm not sure how I missed that; maybe I failed to rerun the notebook after that change, since the following change was outside the notebook. In any case, while it will eventually be removable from the notebook, that was not (and is not) yet the case. This puts it back so the notebook works again. This reverts commit 2cf0a03.
Ordinarily I would have included this in the commit it tests, rather than having it as a separate commit. In this case, though, I want to make it especially easy to check the diffs, to avoid a repeat of the same kind of mistake I made in prematurel removing the notes_tag global variable before.
* Rename CONTEXT_LENGTH to TOKEN_LIMIT In a completion model, the total number of tokens the model can work with is nearly always called its context length. But in an embedding model, the terminology is less uniform. Since the phrase "context length" may be unclear to some developers, and furthermore since it is less intuitive in the setting of text embeddings where it is less often meaningful to regard the input (or some part of it) as "context," I've renamed embed.CONTEXT_LENGTH to embed.TOKEN_LIMIT. All uses in the project are updated. It's possible that a third, even clearer name will be found in the future; I do not claim that TOKEN_LIMIT is necessarily the best name for this. This also expands the TOKEN_LIMIT docstring to explain specifically how it applies to calls to both kinds of our embedding functions. * Re-run usc_42.ipynb to test rename Unlike in #203, where the second commit should possibly be included in the target branch when merging, in this case this commit can definitely be squashed into the one before it with no disadvantage.
class TestStats(_bases.TestBase): | ||
"""Tests for model dimension and token encoding facilites in ``embed``.""" | ||
|
||
# TODO: Decide if we should really have test_model_dimension_is_1536 and |
Check warning
Code scanning / pylint
TODO: Decide if we should really have test_model_dimension_is_1536 and.
Whatever automated tests we make of the USC demo should probably do the same. Note that this does not change any defaults for what code in the embed.demos.usc module does if data_dir is not passed. It only modifies the callers (currently all in notebooks) to specify the usc subdirectory of the repo's data directory.
Co-authored-by: Eliah Kagan <[email protected]>
Co-authored-by: Eliah Kagan <[email protected]>
Also run a simple semantic search query. Co-authored-by: Eliah Kagan <[email protected]>
No description provided.