Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic search in the U.S. Code #136

Draft
wants to merge 117 commits into
base: main
Choose a base branch
from
Draft

Semantic search in the U.S. Code #136

wants to merge 117 commits into from

Conversation

dmvassallo
Copy link
Owner

No description provided.


def count_tokens(text):
"""Count how many tokens ``text`` is, in the model's encoding."""
# TODO: Find a way to use less memory when counting tokens in long texts.

Check warning

Code scanning / pylint

TODO: Find a way to use less memory when counting tokens in long texts.

TODO: Find a way to use less memory when counting tokens in long texts.
tests/test_embed.py Fixed Show fixed Hide fixed
embed/demos/usc.py Fixed Show fixed Hide fixed
embed/demos/usc.py Fixed Show fixed Hide fixed
embed/demos/usc.py Fixed Show fixed Hide fixed
embed/demos/usc.py Fixed Show fixed Hide fixed
embed/demos/usc.py Fixed Show fixed Hide fixed
embed/demos/usc.py Fixed Show fixed Hide fixed
embed/demos/usc.py Fixed Show fixed Hide fixed
dmvassallo and others added 19 commits June 23, 2023 22:13
The standard-library ElementTree outputs modified xml with all tag
names qualified. lxml does not.
This uses Polars, but I might switch to pandas, which is already an
indirect dependency of the project, which produces nicer looking
tables by default (albeit in part by omitting explicit type names),
and which is more willing to sum Decimal values, though I worry
this may be doing so by converting to float64 (and thus sometimes
losing precision in a way that is undesirable for monetary values).

The code here is not complete or altogether working, but I'm
committing this before trying out pandas for this, in case I switch
it to pandas but we want to revisit the Polars code later.
I had thought I might switch to pandas, but Polars seems to be
working fine for it, so long as I compute the totals before adding
the cost columns, which I should possibly be doing anyway.
embed/demos/usc.py Fixed Show fixed Hide fixed
+ Improve reporting style for "text" text.

It appears no "text" or "tail" text appears in any element
get_embeddable_elements would split up, in Title 42.

We should probably provide a keyword argument to make it an error
if it does happen, though, if we're not going to add logic to
preserve such text in a situation where we would break up an
element that has it. Such a "strict" mode should also apply to a
too-big leaf, another condition we don't currently encounter but
that would be skipped.

This commit does *not* itself add any such strict mode. For this
reason, one of the related FIXMEs is retained.
embed/demos/usc.py Fixed Show fixed Hide fixed
This modifies get_embeddable_elements to add a strict mode, which
is enabled by default and which, after computing the entire
breakdown, checks if *any* of the three situations in which content
would be lost has occurred. If so, it raises ValueError, reporting
how many times each of the three general situations occurred. (The
idea behind it being ValueError is that all these situations occur,
or not, based on the value of the input, not its type, or the
result of attempting to look something up, etc.)

This also upgrades the lost "text" text and lost "tail" text
situations from warning to error in the logging done immediately
when the problems are discovered, so all three have "error"
severity.
print('\n'.join(textwrap.wrap(display_text, width=width)))


# FIXME: Avoid breaking up elements like <em> that are not, in a conceptual

Check warning

Code scanning / pylint

FIXME: Avoid breaking up elements like <em> that are not, in a conceptual.

FIXME: Avoid breaking up elements like <em> that are not, in a conceptual.
This updates the USC archive version, reflected in usc.USC_STEM,
adds automatic downloading, and re-runs all the notebooks (removing
the old usc_token_counts.csv file so that it is is regenerated).

A new dependency (pooch) is added for auto-downloading. For now the
stem is fixed (it remains the USC_STEM constant) and the download
is from the uscode.house.gov URL rather than a mirror; those may
both change in the future.
embed/demos/usc.py Fixed Show fixed Hide fixed
embed/demos/usc.py Fixed Show fixed Hide fixed
+ Make sure we don't track the partially completed download, since
  it is downloaded as a file named like tmp* and renamed
  afterwards.
This keeps much of the previous refactoring, but it puts the code
of _require_usc_archive back in extract_usc. That helper function
didn't really do a single clear job or self-contained group of
jobs, and I think it made the code harder rather than easier to
understand.
This was from before we were using lxml.
This checks the namelist for paths that look like they may be
usable to perform a directory traversal attack. It is not likely to
be more robust than the checks zipfile.ZipFile.extractall is
already equipped with, but it treats these as *errors* and refuses
to use the archive, rather than using the archive but modifying the
paths accordingly. It does this without excessive complexity by
relying, to some extent, on knowledge about what can reasonably
go in a USC archive (that is, it can afford to reject some
archives that would be reasonable in other contexts).

It also covers the unintuitive issue of how some paths on Windows
are not regarded to be absolute (at least by pathlib.Path) even
though they can access outside the current directory even without
anything like ".." ever appearing as a path component. I *think*
the ZipFile class's extract and extractall methods do cover this,
but I'm not sure whether it strictly follows from the documentation
that all conforming Python implementations would necessarily do so.

Another reason for this change is to be in line with the strong
encouragement in the documentation to inspect archives before
extracting them. The stakes here are low, because USC archives are
unlikely ever to be totally untrusted, but I think the increase in
code length and complexity is not excessive.
@EliahKagan EliahKagan mentioned this pull request Jun 24, 2023
8 tasks
EliahKagan and others added 10 commits June 24, 2023 16:18
The new get_schema_prefix function provides the behavior of the
old code, but it is also refactored to format the string in a way
that the code itself is easier to read and verify correct.
This extracts repeated ET.iterwalk logic from usc_42.ipynb to
usc.py, as the new walk_tag function, modified to find the schema
prefix from the root (rather than relying on having it as a global,
as was done in the notebook).

This also extracts get_direct_sections, having it use walk_tag to
get its iterwalk iterator.
This deletes the cell in usc_42.ipynb that introduced notes_tag,
because the code that used it instead uses facilities in usc.py
that compute it as needed.

In contrast, there is a small amount of use of section_tag in
usc_42.ipynb, so schema_prefix and section_tag remain there.
One of the changes I made in #198, removing the notes_tag global
variable from usc_42.ipynb, was incorrect, because there was still
one place in that notebook that used it. I'm not sure how I missed
that; maybe I failed to rerun the notebook after that change, since
the following change was outside the notebook. In any case, while
it will eventually be removable from the notebook, that was not
(and is not) yet the case. This puts it back so the notebook works
again.

This reverts commit 2cf0a03.
Ordinarily I would have included this in the commit it tests,
rather than having it as a separate commit. In this case, though,
I want to make it especially easy to check the diffs, to avoid a
repeat of the same kind of mistake I made in prematurel removing
the notes_tag global variable before.
* Rename CONTEXT_LENGTH to TOKEN_LIMIT

In a completion model, the total number of tokens the model can
work with is nearly always called its context length. But in an
embedding model, the terminology is less uniform. Since the phrase
"context length" may be unclear to some developers, and furthermore
since it is less intuitive in the setting of text embeddings where
it is less often meaningful to regard the input (or some part of
it) as "context," I've renamed embed.CONTEXT_LENGTH to
embed.TOKEN_LIMIT.

All uses in the project are updated. It's possible that a third,
even clearer name will be found in the future; I do not claim that
TOKEN_LIMIT is necessarily the best name for this.

This also expands the TOKEN_LIMIT docstring to explain specifically
how it applies to calls to both kinds of our embedding functions.

* Re-run usc_42.ipynb to test rename

Unlike in #203, where the second commit should possibly be included
in the target branch when merging, in this case this commit can
definitely be squashed into the one before it with no disadvantage.
class TestStats(_bases.TestBase):
"""Tests for model dimension and token encoding facilites in ``embed``."""

# TODO: Decide if we should really have test_model_dimension_is_1536 and

Check warning

Code scanning / pylint

TODO: Decide if we should really have test_model_dimension_is_1536 and.

TODO: Decide if we should really have test_model_dimension_is_1536 and.
EliahKagan and others added 6 commits June 29, 2023 18:59
Whatever automated tests we make of the USC demo should probably do
the same. Note that this does not change any defaults for what code
in the embed.demos.usc module does if data_dir is not passed. It
only modifies the callers (currently all in notebooks) to specify
the usc subdirectory of the repo's data directory.
Also run a simple semantic search query.

Co-authored-by: Eliah Kagan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants