Semantic search in the U.S. Code #136

dmvassallo · 2023-05-28T18:47:45Z

No description provided.

embed/__init__.py

+
+def count_tokens(text):
+    """Count how many tokens ``text`` is, in the model's encoding."""
+    # TODO: Find a way to use less memory when counting tokens in long texts.


tests/test_embed.py

embed/demos/usc.py

The standard-library ElementTree outputs modified xml with all tag names qualified. lxml does not.

This uses Polars, but I might switch to pandas, which is already an indirect dependency of the project, which produces nicer looking tables by default (albeit in part by omitting explicit type names), and which is more willing to sum Decimal values, though I worry this may be doing so by converting to float64 (and thus sometimes losing precision in a way that is undesirable for monetary values). The code here is not complete or altogether working, but I'm committing this before trying out pandas for this, in case I switch it to pandas but we want to revisit the Polars code later.

I had thought I might switch to pandas, but Polars seems to be working fine for it, so long as I compute the totals before adding the cost columns, which I should possibly be doing anyway.

embed/demos/usc.py

+ Improve reporting style for "text" text. It appears no "text" or "tail" text appears in any element get_embeddable_elements would split up, in Title 42. We should probably provide a keyword argument to make it an error if it does happen, though, if we're not going to add logic to preserve such text in a situation where we would break up an element that has it. Such a "strict" mode should also apply to a too-big leaf, another condition we don't currently encounter but that would be skipped. This commit does *not* itself add any such strict mode. For this reason, one of the related FIXMEs is retained.

embed/demos/usc.py

This modifies get_embeddable_elements to add a strict mode, which is enabled by default and which, after computing the entire breakdown, checks if *any* of the three situations in which content would be lost has occurred. If so, it raises ValueError, reporting how many times each of the three general situations occurred. (The idea behind it being ValueError is that all these situations occur, or not, based on the value of the input, not its type, or the result of attempting to look something up, etc.) This also upgrades the lost "text" text and lost "tail" text situations from warning to error in the logging done immediately when the problems are discovered, so all three have "error" severity.

embed/demos/usc.py

+    print('\n'.join(textwrap.wrap(display_text, width=width)))
+
+
+# FIXME: Avoid breaking up elements like <em> that are not, in a conceptual


This updates the USC archive version, reflected in usc.USC_STEM, adds automatic downloading, and re-runs all the notebooks (removing the old usc_token_counts.csv file so that it is is regenerated). A new dependency (pooch) is added for auto-downloading. For now the stem is fixed (it remains the USC_STEM constant) and the download is from the uscode.house.gov URL rather than a mirror; those may both change in the future.

embed/demos/usc.py

+ Make sure we don't track the partially completed download, since it is downloaded as a file named like tmp* and renamed afterwards.

This keeps much of the previous refactoring, but it puts the code of _require_usc_archive back in extract_usc. That helper function didn't really do a single clear job or self-contained group of jobs, and I think it made the code harder rather than easier to understand.

This was from before we were using lxml.

This checks the namelist for paths that look like they may be usable to perform a directory traversal attack. It is not likely to be more robust than the checks zipfile.ZipFile.extractall is already equipped with, but it treats these as *errors* and refuses to use the archive, rather than using the archive but modifying the paths accordingly. It does this without excessive complexity by relying, to some extent, on knowledge about what can reasonably go in a USC archive (that is, it can afford to reject some archives that would be reasonable in other contexts). It also covers the unintuitive issue of how some paths on Windows are not regarded to be absolute (at least by pathlib.Path) even though they can access outside the current directory even without anything like ".." ever appearing as a path component. I *think* the ZipFile class's extract and extractall methods do cover this, but I'm not sure whether it strictly follows from the documentation that all conforming Python implementations would necessarily do so. Another reason for this change is to be in line with the strong encouragement in the documentation to inspect archives before extracting them. The stakes here are low, because USC archives are unlikely ever to be totally untrusted, but I think the increase in code length and complexity is not excessive.

The new get_schema_prefix function provides the behavior of the old code, but it is also refactored to format the string in a way that the code itself is easier to read and verify correct.

This extracts repeated ET.iterwalk logic from usc_42.ipynb to usc.py, as the new walk_tag function, modified to find the schema prefix from the root (rather than relying on having it as a global, as was done in the notebook). This also extracts get_direct_sections, having it use walk_tag to get its iterwalk iterator.

This deletes the cell in usc_42.ipynb that introduced notes_tag, because the code that used it instead uses facilities in usc.py that compute it as needed. In contrast, there is a small amount of use of section_tag in usc_42.ipynb, so schema_prefix and section_tag remain there.

One of the changes I made in #198, removing the notes_tag global variable from usc_42.ipynb, was incorrect, because there was still one place in that notebook that used it. I'm not sure how I missed that; maybe I failed to rerun the notebook after that change, since the following change was outside the notebook. In any case, while it will eventually be removable from the notebook, that was not (and is not) yet the case. This puts it back so the notebook works again. This reverts commit 2cf0a03.

Ordinarily I would have included this in the commit it tests, rather than having it as a separate commit. In this case, though, I want to make it especially easy to check the diffs, to avoid a repeat of the same kind of mistake I made in prematurel removing the notes_tag global variable before.

* Rename CONTEXT_LENGTH to TOKEN_LIMIT In a completion model, the total number of tokens the model can work with is nearly always called its context length. But in an embedding model, the terminology is less uniform. Since the phrase "context length" may be unclear to some developers, and furthermore since it is less intuitive in the setting of text embeddings where it is less often meaningful to regard the input (or some part of it) as "context," I've renamed embed.CONTEXT_LENGTH to embed.TOKEN_LIMIT. All uses in the project are updated. It's possible that a third, even clearer name will be found in the future; I do not claim that TOKEN_LIMIT is necessarily the best name for this. This also expands the TOKEN_LIMIT docstring to explain specifically how it applies to calls to both kinds of our embedding functions. * Re-run usc_42.ipynb to test rename Unlike in #203, where the second commit should possibly be included in the target branch when merging, in this case this commit can definitely be squashed into the one before it with no disadvantage.

tests/test_embed.py

+class TestStats(_bases.TestBase):
+    """Tests for model dimension and token encoding facilites in ``embed``."""
+
+    # TODO: Decide if we should really have test_model_dimension_is_1536 and


Whatever automated tests we make of the USC demo should probably do the same. Note that this does not change any defaults for what code in the embed.demos.usc module does if data_dir is not passed. It only modifies the callers (currently all in notebooks) to specify the usc subdirectory of the repo's data directory.

Co-authored-by: Eliah Kagan <[email protected]>

Also run a simple semantic search query. Co-authored-by: Eliah Kagan <[email protected]>

github-advanced-security bot found potential problems May 28, 2023

View reviewed changes

dmvassallo force-pushed the usc branch from 61b7a36 to f28ec23 Compare May 28, 2023 18:48

dmvassallo force-pushed the usc branch from b39f088 to cca5588 Compare June 5, 2023 17:50

EliahKagan mentioned this pull request Jun 13, 2023

Embeddings are occasionally wrong #157

Open

dmvassallo force-pushed the usc branch 2 times, most recently from b1ab039 to 64c0c87 Compare June 17, 2023 00:56

github-advanced-security bot found potential problems Jun 20, 2023

View reviewed changes

tests/test_embed.py Fixed Show fixed Hide fixed

dmvassallo force-pushed the usc branch from 5a6b6f5 to 97db4c9 Compare June 20, 2023 00:42

github-advanced-security bot found potential problems Jun 23, 2023

View reviewed changes

embed/demos/usc.py Fixed Show fixed Hide fixed

embed/demos/usc.py Fixed Show fixed Hide fixed

embed/demos/usc.py Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Jun 23, 2023

View reviewed changes

embed/demos/usc.py Fixed Show fixed Hide fixed

embed/demos/usc.py Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Jun 23, 2023

View reviewed changes

embed/demos/usc.py Fixed Show fixed Hide fixed

embed/demos/usc.py Fixed Show fixed Hide fixed

dmvassallo and others added 19 commits June 23, 2023 22:13

Start work on usc app example

1900b96

Only extract if not yet extracted

5d52def

Add TODO

551e96b

Experiment with xml ElementTree

19d1312

Add FIXME about archive structure

15522c5

Experiment with manually selected USC sections

f873676

Estimate costs

9d9ecd8

Prepare for experiments with Title 42

348f79f

Experiment more with ElementTree

1be930a

Use lxml instead of ElementTree

3d1e1ba

The standard-library ElementTree outputs modified xml with all tag names qualified. lxml does not.

De-duplicate path

23e3f0c

Use lxml in usc_manual.ipynb

e401be0

Add lxml to conda environment

6673e41

Further work with USC42

3bb9b00

Show error in notebook

122d9e8

Get Polars code mostly working

f346b9f

I had thought I might switch to pandas, but Polars seems to be working fine for it, so long as I compute the totals before adding the cost columns, which I should possibly be doing anyway.

Count all sections' tokens and compute costs (Polars)

1d96771

Add FIXMEs to compute ratios from df

1948f48

github-advanced-security bot found potential problems Jun 24, 2023

View reviewed changes

embed/demos/usc.py Fixed Show fixed Hide fixed

EliahKagan added 2 commits June 24, 2023 06:47

Report lost "text" text in get_embeddable_elements

4ba0a82

github-advanced-security bot found potential problems Jun 24, 2023

View reviewed changes

embed/demos/usc.py Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Jun 24, 2023

View reviewed changes

embed/demos/usc.py Fixed Show fixed Hide fixed

embed/demos/usc.py Fixed Show fixed Hide fixed

EliahKagan added 4 commits June 24, 2023 11:45

Refactor extract_usc and pull out parts to helpers (#192)

8615ca6

+ Make sure we don't track the partially completed download, since it is downloaded as a file named like tmp* and renamed afterwards.

Drop TODO about defusedxml (#194)

12ea72b

This was from before we were using lxml.

EliahKagan mentioned this pull request Jun 24, 2023

Plan for first release #113

Open

8 tasks

EliahKagan and others added 10 commits June 24, 2023 16:18

Extract some more repeated notebook code to usc.py (#196)

55fbabb

Extract schema-prefix building to usc module (#197)

e1ac424

The new get_schema_prefix function provides the behavior of the old code, but it is also refactored to format the string in a way that the code itself is easier to read and verify correct.

Add get_embeddable_direct_sections convenience function

51a65f8

Use GitHub repack of US Code instead of Congressional version

885a276

Revise docstrings for readability and accuracy (#199)

798b263

github-advanced-security bot found potential problems Jun 29, 2023

View reviewed changes

EliahKagan and others added 6 commits June 29, 2023 18:59

Fix small style inconsistency in pyproject.toml

b5e037f

Notes towards embedding USC sections

649c09c

Add preliminary Polars data frame for embeddings

2d92086

Co-authored-by: Eliah Kagan <[email protected]>

Embed some early sections of Title 42

83f4900

Co-authored-by: Eliah Kagan <[email protected]>

Embed some slightly later sections

9fb8b19

Also run a simple semantic search query. Co-authored-by: Eliah Kagan <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic search in the U.S. Code #136

Semantic search in the U.S. Code #136

dmvassallo commented May 28, 2023

		print('\n'.join(textwrap.wrap(display_text, width=width)))


		# FIXME: Avoid breaking up elements like <em> that are not, in a conceptual

Semantic search in the U.S. Code #136

Are you sure you want to change the base?

Semantic search in the U.S. Code #136

Conversation

dmvassallo commented May 28, 2023