More clear labelling of chunks and paragraphs when providing as evidence to LLM #3250
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
To try to encourage more accurate and specific referents and citations.
Every chunk will now have an (eg) "ORAL HISTORY ID: OH0627" metadata line.
Every paragraph will begin with "[OH0627|P32] SPEAKER:" -- additional labelling of OH id and paragraph number on every paragraph; and every paragraph has a speaker label if we can provide one, not just assumed from prior paragraph.
After merge, need to re-create all chunks on staging (which will cost us maybe $20), to have properly labelled speaker paragraphs stored in chunk recordsBah, nevermind -- right now, deleting/regenerating chunks will cause all bookmarked questions to disappear, we don't want to do that without thinking and ideally a solution.
So this PR will include new chunk generation code, to ensure all paragraphs have a speaker label, but the chunks on staging won't necessarily reflect it yet. Any new chunks we make in future will. Other improvements in this PR will be immediate.