Skip to content

Conversation

@jrochkind
Copy link
Contributor

@jrochkind jrochkind commented Jan 7, 2026

To try to encourage more accurate and specific referents and citations.

Every chunk will now have an (eg) "ORAL HISTORY ID: OH0627" metadata line.

Every paragraph will begin with "[OH0627|P32] SPEAKER:" -- additional labelling of OH id and paragraph number on every paragraph; and every paragraph has a speaker label if we can provide one, not just assumed from prior paragraph.

  • move chunk formatting for LLM into ERB template
  • add OH numbers to chunks metadata provided to LLM
  • in chunks to LLM add OH and Paragraph number before every paragraph, to try to improve correctness and specificity of citations
  • When creating Chunk records, make sure every paragraph in text begins with a speaker label, assuming from previous paragraph if needed

After merge, need to re-create all chunks on staging (which will cost us maybe $20), to have properly labelled speaker paragraphs stored in chunk records

Bah, nevermind -- right now, deleting/regenerating chunks will cause all bookmarked questions to disappear, we don't want to do that without thinking and ideally a solution.

So this PR will include new chunk generation code, to ensure all paragraphs have a speaker label, but the chunks on staging won't necessarily reflect it yet. Any new chunks we make in future will. Other improvements in this PR will be immediate.

@jrochkind jrochkind marked this pull request as ready for review January 7, 2026 22:46
@jrochkind
Copy link
Contributor Author

After merge, need to re-create all chunks on staging (which will cost us maybe $20), to have properly labelled speaker paragraphs stored in chunk records

Bah, nevermind -- right now, deleting/regenerating chunks will cause all bookmarked questions to disappear, we don't want to do that without thinking and ideally a solution.

So this PR will include new chunk generation code, to ensure all paragraphs have a speaker label, but the chunks on staging won't necessarily reflect it yet. Any new chunks we make in future will. Other improvements in this PR will be immediate.

@jrochkind jrochkind merged commit 622e1b2 into master Jan 12, 2026
1 check passed
@jrochkind jrochkind deleted the better_chunk_provision branch January 12, 2026 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants