Skip to content

fix(techmeme): emit dated, valid JSON from search#1440

Open
mvanhorn wants to merge 4 commits into
mainfrom
fix/techmeme-search-json-dates
Open

fix(techmeme): emit dated, valid JSON from search#1440
mvanhorn wants to merge 4 commits into
mainfrom
fix/techmeme-search-json-dates

Conversation

@mvanhorn

@mvanhorn mvanhorn commented Jul 4, 2026

Copy link
Copy Markdown
Owner

What

techmeme-pp-cli search scraped every anchor on Techmeme's archive-search results page and broke its own JSON contract. This PR makes the search surface honest and agent-safe:

  • Valid JSON always: --json/--agent emit a valid JSON array for every successful invocation, including [] on zero hits (previously: No results for "q" prose on stdout, exit 0 - a nil Go slice also marshals to null, so the empty slice is initialized explicitly).
  • Per-record dates: each record now carries date (ISO YYYY-MM-DD, "" when unparseable), parsed from the results page's per-item iinf blocks with the same multi-layout approach as parseRSSDate/parseRiverTimestamp. Human table gains a DATE column.
  • Noise filtering: structural anchor+date pairing bounded to the results canvas (cut at prevnext) - bare publication anchors ("Wall Street Journal") and post-results promos no longer become records.
  • --days N: client-side recency filter at UTC day granularity (undated records drop when active).
  • Markup-shift guard: stderr warning when anchors and iinf blocks fail to pair in either direction.
  • --agent actually works now: the implied --compact stripped every search field to {} via the shared keepFields allow-list; search now defaults an explicit field select when compacting (shared map untouched).
  • MCP reachability: search was hardcoded into the cobra-tree frameworkCommands skip-list on the premise that the typed techmeme-search_rss tool covers it - but that tool returns the raw HTML of the same endpoint. search is now registered as a shell-out MCP tool.

Recorded in .printing-press-patches/search-json-dates-contract.json as a reprint-guard; contributor attribution added on all three surfaces. No release-owned files touched.

Why (reproduction)

Observed live in a last30days engine run on 2026-07-04 ("Kanye West"): two subqueries died with JSON decode failed: Expecting value: line 1 column 1 (char 0) (zero-hit prose), and the one that returned data delivered Dec 2022 Parler-saga headlines indistinguishable from current news (no dates). Raw page: each result is publication anchor + headline anchor + iinf date; the old single-regex sweep captured all anchors and no dates.

Validation

Check Result
go build ./... / go vet ./... PASS
go test ./... PASS (25 in internal/cli, 104 module-wide)
govulncheck ./... No vulnerabilities found
verify_skill.py --dir library/productivity/techmeme/ PASS (also fixes a pre-existing export positional-arg failure in README)
Live: search "kanye west" --json 10 dated story records, entities decoded, no publication rows
Live: zero-hit --json exactly [], exit 0, pipes through python3 -m json.tool
Live: search "kanye west" --agent populated records (was [{},{},...])

Fixture-first: the trimmed live-capture fixture was committed and parse tests run against the old parser first (6 records incl. "Leaderboard"/"Wall Street Journal", zero dates) before the rewrite.

Known Residuals

  • P2 internal/cli/search.go:88 - --days composes with a single-page fetch (~10 archive hits per page); no signal when more in-window results exist. Pagination was deliberately deferred; a minimal follow-up is parsing the "Results 1 - 10 of about N" header and warning on stderr when truncated under --days.
  • The same zero-results-prose-before-JSON-check bug exists in techmeme's own author.go and at least 5 other published CLIs - filed as a generator-level issue (linked below) rather than widened into this PR.

Post-Deploy Monitoring & Validation

  • After the release ledger assigns the new version: go install .../techmeme/cmd/techmeme-pp-cli@latest, then search <topic> --json | python3 -m json.tool (valid array, dated records) and a zero-hit query (must print []).
  • Healthy signal: last30days engine runs show [Techmeme] found N records with no JSON decode failed lines.
  • Failure signal / rollback trigger: decode failures or empty-object records from --agent; revert this PR.
  • Window: first week of engine runs post-release. Owner: @mvanhorn.

🤖 Generated with Claude Code

https://claude.ai/code/session_01BWuSdMPdQLnAeh65wTeG3L

mvanhorn and others added 3 commits July 4, 2026 09:40
Techmeme's search command scraped every anchor on the results page and broke
its own JSON contract: zero hits printed prose ('No results for ...') to
stdout before the --json check, bare publication anchors ('Wall Street
Journal') leaked in as headlines, and the per-result dates on the page were
never parsed - so archive hits from 2022 were indistinguishable from today's
news (observed breaking the last30days engine on 2026-07-04).

search now pairs each headline anchor with its iinf date block in document
order (bounded to the results canvas, cut at prevnext), emits a date field
(ISO YYYY-MM-DD, empty when unparseable), always outputs a valid JSON array
in JSON mode (empty array on zero hits - nil slices marshal as null, so the
slice is initialized), warns on stderr when markup drift yields anchors but
no date blocks, and gains --days N for client-side recency filtering
(undated records drop when the filter is active). Human mode keeps the
friendly zero-results line and gains a DATE column.

Recorded as .printing-press-patches/search-json-dates-contract.json so a
reprint preserves the contract; contributor attribution added per the
three-surface convention.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BWuSdMPdQLnAeh65wTeG3L
… day-boundary, symmetric markup guard

Review pass (2 independent reviewers + live repro) caught: --agent's implied
--compact stripped every search field to {} via the shared keepFields
allow-list (fixed with a search-local default --select, shared map
untouched); --days compared midnight-UTC dates against a wall-clock instant,
dropping records dated exactly N days ago; the markup-shift guard only fired
for anchors-without-dates, leaving the reverse direction silent. Plus DATE
column test coverage and an anchor regex hardened for attributes before HREF.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BWuSdMPdQLnAeh65wTeG3L
The cobra-tree walker skipped search as 'covered by a typed MCP tool', but
techmeme-search_rss returns the raw archive-search HTML of the same endpoint
the CLI now parses into dated, valid JSON - so the shell-out is the strictly
better surface and MCP agents could not reach the fixed contract at all.
Removed search from frameworkCommands per the file's own when-in-doubt
principle and recorded the lesson in the patch entry.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BWuSdMPdQLnAeh65wTeG3L
@greptile-apps

greptile-apps Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes three distinct bugs in techmeme-pp-cli search: zero-hit pages emitting prose instead of valid JSON ([]), records carrying no date field, and --agent/--compact stripping every field to {}. It also removes search from the MCP frameworkCommands skip-list so agent consumers can reach the parsed surface.

  • Parser rewrite (search.go): replaces a flat anchor sweep with a structural pairing algorithm that walks anchors and iinf date blocks in document order, bounded to the resultscanvas region, with explicit filtering of publication cites, image anchors, and Techmeme story-permalinks by href shape.
  • New --days N flag: client-side recency filter at UTC day granularity; undated records are dropped when active; the boundary-inclusive cutoff is correctly derived from midnight UTC to avoid time-of-day skew.
  • --compact/--agent fix: when compacting without a user-supplied --select, a local copy of flags defaults selectFields to all five search field names, leaving the shared struct unmutated.
  • MCP registration (classify.go): removes "search": true from frameworkCommands, registering the command as a shell-out MCP tool.
  • Tests (search_test.go): 15 focused tests covering the fixture, noise filtering, entity decoding, date layouts, filter boundary, zero-hit JSON contract, markup-shift guard (both directions), partial pairing, and the renamed-permalink edge case.

Confidence Score: 5/5

Safe to merge — the parser rewrite is well-scoped, all three bug fixes are covered by targeted tests, and the MCP registration change is a single-line removal with no behavioral side effects on other commands.

The structural anchor/iinf pairing algorithm is correct: document-order position comparison is consistent within the trimmed page string, the boundary-inclusive cutoff in filterSearchByDays is derived from midnight UTC to avoid time-of-day skew, the nil-slice guard correctly prevents json.Marshal emitting null, and the flags copy in the --compact path leaves the shared struct unmutated. The 15-test suite exercises all meaningful branches including the zone-skew boundary case, the renamed-permalink exclusion, and the partial-pairing markup-shift signal.

No files require special attention.

Important Files Changed

Filename Overview
library/productivity/techmeme/internal/cli/search.go Core parser rewrite: structural anchor/iinf pairing replaces flat anchor sweep; nil-slice to [] fix; --days filter with correct UTC-midnight cutoff; --compact field-select local copy; all logic well-tested.
library/productivity/techmeme/internal/cli/search_test.go 15 tests covering fixture parsing, noise filtering, entity decoding, date layouts, filter boundary including zone-skew edge case, zero-hit JSON contract, markup-shift guard, partial pairing, and renamed-permalink exclusion.
library/productivity/techmeme/internal/mcp/cobratree/classify.go Removes search from frameworkCommands skip-list; adds inline comment explaining why the typed techmeme-search_rss tool does not substitute for the parsed CLI surface.
library/productivity/techmeme/internal/cli/testdata/search_kanye_west.html Trimmed live-capture fixture with 4 representative story triples, prevnext pagination div, and a post-results /r2/ sponsor item — exercises all noise-filter branches.
library/productivity/techmeme/.printing-press-patches/search-json-dates-contract.json Reprint-guard patch record documenting the JSON-array contract, date field requirement, noise-filter bounds, empty-array zero-hit path, and MCP registration change.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[HTTP GET /search/d3results.jsp] --> B[parseSearchResults]
    B --> C{resultscanvas present?}
    C -- yes --> D[Trim to resultscanvas region]
    C -- no --> E[Use full page]
    D --> F[Trim at sponsorscanvas]
    E --> F
    F --> G[Trim at prevnext]
    G --> H[Collect candidate anchors via searchAnchorRE]
    H --> I[Filter /r2/ promos, permalink hrefs, context text]
    I --> J[Find iinf date blocks via searchIinfDateRE]
    J --> K{len dates == 0?}
    K -- yes, anchors > 0 --> L[Return empty, warn=true]
    K -- yes, anchors == 0 --> M[Return empty warn=false zero-hit page]
    K -- no --> N[Pair: last anchor before each iinf block]
    N --> O[parseSearchDate to ISO YYYY-MM-DD]
    O --> P{len results less than len dates?}
    P -- yes --> Q[Return results, warn=true partial pairing]
    P -- no --> R[Return results, warn=false]
    Q --> S{--days N?}
    R --> S
    S -- yes --> T[filterSearchByDays UTC midnight cutoff]
    S -- no --> U[renderSearchResults]
    T --> U
    U --> V{--json or --agent?}
    V -- yes nil results --> W[Replace nil with empty slice]
    W --> X{--compact no --select?}
    X -- yes --> Y[Local copy: selectFields=all fields]
    X -- no --> Z[printJSONFiltered to stdout]
    Y --> Z
    V -- no 0 results --> AA[Print No results prose]
    V -- no results present --> AB[Print DATE SOURCE HEADLINE table]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[HTTP GET /search/d3results.jsp] --> B[parseSearchResults]
    B --> C{resultscanvas present?}
    C -- yes --> D[Trim to resultscanvas region]
    C -- no --> E[Use full page]
    D --> F[Trim at sponsorscanvas]
    E --> F
    F --> G[Trim at prevnext]
    G --> H[Collect candidate anchors via searchAnchorRE]
    H --> I[Filter /r2/ promos, permalink hrefs, context text]
    I --> J[Find iinf date blocks via searchIinfDateRE]
    J --> K{len dates == 0?}
    K -- yes, anchors > 0 --> L[Return empty, warn=true]
    K -- yes, anchors == 0 --> M[Return empty warn=false zero-hit page]
    K -- no --> N[Pair: last anchor before each iinf block]
    N --> O[parseSearchDate to ISO YYYY-MM-DD]
    O --> P{len results less than len dates?}
    P -- yes --> Q[Return results, warn=true partial pairing]
    P -- no --> R[Return results, warn=false]
    Q --> S{--days N?}
    R --> S
    S -- yes --> T[filterSearchByDays UTC midnight cutoff]
    S -- no --> U[renderSearchResults]
    T --> U
    U --> V{--json or --agent?}
    V -- yes nil results --> W[Replace nil with empty slice]
    W --> X{--compact no --select?}
    X -- yes --> Y[Local copy: selectFields=all fields]
    X -- no --> Z[printJSONFiltered to stdout]
    Y --> Z
    V -- no 0 results --> AA[Print No results prose]
    V -- no results present --> AB[Print DATE SOURCE HEADLINE table]
Loading

Reviews (2): Last reviewed commit: "fix(techmeme): harden permalink exclusio..." | Re-trigger Greptile

Comment thread library/productivity/techmeme/internal/cli/search.go
Comment thread library/productivity/techmeme/internal/cli/search.go Outdated
… failures

Greptile review: the 'In context' filter keyed on anchor text, so a renamed
label would let a Techmeme story permalink pair with the NEXT story's date
block and misattribute its link - now excluded by href shape
(techmeme.com/YYMMDD/pNN). And the markup-shift guard fired only on total
pairing collapse; it now warns whenever any iinf block goes unmatched.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BWuSdMPdQLnAeh65wTeG3L
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant