|
| 1 | +--- |
| 2 | +argument-hint: [corpus_slug] |
| 3 | +description: uses Playwright MCP and the `corpus:view` to parse page elements |
| 4 | +--- |
| 5 | + |
| 6 | +- using Playwright MCP, navigate to `http://localhost:3001/corpus/$1/gitcasso` |
| 7 | +- the page will have a div with id `gitcasso-comment-spots`, wait 500ms for it to settle |
| 8 | +- inside the `gitcasso-comment-spots` div you will see something like this: |
| 9 | + |
| 10 | +```json |
| 11 | +{ |
| 12 | + "url": "https://github.com/diffplug/selfie/issues/523", |
| 13 | + "allTextAreas": [ |
| 14 | + { |
| 15 | + "textarea": "id='feedback' name='feedback' className='form-control width-full mb-2'", |
| 16 | + "spot": "NO_SPOT" |
| 17 | + }, |
| 18 | + { |
| 19 | + "textarea": "id=':rn:' name='' className='prc-Textarea-TextArea-13q4j overtype-input'", |
| 20 | + "spot": { |
| 21 | + "domain": "github.com", |
| 22 | + "number": 523, |
| 23 | + "slug": "diffplug/selfie", |
| 24 | + "title": "TODO_TITLE", |
| 25 | + "type": "GH_ISSUE_ADD_COMMENT", |
| 26 | + "unique_key": "github.com:diffplug/selfie:523" |
| 27 | + } |
| 28 | + } |
| 29 | + ] |
| 30 | +} |
| 31 | +``` |
| 32 | + |
| 33 | +- this output means that this page is simulating the url `https://github.com/diffplug/selfie/issues/523` |
| 34 | +- every textarea on the page is represented |
| 35 | +- `NO_SPOT` means that the spot was not enhanced |
| 36 | +- `type: GH_ISSUE_ADD_COMMENT` means that it was enhanced by whichever implementation of `CommentEnhancer` returns the spot type `GH_ISSUE_ADD_COMMENT` |
| 37 | +- if you search for that string in `src/lib/enhancers` you will find the correct one |
| 38 | +- the `tryToEnhance` method returned a `CommentSpot`, and that whole data is splatted out above |
| 39 | + |
| 40 | +If you make a change to the code of the enhancer, you can click the button with id `gitcasso-rebuild-btn`. It will trigger a rebuild of the browser extension, and then refresh the page. You'll be able to see the effects of your change in the `gitcasso-comment-spots` div described above. |
| 41 | + |
| 42 | +## Common extraction workflow |
| 43 | + |
| 44 | +If you see `"title": "TODO_TITLE"` or similar hardcoded `TODO` values in the JSON output, this indicates the enhancer needs some kind of extraction implemented: |
| 45 | + |
| 46 | +1. **Find the enhancer**: Search for the `type` value (e.g., `GH_ISSUE_ADD_COMMENT`) in `src/lib/enhancers/` |
| 47 | +2. **Implement extraction**: Replace hardcoded title with DOM extraction: |
| 48 | + ```javascript |
| 49 | + const title = document.querySelector('main h1')!.textContent.replace(/\s*#\d+$/, '').trim() |
| 50 | + ``` |
| 51 | +4. **Test with rebuild**: Click the 🔄 button to rebuild and verify the title appears correctly in the JSON |
| 52 | + |
| 53 | +## Extraction code style |
| 54 | + |
| 55 | +- Don't hedge your bets and write lots of fallback code or strings of `?.`. Have a specific piece of data you want to get, use non-null `!` assertions where necessary to be clear about getting. |
| 56 | +- If a field is empty, represent it with an empty string. Don't use placeholders when extracting data. |
| 57 | +- The pages we are scraping are going to change over time, and it's easier to fix broken ones if we know exactly what used to work. If the code has lots of branching paths, it's harder to tell what it was doing. |
0 commit comments