diff --git a/.github/ISSUE_TEMPLATE/02-name-correction.yml b/.github/ISSUE_TEMPLATE/02-name-correction.yml index 8790f202a4..b2273d9a69 100644 --- a/.github/ISSUE_TEMPLATE/02-name-correction.yml +++ b/.github/ISSUE_TEMPLATE/02-name-correction.yml @@ -34,7 +34,7 @@ body: validations: required: true - - type: textarea + - type: input id: author_orcid attributes: label: Author ORCID @@ -43,7 +43,7 @@ body: placeholder: ex. https://orcid.org/my-orcid?orcid=0009-0003-8868-7504 validations: required: true - - type: textarea + - type: input id: author_highest_degree_institution attributes: label: Institution of highest (anticipated) degree @@ -54,7 +54,7 @@ body: placeholder: ex. Johns Hopkins University (https://www.jhu.edu/) validations: required: true - - type: textarea + - type: input id: author_name_script_variant attributes: label: Author Name (only if published in another script) diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md new file mode 100644 index 0000000000..3d5e3e32b9 --- /dev/null +++ b/.github/copilot-instructions.md @@ -0,0 +1,133 @@ +# ACL Anthology Copilot Instructions + +## Project Overview +The ACL Anthology is a digital archive of NLP/CL research papers with both a static website generator and a Python package for metadata access. The project manages scholarly publication metadata through XML files and generates a Hugo-based website. + +## Architecture & Data Flow + +### Core Data Model +- **Authoritative XML files** in `data/xml/` contain all paper metadata (schema: `data/xml/schema.rnc`) +- **YAML configuration** in `data/yaml/` defines venues, SIGs, and name variants +- **Hugo static site** generated from processed JSON data in `build/data/` +- **Python package** (`python/acl_anthology/`) provides programmatic access to metadata + +### Build Process Pipeline +1. **XML Processing**: `bin/create_hugo_data.py` converts XML → JSON for Hugo templates +2. **Bibliography Generation**: `bin/create_extra_bib.py` creates BibTeX/MODS/Endnote exports +3. **Hugo Site Generation**: Hugo processes JSON data → static HTML site +4. **Asset Management**: PDF files, attachments managed separately with checksums + +Key build targets in `Makefile`: +- `make all` - Full build (check + site) +- `make hugo_data` - Generate JSON data files only +- `make site` - Generate complete website +- `make check` - Validate XML schema compliance + +## Critical ID System + +### Modern Format (post-2020) +- Format: `YEAR.VENUE-VOLUME.NUMBER` (e.g., `2020.acl-main.12`) +- **VENUE**: lowercase alphanumeric venue identifier (no years!) +- **VOLUME**: volume name (`main`, `short`, `1`, etc.) +- **NUMBER**: paper number within volume + +### Legacy Format (pre-2020) +- Various letter-based schemes (P19-1234, W19-5012, etc.) +- Limited paper capacity, inflexible venue encoding + +## Development Workflows + +### XML Metadata Management +- All paper metadata lives in `data/xml/{COLLECTION_ID}.xml` files +- Use `bin/ingest_aclpub2.py` for bulk ingestion from conference data +- Individual modifications via scripts like `bin/add_author_id.py`, `bin/fix_titles.py` +- **Always validate with `make check`** after XML changes + +### Author Name Handling +- Complex disambiguation system for author identity resolution +- Name variants stored in `data/yaml/name_variants.yaml` +- Scripts: `bin/find_name_variants.py`, `bin/auto_name_variants.py` +- Person IDs assigned automatically but can be explicitly set + +### Testing Strategy +```bash +# Python package tests +cd python && poetry run pytest + +# Full site build test +make check site + +# Integration tests on full data +pytest -m integration +``` + +## Project-Specific Patterns + +### XML Structure Philosophy +- **Separation of content and presentation**: Raw metadata in XML, formatting via Hugo templates +- **Hierarchical organization**: Collections → Volumes → Papers +- **Checksum validation**: All file references include SHA-256 checksums (8-char prefix) + +### Script Naming Conventions +- `add_*.py` - Add new metadata fields +- `fix_*.py` - Correct existing data +- `ingest_*.py` - Import data from external sources +- `create_*.py` - Generate derived files + +### Hugo Data Export Pattern +```python +# All export scripts follow this pattern: +def export_ENTITY(anthology, builddir, dryrun): + # Process anthology data + data = {...} + if not dryrun: + with open(f"{builddir}/data/{entity}.json", "wb") as f: + f.write(ENCODER.encode(data)) +``` + +## Environment Setup + +### Dependencies +- **Python 3.10+** with packages from `bin/requirements.txt` +- **Hugo 0.126.0+** (extended version required) +- **bibutils** for citation format conversion +- **jing** for XML validation + +### Development Commands +```bash +# Setup environment +python3 -m venv venv && source venv/bin/activate +pip install -r bin/requirements.txt + +# Quick data regeneration (development) +make NOBIB=true hugo_data hugo + +# Full production build +make all +``` + +## Key Integration Points + +### External Data Sources +- **ACLPub2**: Conference management system data ingestion +- **Papers with Code**: Research code linking +- **CrossRef**: DOI metadata synchronization +- **Google Scholar**: Author profile integration + +### File Management +- PDFs and attachments stored separately from metadata +- Environment variables: `ANTHOLOGY_PREFIX`, `ANTHOLOGYFILES` +- Symlinked as `anthology-files` in generated site + +## Common Pitfalls +- **Never include years in venue identifiers** - venues are persistent entities +- **XML changes require `make check`** - schema validation is mandatory +- **Author name disambiguation is automatic** - manual overrides via explicit IDs only +- **Hugo memory usage is ~18GB** - normal on large sites, may cause swapping +- **Venue vs Event confusion** - venues are persistent, events are year-specific instances + +## File Locations for Common Tasks +- **Add new venue**: `data/yaml/venues/{venue-id}.yaml` +- **Fix paper metadata**: Edit `data/xml/{collection}.xml` directly +- **Modify site templates**: `hugo/layouts/` +- **Update build process**: `Makefile` and `bin/create_hugo_data.py` diff --git a/.github/instructions/process-author-page.instructions.md b/.github/instructions/process-author-page.instructions.md new file mode 100644 index 0000000000..e35bbd5fde --- /dev/null +++ b/.github/instructions/process-author-page.instructions.md @@ -0,0 +1,365 @@ +--- +applyTo: 'data/xml/*.xml' +--- + +# Processing ACL Anthology Author Page Issues + +This guide provides instructions for processing GitHub issues requesting author page corrections in the ACL Anthology. There are two types of requests: **merging** and **splitting** author pages. + +## Prerequisites & Requirements + +All author page requests **must** include: +- **GitHub issue number** (e.g., `#123`) +- **The author ID** (e.g., `matt-post` or `matt-post-rochester`) +- **Valid ORCID ID** (format: `0000-0000-0000-0000`) +- **Institution** where highest (anticipated) degree was/will be obtained +- **Requested action** (merge or split) +- **Clear identification** of which papers belong to the author (in the case of a split) + +Ideally, this input will be in the form of a JSON object. Here is an example input for merging: + +```json +{ + "github_issue": "#123", + "canonical": "Post, Matt", + "variants": [ + "Post, Matthew", + "Post, Matthew J" + ], + "author_id": "matt-post", + "orcid": "0000-0000-0000-0000", + "institution": "University of Rochester", + "action": "merge", +} +``` + +and for splitting: + +```json +{ + "github_issue": "#123", + "author_id": "matt-post-rochester", + "orcid": "0000-0000-0000-0000", + "institution": "University of Rochester", + "action": "split", + "papers": [ + "2024.acl-main.17", + "2018.wmt-1.67" + ] +} +``` + +## Workflow Overview + +1. **Setup**: Ensure master branch is up to date, create working branch +2. **Process**: Make required changes based on request type (merge or split) +3. **Validate**: Run checks to ensure changes are correct +4. **Submit**: Commit changes and create PR referencing the issue + +## Initial Setup + +### 1. Update and Create Branch + +```bash +# Ensure master is up to date +git checkout master +git pull origin master + +# Create branch using the pattern: author-page-{author_id} where author_id is the unique identifier for the author +git checkout -b author-page-{author_id} +``` + +**Branch naming examples**: +- `author-page-matt-post`: often used when merging multiple pages under a single canonical name variant +- `author-page-matt-post-rochester`: used when we need to split a page, disambiguating one author (using their institution) from others + +## Request Type 1: Merging Author Pages + +**Use case**: Author has published under multiple name variants and wants them consolidated under a canonical name. + +**Example**: "Matt Post" and "Matthew Post" should be merged under "Matt Post". + +### Steps: + +1. **Add entry to `data/yaml/name_variants.yaml`**: + ```yaml + - canonical: {first: Matt, last: Post} + orcid: 0000-0000-0000-0000 + institution: Johns Hopkins University # Include even though not currently used + variants: + - {first: Matthew, last: Post} + ``` + +2. **Check out the branch, merging off master**: + +```bash +# Ensure master is up to date +git checkout master +git pull origin master + +# Create branch using the pattern: author-page-{author_id} +git checkout -b author-page-{author_id} +``` + +3. **Commit to the branch, noting the Github issue being closed** + +```bash +git add data/yaml/name_variants.yaml +git commit -m "Merging author pages for {author_name} (closes #{issue_number})" +``` + +**Important notes**: +- Canonical name should be the author's preferred variant +- Include all name variants found in the XML files +- The `institution` field should be included for future use +- Do not create an `id` field (this is only for splitting) +- Do not list the canonical version under the variants list + +## Request Type 2: Splitting Author Pages + +**Use case**: Multiple authors published under the same name and need to be separated. + +**Example**: Papers under "Matt Post" are actually from different people - separate out the papers belonging to the requester. + +### Steps: + +#### 2.1 Create a base Author ID for all the names + +First, add a "generic" entry to `data/yaml/name_variants.yaml`. For example: + +```yaml +- canonical: {first: Matt, last: Post} + id: matt-post + comment: "May refer to several people" +``` + +This should be added roughly sorted into the YAML file. This helps avoid merge conflicts, +if multiple authors are processed independently at the same time. + +#### 2.3 Tag all authors with that name string using the tag. + +Use the `bin/add_author_id.py` script to efficiently add the ID to all papers that have this author name. +Continuing with our "matt-post" example: + +```bash +# Add ID to all papers by the author's first and last name +bin/add_author_id.py matt-post "Post, Matt" +``` + +This will add the `id` attribute to matching `` tags. For example, this entry + +```xml + +MattPost + +will become this: + + +MattPost +``` + +**Note**: The script automatically maintains proper XML formatting and preserves indentation. + +#### 2.4 Create an Author ID for the Requester + +Now that all names are tagged, we want to select out those of the request and tag them with a new ID. + +First, add an entry to `data/yaml/name_variants.yaml`: +```yaml +- canonical: {first: Matt, last: Post} + id: matt-post-rochester # Format: firstname-lastname-institution + orcid: 0000-0000-0000-0000 + institution: University of Rochester +``` + +**ID format rules**: +- Lowercase only +- Hyphens replace spaces +- Use recognizable institution abbreviation +- Examples: `yang-liu-umich`, `john-smith-stanford`, `jane-doe-google` + +#### 2.5 Tag Author's Papers + +Use the `bin/add_author_id.py` script again, but this time with the `--paper-ids` flag. + +```bash +# Add ID to all papers by the author's first and last name +bin/add_author_id.py matt-post-rochester "Post, Matt" --paper-ids [list of Anthology paper ids] +``` + +This will change the `id` attribute from the generic one to the specific one for the +requesting author: + +```xml + +MattPost + + +MattPost +``` + +### Helper Tools + +- `bin/add_author_id.py author-id "Last name, first name"` - Bulk add ID to matching authors +- `bin/add_author_id.py author-id "Last name, first name" --paper-ids ...` - Bulk add ID to matching authors to specific papers (to prevent over-matching on the author name) + +## Validation & Testing + +### Required Checks + +```bash +# Validate XML schema compliance +make check +``` + +### Common Issues to Avoid + +- **Invalid ORCID format**: Must be exactly `0000-0000-0000-0000` +- **XML formatting**: Don't break single-line `` tags into multiple lines +- **Duplicate IDs**: Ensure new author IDs are unique +- **Missing canonical**: Canonical name must match one existing name variant + +## File Locations + +- **Name variants**: `data/yaml/name_variants.yaml` +- **Paper metadata**: `data/xml/{year}.{venue}.xml` (e.g., `2020.acl-main.xml`) + +## Examples + +### Merge Example +```yaml +# Merging "John P. Smith" and "John Smith" +- canonical: {first: John P., last: Smith} + orcid: 0000-0002-1234-5678 + institution: Stanford University + variants: + - {first: John, last: Smith} + - {first: J. P., last: Smith} +``` + +### Split Example +```yaml +# Splitting "Yang Liu" - requester from University of Michigan +- canonical: {first: Yang, last: Liu} + id: yang-liu-umich + orcid: 0000-0003-1234-5678 + institution: University of Michigan + +# Generic entry for remaining papers +- canonical: {first: Yang, last: Liu} + id: yang-liu + comment: "May refer to several people" +``` + +## Completion + +### 1. Commit Changes + +```bash +# Add all modified files +git add data/yaml/name_variants.yaml data/xml/*.xml + +# Commit with reference to issue number +git commit -m "Process author page request for {Author Name} (closes #{issue_number}) + +- {Brief description of changes made} +" + +# Push branch +git push origin author-page-{author_id} +``` + +### 2. Create Pull Request + +- **Title**: `Author page: {Author Name}` +- **Body**: Reference the GitHub issue number and summarize changes +- **Labels**: Add appropriate labels (`author-page`, `merge` or `split`) + +The PR will trigger automated builds and tests. Once merged, the changes will be reflected in the next site build. +For each paper belonging to the disambiguated author, add `id` attribute to XML: + +**Example**: In `data/xml/2020.acl.xml`: +```xml + + YangLiu + + +``` + +**Formatting Requirements**: +- Keep `` tags on single line (don't expand to multiple lines) +- Preserve existing indentation and spacing patterns +- Use existing XML formatting tools to maintain consistency + +**Tools available**: +- `bin/add_author_id.py author-id "Last name, first name"` - Bulk add ID to author + + +## ID Generation Rules + +### Author ID Format +- **Structure**: `firstname-lastname-institution` +- **Rules**: + - Lowercase only + - Hyphens replace spaces and special characters + - Institution should be recognizable abbreviation + - Examples: `yang-liu-umich`, `john-smith-stanford` + +### Institution Abbreviations +Common patterns: +- Universities: `umich`, `stanford`, `cmu`, `mit` +- Companies: `google`, `microsoft`, `facebook` +- Use domain-based abbreviations when possible + +## Validation and Testing + +### Required Checks +```bash +# Validate XML schema compliance +make check +``` + +### Formatting Consistency +- **XML**: Preserve single-line formatting for `` and `` tags +- **YAML**: Follow existing indentation (2 spaces) and structure in `name_variants.yaml` +- **Use project tools**: Scripts like `bin/add_author_id.py` maintain proper formatting automatically +- **Indentation**: Use `anthology.utils.indent()` function for XML pretty-printing when needed + +### Common Issues +- **Invalid ORCID format**: Must be `0000-0000-0000-0000` +- **XML schema violations**: Missing required fields, invalid nesting +- **Name mismatches**: Canonical name not matching any existing papers +- **Duplicate IDs**: Ensure new author IDs are unique + +## File Locations + +- **Name variants**: `data/yaml/name_variants.yaml` +- **XML metadata**: `data/xml/{collection}.xml` (e.g., `2020.acl.xml`) +- **Validation script**: `make check` +- **Author ID tools**: `bin/add_author_id.py` + +## Post-Processing + +### 1. Commit and Push Changes +```bash +# Add all changes +git add data/yaml/name_variants.yaml data/xml/*.xml + +# Commit with descriptive message +git commit -m "Author page correction: {author-name} ({merge|split})" + +# Push branch +git push origin author-page-{authorid} +``` + +### 2. Create Pull Request +- **Title**: `Author page correction: {Author Name} ({merge|split})` +- **Body**: Include link to original GitHub issue and summary of changes +- **Labels**: Add `correction`, `metadata` labels +- **Assignees**: Add `anthology-assist` + +### 3. Post-PR Actions +1. **Update GitHub issue**: Comment with link to PR and close original issue +2. **Monitor build**: Ensure site builds successfully after merge +3. **Verify author pages**: Check that author pages display correctly on staging/live site +4. **Archive decision**: Document rationale for complex disambiguation cases diff --git a/.github/instructions/process-author-page.prompt.md b/.github/instructions/process-author-page.prompt.md new file mode 100644 index 0000000000..c89f2e148f --- /dev/null +++ b/.github/instructions/process-author-page.prompt.md @@ -0,0 +1,162 @@ +# Prompt template: Process an author-page GitHub issue + +Purpose +------- +This prompt is for automating the `process-author-page` workflow. Give an LLM (or automation) a full GitHub issue (title, body, labels, comments) and it will extract the data needed to fill out the project's author-page instructions and produce a machine-friendly plan and artifacts (YAML snippet, XML edit hints, branch name, commands, PR text, and clarifying questions). + +How to use +---------- +- Provide the full issue object as context: `issue.title`, `issue.body`, `issue.labels`, `issue.author`, `issue.comments` (list of {author, body, created_at}). +- Expect a single JSON object output exactly matching the schema in the "Output schema" section. + +Prompt (give the following to the LLM as the user/system prompt): + +"Process an author-page GitHub issue and produce a complete actionable plan" + +Context you will receive (pass this as context): +- issue.title (string) +- issue.body (string) +- issue.labels (list of strings) +- issue.author (string) +- issue.comments (list of {author, body, created_at}) +- optional: linked PR / linked commits + +Task for the LLM +---------------- +1. Parse the issue and comments to extract: + - canonical_author_name: canonical first/middle/last parts. + - name_variants mentioned in issue/comments. + - requester_author_id (if suggested by user). + - requester_ORCID (if provided). + - requester_institution (if provided). + - primary_paper_ids: Anthology paper IDs the requester claims. + - other_paper_ids: other Anthology IDs referencing the same name. + - requested_action: one of ["create-id-and-assign","assign-existing-id","split","merge","other"], or "clarify" if ambiguous. + - whether the user requests a dummy id for other people sharing the name. + - urgency / labels like "author-page" / "high-priority". + +2. Validate and enrich: + - Validate ORCID format (pattern: 0000-0000-0000-0000). + - Validate Anthology ID patterns; if missing set papers_to_verify=true. + - If ambiguous or missing data, populate `clarifying_questions` with concise questions. + +3. Produce outputs using the exact JSON schema below. Be concise and machine-parseable. When generating branch names and ids, follow repository conventions described in guidelines. + +Output schema (RETURN EXACTLY this JSON object; do not return extra text) +-------------------------------------------------------------------------------- +{ + "metadata": { + "issue_title": string, + "issue_number": integer_or_null, + "issue_author": string, + "labels": [string] + }, + + "extracted": { + "canonical_name": { "first": string, "middle": string_or_null, "last": string }, + "name_variants": [string], + "requester": { + "author_id_proposed": string_or_null, + "orcid": string_or_null, + "institution": string_or_null, + "claim_paper_ids": [string] + }, + "other_paper_ids": [ { "id": string, "found_in_comment_or_body": string } ], + "requested_action": "create-id-and-assign" | "assign-existing-id" | "split" | "merge" | "other" | "clarify", + "wants_dummy_id": boolean, + "ambiguities": [string] + }, + + "plan": { + "branch_name": string, + "name_variants_yaml_snippet": string, + "xml_edits": [ + { "paper_id": string, "file_hint": string_or_null, "author_xpath_hint": string, "action": "add_id" | "remove_id" | "none", "id_to_set": string } + ], + "commands": [ string ], + "git": { + "commit_message": string, + "pr_title": string, + "pr_body": string + }, + "validation_commands": [ string ], + "files_to_edit": [string], + "notes": [string] + }, + + "edge_cases_and_questions": { + "clarifying_questions": [string], + "recommended_dummy_id_format": string, + "conflict_resolution_policy": string + } +} + +Guidelines and conventions (apply when filling fields) +------------------------------------------------------ +- Always use `data/yaml/name_variants.yaml` for new canonical id entries. The YAML snippet must follow existing project structure. Example: + + - canonical: {first: Shashank, last: Gupta} + id: shashank-gupta-uiuc + orcid: 0000-0000-0000-0000 + institution: University of Illinois at Urbana-Champaign + comment: "created from issue #NNN: author-confirmed" + +- Shell commands must be repository-root relative and follow this example order: + - git checkout -b + - python3 bin/add_author_id.py "Last, First" --paper-ids + - git add + - git commit -m "" + - git push --set-upstream origin + +- Include validation commands: `make check` and `make hugo_data` in `validation_commands`. +- If ambiguous or missing paper IDs, set `requested_action` to "clarify" and include exact clarifying questions for the issue author. + +- ID & branch generation policy: + - Prefer supplied author id. If none supplied, generate `last-first` (lowercase, ascii, hyphenated). + - If that collides with an existing id, append an institution shortname (e.g., `-uiuc`) or year suffix (e.g., `-2025`). + +Edge cases to handle (list these briefly in `edge_cases_and_questions`): +- Issue requests adding an id but provides no Anthology paper IDs. +- Multiple people share the canonical name across years and venues. +- ORCID present but invalid format. +- A user requests merging two existing ids (detect and set `requested_action`="merge"). +- Concurrent edits: warn to re-open `data/yaml/name_variants.yaml` before editing to avoid overwrites. + +Minimal example (illustrative only; real output must derive from the issue): + +{ + "metadata": { "issue_title": "Author page: Shashank Gupta", "issue_number": 3658, "issue_author": "shashank", "labels": ["author-page"] }, + "extracted": { + "canonical_name": {"first":"Shashank","middle":null,"last":"Gupta"}, + "name_variants": ["Gupta, Shashank"], + "requester": {"author_id_proposed":"shashank-gupta-uiuc","orcid":"0000-0002-3683-3739","institution":"University of Illinois at Urbana-Champaign","claim_paper_ids":["L18-1086"]}, + "other_paper_ids": [{"id":"2020.semeval-1.56","found_in_comment_or_body":"comment by alice"}], + "requested_action":"create-id-and-assign", + "wants_dummy_id": true, + "ambiguities": [] + }, + "plan": { + "branch_name":"author-page-shashank-gupta-uiuc", + "name_variants_yaml_snippet":"- canonical: {first: Shashank, last: Gupta}\\n id: shashank-gupta-uiuc\\n orcid: 0000-0002-3683-3739\\n institution: University of Illinois at Urbana-Champaign\\n comment: \\\"from issue #3658\\\"", + "xml_edits": [{"paper_id":"L18-1086","file_hint":"data/xml/L18.xml","author_xpath_hint":"//paper[@id='L18-1086']//author[first='Shashank' and last='Gupta']","action":"add_id","id_to_set":"shashank-gupta-uiuc"}], + "commands":["git checkout -b author-page-shashank-gupta-uiuc","python3 bin/add_author_id.py shashank-gupta-uiuc \"Gupta, Shashank\" --paper-ids L18-1086","git add data/xml/L18.xml data/yaml/name_variants.yaml","git commit -m \"Author page: add shashank-gupta-uiuc and assign L18-1086\\n\\nCloses #3658\"","git push --set-upstream origin author-page-shashank-gupta-uiuc"], + "git": {"commit_message":"Author page: add shashank-gupta-uiuc and assign L18-1086\\n\\nCloses #3658","pr_title":"Author page: Shashank Gupta (shashank-gupta-uiuc)","pr_body":"This PR creates author id `shashank-gupta-uiuc` and assigns it to L18-1086. Also adds a `name_variants` YAML entry. See #3658."}, + "validation_commands": ["make check","make hugo_data"], + "files_to_edit":["data/yaml/name_variants.yaml","data/xml/L18.xml"], + "notes":["Re-open `data/yaml/name_variants.yaml` before applying the YAML snippet to avoid overwriting manual edits."] + }, + "edge_cases_and_questions": { + "clarifying_questions": [], + "recommended_dummy_id_format":"last-first", + "conflict_resolution_policy":"If generated id collides, append institution shortname; if still ambiguous, append year suffix and ask the issue author to confirm." + } +} + +Usage notes +----------- +- Feed the full issue (title, body, comments) into this prompt and request exactly one JSON object output. +- If `requested_action` == "clarify", post the `clarifying_questions` as a comment on the issue before making edits. +- Always re-open `data/yaml/name_variants.yaml` to read current contents before applying the `name_variants_yaml_snippet`. +- Run validation commands after edits. + +-- end of prompt template diff --git a/bin/update_author_page_issues.py b/bin/update_author_page_issues.py new file mode 100755 index 0000000000..ff8cd1208b --- /dev/null +++ b/bin/update_author_page_issues.py @@ -0,0 +1,128 @@ +#!/usr/bin/env python3 + +"""Usage: update_author-page_issues.py + +Updates all issues containing "Author page:" in the title to follow the latest template + +Set your OS environment variable "GITHUB_TOKEN" to your personal token or hardcode it in the code. Make sure to not reveal it to others! + +""" + +import os +import textwrap +import requests + +# Configuration +GITHUB_TOKEN = os.getenv("GITHUB_TOKEN") # Can hardcode token here +REPO_OWNER = 'acl-org' +REPO_NAME = 'acl-anthology' + +# Base URL +BASE_URL = f'https://api.github.com/repos/{REPO_OWNER}/{REPO_NAME}' + +HEADERS = { + 'Authorization': f'token {GITHUB_TOKEN}', + 'Accept': 'application/vnd.github.v3+json', +} + + +def get_issues_with_title(title): + issues_url = f'{BASE_URL}/issues' + params = {'state': 'open', 'per_page': 100} + issues = [] + + while issues_url: + response = requests.get(issues_url, headers=HEADERS, params=params) + response.raise_for_status() + data = response.json() + + for issue in data: + if title in issue.get('title', '') and 'pull_request' not in issue: + issues.append(issue) + + issues_url = response.links.get('next', {}).get('url') + + return issues + + +def add_comment_to_issue(issue_number, comment): + url = f'{BASE_URL}/issues/{issue_number}/comments' + payload = {'body': comment} + response = requests.post(url, headers=HEADERS, json=payload) + response.raise_for_status() + print(f'Comment added to issue #{issue_number}') + + +def edit_body_of_issue(issue_number, new_body): + url = f'https://api.github.com/repos/{REPO_OWNER}/{REPO_NAME}/issues/{issue_number}' + payload = {'body': new_body} + response = requests.patch(url, headers=HEADERS, json=payload) + response.raise_for_status() + print(f'Edited body of issue (ID: {issue_number}) updated.') + + +def main(issue_ids): + print('🔎 Fetching issues...') + issues = get_issues_with_title("Author Metadata:") + get_issues_with_title( + "Author Page:" + ) + print(f"Found {len(issues)} issues.") + + for issue in issues: + number = issue["number"] + title = issue["title"] + + if issue_ids and number not in issue_ids: + # print(f"Skipping issue #{number}: {issue['title']}") + continue + + print(f'---\nProcessing issue #{number}: {title}') + + issue_body = issue["body"] + if "### Author ORCID" not in issue_body: + issue_body_list = issue_body.split("### Type of Author Metadata Correction") + issue_body_list.insert( + 1, + textwrap.dedent( + """ + ### Author ORCID + + -Add ORCID here- + + ### Institution of highest (anticipated) degree + + -Add insitution here- + + ### Your papers (if required, see comment below) + + -Provide Anthology IDs or Anthology URLs here- + + ### Type of Author Metadata Correction + """ + ), + ) + issue_body = "".join(issue_body_list) + edit_body_of_issue(number, issue_body) + + add_comment_to_issue( + number, + textwrap.dedent( + """ + Hello: we are attempting to close out a large backlog of author page requests. As part of these efforts, we are collecting additional information ([ORCID](https://orcid.org/) and degree institution) which will help us assign papers to the correct author in the future. Please modify the updated description above with the requested information. + + If you are requesting to split an author page (i.e., your page has some papers that are not yours), please also provide a list of your papers, in the form of Anthology IDs or URLS (e.g., 2023.wmt-1.13 or https://aclanthology.org/2023.wmt-1.13/). We are unable to match papers to Google or Semantic Scholar profiles. + """ + ), + ) + + +if __name__ == '__main__': + import argparse + + parser = argparse.ArgumentParser(description='Update author page issues') + parser.add_argument( + 'issue_ids', nargs='*', type=int, help='List of issue IDs to update' + ) + args = parser.parse_args() + + main(args.issue_ids)