A Chinese name detection and normalization library.
Sinonym is a Python library designed to accurately detect and normalize Chinese names across various romanization systems. It filters out non-Chinese names (such as Western, Korean, Vietnamese, and Japanese names).
This was mostly written with Claude Code with extensive oversight from me... Sorry if the actual code is too AI-ish. It's fast, well-tested, and works pretty well.
Not all the tests pass, and the test suite is intentionally skewed towards failing tests, so I know what to try to work on next. It's more-or-less impossible to guess with 100% accuracy whether a Romanized Chinese name is in the Given-Name Surname
or Surname Given-Name
format, and the best approach is to try to guess the most likely format from a batch of names that should all have the same format (like all the authors of an academic paper or all the names in a specific dataset). This kind of batch processing is described below.
Raw Input
↓
TextPreprocessor (structural cleaning)
↓
NormalizationService (creates NormalizedInput with compound_metadata)
↓
CompoundDetector (generates metadata) → compound_metadata
↓
NameParsingService (uses compound_metadata)
↓
NameFormattingService (uses compound_metadata)
↓
Formatted Output
-
Name Order is
Given-Name Surname
- The library's primary function is to standardize names into a
Given-Name Surname
format, regardless of the input order. - Input:
"Liu Dehua"
→ Output:"De-Hua Liu"
- Input:
"Wei, Yu-Zhong"
→ Output:"Yu-Zhong Wei"
- The library's primary function is to standardize names into a
-
Capitalization is
Title Case
- The output is consistently formatted in Title Case, with the first letter of the surname and each part of the given name capitalized.
- Input:
"DAN CHEN"
→ Output:"Dan Chen"
-
Given Names are Hyphenated
- Given names composed of multiple syllables are joined by a hyphen. This applies to standard names, names with initials, and reduplicated (repeated) names.
- Input (Standard):
"Wang Li Ming"
→ Output:"Li-Ming Wang"
- Input (Initials):
"Y. Z. Wei"
→ Output:"Y-Z Wei"
- Input (Reduplicated):
"Chen Linlin"
→ Output:"Lin-Lin Chen"
-
Compound Surname Formatting is Strictly Preserved
- The library identifies compound (two-character) surnames and preserves their original formatting (compact, spaced, hyphenated, or CamelCase).
- Input (Compact):
"Duanmu Wenjie"
→ Output:"Wen-Jie Duanmu"
- Input (Spaced):
"Au Yeung Chun"
→ Output:"Chun Au Yeung"
- Input (Hyphenated):
"Au-Yeung Chun"
→ Output:"Chun Au-Yeung"
- Input (CamelCase):
"AuYeung Ka Ming"
→ Output:"Ka-Ming AuYeung"
-
Unspaced Compound Given Names are Split and Hyphenated
- If a multi-syllable given name is provided as a single unspaced string, the library identifies the syllables and inserts hyphens.
- Input:
"Wang Xueyin"
→ Output:"Xue-Yin Wang"
-
Handles All-Chinese Character Names
- It correctly processes names written entirely in Chinese characters, applying surname-first convention with frequency-based disambiguation.
- Input:
"巩俐"
→ Output:"Li Gong"
(李 is more frequent surname than 巩) - Input:
"李伟"
→ Output:"Wei Li"
(李 recognized as surname in first position)
-
Handles Mixed Chinese (Hanzi) and Roman Characters
- It correctly parses names containing both Chinese characters and Pinyin, using the Roman parts for the output.
- Input:
"Xiaohong Li 张小红"
→ Output:"Xiao-Hong Li"
-
Normalizes Diacritics, Accents, and Special Characters
- It converts pinyin with tone marks and special characters like
ü
into their basic Roman alphabet equivalents. - Input:
"Dèng Yǎjuān"
→ Output:"Ya-Juan Deng"
- It converts pinyin with tone marks and special characters like
-
Normalizes Full-Width Characters
- It processes full-width Latin characters (often from PDFs) into standard characters.
- Input:
"Li Xiaoming"
→ Output:"Xiao-Ming Li"
-
Handles Messy Formatting (Commas, Dots, Spacing)
- The library correctly parses names despite common data entry or OCR errors.
- Input (Bad Comma):
"Chen,Mei Ling"
→ Output:"Mei-Ling Chen"
- Input (Dot Separators):
"Li.Wei.Zhang"
→ Output:"Li-Wei Zhang"
-
Splits Concatenated Names
- It can split names that have been concatenated without spaces, using CamelCase or mixed-case cues.
- Input:
"ZhangWei"
→ Output:"Wei Zhang"
-
Strips Parenthetical Western Names
- If a Western name is included in parentheses, it is stripped out, and the remaining Chinese name is parsed correctly.
- Input:
"李(Peter)Chen"
→ Output:"Li Chen"
-
Rejects Non-Chinese Names
- The library uses advanced heuristics and machine learning to reject names from other cultures to avoid false positives.
- Western: Rejects
"John Smith"
and even"Christian Wong"
. - Korean: Rejects
"Kim Min-jun"
. - Vietnamese: Rejects
"Nguyen Van Anh"
. - Japanese: Rejects
"Sato Taro"
and Japanese names in Chinese characters like"山田太郎"
(Yamada Taro) using ML classification.
-
Supports Regional Romanizations (Cantonese, Wade-Giles)
- The library recognizes and preserves different English romanization systems.
- Cantonese: Input
"Chan Tai Man"
becomes"Tai-Man Chan"
(not"Chen"
). - Wade-Giles: Input
"Ts'ao Ming"
becomes"Ming Ts'ao"
(preserves apostrophe).
-
Corrects for Pinyin Library Inconsistencies
- It contains an internal mapping to fix cases where the underlying
pypinyin
library's output doesn't match the most common romanization for a surname. - Example: The character
曾
is converted bypypinyin
toZeng
, but this library corrects it to the expectedZeng
.
- It contains an internal mapping to fix cases where the underlying
- High-Performance with Caching
- The library is benchmarked to be very fast, capable of processing over 10,000 diverse names per second, and uses caching to significantly speed up the processing of repeated names.
Sinonym processes names through a multi-stage pipeline designed for high accuracy and performance:
- Input Preprocessing: The input string is cleaned and normalized. This includes handling mixed scripts (e.g., "张 Wei") and standardizing different romanization variants.
- All-Chinese Detection: The system detects inputs written entirely in Chinese characters and applies Han-to-Pinyin conversion with surname-first ordering preferences.
- Ethnicity Classification: The name is analyzed to filter out non-Chinese names. This stage uses linguistic patterns and machine learning to identify and reject Western, Korean, Vietnamese, and Japanese names. For all-Chinese character inputs, a trained ML classifier (99.5% accuracy) determines if names like "山田太郎" are Japanese vs Chinese.
- Probabilistic Parsing: The system identifies potential surname and given name boundaries by leveraging frequency data, which helps in accurately distinguishing between a surname and a given name. For all-Chinese inputs, it applies a surname-first bonus while still considering frequency data.
- Compound Name Splitting: For names with fused given names (e.g., "Weiming"), a tiered confidence system is used to correctly split them into their constituent parts (e.g., "Wei-Ming").
- Output Formatting: The final output is standardized to a "Given-Name Surname" format (e.g., "Wei Zhang").
To get started with Sinonym, clone the repository and install the necessary dependencies using uv
:
git clone https://github.com/allenai/sinonym.git
cd sinonym
- From repo root:
# create the project venv (uv defaults to .venv if you don't give a name)
uv venv --python 3.11
- Activate the venv (choose one):
# macOS / Linux (bash / zsh)
source .venv/bin/activate
# Windows PowerShell
. .venv\Scripts\Activate.ps1
# Windows CMD
.venv\Scripts\activate.bat
- Install project dependencies (dev extras):
uv sync --active --all-extras --dev
Sinonym includes a ML-based Japanese vs Chinese name classifier for enhanced accuracy with all-Chinese character names.
Here's a simple example of how to use Sinonym to detect and normalize a Chinese name:
from sinonym.detector import ChineseNameDetector
# Initialize the detector
detector = ChineseNameDetector()
# --- Example 1: A simple Chinese name ---
result = detector.normalize_name("Li Wei")
if result.success:
print(f"Normalized Name: {result.result}")
# Expected Output: Normalized Name: Wei Li
# --- Example 2: A compound given name ---
result = detector.normalize_name("Wang Weiming")
if result.success:
print(f"Normalized Name: {result.result}")
# Expected Output: Normalized Name: Wei-Ming Wang
# --- Example 3: An all-Chinese character name ---
result = detector.normalize_name("巩俐")
if result.success:
print(f"Normalized Name: {result.result}")
# Expected Output: Normalized Name: Li Gong
# --- Example 4: A non-Chinese name ---
result = detector.normalize_name("John Smith")
if not result.success:
print(f"Error: {result.error_message}")
# Expected Output: Error: name not recognised as Chinese
# --- Example 5: Japanese name in Chinese characters (ML-enhanced detection) ---
result = detector.normalize_name("山田太郎")
if not result.success:
print(f"Error: {result.error_message}")
# Expected Output: Error: Japanese name detected by ML classifier
# --- Example 6: Batch processing of academic author list ---
author_list = ["Zhang Wei", "Li Ming", "Wang Xiaoli", "Liu Jiaming", "Feng Cha"]
batch_result = detector.analyze_name_batch(author_list)
print(f"Format detected: {batch_result.format_pattern.dominant_format}")
print(f"Confidence: {batch_result.format_pattern.confidence:.1%}")
# Expected Output: Format detected: NameFormat.SURNAME_FIRST, Confidence: 94%
for i, result in enumerate(batch_result.results):
if result.success:
print(f"{author_list[i]} → {result.result}")
# Expected Output: Zhang Wei → Wei Zhang, Li Ming → Ming Li, etc.
# --- Example 7: Quick format detection for data validation ---
unknown_format_list = ["Wei Zhang", "Ming Li", "Xiaoli Wang"]
pattern = detector.detect_batch_format(unknown_format_list)
if pattern.threshold_met:
print(f"Consistent {pattern.dominant_format} formatting detected")
print(f"Safe to process as batch with {pattern.confidence:.1%} confidence")
else:
print("Mixed formatting detected - process individually")
# --- Example 8: Simple batch processing for data cleanup ---
messy_names = ["Li, Wei", "Zhang.Ming", "Wang Xiaoli"]
clean_results = detector.process_name_batch(messy_names)
for original, clean in zip(messy_names, clean_results):
if clean.success:
print(f"Cleaned: '{original}' → '{clean.result}'")
# Expected Output: Li, Wei → Wei Li, Zhang.Ming → Ming Zhang, etc.
When you call normalize_name
, you get a ParseResult
with helpful structured fields:
success
: True/False indicating recognition as Chineseresult
: Final formatted string inGiven-Name Surname
orderparsed
: AParsedName
with normalized components in output ordersurname
,given_name
: component strings as inresult
surname_tokens
,given_tokens
: normalized, capitalized tokens used to form componentsmiddle_tokens
: trailing single-letter initials extracted from given name, if presentorder
: component order descriptor, typically["given", "middle", "surname"]
parsed_original_order
: AParsedName
aligned to the input’s original order. In this view, the labels are positional:given
corresponds to the first component(s) in the original input, andsurname
to the last component(s), regardless of the semantic roles used for normalization.
Notes:
- The tokens in
parsed
andparsed_original_order
are the same normalized tokens; only the conceptual ordering differs via theorder
list. middle_tokens
are preserved in both structures and included between given and surname when present.
Examples:
res = detector.normalize_name("Li Wei")
# res.result == "Wei Li"
# res.parsed.order == ["given", "middle", "surname"]
# res.parsed_original_order.order == ["surname", "given"]
# res.parsed_original_order.given_name == "Li" # first in input
# res.parsed_original_order.surname == "Wei" # last in input
res = detector.normalize_name("Chi-Ying F. Huang")
# res.result == "Chi-Ying F Huang"
# res.parsed.given_tokens == ["Chi", "Ying"]
# res.parsed.middle_tokens == ["F"]
# res.parsed.order == ["given", "middle", "surname"]
# res.parsed_original_order.order == ["given", "middle", "surname"]
# res.parsed_original_order.given_name == "Chi-Ying"
# res.parsed_original_order.surname == "Huang"
Sinonym includes advanced batch processing capabilities that significantly improve accuracy when processing lists of names that share consistent formatting patterns. This is particularly valuable for real-world datasets like academic author lists, company directories, or database migrations.
When processing multiple names together, Sinonym:
- Detects Format Patterns: Analyzes the entire batch to identify whether names follow a surname-first (e.g., "Zhang Wei") or given-first (e.g., "Wei Zhang") pattern
- Aggregates Evidence: Uses frequency statistics across all names to build confidence in the detected pattern
- Applies Consistent Formatting: When confidence exceeds 67%, applies the detected pattern to improve parsing of ambiguous individual names
- Tracks Improvements: Identifies which names benefit from batch context vs. individual processing
- Fixes Ambiguous Cases: Names like "Feng Cha" that are difficult to parse individually become clear in batch context
- Maintains Consistency: Ensures all names in a list follow the same formatting pattern
- High Accuracy: Achieves 90%+ success rate on previously problematic cases when proper format context is available
- Intelligent Fallback: Automatically falls back to individual processing when batch patterns are unclear
from sinonym.detector import ChineseNameDetector
detector = ChineseNameDetector()
# Full batch analysis with detailed results
result = detector.analyze_name_batch([
"Zhang Wei", "Li Ming", "Wang Xiaoli", "Liu Jiaming"
])
print(f"Format detected: {result.format_pattern.dominant_format}")
print(f"Confidence: {result.format_pattern.confidence:.1%}")
print(f"Improved names: {len(result.improvements)}")
# Quick format detection without full processing
pattern = detector.detect_batch_format([
"Zhang Wei", "Li Ming", "Wang Xiaoli"
])
if pattern.threshold_met:
print(f"Strong {pattern.dominant_format} pattern detected")
# Simple batch processing (returns list of results)
results = detector.process_name_batch([
"Zhang Wei", "Li Ming", "Wang Xiaoli"
])
for result in results:
print(f"Processed: {result.result}")
- Academic Papers: Author lists typically follow consistent formatting
- Company Directories: Employee lists often use uniform formatting conventions
- Large Datasets: Processing 100+ names where format consistency is expected
Batch processing requires a minimum of 2 names and works best with 5+ names for reliable pattern detection.
Unambiguous Names: Some names have only one possible parsing format (e.g., compound given names like "Wei‑Qi Wang"). Batch processing never forces such names into the detected pattern and never raises. These names keep their best individual parse while other Chinese names benefit from the jointly detected order.
Confidence (Advisory Only): Batch detection computes a dominant format and a confidence value, but there is no confidence threshold gating. Results are always returned. The confidence is informational (e.g., for logging/UX) and is not used to raise errors.
Batch processing works seamlessly with mixed datasets containing both Chinese and non-Chinese names. Non-Chinese names are rejected during individual analysis but still appear in the batch output as failed results.
# Mixed dataset: 2 Western names + 8 Chinese names
mixed_names = [
"John Smith", # Western - will be rejected
"Mary Johnson", # Western - will be rejected
"Xin Liu", # Chinese - GIVEN_FIRST preference
"Yang Li", # Chinese - GIVEN_FIRST preference
"Wei Zhang", # Chinese - GIVEN_FIRST preference
"Ming Wang", # Chinese - GIVEN_FIRST preference
"Li Chen", # Chinese - GIVEN_FIRST preference
"Hui Zhou", # Chinese - GIVEN_FIRST preference
"Feng Zhao", # Chinese - GIVEN_FIRST preference
"Tong Zhang", # Chinese - might prefer SURNAME_FIRST (ambiguous)
]
result = detector.analyze_name_batch(mixed_names)
# Format detection uses only the 8 Chinese names
# If 7 prefer GIVEN_FIRST vs 1 SURNAME_FIRST = 87.5% confidence
# GIVEN_FIRST pattern is applied to Chinese names; non‑Chinese names return clear failures
print(f"Total results: {len(result.results)}") # 10 (same as input)
print(f"Format detected: {result.format_pattern.dominant_format}") # GIVEN_FIRST
print(f"Confidence: {result.format_pattern.confidence:.1%}") # 87.5%
# Check results by type
for i, (name, result_obj) in enumerate(zip(mixed_names, result.results)):
if result_obj.success:
print(f"✅ {name} → {result_obj.result}")
else:
print(f"❌ {name} → {result_obj.error_message}")
# Output:
# ❌ John Smith → name not recognised as Chinese
# ❌ Mary Johnson → name not recognised as Chinese
# ✅ Xin Liu → Xin Liu
# ✅ Yang Li → Yang Li
# ✅ Wei Zhang → Wei Zhang
# ... (all Chinese names processed successfully with consistent formatting)
Key Benefits:
- Maintains input-output correspondence: Results array matches input array length and order
- Robust format detection: Only valid Chinese names contribute to pattern detection
- Consistent formatting: All Chinese names get the same detected format applied
- Clear failure reporting: Non-Chinese names are clearly marked as failed with error messages
If you'd like to contribute to Sinonym, here’s how to set up your development environment.
First, clone the repository:
git clone https://github.com/yourusername/sinonym.git
cd sinonym
Then, install the development dependencies:
uv sync --extra dev
To run the test suite, use the following command:
uv run pytest
We use ruff
for linting and formatting:
# Run linting and formatting
uv run ruff check . --fix
uv run ruff format .
Sinonym is licensed under the Apache 2.0 License. See the LICENSE
file for more details.
We welcome contributions! If you'd like to contribute, please follow these steps:
- Fork the repository.
- Create a new feature branch.
- Make your changes and ensure all tests and quality checks pass.
- Submit a pull request.
The accuracy of Sinonym is enhanced by data derived from ORCID records, which provides valuable frequency information for Chinese surnames and given names.