Code used to run evaluation on TerjamaBench using BLEU, chr and TER.
WARNING: the dataset is gated, visit https://huggingface.co/datasets/atlasia/TerjamaBench to request access.
# Get the HF token from your profile
token = 'your_huggingface_token'
# Load the benchmark dataset
df = load_benchmark(token)
# Evaluate
references = df['Darija'].tolist() # Target translations
predictions = # Your model predictions
# Evaluate individual samples
scores = evaluate(references, predictions)
# Get average scores
avg_scores = evaluate_model(references, predictions)
print(f"Average scores: {avg_scores}")
Find our blog post here.