Evaluation of automated DBpedia tagging on speech audio

A set of resources to perform the evaluation reported in our Automated interlinking of speech radio archives paper.

Content

kiwi-evaluation.py : script to run the TopN evaluation described in our paper.
data/editorial-data : ground truth editorial data on a dataset of 132 items from BBC Programmes.
data/automated-tags : a set of automated tags derived by the framework described in our paper.
data/automated-transcripts : a set of automated transcripts, generated using CMU Sphinx, a HUB4 acoustic model and a Gigaword-derived language model.

Data format

Editorial data

The data was crawled from BBC Programmes on the 16th of May, 2012. Each file is named according to the following pattern: barcode_pid.json, where the barcode is used as an identifier across our different datasets, and the pid is the identifier of that programme on the BBC web site. For example, X0903717_p002h45s.json can be found here. This data holds the editorial tags we are evaluating against.

Automated tags

Each barcode.json holds the automatically derived tags for the programme identified by the barcode. The JSON has the following shape:

[ { "score": score, "link": DBpedia URI }, ... ]

This array is ordered by score descending.

Automated transcripts

Each sub-directory corresponds to a single programme, which barcode is the name of the directory. Within each sub-directory, there is one JSON file for a 2 minutes chunk of the programme. For example transcript-0.json will hold the automated transcript for the first chunk and transcript-1.json will hold the automated transcript for the second chunk.

The JSON has the following shape:

[ "full transcript", [ [ term, start, end, score 1, score 2 ], ... ]

Start and end are in seconds and score 1 and 2 respectively captures the acoustic model score and the language model score.

Getting started

Running the evaluation with results from the automated tagging described in our paper.

$ python evaluation.py

Evaluating your own algorithm

Fork this repository.
Generate JSON files for your automated tags according to the format described above.
Replace the content of the data/automated-tags directory with your tags.
Run the evaluation script.

Licensing terms and authorship

See 'COPYING' and 'AUTHORS' files. The license in 'COPYING' only applies to the Python code and the automated transcripts and tags. The editorial data is under the same non-commercial license as the BBC Programmes data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation of automated DBpedia tagging on speech audio

Content

Data format

Editorial data

Automated tags

Automated transcripts

Getting started

Evaluating your own algorithm

Licensing terms and authorship

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
AUTHORS		AUTHORS
COPYING		COPYING
README.md		README.md
evaluation.py		evaluation.py

License

bbc/automated-audio-tagging-evaluation

Folders and files

Latest commit

History

Repository files navigation

Evaluation of automated DBpedia tagging on speech audio

Content

Data format

Editorial data

Automated tags

Automated transcripts

Getting started

Evaluating your own algorithm

Licensing terms and authorship

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages