Skip to content

Turn the XML Spoken Wikipedia Corpus into dataframes and json

License

Notifications You must be signed in to change notification settings

BayesForDays/swc_parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository has a simple script that turns the Spoken Wikipedia Corpus into dataframes and jsons. I created this repo because currently the data when downloaded is an XML format that is not easily human readable. I might turn this into a package since there are a couple of dependencies.

About the script

Dependencies

  • click
  • pandas
  • xmltodict

parser.py works from the top-level directory (i.e. PATH/TO/spoken_wikipedia_corpus/.), traverses all the directories, and creates jsons (if you want to munge the data in a different way from how I did it here) as well as dataframes containing the columns "term", "start_time", "end_time", and "phonemes"].

These jsons and dataframes are saved in the same folders where the aligned.swc files are for each Wikipedia article, and are saved as aligned.swc.json and aligned.swc.df.

Use

To use the script, simply call python parser.py with an additional parameter -fd to specify the directory where your files are, and it will automatically traverse all directories.

If you have downloaded the repo from github, run pip install -e . from within the folder. Then, from Python, you can do this:

from swc_parser import parser
parser.parse_files('PATH/TO/spoken_wikipedia_corpus')

and voilà!

About the dataframe schema

start_time and end_time are taken directly from the XML schema, as well as the term that is being pronounced, which is stored in the schema as t or maybe n. It's called term here for transparency.

The phonemes column contains a json blob of phone durations for each word if those existed in the annotation file. Many words have durations and not phone durations, and vice versa. Additionally, many words don't have durations or timestamps at all, presumably because of the forced aligner.

Because it is easy to lose track of the documents themselves (and indeed, even the sentences), I have also uploaded my personal version of the data (swc_word_durations.csv), which is a tab-delimited text file with two additional columns beyond the dataframes in each of the folders: id, which is a index of the article that the data come from in alphabetic order, and duration, which subtracts start_time from end_time. Additionally, the index (which will be loaded as a column X in R or as a regular index in pandas) keeps track of the location of each word or character in the document that is being read aloud.

UPDATE 11/30/17

Arne Köhn of SWC wrote up a solid justification of why they used XML and not JSON, and how to parse it using other packages, namely XPath. Give it a read :)

About

Turn the XML Spoken Wikipedia Corpus into dataframes and json

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages