Skip to content

Commit

Permalink
Merge branch 'master' into inf_build
Browse files Browse the repository at this point in the history
  • Loading branch information
Vineel Pratap committed Jan 5, 2021
2 parents 898e8f4 + 64e54f8 commit 12013d9
Show file tree
Hide file tree
Showing 123 changed files with 1,289 additions and 29,322 deletions.
34 changes: 19 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,38 +3,42 @@
[![CircleCI](https://circleci.com/gh/facebookresearch/wav2letter.svg?style=svg)](https://circleci.com/gh/facebookresearch/wav2letter)
[![Join the chat at https://gitter.im/wav2letter/community](https://badges.gitter.im/wav2letter/community.svg)](https://gitter.im/wav2letter/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)

wav2letter++ is a [highly efficient](https://arxiv.org/abs/1812.07625) end-to-end automatic speech recognition (ASR) toolkit written entirely in C++, leveraging [ArrayFire](https://github.com/arrayfire/arrayfire) and [flashlight](https://github.com/facebookresearch/flashlight).
## Important Note:
### wav2letter has been moved and consolidated [into Flashlight](https://github.com/facebookresearch/flashlight) in the [ASR application](https://github.com/facebookresearch/flashlight/tree/master/flashlight/app/asr).

The toolkit started from models predicting letters directly from the raw waveform, and now evolved as an all-purpose end-to-end ASR research toolkit, supporting a wide range of models and learning techniques. It also embarks a very efficient modular beam-search decoder, for both structured learning (CTC, ASG) and seq2seq approaches.
Future wav2letter development will occur in Flashlight.

**Important disclaimer**: as a number of models from this repository could be used for other modalities, we moved most of the code to flashlight.
*To build the old, pre-consolidation version of wav2letter*, checkout the [wav2letter v0.2](https://github.com/facebookresearch/wav2letter/releases/tag/v0.2) release, which depends on the old [Flashlight v0.2](https://github.com/facebookresearch/flashlight/releases/tag/v0.2) release. The [`wav2letter-lua`](https://github.com/facebookresearch/wav2letter/tree/wav2letter-lua) project can be fonud on the `wav2letter-lua` branch, accordingly.

For more information on wav2letter++, see or cite [this arXiv paper](https://arxiv.org/abs/1812.07625).

## Recipes
This repository includes recipes to reproduce the following research papers as well as **pre-trained** models:
- [NEW] [Pratap et al. (2020): Scaling Online Speech Recognition Using ConvNets](recipes/streaming_convnets/)
- [NEW SOTA] [Synnaeve et al. (2020): End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures](recipes/sota/2019)
- [Pratap et al. (2020): Scaling Online Speech Recognition Using ConvNets](recipes/streaming_convnets/)
- [Synnaeve et al. (2020): End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures](recipes/sota/2019)
- [Kahn et al. (2020): Self-Training for End-to-End Speech Recognition](recipes/self_training)
- [Likhomanenko et al. (2019): Who Needs Words? Lexicon-free Speech Recognition](recipes/lexicon_free/)
- [Hannun et al. (2019): Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions](recipes/seq2seq_tds/)

Data preparation for our training and evaluation can be found in [data](data) folder.
Data preparation for training and evaluation can be found in [data](data) directory.

The previous iteration of wav2letter can be found in the:
- (before merging codebases for wav2letter and flashlight) [wav2letter-v0.2](https://github.com/facebookresearch/wav2letter/tree/v0.2) branch.
- (written in Lua) [`wav2letter-lua`](https://github.com/facebookresearch/wav2letter/tree/wav2letter-lua) branch.
### Building the Recipes

## Build recipes
First, isntall [flashlight](https://github.com/facebookresearch/flashlight) with all its dependencies. Then
First, install [Flashlight](https://github.com/facebookresearch/flashlight) with the [ASR application](https://github.com/facebookresearch/flashlight/tree/master/flashlight/app/asr). Then, after cloning the project source:
```shell
mkdir build && cd build
cmake .. && make -j8
```
mkdir build && cd build && cmake .. && make -j8
If Flashlight or ArrayFire are installed in nonstandard paths via a custom `CMAKE_INSTALL_PREFIX`, they can be found by passing
```shell
-Dflashlight_DIR=[PREFIX]/usr/share/flashlight/cmake/ -DArrayFire_DIR=[PREFIX]/usr/share/ArrayFire/cmake
```
If flashlight or ArrayFire are installed in nonstandard paths via `CMAKE_INSTALL_PREFIX`, they can be found by passing `-Dflashlight_DIR=[PREFIX]/usr/share/flashlight/cmake/ -DArrayFire_DIR=[PREFIX]/usr/share/ArrayFire/cmake` when running `cmake`.
when running `cmake`.

## Join the wav2letter community
* Facebook page: https://www.facebook.com/groups/717232008481207/
* Google group: https://groups.google.com/forum/#!forum/wav2letter-users
* Contact: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]

See the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out.

## License
wav2letter++ is BSD-licensed, as found in the [LICENSE](LICENSE) file.
64 changes: 64 additions & 0 deletions data/ami/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# A Recipe for the AMI corpus.

"The AMI Meeting Corpus consists of 100 hours of meeting recordings. The recordings use a range of signals synchronized to a common timeline. These include close-talking and far-field microphones, individual and room-view video cameras, and output from a slide projector and an electronic whiteboard. During the meetings, the participants also have unsynchronized pens available to them that record what is written. The meetings were recorded in English using three different rooms with different acoustic properties, and include mostly non-native speakers." See http://groups.inf.ed.ac.uk/ami/corpus/overview.shtml for more details.

We use the individual headset microphone (IHM) setting for preparing train, dev and test sets. The recipe here is heavily inspired from the preprocessing scripts in Kaldi - https://github.com/kaldi-asr/kaldi/tree/master/egs/ami .

## Steps to download and prepare the audio and text data

Prepare train, dev and test sets as list files to be used for training with wav2letter. Replace `[...]` with appropriate paths

```
python prepare.py -dst [...]
```

The above scripts download the AMI data, segments them into shorter `.flac` audio files based on word timestamps. Limited supervision training set for 10min, 1hr and 10hr will be generated as well.

The following structure will be generated
```
>tree -L 4
.
├── audio
│   ├── EN2001a
│   │   ├── EN2001a.Headset-0.wav
│   │   ├── ...
│   │   └── EN2001a.Headset-4.wav
│   ├── EN2001b
│   ├── ...
│   ├── ...
│   ├── IS1009d
│   │   ├── ...
│   │   └── IS1009d.Headset-3.wav
│   └── segments
│ ├── ES2005a
│ │ ├── ES2005a_H00_MEE018_0.75_1.61.flac
│ │ ├── ES2005a_H00_MEE018_13.19_16.05.flac
│ │ ├── ...
│ │ └── ...
│      ├── ...
│      └── IS1009d
│      ├── ...
│ └── ...
├── lists
│ ├── dev.lst
│ ├── test.lst
│ ├── train_10min_0.lst
│ ├── train_10min_1.lst
│ ├── train_10min_2.lst
│ ├── train_10min_3.lst
│ ├── train_10min_4.lst
│ ├── train_10min_5.lst
│ ├── train_9hr.lst
│ └── train.lst
└── text
├── ami_public_manual_1.6.1.zip
└── annotations
├── 00README_MANUAL.txt
├── ...
├── transcripts0
├── transcripts1
├── transcripts2
├── words
└── youUsages
```
218 changes: 218 additions & 0 deletions data/ami/ami_split_segments.pl
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
#!/usr/bin/env perl

# Copyright 2014 University of Edinburgh (Author: Pawel Swietojanski)

# The script - based on punctuation times - splits segments longer than #words (input parameter)
# and produces bit more more normalised form of transcripts, as follows
# MeetID Channel Spkr stime etime transcripts

#use List::MoreUtils 'indexes';
use strict;
use warnings;

sub split_transcripts;
sub normalise_transcripts;

sub merge_hashes {
my ($h1, $h2) = @_;
my %hash1 = %$h1; my %hash2 = %$h2;
foreach my $key2 ( keys %hash2 ) {
if( exists $hash1{$key2} ) {
warn "Key [$key2] is in both hashes!";
next;
} else {
$hash1{$key2} = $hash2{$key2};
}
}
return %hash1;
}

sub print_hash {
my ($h) = @_;
my %hash = %$h;
foreach my $k (sort keys %hash) {
print "$k : $hash{$k}\n";
}
}

sub get_name {
#no warnings;
my $sname = sprintf("%07d_%07d", $_[0]*100, $_[1]*100) || die 'Input undefined!';
#use warnings;
return $sname;
}

sub split_on_comma {

my ($text, $comma_times, $btime, $etime, $max_words_per_seg)= @_;
my %comma_hash = %$comma_times;

print "Btime, Etime : $btime, $etime\n";

my $stime = ($etime+$btime)/2; #split time
my $skey = "";
my $otime = $btime;
foreach my $k (sort {$comma_hash{$a} cmp $comma_hash{$b} } keys %comma_hash) {
print "Key : $k : $comma_hash{$k}\n";
my $ktime = $comma_hash{$k};
if ($ktime==$btime) { next; }
if ($ktime==$etime) { last; }
if (abs($stime-$ktime)/2<abs($stime-$otime)/2) {
$otime = $ktime;
$skey = $k;
}
}

my %transcripts = ();

if (!($skey =~ /[\,][0-9]+/)) {
print "Cannot split into less than $max_words_per_seg words! Leaving : $text\n";
$transcripts{get_name($btime, $etime)}=$text;
return %transcripts;
}

print "Splitting $text on $skey at time $otime (stime is $stime)\n";
my @utts1 = split(/$skey\s+/, $text);
for (my $i=0; $i<=$#utts1; $i++) {
my $st = $btime;
my $et = $comma_hash{$skey};
if ($i>0) {
$st=$comma_hash{$skey};
$et = $etime;
}
my (@utts) = split (' ', $utts1[$i]);
if ($#utts < $max_words_per_seg) {
my $nm = get_name($st, $et);
print "SplittedOnComma[$i]: $nm : $utts1[$i]\n";
$transcripts{$nm} = $utts1[$i];
} else {
print 'Continue splitting!';
my %transcripts2 = split_on_comma($utts1[$i], \%comma_hash, $st, $et, $max_words_per_seg);
%transcripts = merge_hashes(\%transcripts, \%transcripts2);
}
}
return %transcripts;
}

sub split_transcripts {
@_ == 4 || die 'split_transcripts: transcript btime etime max_word_per_seg';

my ($text, $btime, $etime, $max_words_per_seg) = @_;
my (@transcript) = @$text;

my (@punct_indices) = grep { $transcript[$_] =~ /^[\.,\?\!\:]$/ } 0..$#transcript;
my (@time_indices) = grep { $transcript[$_] =~ /^[0-9]+\.[0-9]*/ } 0..$#transcript;
my (@puncts_times) = delete @transcript[@time_indices];
my (@puncts) = @transcript[@punct_indices];

if ($#puncts_times != $#puncts) {
print 'Ooops, different number of punctuation signs and timestamps! Skipping.';
return ();
}

#first split on full stops
my (@full_stop_indices) = grep { $puncts[$_] =~ /[\.\?]/ } 0..$#puncts;
my (@full_stop_times) = @puncts_times[@full_stop_indices];

unshift (@full_stop_times, $btime);
push (@full_stop_times, $etime);

my %comma_puncts = ();
for (my $i=0, my $j=0;$i<=$#punct_indices; $i++) {
my $lbl = "$transcript[$punct_indices[$i]]$j";
if ($lbl =~ /[\.\?].+/) { next; }
$transcript[$punct_indices[$i]] = $lbl;
$comma_puncts{$lbl} = $puncts_times[$i];
$j++;
}

#print_hash(\%comma_puncts);

print "InpTrans : @transcript\n";
print "Full stops: @full_stop_times\n";

my @utts1 = split (/[\.\?]/, uc join(' ', @transcript));
my %transcripts = ();
for (my $i=0; $i<=$#utts1; $i++) {
my (@utts) = split (' ', $utts1[$i]);
if ($#utts < $max_words_per_seg) {
print "ReadyTrans: $utts1[$i]\n";
$transcripts{get_name($full_stop_times[$i], $full_stop_times[$i+1])} = $utts1[$i];
} else {
print "TransToSplit: $utts1[$i]\n";
my %transcripts2 = split_on_comma($utts1[$i], \%comma_puncts, $full_stop_times[$i], $full_stop_times[$i+1], $max_words_per_seg);
print "Hash TR2:\n"; print_hash(\%transcripts2);
print "Hash TR:\n"; print_hash(\%transcripts);
%transcripts = merge_hashes(\%transcripts, \%transcripts2);
print "Hash TR_NEW : \n"; print_hash(\%transcripts);
}
}
return %transcripts;
}

sub normalise_transcripts {
my $text = $_[0];

#DO SOME ROUGH AND OBVIOUS PRELIMINARY NORMALISATION, AS FOLLOWS
#remove the remaining punctation labels e.g. some text ,0 some text ,1
$text =~ s/[\.\,\?\!\:][0-9]+//g;
#there are some extra spurious puncations without spaces, e.g. UM,I, replace with space
$text =~ s/[A-Z']+,[A-Z']+/ /g;
#split words combination, ie. ANTI-TRUST to ANTI TRUST (None of them appears in cmudict anyway)
#$text =~ s/(.*)([A-Z])\s+(\-)(.*)/$1$2$3$4/g;
$text =~ s/\-/ /g;
#substitute X_M_L with X. M. L. etc.
$text =~ s/\_/. /g;
#normalise and trim spaces
$text =~ s/^\s*//g;
$text =~ s/\s*$//g;
$text =~ s/\s+/ /g;
#some transcripts are empty with -, nullify (and ignore) them
$text =~ s/^\-$//g;
$text =~ s/\s+\-$//;
# apply few exception for dashed phrases, Mm-Hmm, Uh-Huh, etc. those are frequent in AMI
# and will be added to dictionary
$text =~ s/MM HMM/MM\-HMM/g;
$text =~ s/UH HUH/UH\-HUH/g;

return $text;
}

if (@ARGV != 2) {
print STDERR "Usage: ami_split_segments.pl <meet-file> <out-file>\n";
exit(1);
}

my $meet_file = shift @ARGV;
my $out_file = shift @ARGV;
my %transcripts = ();

open(W, ">$out_file") || die "opening output file $out_file";
open(S, "<$meet_file") || die "opening meeting file $meet_file";

while(<S>) {

my @A = split(" ", $_);
if (@A < 9) { print "Skipping line @A"; next; }

my ($meet_id, $channel, $spk, $channel2, $trans_btime, $trans_etime, $aut_btime, $aut_etime) = @A[0..7];
my @transcript = @A[8..$#A];
my %transcript = split_transcripts(\@transcript, $aut_btime, $aut_etime, 30);

for my $key (keys %transcript) {
my $value = $transcript{$key};
my $segment = normalise_transcripts($value);
my @times = split(/\_/, $key);
if ($times[0] >= $times[1]) {
print "Warning, $meet_id, $spk, $times[0] > $times[1]. Skipping. \n"; next;
}
if (length($segment)>0) {
print W join " ", $meet_id, "H0${channel2}", $spk, $times[0]/100.0, $times[1]/100.0, $segment, "\n";
}
}

}
close(S);
close(W);

print STDERR "Finished."
47 changes: 47 additions & 0 deletions data/ami/ami_xml2text.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#!/usr/bin/env bash

# Copyright, University of Edinburgh (Pawel Swietojanski and Jonathan Kilgour)

if [ $# -ne 1 ]; then
echo "Usage: $0 <ami-dir>"
exit 1;
fi

adir=$1
wdir=$1/annotations

[ ! -f $adir/annotations/AMI-metadata.xml ] && echo "$0: File $adir/annotations/AMI-metadata.xml no found." && exit 1;

mkdir -p $wdir/log

JAVA_VER=$(java -version 2>&1 | sed 's/java version "\(.*\)\.\(.*\)\..*"/\1\2/; 1q')

if [ "$JAVA_VER" -ge 15 ]; then
if [ ! -d $wdir/nxt ]; then
echo "Downloading NXT annotation tool..."
wget -O $wdir/nxt.zip http://sourceforge.net/projects/nite/files/nite/nxt_1.4.4/nxt_1.4.4.zip
[ ! -s $wdir/nxt.zip ] && echo "Downloading failed! ($wdir/nxt.zip)" && exit 1
unzip -d $wdir/nxt $wdir/nxt.zip &> /dev/null
fi

if [ ! -f $wdir/transcripts0 ]; then
echo "Parsing XML files (can take several minutes)..."
nxtlib=$wdir/nxt/lib
java -cp $nxtlib/nxt.jar:$nxtlib/xmlParserAPIs.jar:$nxtlib/xalan.jar:$nxtlib \
FunctionQuery -c $adir/annotations/AMI-metadata.xml -q '($s segment)(exists $w1 w):$s^$w1' -atts obs who \
'@extract(($sp speaker)($m meeting):$m@observation=$s@obs && $m^$sp & $s@who==$sp@nxt_agent,global_name, 0)'\
'@extract(($sp speaker)($m meeting):$m@observation=$s@obs && $m^$sp & $s@who==$sp@nxt_agent, channel, 0)' \
transcriber_start transcriber_end starttime endtime '$s' '@extract(($w w):$s^$w & $w@punc="true", starttime,0,0)' \
1> $wdir/transcripts0 2> $wdir/log/nxt_export.log
fi
else
echo "$0. Java not found. Will download exported version of transcripts."
annots=ami_manual_annotations_v1.6.1_export
wget -O $wdir/$annots.gzip http://groups.inf.ed.ac.uk/ami/AMICorpusAnnotations/$annots.gzip
gunzip -c $wdir/${annots}.gzip > $wdir/transcripts0
fi

#remove NXT logs dumped to stdio
grep -e '^Found' -e '^Obs' -i -v $wdir/transcripts0 > $wdir/transcripts1

exit 0;
Loading

0 comments on commit 12013d9

Please sign in to comment.