- 
                Notifications
    You must be signed in to change notification settings 
- Fork 0
Ngram Dataset
The dataset backing NGRAMS is the Google Books Ngram Dataset v3 which is the largest publicly available source of ngram data. It contains word ngrams of length 1 to 5 extracted from books digitized by Google up to and including the year 2019. The dataset was released in February 2020.
At the moment NGRAMS indexes the English, German, Hebrew, and Russian corpora. Contact us if you need support for other languages.
The data model of the raw data is displayed in the diagram below. Basically each corpus is a set of ngrams. An ngram is a sequence of tokens. Each ngram has associated statistical data.
classDiagram
direction LR
Corpus "1" -- "1..*" Ngram : has
Ngram "1" -- "1..*" Stat : has
class Ngram {
    tokens : string[]
}
class Stat {
    year : int
    matchCount : int
    volumeCount : int
}
    Based on the model above, NGRAMS employs a more advanced model which is displayed in the diagram below. The types are also used in our REST API.
classDiagram
direction LR
Ngram --|> NgramLite : extends
Corpus "1" -- "1..*" Ngram : has
Corpus "1" -- "1" CorpusInfo : has
class CorpusInfo {
    name : string
    label : string
    stats : CorpusStat[]
}
class CorpusStat {
    numNgrams : int64
    minYear : int
    maxYear : int
    minMatchCount : int64
    maxMatchCount : int64
    minTotalMatchCount : int64
    maxTotalMatchCount : int64
}
class Ngram {
    stats : NgramStat[]
}
class NgramLite {
    id : string
    abstract : bool
    absTotalMatchCount : int64
    relTotalMatchCount : double
    tokens : NgramToken[]
}
class NgramStat {
    year : int
    absMatchCount : int
    relMatchCount : double
}
class NgramToken {
    text : string
    type : NgramTokenType
    inserted : bool
    completed : bool
}
    raw.property refers to a property in the raw data model.
- 
NgramLite.idis an ID generated by NGRAMS.
- 
NgramLite.abstractis a flag marking an ngram as abstract. An abstract ngram is an ngram that has been derived from other ngrams applying a filter operation such as case-folding or collapsing. An abstract ngram has no one-to-one correspondence to any ngram from the raw dataset and hence has no associated statistical data.
- 
NgramLite.absTotalMatchCountis the sum of allNgram.stats[i].absMatchCountvalues.
- 
NgramLite.relTotalMatchCount = Ngram.stats[i].absMatchCount / totalMatchCountAllYears(corpus, n)wheretotalMatchCountAllYears(corpus, n)returns data fromtotal_countsfiles, e.g.- 
totalMatchCountAllYears(eng, 1)returns data from eng/totalcounts-1
- 
totalMatchCountAllYears(eng, 2)returns data from eng/totalcounts-2
- and so on
 
- 
- NgramLite.tokens[i].text = raw.Ngram.tokens[i]
- 
NgramLite.tokens[i].typeis the token's type such asTEXTorTAGGED_NOUN.
- 
NgramLite.tokens[i].insertedis a flag marking the token as inserted after application of a wildcard operator. This property is dynamically computed at runtime while processing a user query.
- 
NgramLite.tokens[i].completedis a flag marking the token as completed after application of the completion operator. This property is dynamically computed at runtime while processing a user query.
- Ngram.stats[i].year = raw.Stat[i].year
- Ngram.stats[i].absMatchCount = raw.Stat[i].matchCount
- 
Ngram.stats[i].relMatchCount = raw.Stat[i].matchCount / totalMatchCount(corpus, n, year)wheretotalMatchCount(corpus, n, year)returns data fromtotal_countsfiles, e.g.- 
totalMatchCount(eng, 1, year)returns data from eng/totalcounts-1
- 
totalMatchCount(eng, 2, year)returns data from eng/totalcounts-2
- and so on
 
- 
- 
CorpusInfo.nameis the name of a corpus such as "English".
- 
CorpusInfo.labelis the short name of a corpus such as "eng".
- 
CorpusInfo.statsis statistical data derived from the set of indexed ngrams.
- 
CorpusStat.numNgramsis the number of indexed ngrams.
- 
CorpusStat.minYearis the minimum of allNgram.stats[i].yearvalues.
- 
CorpusStat.maxYearis the maximum of allNgram.stats[i].yearvalues.
- 
CorpusStat.minMatchCountis the minimum of allNgram.stats[i].absMatchCountvalues.
- 
CorpusStat.maxMatchCountis the maximum of allNgram.stats[i].absMatchCountvalues.
- 
CorpusStat.minTotalMatchCountis the minimum of allNgramLite.absTotalMatchCountvalues.
- 
CorpusStat.maxTotalMatchCountis the maximum of allNgramLite.absTotalMatchCountvalues.
There are three types of ngrams in the raw dataset.
- With terms only, e.g. the quick brown fox
- With part-of-speech tagged terms, e.g. the quick brown fox_NOUN
- With standalone part-of-speech tags, e.g. the quick brown _NOUN_
Ngrams of type 2 can have multiple tagged terms, but because of the combinatorial explosion Google did not tag 4- and 5-grams this way. So in fact, the 4-gram the quick brown fox_NOUN does not exist in the dataset, but the 3-gram quick brown fox_NOUN does.
NGRAMS has its own custom-made NoSQL system tailored for indexing and storing ngram data. Due to the static nature of the data, things have been heavily optimized for rapid read-only access.
The index contains ngrams of type 1 and 2, see Ngram Types, with complete statistical data as shown in Data Model. It does not contain ngrams of type 3 because the goal of NGRAMS' query language is to replace wildcards with actual words and not standalone tags.
The following table gives and overview of the number of ngrams that have been indexed.
| Corpus | #1grams | #2grams | #3grams | #4grams | #5grams | total | 
|---|---|---|---|---|---|---|
| English | 76.9 M | 1.6 B | 11.8 B | 5.1 B | 5.0 B | 23.6 B | 
| German | 38.8 M | 686.9 M | 2.8 B | 699.1 M | 409.2 M | 4.7 B | 
| Hebrew | 2.8 M | 43.8 M | 66.4 M | 7.6 M | 3.5 M | 124 M | 
| Russian | 12.8 M | 313.0 M | 973.9 M | 181.0 M | 97.5 M | 1.6 B |