-
Notifications
You must be signed in to change notification settings - Fork 0
REST API
NGRAMS has been built following API-first principles. The goal is to make the accessibility of ngram data as easy as possible. The API sends and receives data in UTF-8 encoded JSON format.
There are endpoints which enable the following types of requests:
- Search Request — Send a wildcard query and receive matching ngrams.
- Batch Request — Send multiple raw queries at once and receive matching ngrams.
- Ngram Request — Send an ngram id and receive year-based match count information.
- CorpusInfo Request — Get static information about a corpus.
- TotalCounts Request — Get total match counts by ngram length and year.
The REST API is currently in beta status — expect things to change.
By using the API, you agree to our Terms of Service. In short, they read: NGRAMS can be used free of charge, for both commercial and non-commercial purposes. Use requires attribution.
https://api.ngrams.dev
We do not apply any rate limiting at the moment. You can send as many requests as you want. We will adjust this policy if necessary.
We will block clients based on IP address if we detect abnormal usage, e.g. repeatedly trying to request undocumented endpoints.
At the moment, the following corpora are available.
| Name | Label | #Ngrams |
|---|---|---|
| English | eng |
23.6 B |
| German | ger |
4.5 B |
| Russian | rus |
1.5 B |
TLDR Here is how to get started in FAQ style.
In NGRAMS every ngram has an absolute total match count and a relative total match count. The former is the sum of all year-based absolute match counts. The latter is the absolute total match count divided by the absolute total match count of all ngrams of the same length — you can call this the ngram's probability.
-
Send a Search Request.
GET {base_url}/{corpus}/search?query=my+awesome+ngram -
You will receive a list of matching ngrams because the search is not case-sensitive by default, i.e. the list contains
my awesome ngram,MY AWESOME NGRAM, etc. You can add&flags=csto make the search case-sensitive (only one or none match). Each ngram in the list has arelTotalMatchCountproperty.
I need one probability for all cases of this ngram.
-
Send a Search Request with result set collapsing.
GET {base_url}/{corpus}/search?query=my+awesome+ngram&flags=cr -
You will get one (or none) matching ngram whose absolute total match count is the sum of all cases, i.e.
my awesome ngram,MY AWESOME NGRAM, etc. The relative total match count is derived as described above. The ngram's text will be in case-folded format. The ngram is also called abstract because was derived from other ngrams and has no 1:1 correspondence with an ngram in the raw dataset.
This requires two requests but we will support this use case directly some time in the future. See this feature request.
-
Send a Search Request.
GET {base_url}/{corpus}/search?query=my+awesome+ngram -
You will receive a list of matching ngrams because the search is not case-sensitive by default, i.e. the list contains
my awesome ngram,MY AWESOME NGRAM, etc. You can add&flags=csto make the search case-sensitive (only one or none match). Each ngram in the list has anidproperty. -
Send an Ngram Request providing an ngram ID.
GET {base_url}/{corpus}/{ngram_id} -
You will get the full ngram representation with year-based match counts.
I need the frequencies for all cases of this ngram.
This is not supported at the moment because this would mean to return year-based match counts of an abstract ngram. See NgramLite for details. You can compute these stats yourself by fetching all full ngrams for all IDs returned in step 2 above.
A search request allows you to send a single wildcard query and receive a set of matching ngrams. This is basically the same type of request issued when using the search interface on https://ngrams.dev. The returned ngrams are sorted by decreasing total match count. Large sets are sent in chunks making use of pagination.
GET /{corpus}/search
corpus string
The label of the corpus to search, see corpora.
query string
The percent-encoded query string.
flags string optional
Enable search flags by adding the respective character sequence to the string.
-
cs— Search is case-sensitive. -
cr— Collapse the result set by case-folding and then merging equal ngrams. -
ep— Exclude ngrams from the result set where wildcards matched punctuation marks as of Unicode category P, see also here. -
es— Exclude ngrams from the result set where wildcards matched sentence boundary tags, namely _START_ and _END_. -
rq— Raw query: Do not interpret query operators. No need for escape sequences.
limit number optional default: 100 max: 100
The maximum number of ngrams to return.
The limit is applied before ngram filtering or collapsing is performed.
This means that the actual number of ngrams may be smaller if any of the flags cs, cr, ep, or es are set.
start string optional
An opaque token to fetch the next chunk of a result set (pagination). A start token is included in a successful search result if there are possibly more matching ngrams.
The HTTP status code tells whether a request was successful. A code other than 200 is considered failure. The response to a 400 bad request contains body data with error details.
| Code | Body | Description |
|---|---|---|
200 OK |
SearchResult |
The request was successful. |
400 Bad Request |
ErrorResult |
The request failed due to client error. |
404 Not Found |
no | The corpus is unknown. |
500 Internal Server Error |
no | Try again later. |
curl 'https://api.ngrams.dev/eng/search?query=hello+*&flags=cs&limit=3'
# OR
curl -G https://api.ngrams.dev/eng/search \
--data-urlencode query='hello *' \
-d flags=cs \
-d limit=3200 OK
Response Body
// SearchResult object,
// 2 instead of 3 ngrams due to post-retrieval case-sensitive filtering.
{
"query": "hello *",
"queryTokens": [
{
"text": "hello",
"kind": "TERM"
},
{
"text": "*",
"kind": "STAR"
}
],
"ngrams": [
{
"id": "d975b1edafaf5aa521f6aee0d7efbe06",
"absTotalMatchCount": 608657,
"relTotalMatchCount": 2.899120077549673e-7,
"tokens": [
{
"text": "hello",
"kind": "TERM"
},
{
"text": ",",
"kind": "TERM",
"inserted": true
}
]
},
{
"id": "983f5221b490f979d836276d3d986ef2",
"absTotalMatchCount": 598094,
"relTotalMatchCount": 2.848807002403643e-7,
"tokens": [
{
"text": "hello",
"kind": "TERM"
},
{
"text": ".",
"kind": "TERM",
"inserted": true
}
]
}
],
"nextPageToken": "157c30ede3ed098320eadbaf1a807dd17228ef6880f959e08605f29fcbfc14a894ffea81eec4fdb609a3e331772e9bd5",
"nextPageLink": "https://api.ngrams.dev/eng/search?query=hello+%2A&flags=cs&limit=3&start=157c30ede3ed098320eadbaf1a807dd17228ef6880f959e08605f29fcbfc14a894ffea81eec4fdb609a3e331772e9bd5"
}curl https://api.ngrams.dev/eng/search400 Bad Request
Response Body
// ErrorResult object
{
"error": {
"code": "MISSING_PARAMETER.QUERY"
}
}Wildcard queries can generate result sets that contain thousands of ngrams. The API sends these big result sets in
chunks called pages. The start of a page is controlled by the start parameter. The size of a page is controlled by the
limit parameter.
Every search request that contains a partial result, i.e. a page, has a so called page token in its response. This page
token can be used in a follow-up request — as the value of the start parameter — to fetch the next page.
If a response has no page token, you have reached the end of the result set.
A batch request allows you to send up to 100 raw queries at once, which saves a lot of HTTP round trip time compared to
single search requests. Queries in a batch request have the rq flag implicitly set, which means there is no
interpretation of query operators. This type of request is most appropriate in situations where the existence or
frequency of multiple ngrams needs to be checked quickly.
Batch requests have no means of pagination, because the list of matching ngrams per query is rather short as it only
reflects variants in casing. If, in addition, the cs or cr flag is enabled, there is only one or none ngram to
return per query.
POST /{corpus}/batch
corpus string
The label of the corpus to search, see corpora.
Batch object
The HTTP status code tells whether a request was successful. Invalid request body data causes a request to fail entirely. If the processing of individual queries fails, the batch response will contain corresponding error information for these queries.
| Code | Body | Description |
|---|---|---|
200 OK |
BatchResult |
The request was successful. |
400 Bad Request |
ErrorResult |
The request failed due to client error. |
404 Not Found |
no | The corpus is unknown. |
500 Internal Server Error |
no | Try again later. |
curl https://api.ngrams.dev/eng/batch \
-H 'Content-Type: application/json' \
-d '@path/to/batch.json'path/to/batch.json
// Batch object
{
"flags": "cs",
"queries": [
"The quick brown",
"fox jumps over the lazy dog"
]
}200 OK
Response Body
// BatchResult object
{
"results": [
{
"query": "The quick brown",
"queryTokens": [
{
"text": "The",
"kind": "TERM"
},
{
"text": "quick",
"kind": "TERM"
},
{
"text": "brown",
"kind": "TERM"
}
],
"ngrams": [
{
"id": "ecaf9b4576d82550a5661c85f515be24",
"absTotalMatchCount": 18248,
"relTotalMatchCount": 9.13534806330214e-9,
"tokens": [
{
"text": "The",
"kind": "TERM"
},
{
"text": "quick",
"kind": "TERM"
},
{
"text": "brown",
"kind": "TERM"
}
]
}
]
},
{
"error": {
"code": "INVALID_QUERY.TOO_MANY_TOKENS"
},
"query": "fox jumps over the lazy dog",
"queryTokens": [
{
"text": "fox",
"kind": "TERM"
},
{
"text": "jumps",
"kind": "TERM"
},
{
"text": "over",
"kind": "TERM"
},
{
"text": "the",
"kind": "TERM"
},
{
"text": "lazy",
"kind": "TERM"
},
{
"text": "dog",
"kind": "TERM"
}
]
}
]
}An ngram request allows you to send an ngram ID and receive a full ngram object with year-based match count information. This type of request is used on https://ngrams.dev to fetch the data backing an ngram's histogram view.
GET /{corpus}/{ngram_id}
corpus string
The label of the corpus to search, see corpora.
ngram_id string
An ngram ID as returned from a search or batch request. Note that the ID of an abstract ngram is always considered unknown, because such ngrams have no year-based match count information.
The HTTP status code tells whether a request was successful. A code other than 200 is considered failure.
| Code | Body | Description |
|---|---|---|
200 OK |
Ngram | The request was successful. |
404 Not Found |
no | The corpus or ID is unknown. |
500 Internal Server Error |
no | Try again later. |
curl https://api.ngrams.dev/eng/92c668bc012dc3e387ff0c7e791528db200 OK
Response Body
// Ngram object
{
"id": "92c668bc012dc3e387ff0c7e791528db",
"absTotalMatchCount": 118987,
"relTotalMatchCount": 5.6675204699428895e-8,
"tokens": [
{
"text": "Hello",
"kind": "TERM"
},
{
"text": "World",
"kind": "TERM"
}
],
"stats": [
{
"year": 1880,
"absMatchCount": 52,
"relMatchCount": 1.2108055367130671e-8
},
// There might be gaps for years without any data.
{
"year": 1899,
"absMatchCount": 1,
"relMatchCount": 1.28983869973734e-10
},
{
"year": 1900,
"absMatchCount": 49,
"relMatchCount": 6.053137889244918e-9
},
// Items removed to keep it short.
{
"year": 2017,
"absMatchCount": 5107,
"relMatchCount": 1.765788720318268e-7
},
{
"year": 2018,
"absMatchCount": 4923,
"relMatchCount": 1.7802199706983458e-7
},
{
"year": 2019,
"absMatchCount": 3798,
"relMatchCount": 1.5816449035193755e-7
}
]
}A corpus info request allows you get static information about a corpus.
GET /{corpus}/info
corpus string
The label of the corpus to search, see corpora.
The HTTP status code tells whether a request was successful. A code other than 200 is considered failure.
| Code | Body | Description |
|---|---|---|
200 OK |
CorpusInfo |
The request was successful. |
404 Not Found |
no | The corpus is unknown. |
500 Internal Server Error |
no | Try again later. |
curl https://api.ngrams.dev/eng/info200 OK
Response Body
// CorpusInfo object
{
"name": "English",
"label": "eng",
"stats": [
{
"numNgrams": 76862879,
"minYear": 1470,
"maxYear": 2019,
"minMatchCount": 1,
"maxMatchCount": 1922716631,
"minTotalMatchCount": 40,
"maxTotalMatchCount": 115513165249
},
{
"numNgrams": 1604084580,
"minYear": 1470,
"maxYear": 2019,
"minMatchCount": 1,
"maxMatchCount": 1446928350,
"minTotalMatchCount": 40,
"maxTotalMatchCount": 82544506739
},
{
"numNgrams": 11777289629,
"minYear": 1470,
"maxYear": 2019,
"minMatchCount": 1,
"maxMatchCount": 84854130,
"minTotalMatchCount": 40,
"maxTotalMatchCount": 2907518961
},
{
"numNgrams": 5089891990,
"minYear": 1470,
"maxYear": 2019,
"minMatchCount": 1,
"maxMatchCount": 14391742,
"minTotalMatchCount": 40,
"maxTotalMatchCount": 384260789
},
{
"numNgrams": 5020506742,
"minYear": 1470,
"maxYear": 2019,
"minMatchCount": 1,
"maxMatchCount": 7167265,
"minTotalMatchCount": 40,
"maxTotalMatchCount": 226361873
}
]
}Get the sum of ngram occurrences by ngram length and year. This data is useful for computing the relative frequencies of ngrams.
GET /{corpus}/total_counts
corpus string
The label of the corpus to search, see corpora.
The HTTP status code tells whether a request was successful. A code other than 200 is considered failure.
| Code | Body | Description |
|---|---|---|
200 OK |
TotalCounts |
The request was successful. |
404 Not Found |
no | The corpus is unknown. |
500 Internal Server Error |
no | Try again later. |
curl https://api.ngrams.dev/eng/total_counts200 OK
Response Body
// TotalCounts object
{
"minYear": 1470,
"maxYear": 2019,
"matchCounts": [
[
984,
"…",
22826152232
],
[
1019,
"…",
24012975299
],
[
984,
"…",
22826152232
],
[
949,
"…",
21639329784
],
[
914,
"…",
20458636150
]
]
}A complete list of types (schemas) used in this API.
A container for multiple queries and search options.
All queries have the rq flag implicitly set, i.e. wildcards are not interpreted.
queries string[]
An array of query strings.
flags string optional
Enable search flags by adding the respective character sequence to the string.
-
cs— Search is case-sensitive. -
cr— Collapse the result set by case-folding and then merging equal ngrams.
limit number optional
The maximum number of ngrams to return per query.
The limit is applied before ngram filtering or collapsing is performed.
This means that the actual number of ngrams may be smaller if the cs or cr flag is set.
A limit of 0 is mapped to the default value (20).
A container for multiple search results.
results (SearchResult | ErrorResult)[]
An array of multiple types, aka union. results[i] is the outcome of BatchRequest.queries[i]. If results[i].error
exists, the object is an instance of ErrorResult, otherwise it is an instance of
SearchResult.
A container for static information about a single corpus.
name string
The name of the corpus — something like "English", see corpora.
label string
The label of the corpus — something like "eng", see corpora.
stats CorpusStat[5]
An array of CorpusStat objects sorted by ngram length. stats[0] refers to the subset of 1-grams,
stats[1] refers to the subset of 2-grams, and so on.
A container for statistical data about a corpus or sub-corpus.
numNgrams number
The number of indexed ngrams. See ngram dataset for details.
minYear number
The minimum year value associated with an ngram.
maxYear number
The maximum year value associated with an ngram.
minMatchCount number
The minimum value of an ngram's year-based match count.
maxMatchCount number
The maximum value of an ngram's year-based match count.
minTotalMatchCount number
The minimum value of an ngram's total match count.
maxTotalMatchCount number
The maximum value of an ngram's total match count.
A type containing information about a failed query or request.
code ErrorCode
A string indicating the type of error. The values are constants to be used for programmatic error handling.
context string | object optional
Provides error-specific context information to be used for advanced programmatic error handling. The exact format is currently work in progress.
An enum that describes the type of an error. Values are string constants.
| Value | Description |
|---|---|
INVALID_BATCH_SIZE |
Number of queries in a batch request was out-of-range |
INVALID_PARAMETER.LIMIT |
limit parameter not parsable or out-of-range |
INVALID_PARAMETER.START |
start parameter is invalid |
INVALID_QUERY.BAD_ALTERNATION |
Token to the left or right of the / operator is invalid |
INVALID_QUERY.BAD_COMPLETION |
The ~ operator has no prefix in front of it |
INVALID_QUERY.BAD_TERM_GROUP |
Opening quotation mark without closing quotation mark |
INVALID_QUERY.NO_TERM |
Query has no search term |
INVALID_QUERY.TOO_EXPENSIVE |
Query is too expensive to process and was rejected |
INVALID_QUERY.TOO_MANY_TOKENS |
Query has more than 5 tokens after tokenization |
INVALID_REQUEST_BODY |
Body of batch request not parsable or has wrong schema |
INVALID_UTF8_ENCODING |
Query string is not in UTF-8 format |
MISSING_PARAMETER.QUERY |
query parameter is missing |
A container for error information and related data.
error Error
An error object.
query String optional
The user query.
queryTokens QueryToken[] optional
A representation of the query after tokenization, which is an array of QueryToken objects. This
property is only available if query processing has actually taken place. It is not available if a request was rejected
at an earlier stage, e.g. due to missing required parameters.
A representation of an ngram with full year-based match count information. The properties listed below are in addition
to the properties of NgramLite, i.e. Ngram extends NgramLite.
stats NgramStat[]
An array of NgamStat objects.
A light-weight representation of an ngram with basic metadata.
id string
An ID that identifies an ngram uniquely within a corpus. The ID can be used to fetch the corresponding Ngram object
with full year-based match count information. See ngram request for details. Applications that need a
unique ngram ID for the whole dataset can do so by prefixing this ID with the label of the associated corpus, i.e.
{label}_{ngram_id}.
abstract boolean optional
Indicates whether the ngram is abstract as a result of applying a filter operation, e.g. result set collapsing. An abstract ngram does not represent an existing ngram from the dataset and hence has no associated year-based match count information. Its ID is nevertheless unique within a corpus. The property is only present if true — absence means false.
absTotalMatchCount number
The ngram's absolute total match count. See Data Model for details.
relTotalMatchCount number
The ngram's relative total match count. See Data Model for details.
tokens NgramToken[1..5]
An array of NgramToken objects of length 1 to 5.
A representation of an ngram's match count relating to a single year.
year number
The year the data belongs to. See Data Model for details.
absMatchCount number
The ngram's absolute match count. See Data Model for details.
relMatchCount number
The ngram's relative match count. See Data Model for details.
A representation of a single token as part of an ngram. It contains basic information like text and type, as well as metadata about its relation to a query, e.g. if the token has been inserted as a result of wildcard application.
text string
The token's text in UTF-8 encoding. For tokens that have a part-of-speech suffix in the original raw data, e.g.
example_NOUN, this suffix has been removed. The POS tag information is available via the type property.
kind NgramTokenKind
The token's kind allows to distinguish programmatically between text-like tokens, part-of-speech (POS) tagged tokens, and sentence boundary tags. It can be used to append the original POS tag suffix to the text string or for syntax highlighting when displayed.
inserted boolean optional
Indicates whether the token was inserted as a result of applying a *, **, or *_ADJ and friends wildcard. The
property is only present if true — absence means false.
completed boolean optional
Indicates whether the token was completed as a result of applying the ~ operator. The property is only present if
true, absence means false.
An enum that describes the type of an ngram token. Values are string constants.
| Value | Description |
|---|---|
TERM |
The token is a regular term. |
TAGGED_AS_ADJ |
The token has a POS tag of ADJ. |
TAGGED_AS_ADP |
The token has a POS tag of ADP. |
TAGGED_AS_ADV |
The token has a POS tag of ADV. |
TAGGED_AS_CONJ |
The token has a POS tag of CONJ. |
TAGGED_AS_DET |
The token has a POS tag of DET. |
TAGGED_AS_NOUN |
The token has a POS tag of NOUN. |
TAGGED_AS_NUM |
The token has a POS tag of NUM. |
TAGGED_AS_PRON |
The token has a POS tag of PRON. |
TAGGED_AS_PRT |
The token has a POS tag of PRT. |
TAGGED_AS_VERB |
The token has a POS tag of VERB. |
SENTENCE_START |
The token is the _START_ token. |
SENTENCE_END |
The token is the _END_ token. |
A representation of a single token as part of a query string.
text string
The token's text in UTF-8 encoding.
kind QueryTokenKind
The token's kind tells if the token has been recognized as text-like token or some query operator.
An enum that describes the type of a query token. Values are string constants.
| Value | Description |
|---|---|
TERM |
The token is a regular query term. |
STAR |
The token is the * wildcard. |
STARSTAR |
The token is the ** wildcard. |
STAR_ADJ |
The token is the *_ADJ wildcard. |
STAR_ADP |
The token is the *_ADP wildcard. |
STAR_ADV |
The token is the *_ADV wildcard. |
STAR_CONJ |
The token is the *_CONJ wildcard. |
STAR_DET |
The token is the *_DET wildcard. |
STAR_NOUN |
The token is the *_NOUN wildcard. |
STAR_NUM |
The token is the *_NUM wildcard. |
STAR_PRON |
The token is the *_PRON wildcard. |
STAR_PRT |
The token is the *_PRT wildcard. |
STAR_VERB |
The token is the *_VERB wildcard. |
SENTENCE_START |
The token is the _START_ token. |
SENTENCE_END |
The token is the _END_ token. |
SLASH |
The token is the / operator. |
PREFIX |
The token has the ~ operator. |
TERM_GROUP |
The token is a term group. |
A representation of the outcome of a successfully processed query.
query String
The user query.
queryTokens QueryToken[]
A representation of the query after tokenization, which is an array of QueryToken objects.
ngrams NgramLite[]
A representation of the result set, which is an array of NgramLite objects.
nextPageToken string optional
An opaque token to be used in a follow-up request to fetch the next chunk of the result set. See pagination for details.
nextPageLink string optional
An absolute URL to issue a follow-up request. See pagination for details.
Represents a lookup table for ngram total match counts by ngram length and year.
The table covers the full range from ngram length 1 to 5 combined with first year to last year of ngram occurrence. For combinations where no total match count is available the value is zero.
minYear int32
The first year of ngram occurrence.
maxYear int32
The last year of ngram occurrence.
matchCounts int64[][]
A matrix of ngram total match counts.
- The first subscript in the range
[0, 4]denotes the ngram length, with 1-gram counts at index 0, 2-gram counts at index 1, and so on. - The second subscript in the range
[0, (maxYear - minYear)]denotes the year, withminYearat index 0,minYear + 1at index 1, and so on.
Example: The total match count of 3-grams for the year 2000 in the English corpus is matchCounts[2][530] because
minYear = 1470 and 2000 - minYear = 530.