-
Notifications
You must be signed in to change notification settings - Fork 82
Query Cloudberry
After Register Dataset to Cloudberry, the front-end application or the user can send queries to Cloudberry through either of the following two end-points.
Example URL
http://localhost:9000/berry
Example Query
Suppose you create a file query.json
with the following content.
{
"dataset": "twitter.ds_tweet",
"select" : {
"order" : ["-create_at"],
"limit": 1,
"offset": 0,
"field": ["*"]
}
}
This is a simple query to select only one record from twitter.ds_tweet
dataset.
Example Command
curl -X POST -H "Content-Type: application/json" -d @query.json http://localhost:9000/berry
Example Response
[[
{
"place.bounding_box":[
[-71.625705, 42.310982],
[-71.475702, 42.380659]
],
"favorite_count":0,
"geo_tag.countyID":25017,
"user.id":111986159,
"geo_tag.cityID":2538715,
"is_retweet":false,
"text":"Brewing with the back to be or tomorrow at the best to offer they are @ Deep County, CA [Portland Familes, San Maraporonic Makers -",
"retweet_count":0,
"in_reply_to_user":-1,
"id":1129340818383343622,
"coordinate":[-71.50138889, 42.36916667],
"in_reply_to_status":-1,
"geo_tag.stateID":25,
"create_at":"2019-05-17T11:00:07.000Z",
"lang":"en",
"user.profile_image_url":"http://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png",
"user.name":"Hazel Dashawn"
}
]]
To minimize the overhead of communication, Cloudberry supports the Websocket
connection when front-end application needs to fluently send queries to and receive responses from Cloudberry, or needs responses pushed by Cloudberry automatically (e.g. when query slicing is applied).
Example URL
ws://localhost:9000/ws
Example code
Open a Websocket
connection to Cloudberry in Javascript as following:
var ws = new WebSocket("ws://localhost:9000/ws"");
Query format of Websocket
end-point is the same as HTTP POST
end-point, and is described as following.
A query is composed of the following parameters:
- Dataset : the dataset to query on.
-
Unnest : to flatten a record based on the nested
Bag
attribute to generate multiple records. (Only useful for semi-structured databases, e.g. AsterixDB.) - Filter : a set of selection predicates.
-
Group :
-
by : to specify the
group by
fields. -
aggregate: to specify the aggregation functions to apply, including
count
,sum
,min
,max
andavg
.
-
by : to specify the
-
Select:
- field : a set of fields that should be returned.
-
order : to specify the
order by
fields. - limit : to specify the number of records that will be returned at most.
-
offset : to specify the pagination combined with
limit
value.
Here are some examples as following.
- Get 100 latest tweets (time and id) that mention "hurricane".
{
"dataset": "twitter.ds_tweet",
"filter": [{
"field": "text",
"relation": "contains",
"values": ["hurricane"]
}],
"select" : {
"order" : ["-create_at"],
"limit": 100,
"offset" : 0,
"field": ["create_at", "id"]
}
}
Expected results are as following:
[[
{"create_at":"2019-05-16T17:00:17.000Z","id":1129069066373799936},
{"create_at":"2019-05-14T21:34:07.000Z","id":1128413203938127873},
{"create_at":"2019-05-14T16:12:34.000Z","id":1128332285806354438},
...
]]
- Get the per-state and per-day count of tweets that contain "hurricane" and "florence" in 2018.
{
"dataset": "twitter.ds_tweet",
"filter": [
{
"field": "create_at",
"relation": "inRange",
"values": ["2018-01-01T00:00:00.000Z", "2018-12-31T00:00:00.000Z"]
},
{
"field": "text",
"relation": "contains",
"values": ["hurricane", "florence"]
}
],
"group": {
"by": [
{
"field": "geo_tag.stateID",
"as": "state"
},
{
"field": "create_at",
"apply": {
"name": "interval",
"args": {
"unit": "day"
}
},
"as": "day"
}
],
"aggregate": [
{
"field": "*",
"apply": {
"name": "count"
},
"as": "count"
}
]
}
}
Expected results are as following:
[[
{"state":6,"day":"2018-04-05T00:00:00.000Z","count":1},
{"state":6,"day":"2018-09-11T00:00:00.000Z","count":1},
{"state":8,"day":"2018-03-28T00:00:00.000Z","count":1},
{"state":12,"day":"2018-03-01T00:00:00.000Z","count":1}
...
]]
- Get top-10 related hashtags for tweets that mention "hurricane".
{
"dataset": "twitter.ds_tweet",
"filter": [
{
"field": "text",
"relation": "contains",
"values": ["hurricane"]
}
],
"unnest" : [{"hashtags": "tag"}],
"group": {
"by": [
{ "field": "tag" }
],
"aggregate": [
{
"field" : "*",
"apply" : {
"name": "count"
},
"as" : "count"
}
]
},
"select" : {
"order" : ["-count"],
"limit": 10,
"offset" : 0
}
}
Expected results are as following:
[[
{"tag":"Hurricane","count":102},
{"tag":"Florida","count":92},
{"tag":"hurricane","count":92},
...
]]
- Get the per-state and per-day count of tweets that contain "hurricane" and "florence" in 2018 as well as the per-state population.
{
"dataset": "twitter.ds_tweet",
"filter": [
{
"field": "create_at",
"relation": "inRange",
"values": ["2018-01-01T00:00:00.000Z", "2018-12-31T00:00:00.000Z"]
},
{
"field": "text",
"relation": "contains",
"values": ["hurricane", "florence"]
}
],
"group": {
"by": [
{
"field": "geo_tag.stateID",
"as": "state"
},
{
"field": "create_at",
"apply": {
"name": "interval",
"args": {
"unit": "day"
}
},
"as": "day"
}
],
"aggregate": [
{
"field": "*",
"apply": {
"name": "count"
},
"as": "count"
}
],
"lookup": [
{
"joinKey": ["state"],
"dataset": "twitter.dsStatePopulation",
"lookupKey": ["stateID"],
"select": ["population"],
"as": ["population"]
}
]
}
}
Expected results are as following:
[[
{"state":9,"day":"2018-07-05T00:00:00.000Z","count":1,"population":3576452},
{"state":12,"day":"2018-09-08T00:00:00.000Z","count":1,"population":20612439},
{"state":12,"day":"2018-12-16T00:00:00.000Z","count":1,"population":20612439},
...
]]
Cloudberry supports automatic query-slicing on the timeField
. The front-end application or user can specify a response time limit (ms) for each "small query" to get the results progressively.
{
...
"option":{
"sliceMillis": 300
}
}
For example, the following query asks for the per-state and per-day count of tweets that contain "hurricane" and "florence" in 2018, with an option to accept an updated results every 300ms.
{
"dataset": "twitter.ds_tweet",
"filter": [
{
"field": "create_at",
"relation": "inRange",
"values": ["2018-01-01T00:00:00.000Z", "2018-12-31T00:00:00.000Z"]
},
{
"field": "text",
"relation": "contains",
"values": ["hurricane", "florence"]
}
],
"group": {
"by": [
{
"field": "geo_tag.stateID",
"as": "state"
},
{
"field": "create_at",
"apply": {
"name": "interval",
"args": {
"unit": "day"
}
},
"as": "day"
}
],
"aggregate": [
{
"field": "*",
"apply": {
"name": "count"
},
"as": "count"
}
]
},
"option": {
"sliceMillis": 300
}
}
There will be a stream of results returned from Cloudberry as following:
{"value":[[{"state":34,"day":"2018-12-29T00:00:00.000Z","count":1}]],"timeInterval":{"start":1546041600000,"end":1546214400000}}
{"value":[[{"state":34,"day":"2018-12-29T00:00:00.000Z","count":1}]],"timeInterval":{"start":1545696000000,"end":1546214400000}}
{"value":[[{"state":48,"day":"2018-01-03T00:00:00.000Z","count":1},
{"state":36,"day":"2018-08-26T00:00:00.000Z","count":1},
{"state":36,"day":"2018-08-06T00:00:00.000Z","count":1}
...
]],"timeInterval":{"start":1514764800000,"end":1546214400000}}
Sometimes the front-end application wants to slice a set of queries simultaneously so that results of different queries are consistent in semantics.
In this case, it can wrap the queries inside the batch
field and specify only one option
field.
{
"batch" : [
{ query#1 },
{ query#2 }
],
"option" : {
"sliceMillis": 300
}
}
E.g., the following query shows a batch
example that asks for the per-state per-day count and the top-10 hashtags with keyword "hurricane" and these two queries should be sliced synchronously.
{
"batch": [{
"dataset": "twitter.ds_tweet",
"filter": [{
"field": "create_at",
"relation": "inRange",
"values": ["2018-01-01T00:00:00.000Z", "2018-12-31T00:00:00.000Z"]
}, {
"field": "text",
"relation": "contains",
"values": ["hurricane"]
}],
"group": {
"by": [{
"field": "geo_tag.stateID",
"as": "state"
},
{
"field": "create_at",
"apply": {
"name": "interval",
"args": {
"unit": "day"
}
},
"as": "day"
}],
"aggregate": [{
"field": "*",
"apply": {
"name": "count"
},
"as": "count"
}]
}
}, {
"dataset": "twitter.ds_tweet",
"filter": [{
"field": "text",
"relation": "contains",
"values": ["hurricane"]
}],
"unnest": [{
"hashtags": "tag"
}],
"group": {
"by": [{
"field": "tag"
}],
"aggregate": [{
"field": "*",
"apply": {
"name": "count"
},
"as": "count"
}]
},
"select": {
"order": ["-count"],
"limit": 10,
"offset": 0
}
}],
"option": {
"sliceMillis": 300
}
}
The response is as following:
{"value":[[],[]],"timeInterval":{"start":1565324448022,"end":1565497248022}}
{"value":[[],[]],"timeInterval":{"start":1565086848022,"end":1565497248022}}
{"value":[
[{"state":15,"day":"2018-09-19T00:00:00.000Z","count":1},
{"state":48,"day":"2018-01-03T00:00:00.000Z","count":1},
...],
[{"tag":"Hurricane","count":102},
{"tag":"hurricane","count":92},
...]
],"timeInterval":{"start":1514764800000,"end":1565497248022}}
...
The front-end application can optionally add a "transform" operation in JSON query to define the post-processing operations.
For example, it can define a wrap
operation to wrap the whole response in a key-value pair JSON object in which the key
is pre-defined. The following query asks the Cloudberry to wrap the result in the value with the key of sample
:
{
"dataset": "twitter.ds_tweet",
"filter": [{
"field": "text",
"relation": "contains",
"values": ["hurricane"]
}],
"select" : {
"order" : [ "-create_at"],
"limit": 100,
"offset" : 0,
"field": ["create_at", "id"]
},
"transform" : {
"wrap": {
"key": "sample"
}
}
}
The response is as below:
{
"key":"sample",
"value":[[
{"create_at":"2019-05-16T17:00:17.000Z","id":1129069066373799936},
{"create_at":"2019-05-14T21:34:07.000Z","id":1128413203938127873},
{"create_at":"2019-05-14T16:12:34.000Z","id":1128332285806354438},
...
]]
}
Note: wrap
transformation is only effective for WebSocket
end-point, since multiple queries' results are transferred within the same channel, and front-end application needs to differentiate them, while HTTP
end-point has no such problem.