Skip to content
QIUSHI BAI edited this page Aug 16, 2019 · 6 revisions

After Register Dataset to Cloudberry, the front-end application or the user can send queries to Cloudberry through either of the following two end-points.

POST /berry

Example URL

http://localhost:9000/berry

Example Query

Suppose you create a file query.json with the following content.

{
  "dataset": "twitter.ds_tweet",
  "select" : {
    "order" : ["-create_at"],
    "limit": 1,
    "offset": 0,
    "field": ["*"]
  }
}

This is a simple query to select only one record from twitter.ds_tweet dataset.

Example Command

curl -X POST -H "Content-Type: application/json" -d @query.json http://localhost:9000/berry

Example Response

[[
  {
      "place.bounding_box":[
         [-71.625705, 42.310982],
         [-71.475702, 42.380659]
      ],
      "favorite_count":0,
      "geo_tag.countyID":25017,
      "user.id":111986159,
      "geo_tag.cityID":2538715,
      "is_retweet":false,
      "text":"Brewing with the back to be or tomorrow at the best to offer they are @ Deep County, CA [Portland Familes, San Maraporonic Makers -",
      "retweet_count":0,
      "in_reply_to_user":-1,
      "id":1129340818383343622,
      "coordinate":[-71.50138889, 42.36916667],
      "in_reply_to_status":-1,
      "geo_tag.stateID":25,
      "create_at":"2019-05-17T11:00:07.000Z",
      "lang":"en",
      "user.profile_image_url":"http://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png",
      "user.name":"Hazel Dashawn"
   }
]]

Websocket /ws

To minimize the overhead of communication, Cloudberry supports the Websocket connection when front-end application needs to fluently send queries to and receive responses from Cloudberry, or needs responses pushed by Cloudberry automatically (e.g. when query slicing is applied).

Example URL

ws://localhost:9000/ws

Example code

Open a Websocket connection to Cloudberry in Javascript as following:

var ws = new WebSocket("ws://localhost:9000/ws"");

Query format of Websocket end-point is the same as HTTP POST end-point, and is described as following.

Query Format

A query is composed of the following parameters:

  • Dataset : the dataset to query on.
  • Unnest : to flatten a record based on the nested Bag attribute to generate multiple records. (Only useful for semi-structured databases, e.g. AsterixDB.)
  • Filter : a set of selection predicates.
  • Group :
    • by : to specify the group by fields.
    • aggregate : to specify the aggregation functions to apply, including count, sum, min, max and avg.
    • lookup : (optional) to specify a lookup operation, similar to join operator in SQL, for some filed as joinKey equi-join a different dataset on lookupKey, and return select fields.
  • Select:
    • field : a set of fields that should be returned.
    • order : to specify the order by fields.
    • limit : to specify the number of records that will be returned at most.
    • offset : to specify the pagination combined with limit value.

Here are some examples as following.

Example 1 - Keyword Search

  • Get 100 latest tweets (time and id) that mention "hurricane".
{
  "dataset": "twitter.ds_tweet",
  "filter": [{
    "field": "text",
    "relation": "contains",
    "values": ["hurricane"]
  }],
  "select" : {
    "order" : ["-create_at"],
    "limit": 100,
    "offset" : 0,
    "field": ["create_at", "id"]
  }
}

Expected results are as following:

[[
 {"create_at":"2019-05-16T17:00:17.000Z","id":1129069066373799936},
 {"create_at":"2019-05-14T21:34:07.000Z","id":1128413203938127873},
 {"create_at":"2019-05-14T16:12:34.000Z","id":1128332285806354438},
 ...
]]

Example 2 - Group By + Count

  • Get the per-state and per-day count of tweets that contain "hurricane" and "florence" in 2018.
{
  "dataset": "twitter.ds_tweet",
  "filter": [
    {
      "field": "create_at",
      "relation": "inRange",
      "values": ["2018-01-01T00:00:00.000Z", "2018-12-31T00:00:00.000Z"]
    },
    {
      "field": "text",
      "relation": "contains",
      "values": ["hurricane", "florence"]
    }
  ],
  "group": {
    "by": [
        {
          "field": "geo_tag.stateID",
          "as": "state"
        },
        {
          "field": "create_at",
          "apply": {
            "name": "interval",
            "args": {
              "unit": "day"
            }
          },
          "as": "day"
        }
      ],
    "aggregate": [
      {
        "field": "*",
        "apply": {
          "name": "count"
        },
        "as": "count"
      }
      ]
  }
}

Expected results are as following:

[[
    {"state":6,"day":"2018-04-05T00:00:00.000Z","count":1},
    {"state":6,"day":"2018-09-11T00:00:00.000Z","count":1},
    {"state":8,"day":"2018-03-28T00:00:00.000Z","count":1},
    {"state":12,"day":"2018-03-01T00:00:00.000Z","count":1}
    ...
]]

Example 3 - Unnest Bag

  • Get top-10 related hashtags for tweets that mention "hurricane".
{
  "dataset": "twitter.ds_tweet",
  "filter": [
  {
    "field": "text",
    "relation": "contains",
    "values": ["hurricane"]
  }
  ],
  "unnest" : [{"hashtags": "tag"}],
  "group": {
    "by": [
      { "field": "tag" }
    ],
    "aggregate": [
      {
        "field" : "*",
        "apply" : {
          "name": "count"
        },
        "as" : "count"
      }
    ]
  },
  "select" : {
    "order" : ["-count"],
    "limit": 10,
    "offset" : 0
  }
}

Expected results are as following:

[[
  {"tag":"Hurricane","count":102},
  {"tag":"Florida","count":92},
  {"tag":"hurricane","count":92},
  ...
]]

Example 4 - Lookup (Join)

  • Get the per-state and per-day count of tweets that contain "hurricane" and "florence" in 2018 as well as the per-state population.
{
  "dataset": "twitter.ds_tweet",
  "filter": [
    {
      "field": "create_at",
      "relation": "inRange",
      "values": ["2018-01-01T00:00:00.000Z", "2018-12-31T00:00:00.000Z"]
    },
    {
      "field": "text",
      "relation": "contains",
      "values": ["hurricane", "florence"]
    }
  ],
  "group": {
    "by": [
        {
          "field": "geo_tag.stateID",
          "as": "state"
        },
        {
          "field": "create_at",
          "apply": {
            "name": "interval",
            "args": {
              "unit": "day"
            }
          },
          "as": "day"
        }
      ],
    "aggregate": [
      {
        "field": "*",
        "apply": {
          "name": "count"
        },
        "as": "count"
      }
      ],
    "lookup": [
      {
        "joinKey": ["state"],
        "dataset": "twitter.dsStatePopulation",
        "lookupKey": ["stateID"],
        "select": ["population"],
        "as": ["population"]
      }
      ]
  }
}

Expected results are as following:

[[
    {"state":9,"day":"2018-07-05T00:00:00.000Z","count":1,"population":3576452},
    {"state":12,"day":"2018-09-08T00:00:00.000Z","count":1,"population":20612439},
    {"state":12,"day":"2018-12-16T00:00:00.000Z","count":1,"population":20612439},
    ...
]]

Advanced Query Options

Query slicing

Cloudberry supports automatic query-slicing on the timeField. The front-end application or user can specify a response time limit (ms) for each "small query" to get the results progressively.

{
 ...
 "option":{
   "sliceMillis": 300
 }
}

For example, the following query asks for the per-state and per-day count of tweets that contain "hurricane" and "florence" in 2018, with an option to accept an updated results every 300ms.

{
  "dataset": "twitter.ds_tweet",
  "filter": [
    {
      "field": "create_at",
      "relation": "inRange",
      "values": ["2018-01-01T00:00:00.000Z", "2018-12-31T00:00:00.000Z"]
    },
    {
      "field": "text",
      "relation": "contains",
      "values": ["hurricane", "florence"]
    }
  ],
  "group": {
    "by": [
        {
          "field": "geo_tag.stateID",
          "as": "state"
        },
        {
          "field": "create_at",
          "apply": {
            "name": "interval",
            "args": {
              "unit": "day"
            }
          },
          "as": "day"
        }
      ],
    "aggregate": [
      {
        "field": "*",
        "apply": {
          "name": "count"
        },
        "as": "count"
      }
      ]
  },
  "option": {
    "sliceMillis": 300
  }
}

There will be a stream of results returned from Cloudberry as following:

{"value":[[{"state":34,"day":"2018-12-29T00:00:00.000Z","count":1}]],"timeInterval":{"start":1546041600000,"end":1546214400000}}
{"value":[[{"state":34,"day":"2018-12-29T00:00:00.000Z","count":1}]],"timeInterval":{"start":1545696000000,"end":1546214400000}}
{"value":[[{"state":48,"day":"2018-01-03T00:00:00.000Z","count":1},
           {"state":36,"day":"2018-08-26T00:00:00.000Z","count":1},
           {"state":36,"day":"2018-08-06T00:00:00.000Z","count":1}
            ...
          ]],"timeInterval":{"start":1514764800000,"end":1546214400000}}

Batch of Queries

Sometimes the front-end application wants to slice a set of queries simultaneously so that results of different queries are consistent in semantics.

In this case, it can wrap the queries inside the batch field and specify only one option field.

{
  "batch" : [
    { query#1 },
    { query#2 }
  ],
  "option" : {
    "sliceMillis": 300
  }
}

E.g., the following query shows a batch example that asks for the per-state per-day count and the top-10 hashtags with keyword "hurricane" and these two queries should be sliced synchronously.

{
    "batch": [{
        "dataset": "twitter.ds_tweet",
        "filter": [{
            "field": "create_at",
            "relation": "inRange",
            "values": ["2018-01-01T00:00:00.000Z", "2018-12-31T00:00:00.000Z"]
        }, {
            "field": "text",
            "relation": "contains",
            "values": ["hurricane"]
        }],
        "group": {
            "by": [{
                "field": "geo_tag.stateID",
                "as": "state"
            },
            {
                "field": "create_at",
                "apply": {
                    "name": "interval",
                    "args": {
                        "unit": "day"
                    }
                },
                "as": "day"
            }],
            "aggregate": [{
                "field": "*",
                "apply": {
                    "name": "count"
                },
                "as": "count"
            }]
        }
    }, {
        "dataset": "twitter.ds_tweet",
        "filter": [{
            "field": "text",
            "relation": "contains",
            "values": ["hurricane"]
        }],
        "unnest": [{
            "hashtags": "tag"
        }],
        "group": {
            "by": [{
                "field": "tag"
            }],
            "aggregate": [{
                "field": "*",
                "apply": {
                    "name": "count"
                },
                "as": "count"
            }]
        },
        "select": {
            "order": ["-count"],
            "limit": 10,
            "offset": 0
        }
    }],
    "option": {
        "sliceMillis": 300
    }
}

The response is as following:

{"value":[[],[]],"timeInterval":{"start":1565324448022,"end":1565497248022}}
{"value":[[],[]],"timeInterval":{"start":1565086848022,"end":1565497248022}}
{"value":[
           [{"state":15,"day":"2018-09-19T00:00:00.000Z","count":1},
            {"state":48,"day":"2018-01-03T00:00:00.000Z","count":1},
             ...],
           [{"tag":"Hurricane","count":102},
            {"tag":"hurricane","count":92},
            ...]
         ],"timeInterval":{"start":1514764800000,"end":1565497248022}}
...

Transform Response Format

The front-end application can optionally add a "transform" operation in JSON query to define the post-processing operations.

For example, it can define a wrap operation to wrap the whole response in a key-value pair JSON object in which the key is pre-defined. The following query asks the Cloudberry to wrap the result in the value with the key of sample:

{
  "dataset": "twitter.ds_tweet",
  "filter": [{
    "field": "text",
    "relation": "contains",
    "values": ["hurricane"]
  }],
  "select" : {
    "order" : [ "-create_at"],
    "limit": 100,
    "offset" : 0,
    "field": ["create_at", "id"]
  },
  "transform" : {
    "wrap": {
      "key": "sample"
    }
  }
}

The response is as below:

{
  "key":"sample",
  "value":[[
    {"create_at":"2019-05-16T17:00:17.000Z","id":1129069066373799936},
    {"create_at":"2019-05-14T21:34:07.000Z","id":1128413203938127873},
    {"create_at":"2019-05-14T16:12:34.000Z","id":1128332285806354438},
    ...
  ]]
}

Note: wrap transformation is only effective for WebSocket end-point, since multiple queries' results are transferred within the same channel, and front-end application needs to differentiate them, while HTTP end-point has no such problem.