Query Cloudberry

After Register Dataset to Cloudberry, the front-end application or the user can send queries to Cloudberry through either of the following two end-points.

`POST` `/berry`

Example URL

http://localhost:9000/berry

Example Query

Suppose you create a file query.json with the following content.

{
  "dataset": "twitter.ds_tweet",
  "select" : {
    "order" : ["-create_at"],
    "limit": 1,
    "offset": 0,
    "field": ["*"]
  }
}

This is a simple query to select only one record from twitter.ds_tweet dataset.

Example Command

curl -X POST -H "Content-Type: application/json" -d @query.json http://localhost:9000/berry

Example Response

[[
  {
      "place.bounding_box":[
         [-71.625705, 42.310982],
         [-71.475702, 42.380659]
      ],
      "favorite_count":0,
      "geo_tag.countyID":25017,
      "user.id":111986159,
      "geo_tag.cityID":2538715,
      "is_retweet":false,
      "text":"Brewing with the back to be or tomorrow at the best to offer they are @ Deep County, CA [Portland Familes, San Maraporonic Makers -",
      "retweet_count":0,
      "in_reply_to_user":-1,
      "id":1129340818383343622,
      "coordinate":[-71.50138889, 42.36916667],
      "in_reply_to_status":-1,
      "geo_tag.stateID":25,
      "create_at":"2019-05-17T11:00:07.000Z",
      "lang":"en",
      "user.profile_image_url":"http://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png",
      "user.name":"Hazel Dashawn"
   }
]]

`Websocket` `/ws`

To minimize the overhead of communication, Cloudberry supports the Websocket connection when front-end application needs to fluently send queries to and receive responses from Cloudberry, or needs responses pushed by Cloudberry automatically (e.g. when query slicing is applied).

Example URL

ws://localhost:9000/ws

Example code

Open a Websocket connection to Cloudberry in Javascript as following:

var ws = new WebSocket("ws://localhost:9000/ws"");

Query format of Websocket end-point is the same as HTTP POST end-point, and is described as following.

Query Format

A query is composed of the following parameters:

Dataset : the dataset to query on.
Unnest : to flatten a record based on the nested Bag attribute to generate multiple records. (Only useful for semi-structured databases, e.g. AsterixDB.)
Filter : a set of selection predicates.
Group :
- by : to specify the group by fields.
- aggregate: to specify the aggregation functions to apply, including count, sum, min, max and avg.
Select:
- field : a set of fields that should be returned.
- order : to specify the order by fields.
- limit : to specify the number of records that will be returned at most.
- offset : to specify the pagination combined with limit value.

Here are some examples as following.

Example 1 - Keyword Search

Get 100 latest tweets (time and id) that mention "hurricane".

{
  "dataset": "twitter.ds_tweet",
  "filter": [{
    "field": "text",
    "relation": "contains",
    "values": ["hurricane"]
  }],
  "select" : {
    "order" : ["-create_at"],
    "limit": 100,
    "offset" : 0,
    "field": ["create_at", "id"]
  }
}

Expected results are as following:

[[
 {"create_at":"2019-05-16T17:00:17.000Z","id":1129069066373799936},
 {"create_at":"2019-05-14T21:34:07.000Z","id":1128413203938127873},
 {"create_at":"2019-05-14T16:12:34.000Z","id":1128332285806354438},
 ...
]]

Example 2 - Group By + Count

Get the per-state and per-day count of tweets that contain "hurricane" and "florence" in 2018.

{
  "dataset": "twitter.ds_tweet",
  "filter": [
    {
      "field": "create_at",
      "relation": "inRange",
      "values": ["2018-01-01T00:00:00.000Z", "2018-12-31T00:00:00.000Z"]
    },
    {
      "field": "text",
      "relation": "contains",
      "values": ["hurricane", "florence"]
    }
  ],
  "group": {
    "by": [
        {
          "field": "geo_tag.stateID",
          "as": "state"
        },
        {
          "field": "create_at",
          "apply": {
            "name": "interval",
            "args": {
              "unit": "day"
            }
          },
          "as": "day"
        }
      ],
    "aggregate": [
      {
        "field": "*",
        "apply": {
          "name": "count"
        },
        "as": "count"
      }
      ]
  }
}

Expected results are as following:

[[
    {"state":6,"day":"2018-04-05T00:00:00.000Z","count":1},
    {"state":6,"day":"2018-09-11T00:00:00.000Z","count":1},
    {"state":8,"day":"2018-03-28T00:00:00.000Z","count":1},
    {"state":12,"day":"2018-03-01T00:00:00.000Z","count":1}
    ...
]]

Example 3 - Unnest Bag

Get top-10 related hashtags for tweets that mention "hurricane".

{
  "dataset": "twitter.ds_tweet",
  "filter": [
  {
    "field": "text",
    "relation": "contains",
    "values": ["hurricane"]
  }
  ],
  "unnest" : [{"hashtags": "tag"}],
  "group": {
    "by": [
      { "field": "tag" }
    ],
    "aggregate": [
      {
        "field" : "*",
        "apply" : {
          "name": "count"
        },
        "as" : "count"
      }
    ]
  },
  "select" : {
    "order" : ["-count"],
    "limit": 10,
    "offset" : 0
  }
}

Expected results are as following:

[[
  {"tag":"Hurricane","count":102},
  {"tag":"Florida","count":92},
  {"tag":"hurricane","count":92},
  ...
]]

Advanced Query Options

Query slicing

Cloudberry supports automatic query-slicing on the timeField. The front-end application or user can specify a response time limit (ms) for each "small query" to get the results progressively.

{
 ...
 "option":{
   "sliceMillis": 300
 }
}

For example, the following query asks for the per-state and per-day count of tweets that contain "hurricane" and "florence" in 2018, with an option to accept an updated results every 300ms.

{
  "dataset": "twitter.ds_tweet",
  "filter": [
    {
      "field": "create_at",
      "relation": "inRange",
      "values": ["2018-01-01T00:00:00.000Z", "2018-12-31T00:00:00.000Z"]
    },
    {
      "field": "text",
      "relation": "contains",
      "values": ["hurricane", "florence"]
    }
  ],
  "group": {
    "by": [
        {
          "field": "geo_tag.stateID",
          "as": "state"
        },
        {
          "field": "create_at",
          "apply": {
            "name": "interval",
            "args": {
              "unit": "day"
            }
          },
          "as": "day"
        }
      ],
    "aggregate": [
      {
        "field": "*",
        "apply": {
          "name": "count"
        },
        "as": "count"
      }
      ]
  },
  "option": {
    "sliceMillis": 300
  }
}

There will be a stream of results returned from Cloudberry as following:

{"value":[[{"state":34,"day":"2018-12-29T00:00:00.000Z","count":1}]],"timeInterval":{"start":1546041600000,"end":1546214400000}}
{"value":[[{"state":34,"day":"2018-12-29T00:00:00.000Z","count":1}]],"timeInterval":{"start":1545696000000,"end":1546214400000}}
{"value":[[{"state":48,"day":"2018-01-03T00:00:00.000Z","count":1},
           {"state":36,"day":"2018-08-26T00:00:00.000Z","count":1},
           {"state":36,"day":"2018-08-06T00:00:00.000Z","count":1}
            ...
          ]],"timeInterval":{"start":1514764800000,"end":1546214400000}}

Batch of Queries

Sometimes the front-end application wants to slice a set of queries simultaneously so that results of different queries are consistent in semantics.

In this case, it can wrap the queries inside the batch field and specify only one option field.

{
  "batch" : [
    { query#1 },
    { query#2 }
  ],
  "option" : {
    "sliceMillis": 300
  }
}

E.g., the following query shows a batch example that asks for the per-state per-day count and the top-10 hashtags with keyword "hurricane" and these two queries should be sliced synchronously.

{
    "batch": [{
        "dataset": "twitter.ds_tweet",
        "filter": [{
            "field": "create_at",
            "relation": "inRange",
            "values": ["2018-01-01T00:00:00.000Z", "2018-12-31T00:00:00.000Z"]
        }, {
            "field": "text",
            "relation": "contains",
            "values": ["hurricane"]
        }],
        "group": {
            "by": [{
                "field": "geo_tag.stateID",
                "as": "state"
            },
            {
                "field": "create_at",
                "apply": {
                    "name": "interval",
                    "args": {
                        "unit": "day"
                    }
                },
                "as": "day"
            }],
            "aggregate": [{
                "field": "*",
                "apply": {
                    "name": "count"
                },
                "as": "count"
            }]
        }
    }, {
        "dataset": "twitter.ds_tweet",
        "filter": [{
            "field": "text",
            "relation": "contains",
            "values": ["hurricane"]
        }],
        "unnest": [{
            "hashtags": "tag"
        }],
        "group": {
            "by": [{
                "field": "tag"
            }],
            "aggregate": [{
                "field": "*",
                "apply": {
                    "name": "count"
                },
                "as": "count"
            }]
        },
        "select": {
            "order": ["-count"],
            "limit": 10,
            "offset": 0
        }
    }],
    "option": {
        "sliceMillis": 300
    }
}

The response is as following:

{"value":[[],[]],"timeInterval":{"start":1565324448022,"end":1565497248022}}
{"value":[[],[]],"timeInterval":{"start":1565086848022,"end":1565497248022}}
{"value":[
           [{"state":15,"day":"2018-09-19T00:00:00.000Z","count":1},
            {"state":48,"day":"2018-01-03T00:00:00.000Z","count":1},
             ...],
           [{"tag":"Hurricane","count":102},
            {"tag":"hurricane","count":92},
            ...]
         ],"timeInterval":{"start":1514764800000,"end":1565497248022}}
...

Transform Response Format

The front-end application can optionally add a "transform" operation in JSON query to define the post-processing operations.

For example, it can define a wrap operation to wrap the whole response in a key-value pair JSON object in which the key is pre-defined. The following query asks the Cloudberry to wrap the result in the value with the key of sample:

{
  "dataset": "twitter.ds_tweet",
  "filter": [{
    "field": "text",
    "relation": "contains",
    "values": ["hurricane"]
  }],
  "select" : {
    "order" : [ "-create_at"],
    "limit": 100,
    "offset" : 0,
    "field": ["create_at", "id"]
  },
  "transform" : {
    "wrap": {
      "key": "sample"
    }
  }
}

The response is as below:

{
  "key":"sample",
  "value":[[
    {"create_at":"2019-05-16T17:00:17.000Z","id":1129069066373799936},
    {"create_at":"2019-05-14T21:34:07.000Z","id":1128413203938127873},
    {"create_at":"2019-05-14T16:12:34.000Z","id":1128332285806354438},
    ...
  ]]
}

Note: wrap transformation is only effective for WebSocket end-point, since multiple queries' results are transferred within the same channel, and front-end application needs to differentiate them, while HTTP end-point has no such problem.

Quick Start
Documentation
Advanced topics
- Database Adapters
- Enable Sidebar Live Tweets
- Realtime Tweets' Ingestion
How to Contribute
Research

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query Cloudberry

`POST` `/berry`

`Websocket` `/ws`

Query Format

Example 1 - Keyword Search

Example 2 - Group By + Count

Example 3 - Unnest Bag

Advanced Query Options

Query slicing

Batch of Queries

Transform Response Format

Clone this wiki locally

Query Cloudberry

POST /berry

Websocket /ws

Query Format

Example 1 - Keyword Search

Example 2 - Group By + Count

Example 3 - Unnest Bag

Advanced Query Options

Query slicing

Batch of Queries

Transform Response Format

Clone this wiki locally

`POST` `/berry`

`Websocket` `/ws`