Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/main' into publish
Browse files Browse the repository at this point in the history
  • Loading branch information
ttt-77 committed Nov 9, 2023
2 parents 1309979 + 5bcb1e2 commit 16d3b4a
Show file tree
Hide file tree
Showing 6 changed files with 151 additions and 18 deletions.
104 changes: 104 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
<h1 style="text-align: center;">AIDB</h1>

<p align="center"> Analyze unstructured data blazingly fast with machine learning. Connect your own ML models to your own data sources and query away! </p>

<p align="center">
<img src="assets/aidbuse.gif" style="width:550px;"/>
</p>

## Quick Start

In order to start using AIDB, all you need to do is install the requirements, specify a configuration, and query!
Setting up on the environment is as simple as
```bash
git clone https://github.com/ddkang/aidb.git
cd aidb
pip install -r requirements.txt

# Optional if you'd like to run the examples below
gdown https://drive.google.com/uc?id=1SyHRaJNvVa7V08mw-4_Vqj7tCynRRA3x
unzip data.zip -d tests/

```

### Text Example (in CSV)

We've set up an example of analyzing product reviews with HuggingFace. Set your HuggingFace API key. After this, all you need to do is run
```bash
python launch.py --config=config.sentiment --setup-blob-table --setup-output-table
```

As an example query, you can run
```sql
SELECT AVG(score)
FROM sentiment
WHERE label = '5 stars'
ERROR_TARGET 10%
CONFIDENCE 95%;
```

You can see the mappings [here](https://github.com/ddkang/aidb/blob/main/config/sentiment.py#L15). We use the HuggingFace API to generate sentiments from the reviews.


### Image Example (local directory)

We've also set up another example of analyzing whether or not user-generated content is adult content for filtering.
In order to run this example, all you need to do is run
```bash
python launch.py --config=config.nsfw_detect --setup-blob-table --setup-output-table
```

As an example query, you can run
```sql
SELECT *
FROM nsfw
WHERE racy LIKE 'POSSIBLE';
```

You can see the mappings [here](https://github.com/ddkang/aidb/blob/main/config/nsfw_detect.py#L10). We use the Google Vision API to generate the safety labels.



## Key Features

AIDB focuses on keeping cost down and interoperability high.

We reduce costs with our optimizations:
- First-class support for approximate queries, reducing the cost of aggregations by up to **350x**.
- Caching, which speeds up multiple queries over the same data.

We keep interoperability high by allowing you to bring your own data source, ML models, and vector databases!


## Approximate Querying

One key feature of AIDB is first-class support for approximate queries.
Currently, we support approximate `AVG`, `COUNT`, and `SUM`.
We don't currently support `GROUP BY` or `JOIN` for approximate aggregations, but it's on our roadmap.
Please reach out if you'd like us to support your queries!

In order to execute an approximate aggregation query, simply append `ERROR_TARGET <error percent>% CONFIDENCE <confidence>%` to your normal aggregation.
As a full example, you can compute an approximate count by doing:
```sql
SELECT COUNT(xmin)
FROM objects
ERROR_TARGET 5%
CONFIDENCE 95%;
```

The `ERROR_TARGET` specifies the percent error _compared to running the query exactly._
For example, if the true answer is 100, you will get answers between 95 and 105 (95% of the time).

## Useful Links
- [How to connect ML APIs](https://github.com/ddkang/aidb/blob/main/aidb/inference/examples/README.md)
- [How to define configuration file](https://github.com/ddkang/aidb/tree/main/config)
- [Connecting to Data Store](https://github.com/ddkang/aidb/tree/main/aidb_utilities/blob_store)

## Contribute

We have many improvements we'd like to implement. Please help us! For the time being, please [email](mailto:[email protected]) us, if you'd like to help contribute.


## Contact Us

Need help in setting up AIDB for your specific dataset or want a new feature? Please fill [this form](https://forms.gle/YyAXWxqzZPVBrvBR7).
29 changes: 29 additions & 0 deletions aidb_utilities/blob_store/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Connecting to Data Stores

We provide utilities to connect to different forms of data stores.
You can also implement your own.


## Images stored in the local storage

In our first example, we show how to access images stored in local storage:

```python
local_image_store = LocalImageBlobStore(data_dir)
image_blobs = local_image_store.get_blobs()
base_table_setup = BaseTablesSetup(DB_URL)
base_table_setup.insert_blob_meta_data('blob00', input_blobs, ['blob_id'])
```



## Documents stored in the AWS S3 storage

We also show how to access documents stored in S3:

```python
aws_doc_store = AwsS3DocumentBlobStore('bucket-name', '<your-aws-access-key>', 'your-secret-key')
doc_blobs = aws_doc_store.get_blobs()
base_table_setup = BaseTablesSetup(DB_URL)
base_table_setup.insert_blob_meta_data('blob00', doc_blobs, ['blob_id'])
```
Binary file added assets/aidbuse.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ six==1.16.0
sniffio==1.3.0
SQLAlchemy==1.4.39
SQLAlchemy-Utils==0.41.1
sqlglot-aidb==0.0.4
sqlglot-aidb==0.0.5
starlette==0.27.0
statsmodels==0.14.0
sympy==1.11.1
Expand Down
22 changes: 11 additions & 11 deletions tests/test_aggregation.py
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -16,52 +16,52 @@
queries = [
(
'approx_aggregate',
'''SELECT SUM(x_min) FROM objects00 WHERE x_min > 1000 ERROR_TARGET 10% CONFIDENCE 95;''',
'''SELECT SUM(x_min) FROM objects00 WHERE x_min > 1000 ERROR_TARGET 10% CONFIDENCE 95%;''',
'''SELECT SUM(x_min) FROM objects00 WHERE x_min > 1000;'''
),
(
'approx_aggregate',
'''SELECT COUNT(x_min) FROM objects00 WHERE x_min > 1000 ERROR_TARGET 10% CONFIDENCE 95;''',
'''SELECT COUNT(x_min) FROM objects00 WHERE x_min > 1000 ERROR_TARGET 10% CONFIDENCE 95%;''',
'''SELECT COUNT(x_min) FROM objects00 WHERE x_min > 1000;'''
),
(
'approx_aggregate',
'''SELECT SUM(x_min) FROM objects00 ERROR_TARGET 10% CONFIDENCE 95;''',
'''SELECT SUM(x_min) FROM objects00 ERROR_TARGET 10% CONFIDENCE 95%;''',
'''SELECT SUM(x_min) FROM objects00;'''
),
(
'approx_aggregate',
'''SELECT SUM(y_min) FROM objects00 ERROR_TARGET 10% CONFIDENCE 95;''',
'''SELECT SUM(y_min) FROM objects00 ERROR_TARGET 10% CONFIDENCE 95%;''',
'''SELECT SUM(y_min) FROM objects00;'''
),
(
'approx_aggregate',
'''SELECT COUNT(x_min) FROM objects00 ERROR_TARGET 10% CONFIDENCE 95;''',
'''SELECT COUNT(x_min) FROM objects00 ERROR_TARGET 10% CONFIDENCE 95%;''',
'''SELECT COUNT(x_min) FROM objects00;'''
),
(
'approx_aggregate',
'''SELECT AVG(x_min) FROM objects00 ERROR_TARGET 5% CONFIDENCE 95;''',
'''SELECT AVG(x_min) FROM objects00 ERROR_TARGET 5% CONFIDENCE 95%;''',
'''SELECT AVG(x_min) FROM objects00;'''
),
(
'approx_aggregate',
'''SELECT AVG(x_max) FROM objects00 ERROR_TARGET 5% CONFIDENCE 95;''',
'''SELECT AVG(x_max) FROM objects00 ERROR_TARGET 5% CONFIDENCE 95%;''',
'''SELECT AVG(x_max) FROM objects00;'''
),
(
'approx_aggregate',
'''SELECT AVG(x_min) FROM objects00 WHERE x_min > 1000 ERROR_TARGET 5% CONFIDENCE 95;''',
'''SELECT AVG(x_min) FROM objects00 WHERE x_min > 1000 ERROR_TARGET 5% CONFIDENCE 95%;''',
'''SELECT AVG(x_min) FROM objects00 WHERE x_min > 1000;'''
),
(
'approx_aggregate',
'''SELECT AVG(x_min) FROM objects00 WHERE y_max < 900 ERROR_TARGET 5% CONFIDENCE 95;''',
'''SELECT AVG(x_min) FROM objects00 WHERE y_max < 900 ERROR_TARGET 5% CONFIDENCE 95%;''',
'''SELECT AVG(x_min) FROM objects00 WHERE y_max < 900;'''
),
(
'approx_aggregate',
'''SELECT AVG(x_min) FROM objects00 WHERE x_min < 700 ERROR_TARGET 5% CONFIDENCE 95;''',
'''SELECT AVG(x_min) FROM objects00 WHERE x_min < 700 ERROR_TARGET 5% CONFIDENCE 95%;''',
'''SELECT AVG(x_min) FROM objects00 WHERE x_min < 700;'''
),
]
Expand Down Expand Up @@ -110,4 +110,4 @@ async def test_agg_query(self):


if __name__ == '__main__':
unittest.main()
unittest.main()
12 changes: 6 additions & 6 deletions tests/test_queries.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,35 +6,35 @@
SELECT AVG(bar)
FROM foo
ERROR_TARGET 1%
CONFIDENCE 95;
CONFIDENCE 95%;
'''

valid_count_sql = '''
SELECT COUNT(bar)
FROM foo
ERROR_TARGET 1%
CONFIDENCE 95;
CONFIDENCE 95%;
'''

valid_sum_sql = '''
SELECT SUM(bar)
FROM foo
ERROR_TARGET 1%
CONFIDENCE 95;
CONFIDENCE 95%;
'''

# Invalid
unsupported_agg_aqp_sql = '''
SELECT MAX(bar)
FROM foo
ERROR_TARGET 1%
CONFIDENCE 95;
CONFIDENCE 95%;
'''

agg_no_et_sql = '''
SELECT AVG(bar)
FROM foo
CONFIDENCE 95;
CONFIDENCE 95%;
'''

agg_no_conf_sql = '''
Expand Down Expand Up @@ -123,4 +123,4 @@ def test_invalid_approximate_agg_join_sql(self):


if __name__ == '__main__':
unittest.main()
unittest.main()

0 comments on commit 16d3b4a

Please sign in to comment.