Skip to content

Tasting data like a pro; easy and powerful data sampling with Apache Spark

License

Notifications You must be signed in to change notification settings

bsc-dd/sommelier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sommelier

We are thrilled to present a new approach to analyze Big Data by accessing the smallest quantity of data possible: the Sommelier Sampling.

With the raise of Big Data and Analytics, the demand for optimization on the way data is handled is growing every day. Executing complex data analytics queries on ever increasing datasets costs time and money, access to data will become more expensive and will induce to memory walls, being sampling techniques the perfect solution.

Sampling is a powerful but also feared technique for approximating query answers. The main issue about sampling in large data platforms is that it does not offer sizable savings with only a small effect on the answer quality and the error estimation is still challenging.

Some experts have written reports to demonstrate its benefits, along with rules and calculations to measure error estimation. From this point, we propose to study and implement an open-sourced version of such studies.

Doc

In the spec.ipynb notebook we present a few examples to better understand the different types of samples that exist and when they should be used. We create synthetic datasets, normally distributed data and skewed data affects query results when sample techniques are used. We will see how the cardinality between data is key.

Discussion

We look forward for your feedback! Any project-structure convenience or doubt that you have, make sure to let us know! The issues are open.

Also, if you want to contribute, don't hesitate to contact with us!

About

Tasting data like a pro; easy and powerful data sampling with Apache Spark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published