deequ.NET is a port of the awslabs/deequ library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. deequ.NET runs on dotnet/spark.
deequ.NET runs on Apache Spark and depends on dotnet/spark. Therefore it is required to install the following dependencies locally:
It is also necessary to install the Microsoft.Spark.Worker on your local machine and configure the path into the PATH
env var.
For a detailed instructions, see dotnet/spark - Getting started
The following example implements a set of checks on some records and it submits the execution using the spark-submit
command.
-
Use the
dotnet
CLI to create a console application:dotnet new console -o DeequExample
-
Install
Microsoft.Spark
and thedeequ
Nuget packages into the project:cd DeequExample dotnet add package Microsoft.Spark dotnet add package deequ
-
Replace the contents of the
Program.cs
file with the following code:using deequ; using deequ.Checks; using deequ.Extensions; using Microsoft.Spark.Sql; namespace DeequExample { class Program { static void Main(string[] args) { SparkSession spark = SparkSession.Builder().GetOrCreate(); DataFrame data = spark.Read().Json("inventory.json"); data.Show(); VerificationResult verificationResult = new VerificationSuite() .OnData(data) .AddCheck( new Check(CheckLevel.Error, "integrity checks") .HasSize(value => value == 5) .IsComplete("id") .IsUnique("id") .IsComplete("productName") .IsContainedIn("priority", new[] { "high", "low" }) .IsNonNegative("numViews") ) .AddCheck( new Check(CheckLevel.Warning, "distribution checks") .ContainsURL("description", value => value >= .5) ) .Run(); verificationResult.Debug(); } } }
-
Use the
dotnet
CLI to build the application:dotnet build
-
Open your terminal and navigate into your app folder.
cd <your-app-output-directory>
-
Create
inventory.json
with the following content:{"id":1, "productName":"Thingy A", "description":"awesome thing. http://thingb.com", "priority":"high", "numViews":0} {"id":2, "productName":"Thingy B", "description":"available at http://thingb.com","priority":null, "numViews":0} {"id":3, "productName":"Thingy C", "description": null, "priority":"low", "numViews":5} {"id":4, "productName":"Thingy D", "description": "checkout https://thingd.ca", "priority":"low","numViews": 10} {"id":5, "productName":"Thingy E", "description":null, "priority":"high","numViews": 12}
-
Run your app.
spark-submit \ --class org.apache.spark.deploy.dotnet.DotnetRunner \ --master local \ microsoft-spark-2.4.x-<version>.jar \ dotnet DeequExample.dll
Note: This command requires Apache Spark in your PATH environment variable to be able to use
spark-submit
. For detailed instructions, you can see Building .NET for Apache Spark from Source on Ubuntu. -
The output of the application should look similar to the output below:
_ _ _ ______ _______ | | | \ | | ____|__ __| __| | ___ ___ __ _ _ _ | \| | |__ | | / _` |/ _ \/ _ \/ _` | | | | | . ` | __| | | | (_| | __/ __/ (_| | |_| |_| |\ | |____ | | \__,_|\___|\___|\__, |\__,_(_)_| \_|______| |_| | | |_| Success
The following list shows more examples/showcases of the deequ.NET API:
Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating large-scale data quality verification. Proc. VLDB Endow. 11, 12 (August 2018), 1781-1794.