Skip to content

Commit 70917d2

Browse files
authored
Merge pull request #177 from ing-bank/develop
v0.5.0
2 parents 729d61a + 92eee2b commit 70917d2

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+2027
-1359
lines changed
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
name: "CodeQL"
2+
3+
on:
4+
push:
5+
branches: [ master, develop ]
6+
pull_request:
7+
# The branches below must be a subset of the branches above
8+
branches: [ master ]
9+
schedule:
10+
- cron: '22 11 * * 2'
11+
12+
jobs:
13+
analyze:
14+
name: Analyze
15+
runs-on: ubuntu-latest
16+
permissions:
17+
actions: read
18+
contents: read
19+
security-events: write
20+
21+
strategy:
22+
fail-fast: false
23+
matrix:
24+
language: [ 'python' ]
25+
# CodeQL supports [ 'cpp', 'csharp', 'go', 'java', 'javascript', 'python' ]
26+
# Learn more:
27+
# https://docs.github.com/en/free-pro-team@latest/github/finding-security-vulnerabilities-and-errors-in-your-code/configuring-code-scanning#changing-the-languages-that-are-analyzed
28+
29+
steps:
30+
- name: Checkout repository
31+
uses: actions/checkout@v2
32+
33+
# Initializes the CodeQL tools for scanning.
34+
- name: Initialize CodeQL
35+
uses: github/codeql-action/init@v1
36+
with:
37+
languages: ${{ matrix.language }}
38+
# If you wish to specify custom queries, you can do so here or in a config file.
39+
# By default, queries listed here will override any specified in a config file.
40+
# Prefix the list here with "+" to use these queries and those in the config file.
41+
# queries: ./path/to/local/query, your-org/your-repo/queries@main
42+
43+
# Autobuild attempts to build any compiled languages (C/C++, C#, or Java).
44+
# If this step fails, then you should remove it and run the build manually (see below)
45+
- name: Autobuild
46+
uses: github/codeql-action/autobuild@v1
47+
48+
# ℹ️ Command-line programs to run using the OS shell.
49+
# 📚 https://git.io/JvXDl
50+
51+
# ✏️ If the Autobuild fails above, remove it and uncomment the following three lines
52+
# and modify them (or add more) to build your code if your project
53+
# uses a compiled language
54+
55+
#- run: |
56+
# make bootstrap
57+
# make release
58+
59+
- name: Perform CodeQL Analysis
60+
uses: github/codeql-action/analyze@v1

.pre-commit-config.yaml

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
repos:
22
- repo: https://github.com/psf/black
3-
rev: 21.9b0
3+
rev: 21.11b1
44
hooks:
55
- id: black
66
- repo: https://github.com/pycqa/isort
7-
rev: 5.9.3
7+
rev: 5.10.1
88
hooks:
99
- id: isort
1010
files: '.*'
@@ -15,9 +15,14 @@ repos:
1515
- id: flake8
1616
additional_dependencies:
1717
- flake8-comprehensions
18-
args: [ "--select=E9,F63,F7,F82,C4"]
18+
- tryceratops
19+
args: [ "--select=E9,F63,F7,F82,C4,F401,TR004,TC200,TC201,TC202"]
1920
- repo: https://github.com/asottile/pyupgrade
20-
rev: v2.29.0
21+
rev: v2.29.1
2122
hooks:
2223
- id: pyupgrade
2324
args: ['--py36-plus','--exit-zero-even-if-changed']
25+
- repo: https://github.com/asottile/blacken-docs
26+
rev: v1.12.0
27+
hooks:
28+
- id: blacken-docs

.readthedocs.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,5 @@ build:
77
python:
88
version: 3.8
99
setup_py_install: true
10-
10+
install:
11+
- requirements: docs/requirements.txt

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,4 +66,4 @@
6666

6767
## v0.4.0 and before
6868

69-
The release notes for preceding versions are available `here <https://github.com/ing-bank/popmon/blob/master/CHANGES.rst>`_
69+
The release notes for preceding versions are available [here](https://github.com/ing-bank/popmon/blob/master/CHANGES.rst>).

README.rst

Lines changed: 39 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,10 @@ With Spark 3.0, based on Scala 2.12, make sure to pick up the correct `histogram
2929

3030
.. code-block:: python
3131
32-
spark = SparkSession.builder.config("spark.jars.packages", "io.github.histogrammar:histogrammar_2.12:1.0.20,io.github.histogrammar:histogrammar-sparksql_2.12:1.0.20").getOrCreate()
32+
spark = SparkSession.builder.config(
33+
"spark.jars.packages",
34+
"io.github.histogrammar:histogrammar_2.12:1.0.20,io.github.histogrammar:histogrammar-sparksql_2.12:1.0.20",
35+
).getOrCreate()
3336
3437
For Spark 2.X compiled against scala 2.11, in the string above simply replace 2.12 with 2.11.
3538

@@ -101,12 +104,12 @@ As a quick example, you can do:
101104
from popmon import resources
102105
103106
# open synthetic data
104-
df = pd.read_csv(resources.data('test.csv.gz'), parse_dates=['date'])
107+
df = pd.read_csv(resources.data("test.csv.gz"), parse_dates=["date"])
105108
df.head()
106109
107110
# generate stability report using automatic binning of all encountered features
108111
# (importing popmon automatically adds this functionality to a dataframe)
109-
report = df.pm_stability_report(time_axis='date', features=['date:age', 'date:gender'])
112+
report = df.pm_stability_report(time_axis="date", features=["date:age", "date:gender"])
110113
111114
# to show the output of the report in a Jupyter notebook you can simply run:
112115
report
@@ -119,23 +122,32 @@ To specify your own binning specifications and features you want to report on, y
119122
.. code-block:: python
120123
121124
# time-axis specifications alone; all other features are auto-binned.
122-
report = df.pm_stability_report(time_axis='date', time_width='1w', time_offset='2020-1-6')
125+
report = df.pm_stability_report(
126+
time_axis="date", time_width="1w", time_offset="2020-1-6"
127+
)
123128
124129
# histogram selections. Here 'date' is the first axis of each histogram.
125-
features=[
126-
'date:isActive', 'date:age', 'date:eyeColor', 'date:gender',
127-
'date:latitude', 'date:longitude', 'date:isActive:age'
130+
features = [
131+
"date:isActive",
132+
"date:age",
133+
"date:eyeColor",
134+
"date:gender",
135+
"date:latitude",
136+
"date:longitude",
137+
"date:isActive:age",
128138
]
129139
130140
# Specify your own binning specifications for individual features or combinations thereof.
131141
# This bin specification uses open-ended ("sparse") histograms; unspecified features get
132142
# auto-binned. The time-axis binning, when specified here, needs to be in nanoseconds.
133-
bin_specs={
134-
'longitude': {'bin_width': 5.0, 'bin_offset': 0.0},
135-
'latitude': {'bin_width': 5.0, 'bin_offset': 0.0},
136-
'age': {'bin_width': 10.0, 'bin_offset': 0.0},
137-
'date': {'bin_width': pd.Timedelta('4w').value,
138-
'bin_offset': pd.Timestamp('2015-1-1').value}
143+
bin_specs = {
144+
"longitude": {"bin_width": 5.0, "bin_offset": 0.0},
145+
"latitude": {"bin_width": 5.0, "bin_offset": 0.0},
146+
"age": {"bin_width": 10.0, "bin_offset": 0.0},
147+
"date": {
148+
"bin_width": pd.Timedelta("4w").value,
149+
"bin_offset": pd.Timestamp("2015-1-1").value,
150+
},
139151
}
140152
141153
# generate stability report
@@ -145,6 +157,17 @@ These examples also work with spark dataframes.
145157
You can see the output of such example notebook code `here <https://crclz.com/popmon/reports/test_data_report.html>`_.
146158
For all available examples, please see the `tutorials <https://popmon.readthedocs.io/en/latest/tutorials.html>`_ at read-the-docs.
147159

160+
Pipelines for monitoring dataset shift
161+
======================================
162+
Advanced users can leverage popmon's modular data pipeline to customize their workflow.
163+
Visualization of the pipeline can be useful when debugging, or for didactic purposes.
164+
There is a `script <https://github.com/ing-bank/popmon/tree/master/tools/>`_ included with the package that you can use.
165+
The plotting is configurable, and depending on the options you will obtain a result that can be used for understanding the data flow, the high-level components and the (re)use of datasets.
166+
167+
|pipeline|
168+
169+
*Example pipeline visualization (click to enlarge)*
170+
148171
Resources
149172
=========
150173

@@ -202,6 +225,9 @@ Copyright ING WBAA. `popmon` is completely free, open-source and licensed under
202225
:target: https://github.com/ing-bank/popmon
203226
.. |example| image:: https://raw.githubusercontent.com/ing-bank/popmon/master/docs/source/assets/traffic_light_overview.png
204227
:alt: Traffic Light Overview
228+
.. |pipeline| image:: https://raw.githubusercontent.com/ing-bank/popmon/master/docs/source/assets/pipeline.png
229+
:alt: Pipeline Visualization
230+
:target: https://github.com/ing-bank/popmon/files/7417124/pipeline_amazingpipeline_subgraphs_unversioned.pdf
205231
.. |build| image:: https://github.com/ing-bank/popmon/workflows/build/badge.svg
206232
:alt: Build status
207233
.. |docs| image:: https://readthedocs.org/projects/popmon/badge/?version=latest

bump.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22
from pathlib import Path
33

44
MAJOR = 0
5-
REVISION = 4
6-
PATCH = 4
5+
REVISION = 5
6+
PATCH = 0
77
VERSION = f"{MAJOR}.{REVISION}.{PATCH}"
88

99

docs/requirements.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
sphinx_rtd_theme
2+
myst_parser

docs/source/assets/pipeline.png

26.6 KB
Loading

docs/source/configuration.rst

Lines changed: 42 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,9 @@ To specify the time-axis binning alone, do:
5555

5656
.. code-block:: python
5757
58-
report = df.pm_stability_report(time_axis='date', time_width='1w', time_offset='2020-1-6')
58+
report = df.pm_stability_report(
59+
time_axis="date", time_width="1w", time_offset="2020-1-6"
60+
)
5961
6062
The ``time_axis`` argument should be the name of a column that is of type **numeric (e.g. batch id, time in ns) or date(time)**.
6163
The default time width is 30 days ('30d'), with time offset 2010-1-4 (a Monday).
@@ -72,11 +74,15 @@ An example bin_specs dictionary is:
7274

7375
.. code-block:: python
7476
75-
bin_specs = {'x': {'bin_width': 1, 'bin_offset': 0},
76-
'y': {'num': 10, 'low': 0.0, 'high': 2.0},
77-
'x:y': [{}, {'num': 5, 'low': 0.0, 'high': 1.0}],
78-
'date': {'bin_width': pd.Timedelta('4w').value,
79-
'bin_offset': pd.Timestamp('2015-1-1').value}}
77+
bin_specs = {
78+
"x": {"bin_width": 1, "bin_offset": 0},
79+
"y": {"num": 10, "low": 0.0, "high": 2.0},
80+
"x:y": [{}, {"num": 5, "low": 0.0, "high": 1.0}],
81+
"date": {
82+
"bin_width": pd.Timedelta("4w").value,
83+
"bin_offset": pd.Timestamp("2015-1-1").value,
84+
},
85+
}
8086
8187
In the bin specs for 'x:y', 'x' is not provided (here) and reverts to the 1-dim setting.
8288
Any time-axis, when specified here ('date'), needs to be specified in nanoseconds. This takes precedence over
@@ -112,9 +118,11 @@ When not provided, the default setting is:
112118

113119
.. code-block:: python
114120
115-
monitoring_rules = {"*_pull": [7, 4, -4, -7],
116-
"*_zscore": [7, 4, -4, -7],
117-
"[!p]*_unknown_labels": [0.5, 0.5, 0, 0]}
121+
monitoring_rules = {
122+
"*_pull": [7, 4, -4, -7],
123+
"*_zscore": [7, 4, -4, -7],
124+
"[!p]*_unknown_labels": [0.5, 0.5, 0, 0],
125+
}
118126
119127
Note that the (filename based) wildcards such as * apply to all statistic names matching that pattern.
120128
For example, ``"*_pull"`` applies for all features to all statistics ending on "_pull". Same for ``"*_zscore"``.
@@ -132,11 +140,13 @@ feature name in front. This also works for a combinations of two features. E.g.
132140

133141
.. code-block:: python
134142
135-
monitoring_rules = {"featureA:*_pull": [5, 3, -3, -5],
136-
"featureA:featureB:*_pull": [6, 3, -3, -6],
137-
"featureA:nan": [4, 1, 0, 0],
138-
"*_pull": [7, 4, -4, -7],
139-
"nan": [8, 1, 0, 0]}
143+
monitoring_rules = {
144+
"featureA:*_pull": [5, 3, -3, -5],
145+
"featureA:featureB:*_pull": [6, 3, -3, -6],
146+
"featureA:nan": [4, 1, 0, 0],
147+
"*_pull": [7, 4, -4, -7],
148+
"nan": [8, 1, 0, 0],
149+
}
140150
141151
In the case where multiple rules could apply for a feature's statistic, the most specific one gets applied.
142152
So in case of the statistic "nan": "featureA:nan" is used for "featureA", and the other "nan" rule
@@ -204,13 +214,16 @@ Spark usage
204214
from pyspark.sql import SparkSession
205215
206216
# downloads histogrammar jar files if not already installed, used for histogramming of spark dataframe
207-
spark = SparkSession.builder.config("spark.jars.packages", "io.github.histogrammar:histogrammar_2.12:1.0.20,io.github.histogrammar:histogrammar-sparksql_2.12:1.0.20").getOrCreate()
217+
spark = SparkSession.builder.config(
218+
"spark.jars.packages",
219+
"io.github.histogrammar:histogrammar_2.12:1.0.20,io.github.histogrammar:histogrammar-sparksql_2.12:1.0.20",
220+
).getOrCreate()
208221
209222
# load a dataframe
210-
spark_df = spark.read.format('csv').options(header='true').load('file.csv')
223+
spark_df = spark.read.format("csv").options(header="true").load("file.csv")
211224
212225
# generate the report
213-
report = spark_df.pm_stability_report(time_axis='timestamp')
226+
report = spark_df.pm_stability_report(time_axis="timestamp")
214227
215228
216229
Spark example on Google Colab
@@ -231,16 +244,23 @@ Now that spark is installed, restart the runtime.
231244
.. code-block:: python
232245
233246
import os
247+
234248
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
235249
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"
236250
237251
import findspark
252+
238253
findspark.init()
239254
240255
from pyspark.sql import SparkSession
241256
242-
spark = SparkSession.builder.master("local[*]") \
243-
.config("spark.jars", "/content/jars/histogrammar_2.12-1.0.20.jar,/content/jars/histogrammar-sparksql_2.12-1.0.20.jar") \
244-
.config("spark.sql.execution.arrow.enabled", "false") \
245-
.config("spark.sql.session.timeZone", "GMT") \
246-
.getOrCreate()
257+
spark = (
258+
SparkSession.builder.master("local[*]")
259+
.config(
260+
"spark.jars",
261+
"/content/jars/histogrammar_2.12-1.0.20.jar,/content/jars/histogrammar-sparksql_2.12-1.0.20.jar",
262+
)
263+
.config("spark.sql.execution.arrow.enabled", "false")
264+
.config("spark.sql.session.timeZone", "GMT")
265+
.getOrCreate()
266+
)

examples/flight_delays.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import pandas as pd
22

3-
import popmon
3+
import popmon # noqa
44
from popmon import resources
55

66
# open synthetic data

0 commit comments

Comments
 (0)