Skip to content

Commit 6bab922

Browse files
authored
Merge pull request #116 from ing-bank/develop
v0.4.0
2 parents e661970 + c12783b commit 6bab922

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+805
-3083
lines changed

.github/workflows/build.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,12 +37,15 @@ jobs:
3737
SPARK_VERSION: "2.4.7"
3838
HADOOP_VERSION: "2.7"
3939
SPARK_HOME: "/home/runner/work/spark/" #${{ github.workspace }}/spark/
40+
SPARK_LOCAL_IP: "localhost"
4041
run: |
4142
sudo apt-get -y install openjdk-8-jdk
4243
curl https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz --output ${BUILD_DIR}/spark.tgz
4344
tar -xvzf ${BUILD_DIR}/spark.tgz && mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} ${SPARK_HOME}
4445
pip install pytest-spark>=0.6.0 pyarrow>=0.8.0 pyspark==2.4.7
4546
- name: Test with pytest (spark-specific)
47+
env:
48+
SPARK_LOCAL_IP: "localhost"
4649
run: |
4750
pytest -m spark
4851

.github/workflows/commit.yml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
name: Lint Commit Messages
2+
on: [pull_request]
3+
4+
jobs:
5+
commitlint:
6+
runs-on: ubuntu-latest
7+
steps:
8+
- uses: actions/checkout@v2
9+
with:
10+
fetch-depth: 0
11+
- uses: wagoid/commitlint-github-action@v3

.pre-commit-config.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,18 @@ repos:
44
hooks:
55
- id: black
66
- repo: https://github.com/pycqa/isort
7-
rev: 5.7.0
7+
rev: 5.8.0
88
hooks:
99
- id: isort
1010
files: '.*'
1111
args: [ --profile=black, --project=popmon, --thirdparty histogrammar, --thirdparty pybase64 ]
1212
- repo: https://gitlab.com/pycqa/flake8
13-
rev: "3.8.4"
13+
rev: "3.9.0"
1414
hooks:
1515
- id: flake8
1616
args: [ "--select=E9,F63,F7,F82"]
1717
- repo: https://github.com/asottile/pyupgrade
18-
rev: v2.10.0
18+
rev: v2.12.0
1919
hooks:
2020
- id: pyupgrade
2121
args: ['--py36-plus','--exit-zero-even-if-changed']

CHANGES.rst

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,20 @@
22
Release notes
33
=============
44

5+
Version 0.4.0, (16-04-2021)
6+
---------------------------
7+
Documentation:
8+
9+
* **docs**: include BDTWS presentation
10+
* **docs**: clarify that ``time_axis`` should be date or numeric
11+
* **docs**: initialize spark with both histogrammar jar files
12+
13+
Build system
14+
15+
* **build**: Migrate to version 1.0.25 of ``histogrammar``.
16+
* **build**: update ``pyupgrade`` to v2.12.0
17+
* **build**: update ``isort`` to 5.8.0
18+
* **build**: update ``flake8`` to 3.9.0
519

620
Version 0.3.14, Feb 2021
721
------------------------

README.rst

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -29,24 +29,22 @@ With Spark 3.0, based on Scala 2.12, make sure to pick up the correct `histogram
2929

3030
.. code-block:: python
3131
32-
spark = SparkSession.builder.config("spark.jars.packages", "io.github.histogrammar:histogrammar_2.12:1.0.11,io.github.histogrammar:histogrammar-sparksql_2.12:1.0.11").getOrCreate()
32+
spark = SparkSession.builder.config("spark.jars.packages", "io.github.histogrammar:histogrammar_2.12:1.0.20,io.github.histogrammar:histogrammar-sparksql_2.12:1.0.20").getOrCreate()
3333
3434
For Spark 2.X compiled against scala 2.11, in the string above simply replace 2.12 with 2.11.
3535

36-
`January 29, 2021`
36+
Examples
37+
========
38+
39+
- `Flight Delays and Cancellations Kaggle data <https://crclz.com/popmon/reports/flight_delays_report.html>`_
40+
- `Synthetic data (code example below) <https://crclz.com/popmon/reports/test_data_report.html>`_
3741

3842
Documentation
3943
=============
4044

4145
The entire `popmon` documentation including tutorials can be found at `read-the-docs <https://popmon.readthedocs.io>`_.
4246

4347

44-
Examples
45-
========
46-
47-
- `Flight Delays and Cancellations Kaggle data <https://crclz.com/popmon/reports/flight_delays_report.html>`_
48-
- `Synthetic data (code example below) <https://crclz.com/popmon/reports/test_data_report.html>`_
49-
5048
Notebooks
5149
=========
5250

@@ -151,19 +149,21 @@ Resources
151149
Presentations
152150
-------------
153151

154-
+------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+------------------+-------------------------+
155-
| Title | Host | Date | Speaker |
156-
+------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+------------------+-------------------------+
157-
| Popmon - population monitoring made easy | `Data Lunch @ Eneco <https://www.eneco.nl/>`_ | October 29, 2020 | Max Baak, Simon Brugman |
158-
+------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+------------------+-------------------------+
159-
| Popmon - population monitoring made easy | `Data Science Summit 2020 <https://dssconf.pl/en/>`_ | October 16, 2020 | Max Baak |
160-
+------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+------------------+-------------------------+
161-
| `Population Shift Monitoring Made Easy: the popmon package <https://youtu.be/PgaQpxzT_0g>`_ | `Online Data Science Meetup @ ING WBAA <https://www.meetup.com/nl-NL/Tech-Meetups-ING/events/>`_ | July 8 2020 | Tomas Sostak |
162-
+------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+------------------+-------------------------+
163-
| `Popmon: Population Shift Monitoring Made Easy <https://www.youtube.com/watch?v=HE-3YeVYqPY>`_ | `PyData Fest Amsterdam 2020 <https://amsterdam.pydata.org/>`_ | June 16, 2020 | Tomas Sostak |
164-
+------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+------------------+-------------------------+
165-
| Popmon: Population Shift Monitoring Made Easy | `Amundsen Community Meetup <https://github.com/amundsen-io/amundsen>`_ | June 4, 2020 | Max Baak |
166-
+------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+------------------+-------------------------+
152+
+------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+-------------------+-------------------------+
153+
| Title | Host | Date | Speaker |
154+
+------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+-------------------+-------------------------+
155+
| Popmon - population monitoring made easy | `Big Data Technology Warsaw Summit 2021 <https://bigdatatechwarsaw.eu/>`_ | February 25, 2021 | Simon Brugman |
156+
+------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+-------------------+-------------------------+
157+
| Popmon - population monitoring made easy | `Data Lunch @ Eneco <https://www.eneco.nl/>`_ | October 29, 2020 | Max Baak, Simon Brugman |
158+
+------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+-------------------+-------------------------+
159+
| Popmon - population monitoring made easy | `Data Science Summit 2020 <https://dssconf.pl/en/>`_ | October 16, 2020 | Max Baak |
160+
+------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+-------------------+-------------------------+
161+
| `Population Shift Monitoring Made Easy: the popmon package <https://youtu.be/PgaQpxzT_0g>`_ | `Online Data Science Meetup @ ING WBAA <https://www.meetup.com/nl-NL/Tech-Meetups-ING/events/>`_ | July 8 2020 | Tomas Sostak |
162+
+------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+-------------------+-------------------------+
163+
| `Popmon: Population Shift Monitoring Made Easy <https://www.youtube.com/watch?v=HE-3YeVYqPY>`_ | `PyData Fest Amsterdam 2020 <https://amsterdam.pydata.org/>`_ | June 16, 2020 | Tomas Sostak |
164+
+------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+-------------------+-------------------------+
165+
| Popmon: Population Shift Monitoring Made Easy | `Amundsen Community Meetup <https://github.com/amundsen-io/amundsen>`_ | June 4, 2020 | Max Baak |
166+
+------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+-------------------+-------------------------+
167167

168168

169169
Articles

docs/source/configuration.rst

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@ To specify the time-axis binning alone, do:
5757
5858
report = df.pm_stability_report(time_axis='date', time_width='1w', time_offset='2020-1-6')
5959
60+
The ``time_axis`` argument should be the name of a column that is of type **numeric (e.g. batch id, time in ns) or date(time)**.
6061
The default time width is 30 days ('30d'), with time offset 2010-1-4 (a Monday).
6162
All other features (except for 'date') are auto-binned in this example.
6263

@@ -203,7 +204,7 @@ Spark usage
203204
from pyspark.sql import SparkSession
204205
205206
# downloads histogrammar jar files if not already installed, used for histogramming of spark dataframe
206-
spark = SparkSession.builder.config("spark.jars.packages", "io.github.histogrammar:histogrammar_2.12:1.0.11,io.github.histogrammar:histogrammar-sparksql_2.12:1.0.11").getOrCreate()
207+
spark = SparkSession.builder.config("spark.jars.packages", "io.github.histogrammar:histogrammar_2.12:1.0.20,io.github.histogrammar:histogrammar-sparksql_2.12:1.0.20").getOrCreate()
207208
208209
# load a dataframe
209210
spark_df = spark.read.format('csv').options(header='true').load('file.csv')
@@ -221,8 +222,8 @@ This snippet contains the instructions for setting up a minimal environment for
221222
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
222223
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
223224
!tar xf spark-2.4.7-bin-hadoop2.7.tgz
224-
!wget -P /content/spark-2.4.7-bin-hadoop2.7/jars/ -q https://repo1.maven.org/maven2/io/github/histogrammar/histogrammar-sparksql_2.12/1.0.11/histogrammar-sparksql_2.12-1.0.11.jar
225-
!wget -P /content/spark-2.4.7-bin-hadoop2.7/jars/ -q https://repo1.maven.org/maven2/io/github/histogrammar/histogrammar_2.12/1.0.11/histogrammar_2.12-1.0.11.jar
225+
!wget -P /content/spark-2.4.7-bin-hadoop2.7/jars/ -q https://repo1.maven.org/maven2/io/github/histogrammar/histogrammar-sparksql_2.12/1.0.20/histogrammar-sparksql_2.12-1.0.20.jar
226+
!wget -P /content/spark-2.4.7-bin-hadoop2.7/jars/ -q https://repo1.maven.org/maven2/io/github/histogrammar/histogrammar_2.12/1.0.20/histogrammar_2.12-1.0.20.jar
226227
!pip install -q findspark popmon
227228
228229
Now that spark is installed, restart the runtime.
@@ -239,7 +240,7 @@ Now that spark is installed, restart the runtime.
239240
from pyspark.sql import SparkSession
240241
241242
spark = SparkSession.builder.master("local[*]") \
242-
.config("spark.jars", "/content/jars/histogrammar_2.12-1.0.11.jar,/content/jars/histogrammar-sparksql_2.12-1.0.11.jar") \
243+
.config("spark.jars", "/content/jars/histogrammar_2.12-1.0.20.jar,/content/jars/histogrammar-sparksql_2.12-1.0.20.jar") \
243244
.config("spark.sql.execution.arrow.enabled", "false") \
244245
.config("spark.sql.session.timeZone", "GMT") \
245246
.getOrCreate()

popmon/__init__.py

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,16 @@
1818
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
1919

2020

21-
# flake8: noqa
21+
# histogram and report functions
22+
from histogrammar.dfinterface.make_histograms import (
23+
get_bin_specs,
24+
get_time_axes,
25+
make_histograms,
26+
)
27+
2228
# pandas/spark dataframe decorators
2329
from popmon import decorators
2430

25-
# histogram and report functions
26-
from .hist.filling import get_bin_specs, get_time_axes, make_histograms
2731
from .pipeline.metrics import df_stability_metrics, stability_metrics
2832
from .pipeline.report import df_stability_report, stability_report
2933
from .stitching import stitch_histograms

popmon/alerting/compute_tl_bounds.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -329,7 +329,7 @@ def df_single_op_pull_bounds(
329329
:param list cols: list of cols to calculate bounds of (optional)
330330
"""
331331
if len(df.index) == 0:
332-
raise RuntimeError("input df has zero length")
332+
raise ValueError("input df has zero length")
333333
row = df.iloc[0]
334334
return pull_bounds(
335335
row, red_high, yellow_high, yellow_low, red_low, suffix_mean, suffix_std, cols

popmon/analysis/comparison/hist_comparer.py

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@
3939
get_consistent_numpy_entries,
4040
)
4141
from ...base import Pipeline
42-
from ...hist.histogram import HistogramContainer
42+
from ...hist.hist_utils import COMMON_HIST_TYPES, is_numeric
4343
from ...stats.numpy import googl_test, ks_prob, ks_test, uu_chi2
4444

4545

@@ -78,21 +78,21 @@ def hist_compare(row, hist_name1="", hist_name2="", max_res_bound=7.0):
7878
hist_name1 = cols[0]
7979
hist_name2 = cols[1]
8080
if not all([name in cols for name in [hist_name1, hist_name2]]):
81-
raise RuntimeError("Need to provide two histogram column names.")
81+
raise ValueError("Need to provide two histogram column names.")
8282

8383
# basic histogram checks
84-
hc1 = row[hist_name1]
85-
hc2 = row[hist_name2]
86-
if not all([isinstance(hc, HistogramContainer) for hc in [hc1, hc2]]):
84+
hist1 = row[hist_name1]
85+
hist2 = row[hist_name2]
86+
if not all([isinstance(hist, COMMON_HIST_TYPES) for hist in [hist1, hist2]]):
8787
return x
88-
if not check_similar_hists([hc1, hc2]):
88+
if not check_similar_hists([hist1, hist2]):
8989
return x
9090

9191
# compare
92-
is_num = hc1.is_num
93-
if hc1.n_dim == 1:
92+
is_num = is_numeric(hist1)
93+
if hist1.n_dim == 1:
9494
if is_num:
95-
numpy_1dhists = get_consistent_numpy_1dhists([hc1, hc2])
95+
numpy_1dhists = get_consistent_numpy_1dhists([hist1, hist2])
9696
entries_list = [nphist[0] for nphist in numpy_1dhists]
9797
# KS-test only properly defined for (ordered) 1D interval variables
9898
ks_testscore = ks_test(*entries_list)
@@ -101,14 +101,14 @@ def hist_compare(row, hist_name1="", hist_name2="", max_res_bound=7.0):
101101
x["ks_pvalue"] = ks_pvalue
102102
x["ks_zscore"] = -norm.ppf(ks_pvalue)
103103
else: # categorical
104-
entries_list = get_consistent_numpy_entries([hc1, hc2])
104+
entries_list = get_consistent_numpy_entries([hist1, hist2])
105105
# check consistency of bin_labels
106-
labels1 = hc1.hist.bin_labels()
107-
labels2 = hc2.hist.bin_labels()
106+
labels1 = hist1.bin_labels()
107+
labels2 = hist2.bin_labels()
108108
subset = set(labels1) <= set(labels2)
109109
unknown_labels = int(not subset)
110-
elif hc1.n_dim == 2:
111-
numpy_2dgrids = get_consistent_numpy_2dgrids([hc1, hc2])
110+
elif hist1.n_dim == 2:
111+
numpy_2dgrids = get_consistent_numpy_2dgrids([hist1, hist2])
112112
entries_list = [entry.flatten() for entry in numpy_2dgrids]
113113

114114
# calculate pearson coefficient

0 commit comments

Comments
 (0)