Skip to content

Releases: DistrictDataLabs/baleen

Version 0.3.3

18 Apr 19:28
Compare
Choose a tag to compare

Extended the Baleen export functionality to dump either an HTML or JSON corpus to disk in a suitable format for NLP analysis, particularly using NLTK. The new export functionality is still single process, but does some smart things to reduce the amount of time the export takes, as well as the amount of memory required. Additionally, we have improved the visual interface to the web application, making status messages more noticeable as we monitor continued data ingestion.

The app can be found online at http://baleen.districtdatalabs.com.

Deployed: Monday, April 18, 2016
Contributors: Benjamin Bengfort, Sasan Bahadaran

Changes

  • Updated the exporter to use no_dereferencing and no_cache
  • Updated the exporter to write out a json meta file of feeds
  • Exporter can now export in either JSON (default) or HTML
  • Exporter is now memory and query optimized as good as we can get it
  • No HTML sanitization occurs in the exporter any more
  • Added a bit of colorization to the web app status page for quick duration identification
  • Added iconography to the feeds and status page for better visualization
  • Better datetime formatting for the timezone and understandability
  • Inclusion of the humanize package for timesince and intcomma readability
  • Made the status page responsive

Version 0.3.2

13 Apr 14:39
Compare
Choose a tag to compare

Some changes to the web application to attempt to solve SEGFAULT errors and to make the status and the logs more readable. This is just a quick hotfix to make sure we have decent monitoring in the app.

The app can be found online at http://baleen.districtdatalabs.com.

Deployed: Wednesday, April 13, 2016
Contributors: Benjamin Bengfort

Changes

  • Bootstrapified the status page
  • Added a job history listing to status page
  • Added a duration computation to the Job model
  • Created a mongoengine Log model
  • Added in a helper for flask.ext.MongoEngine to make db connections better
  • Removed log file reading and now read from the database
  • Added in Flask humanize for better visibility in the status page

Version 0.3.1

07 Apr 16:54
Compare
Choose a tag to compare

Very happy to have had @lauralorenz and @bahadasx contribute to Baleen by building a web admin app. The app is a very simple Flask app that reads from the database and reports on the status, including the list of available feeds. It also reports information from the log file.

The app can be found online now at http://baleen.districtdatalabs.com.

Deployed: Thursday, April 7, 2016
Contributors: Sasan Bahadaran, Laura Lorenz, and Benjamin Bengfort

Changes

  • Created a Docker configuration and setup for easier development
  • Improved the export functionality for a quick corpus
  • Created a Flask web application for Baleen administration
  • Added a feeds listing page to quickly see what feeds are being ingested
  • Added a job status page that reports on the current Baleen status.
  • Add a log file reader to inspect what's going on in the log file.
  • Added boostrap and baleen integration
  • Created a serve sub command for Baleen for easy management
  • Created a deployment method with uWSGI + Nginx

Version 0.3

03 Mar 19:31
Compare
Choose a tag to compare

Releases one day after another! The reason is because Baleen needs to be running in production to gather a large enough corpus for PyCon. Version 0.3 is a big release that implements the revised component architecture. It should hopefully be more stable, give more visibility into what's going on, be easier to update and fix, and have a few more features. Features include tracking ingestion jobs in the Mongo database (so we can add a web application), synchronization of feeds and wrangling of posts are not coupled. Added Commis for easier console utility management, and finally added some other tools and tests.

Deployed: Thursday, March 3, 2016
Contributors: Benjamin Bengfort

Changes

  • Added the Commis library for our new console utility which gives us more flexibility on the application.
  • Added a feed synchronization utility that decouples the feedparser interaction from anything but a feed object.
  • Added a new decorators library inspired from previous libraries
  • Added a reraise decorator that wraps exceptions and makes them Baleen exceptions
  • Added plenty of tests for various modules
  • Added a post wrangling method that decouples the post interaction and web fetch from anything but a post object.
  • Created a better info command with more information about the app
  • Modified the ingest and run commands to be a bit more stable
  • Created a Job model for saving information about each ingestion run for application views
  • Did I say more tests?

Hotfix 0.2.1

03 Mar 16:47
Compare
Choose a tag to compare

Hotfix for an error that caused unicode strings to kill the ingestion in a try/except block (as it was being written to the logger)! This error was so serious it needed to be fixed right away, even in the middle of Version 0.3 updates.

Deployed: Wednesday, March 2, 2016
Contributors: Benjamin Bengfort

Changes

  • Eliminated the traceback capture from the baleen console utility
  • Fixed the unicode decode exception in the error logging try/except block
  • Added some stability measures

Version 0.2

01 Mar 22:21
Compare
Choose a tag to compare

This update was a push to get Baleen running on EC2 on an hourly basis in preparation for PyCon. We updated all of Baleen's dependencies to their latest versions, added tests and other important fixtures, and organized the code a bit better. New functionality includes the ability to fetch the post webpage from the link, export the corpus to disk using the command line utility, and run in the background using the schedule library.

Deployed: Tuesday, March 1, 2016
Contributors: Benjamin Bengfort

Changes

  • Refactoring of the code to a more organized structure
  • Added some tests for safety on a number of modules
  • Updated all the dependencies from 2014
  • Added an export command to the CLI
  • Uses requests.py to fetch the full webpage from the link
  • Slightly better logging configuration
  • Use schedule to run every hour
  • Created Upstart configuration for background on Ubuntu

Version 0.1

19 Feb 01:38
Compare
Choose a tag to compare

This was the initial version of Baleen before the revamp occurred thanks to the PyCon tutorial. Baleen in this form was a command line utility that fetched RSS feeds on demand and stored them in a Mongo database. The input to Baleen is an OPML file that contains an RSS feed listing as well as their topics.

Baleen was originally used to produce a corpus for the District Data Labs NLP with NLTK course. The corpus was then adapted for use in the Statistics.com online course of the same name. The problem is that because Baleen had to be ran manually, it was difficult to get a high quality corpus on demand.

Release: Tuesday, September 23, 2014
Deployed: Thursday, February 18, 2016
Contributor: Benjamin Bengfort

Changes

  • CLI Program to import OPML files and kick off ingestion
  • OPML parser to read RSS feeds
  • Ingestion module to download and parse RSS using feedburner
  • Logging module for better information about ingestion
  • Mongo database integration