Skip to content

Releases: factorhouse/factorhouse-local

Factor House Local v2.1

10 Sep 06:36
671a9f6
Compare
Choose a tag to compare

Factor House Local v2.1: Introducing Full-Stack Observability & Data Lineage

We are excited to announce the release of Factor House Local v2.1, a significant update that brings comprehensive observability and data lineage capabilities to our suite of pre-configured Docker Compose environments. This release introduces a brand new, out-of-the-box Observability Stack featuring Marquez (the reference implementation of OpenLineage) alongside the industry-standard Prometheus, Grafana, and Alertmanager.

This enhancement moves beyond simply running data platforms, allowing you to gain deep, actionable insights into both system health and data provenance.

Release Highlights

🚀 Introducing the Centralized Observability & Data Lineage Stack

The cornerstone of this release is a new, pre-configured stack designed to provide a unified view of your entire data ecosystem. It seamlessly integrates two critical aspects of modern data platform management: data lineage and systems monitoring.

  • Benefit: Holistic Visibility. This new stack provides a single pane of glass to understand both the "what" and the "how" of your data pipelines. By combining data lineage with performance metrics, you get a complete picture that accelerates development and simplifies debugging.
    • Automated Data Lineage: Powered by Marquez and OpenLineage, the stack automatically captures metadata from Flink and Spark jobs. This allows you to answer critical questions like "What downstream services will this change affect?" or "Where did this bad data originate?" by visually tracing data's journey.
    • Centralized System Monitoring: The industry-standard Prometheus, Grafana, and Alertmanager suite provides robust, real-time monitoring. You can track resource utilization, monitor application performance, visualize trends on pre-built dashboards, and set up proactive alerts to identify issues before they become critical.

Core Environments

This release includes the following updated and refined local development stacks:

  • Kafka Development & Monitoring with Kpow: A robust, 3-node Apache Kafka environment including Schema Registry, Kafka Connect, and the Kpow UI/API for enterprise-grade observability and management.
  • Unified Analytics Platform with Flex, Flink, Spark, Iceberg & HMS: A comprehensive lakehouse environment featuring Flink, Spark, Iceberg, Hive Metastore, PostgreSQL (CDC-ready), and MinIO (S3), managed with the Flex UI.
  • Apache Pinot Real-Time OLAP Cluster: A real-time distributed OLAP datastore designed for ultra-low-latency, user-facing analytics and dashboards.
  • NEW! Centralized Observability & Data Lineage: A complete monitoring and data provenance solution featuring Marquez, Prometheus, Grafana, and Alertmanager to provide a unified view of your entire data platform's health and data flows.

Factor House Local v2.0

09 Jul 01:11
bb3f2d7
Compare
Choose a tag to compare

Factor House Local v2.0: A Unified Platform with Enhanced Persistence

We are thrilled to announce a significant update to Factor House Local, our suite of pre-configured Docker Compose environments. This release features a major architectural enhancement, merging our previously separate streaming and batch analytics environments into a single, cohesive platform. We also add Apache Hive Metastore to provide a robust, centralized, and persistent catalog for the entire data ecosystem.

Release Highlights

🚀 Consolidation into a Unified Analytics Platform

We have merged the previously separate Flink/Flex and Spark/Analytics environments into a single Unified Analytics Platform. This architecture bridges the gap between real-time stream processing (Apache Flink) and large-scale batch processing (Apache Spark), allowing both engines to operate seamlessly on a shared Apache Iceberg data lakehouse.

  • Benefit: This consolidation provides a more powerful and streamlined development experience. By merging the two stacks, users gain:
    • Simplified Management: A single, integrated platform reduces the complexity and operational overhead of managing, configuring, and running separate environments.
    • A Single Source of Truth: The unified architecture eliminates data silos, allowing both streaming and batch jobs to work on the same data. This enables everything from low-latency event streaming to complex historical analysis on a consistent dataset.
    • Faster, More Realistic Prototyping: Developers can rapidly build and test end-to-end pipelines that more accurately model modern, production-grade data platforms.

🧠 Hive Metastore: The New Unified Catalog

The platform now utilizes Apache Hive Metastore as its central catalog. Backed by a durable PostgreSQL database, the Hive Metastore provides robust, persistent metadata management for the entire ecosystem.

  • Benefit: The Metastore serves as the persistent memory for the analytics platform. It goes far beyond cataloging tables by storing the analytical logic itself, including permanent SQL views that encapsulate complex logic and custom user-defined functions (UDFs) that extend native capabilities. This creates a truly stateful and collaborative environment where analytical assets defined in one session are immediately available to other Flink and Spark jobs, dramatically improving reusability and simplifying development.

💾 Enhanced Flink Reliability with Persistent State

The Flink configuration has been upgraded to ensure better resilience and easier management of streaming jobs. Flink checkpoints and savepoints are now configured to persist directly to MinIO (S3-compatible object storage).

  • Benefit: This ensures robust state management and reliable recovery for Flink applications. Jobs can recover cleanly from failures, and operational management of long-running streaming processes is significantly simplified.

🌊 CDC-Ready Transactional Hub

The PostgreSQL instance serves a dual role: backing the Hive Metastore and acting as a transactional database ready for Change Data Capture (CDC). It is pre-configured with wal_level=logical, enabling real-time streaming of database changes directly into the lakehouse.

  • Benefit: Users can easily prototype near-real-time synchronization between operational databases and the analytics lakehouse.

Core Environments

This release includes the following updated and refined local development stacks:

  • Kafka Development & Monitoring with Kpow: A robust, 3-node Apache Kafka environment including Schema Registry, Kafka Connect, and the Kpow UI/API for enterprise-grade observability and management.
  • Unified Analytics Platform with Flex, Flink, Spark, Iceberg & HMS: A comprehensive lakehouse environment featuring Flink, Spark, Iceberg, Hive Metastore, PostgreSQL (CDC-ready), and MinIO (S3), managed with the Flex UI.
  • Apache Pinot Real-Time OLAP Cluster: A real-time distributed OLAP datastore designed for ultra-low-latency, user-facing analytics and dashboards.

Factor House Local v1.0

03 Jul 07:20
Compare
Choose a tag to compare

Factor House Local v1.0: A Suite of Modern Data Platforms

We are excited to announce the latest updates to Factor House Local, a collection of pre-configured Docker Compose environments designed to showcase modern data platform setups. This release enhances our local development stacks, providing robust, ready-to-use environments for a variety of data-intensive applications.

Highlights

  • Comprehensive Kafka Development Stack with Kpow
    This environment provides a complete Apache Kafka ecosystem for development and monitoring. It is built with a high-availability 3-node Kafka cluster, Zookeeper for coordination, Confluent Schema Registry for data governance, and Kafka Connect for data integration. The stack is enhanced with Kpow, an enterprise-grade UI for comprehensive monitoring, data inspection, and management of your Kafka resources, making it ideal for developing and testing event-driven architectures and microservices.

  • Real-Time Stream Analytics with Flink & Flex
    Centered on Apache Flink, this stack delivers a high-performance solution for streaming analytics. It is tailored for low-latency processing, complex event enrichment, and SQL-driven operations. The environment includes a Flink JobManager, multiple TaskManagers, and a SQL Gateway for interactive queries. It is managed by Flex, an enterprise-grade tool for Flink that provides robust RBAC, a data-oriented UI, and simplified management, perfect for operational intelligence, advanced fraud detection, and real-time metric pipelines.

  • Modern Analytics & Lakehouse with Spark and Iceberg
    This stack provides a self-contained environment for building and querying data lakehouses. It combines the power of Apache Spark with Apache Iceberg for transactional data management on an open-table format. Data is stored in MinIO, an S3-compatible object storage layer, and a PostgreSQL database is included for relational data workloads. This architecture is ideal for batch ETL/ELT pipelines, interactive data exploration via the included Jupyter Notebook server, and building reliable, scalable analytics pipelines with ACID transactions, schema evolution, and time-travel capabilities.

  • Apache Pinot Real-Time OLAP Cluster
    Deploy a distributed Apache Pinot cluster, a real-time OLAP (Online Analytical Processing) datastore designed for ultra-low-latency analytics at scale. The stack includes the core Pinot components—Controller, Broker, and Server—providing the foundation to ingest data from streaming sources like Kafka and make it available for analytical queries with millisecond response times. It is optimized for user-facing analytics, real-time dashboards, and anomaly detection.

  • New Custom Flink Docker Image
    We are introducing factorhouse/flink, a custom, multi-architecture (amd64, arm64) Docker image based on the Apache Flink LTS release. This image is optimized for running Flink SQL and PyFlink jobs and comes with out-of-the-box support for S3, Hadoop, Apache Iceberg, and Parquet. It features a unique custom dependency loading mechanism that simplifies using the Flink SQL Client and Gateway by automatically adding pre-packaged JARs to the classpath, significantly streamlining the development of complex data pipelines.