Skip to content

Life-cycle: Internal working of HDFS, SQOOP, HIVE, SPARK, HBASE, KAFKA with code.

License

Notifications You must be signed in to change notification settings

Jayvardhan-Reddy/BigData-Ecosystem-Architecture

Repository files navigation

BigData Ecosystem Architecture

Internal working of Bigdata and it's ecosystems such as

  • The background process of resource allocation, database connection.
  • How the data is distributed across the nodes.
  • Execution life-cycle on submitting a Job.

** Note: Refer the links metioned below under each ecosystem for detailed explanation **

1. HDFS 🐘

The various underlying process that takes place during the storage of a file into HDFS such as:

  • Type of scheduler

  • Block & Rack information

  • File size

  • File location

  • Replication information about the file(Over-replicated blocks, Under-replicated blocks, ...)

  • Health status of the file

Please click on the link below to know the execution and flow process

🔗 HDFS Architecture in Depth

2. SQOOP :octocat:

Used to perform 2 main operations.

  • Sqoop Import:

    • To ingest data from any source such as traditional databases into hadoop file system HDFS
  • Sqoop Export:

    • To export data from hadoop file system HDFS to any traditional databases

To support the above two operations internally a CodeGen is used.

  • Sqoop CodeGen:

    • To compile metadata and other relative information into java class file & create a Jar

Please click on the link below to know the execution and flow process

🔗 SQOOP Architecture in Depth

3. HIVE 🐝

It has mainly 4 components

  • Hadoop core components(Hdfs, MapReduce)

  • Metastore

  • Driver

  • Hive Clients

Please click on the link below to know the execution and flow process

🔗 HIVE Architecture in Depth

4. SPARK 💥

The various phases involved before and during the execution of a spark job.

  • Spark Context

    • It is the heart of spark application.
  • Yarn Resource Manager, Application Master & launching of executors (containers).

  • Setting up environment variables, job resources.

  • CoarseGrainedExecutorBackend & Netty-based RPC.

  • SparkListeners.

    • LiveListenerBus
    • StatsReportListener
    • EventLoggingListener
  • Execution of a job

    • Logical Plan (Lineage)
    • Physical Plan (DAG)
  • Spark-WebUI.

Please click on the link below to know the execution and flow process

🔗 SPARK Architecture in Depth

4.1 SPARK Abstraction Layers & Internal Optimization Techniques used 💥

It has 3 different variants as part of it.

  • RDD (Resilient Distributed Datasets)

    • Lineage Graph
    • DAG Scheduler
  • DataFrames

    • Catalyst Optimizer
    • Tungsten Engine
    • Default source or Base relation
  • Datasets

    • Optimized Tungsten Engine - V2
    • Whole Stage Code Generation

5. HBASE 🐋