- Lots of cheap hardware
- HDFS
- Replication and Fault Tolerance
- YARN
- Distributed Computing
- MapReduce
- GCS is used on GCP.
- Don’t use HDFS as you would have to pay for a VM on Compute Engine.
- Suited for batch processing.
- Data access has high throughput rather than low latency.
- Name Node
- 1 master node
- Contains YARN resource manager
- Manages overall file system
- Stores
- The directory structure
- Metadata on the files
- Data Nodes
- Physically stores the data in the files.
- Break data into blocks of equal size
- Different length files are treated the same way
- Storage is simplified
- Unit for replication and fault tolerance
- Blocks are of size 128 MB
- Larger -> Reduces parallelism
- Smaller -> Increases overhead (more metadata)
- Stores the blocks across the data nodes
- Each node contains a partition or a split of data
- How do we know where the splits of a particular file are?
- Name Node (File 1 | Block 1 | Data Node)
- Can have multiple name nodes.
- Kept in sync with Zookeeper
- Maximize Redundancy
- 1st location chosen at random
- 2nd has to be on a different rack (if possible)
- 3rd will be on same rack as the second, but on a different node.
- Reduces inter-rack traffic and improves write performance.
- Read operations are sent to the rack closest to the client.
- Minimize Write Bandwidth
- Data is forwarded from first data node to the next replica location.
- Forwarded further to the next replica location.
- Forwarding requires a large amount of bandwidth.
- Increases cost of writes.
- An operation performed in parallel, on small portions of dataset.
- Outputs KV pairs
- Mapper outputs become one final output.
SQL interface over MapReduce = Hive
- Data analysts understand SQL but not Java code.
- What {key, value} pairs should be emitted in the map step?
- How should values with the same key be combined?
- Coordinate tasks running on the cluster.
- Assign new nodes in case of failure.
- Runs on a single master node
- Schedules tasks across nodes
- Starts Application Master within containers.
- Run on all other nodes
- Manages tasks on the individual node.
- Can have multiple containers.
- Can request containers for mappers and reducers.
- If additional resources are required, Application Master makes the request.
- 1 instance per application.
- Client communicates directly to get status, progress updates via an application-specific protocol.
- All processes are run within a container in a Node Manager.
- Package of resources including RAM, CPU, Network, HDD etc on a single node.
- Executes the application code.
- Can communicate with Application Master itself.
- Assign a process to the same node where the data to be processed lives.
- If CPU/Memory not available, WAIT!
- FIFO Scheduler
- Queue
- Capacity Scheduler
- Priority Queue
- Fair Scheduler
- Jobs assigned equal share of all resources
- Database management system on top of Hadoop.
- Integrates with your application just like a traditional database.
- Advantages
- Sparse Tables
- No wastage of space when storing data.
- Dynamic Attributes
- Update attributes dynamically without changing storage structure.
- Do not need to change schema.
- Sparse Tables
- Column names repeat across rows.
- Normalization Reduces data duplication => Optimizes storage.
- Storage is cheap in a distributed file system.
- Optimize number of disk seeks instead by denormalization.
- Don’t have to join tables.
- Read a single record to get all details about an employee in one read operation.
- No comparisons/sorting/inequality checks across multiple rows
- No joins
- No group by
- No order by
- No operations involving multiple tables
- No indexes on tables
- No constraints
- Updates to a single row are atomic
- All columns are updated, or none are.
- Updates to multiple rows are not atomic
- Even if update is on the same column in multiple rows.
- Provides a SQL interface to Hadoop.
- Bridge to Hadoop for people without OOP exposure.
- Not suitable for very low latency apps due to HDFS.
- HiveQL ~= SQL
- Wrapper on top of MapReduce
- HCatalog
- Bridge between HDFS and Hive
- Stores metadata for all tables in Hive
- Maps the files and directories in Hive to tables
- Holds the definitions and the schema for each table
- Any database with a JDBC driver can be used as a metastore.
- Development
- Use built-in Derby database
- Embedded metastore
- Only one session can connect.
- Production
- Local metastores
- Allow multiple sessions to connect to Hive
- DB is a separate process and can be on separate host.
- Remote metastores
- Separate processes for Hive and the metastore
- Metastore runs in its own JVM process.
- Processes communicate with Metastore using Thrift network API (hive.metastore.uris property)
- Does not require admin to share JDBC login info for the metastore db with each Hive user.
- Local metastores
- Hive vs. RDBMS
- Large vs. Small datasets
- Parallel vs. serial computations
- High vs. low latency
- Read vs. Read/write operations
- Not ACID compliant vs. ACID compliant
- HiveQL vs. SQL
- High latency
- Records not indexed.
- Fetching a row runs a MapReduce which may take minutes.
- Not owner of the data.
- It exists in HDFS
- Schema-on-read
- Not ACID compliant
- Data can be dumped into Hive tables from any source
- Row level updates, deletes as a special case
- Many more built in functions
- Only equi-joins allowed
- High latency
- OLAP in Hive
- Partitioning
- State specific queries will run only on data in one directory.
- Splits NOT of the same size.
- Bucketing
- Size of each split should be the same.
- Hash of a column value
- Each bucket is a separate file
- Makes sampling and joining data more efficient
- Reduces search space
- Size of each split should be the same.
- Join Optimizations
- Join operations are Map Reduce jobs under the hood
- Optimize joins by reducing the amount of data held in memory
- Reducing data held in memory
- On a join, one table is held in memory while the other is read from disk
- Hold smaller in memory
- On a join, one table is held in memory while the other is read from disk
- Structuring Joins as Map-Only Operation
- Filter queries (only these rows)
- Mapper needs to use null as key
- Filter queries (only these rows)
- Join operations are Map Reduce jobs under the hood
- Windowing in Hive
- A suite of functions which are syntactic sugar for complex queries.
- e.x. What revenue percentile did this supplier fall into this quarter?
- Window = 1 quarter
- Operation = Percentile on revenue
- Partitioning
- ETL
- A data manipulation language
- Transforms unstructured data into a structured format
- Query this structured data using interfaces like Hive.
- Raw Data -> Pig -> Warehouse -> HiveQL -> Analytics
- Pig Latin
- A procedural, data flow language to extract, transform and load.
- Procedural
- Uses a series of well-defined steps to perform operations.
- No if statements or for loops.
- Specifies exactly how data is to be modified at each step.
- Data Flow
- Focused on transformations applied to the data.
- Written with a series of data operations in mind.
- Nodes in a DAG
- Procedural
- Data from one or more sources can be read, processed and stored in parallel.
- Cleans data, precomputes common aggregates before storing in a data warehouse.
- A procedural, data flow language to extract, transform and load.
- Pig on Hadoop
- Optimizes operations before MapReduce jobs are run, to speed operations up.
- Works better with Apache Tez and Spark.
- A tool used to schedule workflows on all the Hadoop ecosystem technologies.