diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
new file mode 100644
index 0000000000000..ca8e58920596d
--- /dev/null
+++ b/ARCHITECTURE.md
@@ -0,0 +1,280 @@
+# Apache Spark Architecture
+
+This document provides an overview of the Apache Spark architecture and its key components.
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Core Components](#core-components)
+- [Execution Model](#execution-model)
+- [Key Subsystems](#key-subsystems)
+- [Data Flow](#data-flow)
+- [Module Structure](#module-structure)
+
+## Overview
+
+Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis.
+
+### Design Principles
+
+1. **Unified Engine**: Single system for batch processing, streaming, machine learning, and graph processing
+2. **In-Memory Computing**: Leverages RAM for fast iterative algorithms and interactive queries
+3. **Lazy Evaluation**: Operations are not executed until an action is called
+4. **Fault Tolerance**: Resilient Distributed Datasets (RDDs) provide automatic fault recovery
+5. **Scalability**: Scales from a single machine to thousands of nodes
+
+## Core Components
+
+### 1. Spark Core
+
+The foundation of the Spark platform, providing:
+
+- **Task scheduling and dispatch**
+- **Memory management**
+- **Fault recovery**
+- **Interaction with storage systems**
+- **RDD API** - The fundamental data abstraction
+
+Location: `core/` directory
+
+Key classes:
+- `SparkContext`: Main entry point for Spark functionality
+- `RDD`: Resilient Distributed Dataset, the fundamental data structure
+- `DAGScheduler`: Schedules stages based on DAG of operations
+- `TaskScheduler`: Launches tasks on executors
+
+### 2. Spark SQL
+
+Module for structured data processing with:
+
+- **DataFrame and Dataset APIs**
+- **SQL query engine**
+- **Data source connectors** (Parquet, JSON, JDBC, etc.)
+- **Catalyst optimizer** for query optimization
+
+Location: `sql/` directory
+
+Key components:
+- Query parsing and analysis
+- Logical and physical query planning
+- Code generation for efficient execution
+- Catalog management
+
+### 3. Spark Streaming
+
+Framework for scalable, high-throughput, fault-tolerant stream processing:
+
+- **DStreams** (Discretized Streams) - Legacy API
+- **Structured Streaming** - Modern streaming API built on Spark SQL
+
+Location: `streaming/` directory
+
+Key features:
+- Micro-batch processing model
+- Exactly-once semantics
+- Integration with Kafka, Flume, Kinesis, and more
+
+### 4. MLlib (Machine Learning Library)
+
+Scalable machine learning library providing:
+
+- **Classification and regression**
+- **Clustering**
+- **Collaborative filtering**
+- **Dimensionality reduction**
+- **Feature extraction and transformation**
+- **ML Pipelines** for building workflows
+
+Location: `mllib/` and `mllib-local/` directories
+
+### 5. GraphX
+
+Graph processing framework with:
+
+- **Graph abstraction** built on top of RDDs
+- **Graph algorithms** (PageRank, connected components, triangle counting, etc.)
+- **Pregel-like API** for iterative graph computations
+
+Location: `graphx/` directory
+
+## Execution Model
+
+### Spark Application Lifecycle
+
+1. **Initialization**: User creates a `SparkContext` or `SparkSession`
+2. **Job Submission**: Actions trigger job submission to the DAG scheduler
+3. **Stage Creation**: DAG scheduler breaks jobs into stages based on shuffle boundaries
+4. **Task Scheduling**: Task scheduler assigns tasks to executors
+5. **Execution**: Executors run tasks and return results
+6. **Result Collection**: Results are collected back to the driver or written to storage
+
+### Driver and Executors
+
+- **Driver Program**: Runs the main() function and creates SparkContext
+ - Converts user program into tasks
+ - Schedules tasks on executors
+ - Maintains metadata about the application
+
+- **Executors**: Processes that run on worker nodes
+ - Run tasks assigned by the driver
+ - Store data in memory or disk
+ - Return results to the driver
+
+### Cluster Managers
+
+Spark supports multiple cluster managers:
+
+- **Standalone**: Built-in cluster manager
+- **Apache YARN**: Hadoop's resource manager
+- **Apache Mesos**: General-purpose cluster manager
+- **Kubernetes**: Container orchestration platform
+
+Location: `resource-managers/` directory
+
+## Key Subsystems
+
+### Memory Management
+
+Spark manages memory in several regions:
+
+1. **Execution Memory**: For shuffles, joins, sorts, and aggregations
+2. **Storage Memory**: For caching and broadcasting data
+3. **User Memory**: For user data structures and metadata
+4. **Reserved Memory**: System reserved memory
+
+Configuration: Unified memory management allows dynamic allocation between execution and storage.
+
+### Shuffle Subsystem
+
+Handles data redistribution across partitions:
+
+- **Shuffle Write**: Map tasks write data to local disk
+- **Shuffle Read**: Reduce tasks fetch data from map outputs
+- **Shuffle Service**: External shuffle service for improved reliability
+
+Location: `core/src/main/scala/org/apache/spark/shuffle/`
+
+### Storage Subsystem
+
+Manages cached data and intermediate results:
+
+- **Block Manager**: Manages storage of data blocks
+- **Memory Store**: In-memory cache
+- **Disk Store**: Disk-based storage
+- **Off-Heap Storage**: Direct memory storage
+
+Location: `core/src/main/scala/org/apache/spark/storage/`
+
+### Serialization
+
+Efficient serialization is critical for performance:
+
+- **Java Serialization**: Default, but slower
+- **Kryo Serialization**: Faster and more compact (recommended)
+- **Custom Serializers**: For specific data types
+
+Location: `core/src/main/scala/org/apache/spark/serializer/`
+
+## Data Flow
+
+### Transformation and Action Pipeline
+
+1. **Transformations**: Lazy operations that define a new RDD/DataFrame
+ - Examples: `map`, `filter`, `join`, `groupBy`
+ - Build up a DAG of operations
+
+2. **Actions**: Operations that trigger computation
+ - Examples: `count`, `collect`, `save`, `reduce`
+ - Cause DAG execution
+
+3. **Stages**: Groups of tasks that can be executed together
+ - Separated by shuffle operations
+ - Pipeline operations within a stage
+
+4. **Tasks**: Unit of work sent to executors
+ - One task per partition
+ - Execute transformations and return results
+
+## Module Structure
+
+### Project Organization
+
+```
+spark/
+├── assembly/ # Builds the final Spark assembly JAR
+├── bin/ # User-facing command-line scripts
+├── build/ # Build-related scripts
+├── common/ # Common utilities shared across modules
+├── conf/ # Configuration file templates
+├── connector/ # External data source connectors
+├── core/ # Spark Core engine
+├── data/ # Sample data for examples
+├── dev/ # Development scripts and tools
+├── docs/ # Documentation source files
+├── examples/ # Example programs
+├── graphx/ # Graph processing library
+├── hadoop-cloud/ # Cloud storage integration
+├── launcher/ # Application launcher
+├── mllib/ # Machine learning library (RDD-based)
+├── mllib-local/ # Local ML algorithms
+├── python/ # PySpark - Python API
+├── R/ # SparkR - R API
+├── repl/ # Interactive Scala shell
+├── resource-managers/ # Cluster manager integrations
+├── sbin/ # Admin scripts for cluster management
+├── sql/ # Spark SQL and DataFrames
+├── streaming/ # Streaming processing
+└── tools/ # Various utility tools
+```
+
+### Module Dependencies
+
+- **Core**: Foundation for all other modules
+- **SQL**: Depends on Core, used by Streaming, MLlib
+- **Streaming**: Depends on Core and SQL
+- **MLlib**: Depends on Core and SQL
+- **GraphX**: Depends on Core
+- **Python/R**: Language bindings to Core APIs
+
+## Building and Testing
+
+For detailed build instructions, see [building-spark.md](docs/building-spark.md).
+
+Quick start:
+```bash
+# Build Spark
+./build/mvn -DskipTests clean package
+
+# Run tests
+./dev/run-tests
+
+# Run specific module tests
+./build/mvn test -pl core
+```
+
+## Performance Tuning
+
+Key areas for optimization:
+
+1. **Memory Configuration**: Adjust executor memory and memory fractions
+2. **Parallelism**: Set appropriate partition counts
+3. **Serialization**: Use Kryo for better performance
+4. **Caching**: Cache frequently accessed data
+5. **Broadcast Variables**: Efficiently distribute large read-only data
+6. **Data Locality**: Ensure tasks run close to their data
+
+See [tuning.md](docs/tuning.md) for detailed tuning guidelines.
+
+## Contributing
+
+See [CONTRIBUTING.md](CONTRIBUTING.md) and the [contributing guide](https://spark.apache.org/contributing.html) for information on how to contribute to Apache Spark.
+
+## Further Reading
+
+- [Programming Guide](docs/programming-guide.md)
+- [SQL Programming Guide](docs/sql-programming-guide.md)
+- [Structured Streaming Guide](docs/structured-streaming-programming-guide.md)
+- [MLlib Guide](docs/ml-guide.md)
+- [GraphX Guide](docs/graphx-programming-guide.md)
+- [Cluster Overview](docs/cluster-overview.md)
+- [Configuration](docs/configuration.md)
diff --git a/CODE_DOCUMENTATION_GUIDE.md b/CODE_DOCUMENTATION_GUIDE.md
new file mode 100644
index 0000000000000..1229bc447140b
--- /dev/null
+++ b/CODE_DOCUMENTATION_GUIDE.md
@@ -0,0 +1,612 @@
+# Code Documentation Guide
+
+This guide describes documentation standards for Apache Spark source code.
+
+## Overview
+
+Good documentation helps developers understand and maintain code. Spark follows industry-standard documentation practices for each language it supports.
+
+## Scala Documentation (Scaladoc)
+
+Scala code uses Scaladoc for API documentation.
+
+### Basic Format
+
+```scala
+/**
+ * Brief one-line description.
+ *
+ * Detailed description that can span multiple lines.
+ * Explain what this class/method does, important behavior,
+ * and any constraints or assumptions.
+ *
+ * @param paramName description of parameter
+ * @param anotherParam description of another parameter
+ * @return description of return value
+ * @throws ExceptionType when this exception is thrown
+ * @since 3.5.0
+ * @note Important note about usage or behavior
+ */
+def methodName(paramName: String, anotherParam: Int): ReturnType = {
+ // Implementation
+}
+```
+
+### Class Documentation
+
+```scala
+/**
+ * Brief description of the class purpose.
+ *
+ * Detailed explanation of the class functionality, usage patterns,
+ * and important considerations.
+ *
+ * Example usage:
+ * {{{
+ * val example = new MyClass(param1, param2)
+ * example.doSomething()
+ * }}}
+ *
+ * @constructor Creates a new instance with the given parameters
+ * @param config Configuration object
+ * @param isLocal Whether running in local mode
+ * @since 3.0.0
+ */
+class MyClass(config: SparkConf, isLocal: Boolean) extends Logging {
+ // Class implementation
+}
+```
+
+### Code Examples
+
+Use triple braces for code examples:
+
+```scala
+/**
+ * Transforms the RDD by applying a function to each element.
+ *
+ * Example:
+ * {{{
+ * val rdd = sc.parallelize(1 to 10)
+ * val doubled = rdd.map(_ * 2)
+ * doubled.collect() // Array(2, 4, 6, ..., 20)
+ * }}}
+ *
+ * @param f function to apply to each element
+ * @return transformed RDD
+ */
+def map[U: ClassTag](f: T => U): RDD[U]
+```
+
+### Annotations
+
+Use Spark annotations for API stability:
+
+```scala
+/**
+ * :: Experimental ::
+ * This feature is experimental and may change in future releases.
+ */
+@Experimental
+class ExperimentalFeature
+
+/**
+ * :: DeveloperApi ::
+ * This is a developer API and may change between minor versions.
+ */
+@DeveloperApi
+class DeveloperFeature
+
+/**
+ * :: Unstable ::
+ * This API is unstable and may change in patch releases.
+ */
+@Unstable
+class UnstableFeature
+```
+
+### Internal APIs
+
+Mark internal classes and methods:
+
+```scala
+/**
+ * Internal utility class for XYZ.
+ *
+ * @note This is an internal API and may change without notice.
+ */
+private[spark] class InternalUtil
+
+/**
+ * Internal method used by scheduler.
+ */
+private[scheduler] def internalMethod(): Unit
+```
+
+## Java Documentation (Javadoc)
+
+Java code uses Javadoc for API documentation.
+
+### Basic Format
+
+```java
+/**
+ * Brief one-line description.
+ *
+ * Detailed description that can span multiple paragraphs.
+ * Explain what this class/method does and important behavior.
+ *
+ *
+ * @param paramName description of parameter
+ * @param anotherParam description of another parameter
+ * @return description of return value
+ * @throws ExceptionType when this exception is thrown
+ * @since 3.5.0
+ */
+public ReturnType methodName(String paramName, int anotherParam)
+ throws ExceptionType {
+ // Implementation
+}
+```
+
+### Class Documentation
+
+```java
+/**
+ * Brief description of the class purpose.
+ *
+ * Detailed explanation of functionality, usage patterns,
+ * and important considerations.
+ *
+ *
+ * Example usage:
+ *
{@code
+ * MyClass example = new MyClass(param1, param2);
+ * example.doSomething();
+ * }
+ *
+ *
+ * @param type parameter description
+ * @since 3.0.0
+ */
+public class MyClass implements Serializable {
+ // Class implementation
+}
+```
+
+### Interface Documentation
+
+```java
+/**
+ * Interface for shuffle block resolution.
+ *
+ * Implementations of this interface are responsible for
+ * resolving shuffle block locations and reading shuffle data.
+ *
+ *
+ * @since 2.3.0
+ */
+public interface ShuffleBlockResolver {
+ /**
+ * Gets the data for a shuffle block.
+ *
+ * @param blockId the block identifier
+ * @return managed buffer containing the block data
+ */
+ ManagedBuffer getBlockData(BlockId blockId);
+}
+```
+
+## Python Documentation (Docstrings)
+
+Python code uses docstrings following PEP 257 and Google style.
+
+### Function Documentation
+
+```python
+def function_name(param1: str, param2: int) -> bool:
+ """
+ Brief one-line description.
+
+ Detailed description that can span multiple lines.
+ Explain what this function does, important behavior,
+ and any constraints.
+
+ Parameters
+ ----------
+ param1 : str
+ Description of param1
+ param2 : int
+ Description of param2
+
+ Returns
+ -------
+ bool
+ Description of return value
+
+ Raises
+ ------
+ ValueError
+ When input is invalid
+
+ Examples
+ --------
+ >>> result = function_name("test", 42)
+ >>> print(result)
+ True
+
+ Notes
+ -----
+ Important notes about usage or behavior.
+
+ .. versionadded:: 3.5.0
+ """
+ # Implementation
+ pass
+```
+
+### Class Documentation
+
+```python
+class MyClass:
+ """
+ Brief description of the class.
+
+ Detailed explanation of the class functionality,
+ usage patterns, and important considerations.
+
+ Parameters
+ ----------
+ config : dict
+ Configuration dictionary
+ is_local : bool, optional
+ Whether running in local mode (default is False)
+
+ Attributes
+ ----------
+ config : dict
+ Stored configuration
+ state : str
+ Current state of the object
+
+ Examples
+ --------
+ >>> obj = MyClass({'key': 'value'}, is_local=True)
+ >>> obj.do_something()
+
+ Notes
+ -----
+ This class is thread-safe.
+
+ .. versionadded:: 3.0.0
+ """
+
+ def __init__(self, config: dict, is_local: bool = False):
+ self.config = config
+ self.is_local = is_local
+ self.state = "initialized"
+```
+
+### Type Hints
+
+Use type hints consistently:
+
+```python
+from typing import List, Optional, Dict, Any, Union
+from pyspark.sql import DataFrame
+
+def process_data(
+ df: DataFrame,
+ columns: List[str],
+ options: Optional[Dict[str, Any]] = None
+) -> Union[DataFrame, None]:
+ """
+ Process DataFrame with specified columns.
+
+ Parameters
+ ----------
+ df : DataFrame
+ Input DataFrame to process
+ columns : list of str
+ Column names to include
+ options : dict, optional
+ Processing options
+
+ Returns
+ -------
+ DataFrame or None
+ Processed DataFrame, or None if processing fails
+ """
+ pass
+```
+
+## R Documentation (Roxygen2)
+
+R code uses Roxygen2-style documentation.
+
+### Function Documentation
+
+```r
+#' Brief one-line description
+#'
+#' Detailed description that can span multiple lines.
+#' Explain what this function does and important behavior.
+#'
+#' @param param1 description of param1
+#' @param param2 description of param2
+#' @return description of return value
+#' @examples
+#' \dontrun{
+#' result <- myFunction(param1 = "test", param2 = 42)
+#' print(result)
+#' }
+#' @note Important note about usage
+#' @rdname function-name
+#' @since 3.0.0
+#' @export
+myFunction <- function(param1, param2) {
+ # Implementation
+}
+```
+
+### Class Documentation
+
+```r
+#' MyClass: A class for doing XYZ
+#'
+#' Detailed description of the class functionality
+#' and usage patterns.
+#'
+#' @slot field1 description of field1
+#' @slot field2 description of field2
+#' @export
+#' @since 3.0.0
+setClass("MyClass",
+ slots = c(
+ field1 = "character",
+ field2 = "numeric"
+ )
+)
+```
+
+## Documentation Best Practices
+
+### 1. Write Clear, Concise Descriptions
+
+**Good:**
+```scala
+/**
+ * Computes the mean of values in the RDD.
+ *
+ * @return the arithmetic mean, or NaN if the RDD is empty
+ */
+def mean(): Double
+```
+
+**Bad:**
+```scala
+/**
+ * This method calculates and returns the mean.
+ */
+def mean(): Double
+```
+
+### 2. Document Edge Cases
+
+```scala
+/**
+ * Divides two numbers.
+ *
+ * @param a numerator
+ * @param b denominator
+ * @return result of a / b
+ * @throws ArithmeticException if b is zero
+ * @note Returns Double.PositiveInfinity if a > 0 and b = 0+
+ */
+def divide(a: Double, b: Double): Double
+```
+
+### 3. Provide Examples
+
+Always include examples for public APIs:
+
+```scala
+/**
+ * Filters elements using the given predicate.
+ *
+ * Example:
+ * {{{
+ * val rdd = sc.parallelize(1 to 10)
+ * val evens = rdd.filter(_ % 2 == 0)
+ * evens.collect() // Array(2, 4, 6, 8, 10)
+ * }}}
+ */
+def filter(f: T => Boolean): RDD[T]
+```
+
+### 4. Document Thread Safety
+
+```scala
+/**
+ * Thread-safe cache implementation.
+ *
+ * @note This class uses internal synchronization and is safe
+ * for concurrent access from multiple threads.
+ */
+class ConcurrentCache[K, V] extends Cache[K, V]
+```
+
+### 5. Document Performance Characteristics
+
+```scala
+/**
+ * Sorts the RDD by key.
+ *
+ * @note This operation triggers a shuffle and is expensive.
+ * The time complexity is O(n log n) where n is the
+ * number of elements.
+ */
+def sortByKey(): RDD[(K, V)]
+```
+
+### 6. Link to Related APIs
+
+```scala
+/**
+ * Maps elements to key-value pairs.
+ *
+ * @see [[groupByKey]] for grouping by keys
+ * @see [[reduceByKey]] for aggregating by keys
+ */
+def keyBy[K](f: T => K): RDD[(K, T)]
+```
+
+### 7. Version Information
+
+```scala
+/**
+ * New feature introduced in 3.5.0.
+ *
+ * @since 3.5.0
+ */
+def newMethod(): Unit
+
+/**
+ * Deprecated method, use [[newMethod]] instead.
+ *
+ * @deprecated Use newMethod() instead, since 3.5.0
+ */
+@deprecated("Use newMethod() instead", "3.5.0")
+def oldMethod(): Unit
+```
+
+## Internal Documentation
+
+### Code Comments
+
+Use comments for complex logic:
+
+```scala
+// Sort by key and value to ensure deterministic output
+// This is critical for testing and reproducing results
+val sorted = data.sortBy(x => (x._1, x._2))
+
+// TODO: Optimize this for large datasets
+// Current implementation loads all data into memory
+val result = computeExpensiveOperation()
+
+// FIXME: This breaks when input size exceeds Int.MaxValue
+val size = data.size.toInt
+```
+
+### Architecture Comments
+
+Document architectural decisions:
+
+```scala
+/**
+ * Internal scheduler implementation.
+ *
+ * Architecture:
+ * 1. Jobs are submitted to DAGScheduler
+ * 2. DAGScheduler creates stages based on shuffle boundaries
+ * 3. Each stage is submitted as a TaskSet to TaskScheduler
+ * 4. TaskScheduler assigns tasks to executors
+ * 5. Task results are returned to the driver
+ *
+ * Thread Safety:
+ * - DAGScheduler runs in a single thread (event loop)
+ * - TaskScheduler methods are thread-safe
+ * - Results are collected with appropriate synchronization
+ */
+private[spark] class SchedulerImpl
+```
+
+## Generating Documentation
+
+### Scaladoc
+
+```bash
+# Generate Scaladoc
+./build/mvn scala:doc
+
+# Output in target/site/scaladocs/
+```
+
+### Javadoc
+
+```bash
+# Generate Javadoc
+./build/mvn javadoc:javadoc
+
+# Output in target/site/apidocs/
+```
+
+### Python Documentation
+
+```bash
+# Generate Sphinx documentation
+cd python/docs
+make html
+
+# Output in _build/html/
+```
+
+### R Documentation
+
+```bash
+# Generate R documentation
+cd R/pkg
+R CMD Rd2pdf .
+```
+
+## Documentation Review Checklist
+
+When reviewing documentation:
+
+- [ ] Is the description clear and accurate?
+- [ ] Are all parameters documented?
+- [ ] Is the return value documented?
+- [ ] Are exceptions/errors documented?
+- [ ] Are examples provided for public APIs?
+- [ ] Is thread safety documented if relevant?
+- [ ] Are performance characteristics noted?
+- [ ] Is version information included?
+- [ ] Are deprecated APIs marked?
+- [ ] Are there links to related APIs?
+- [ ] Is internal vs. public API clearly marked?
+
+## Tools
+
+### IDE Support
+
+- **IntelliJ IDEA**: Auto-generates documentation templates
+- **VS Code**: Extensions for Scaladoc/Javadoc
+- **Eclipse**: Built-in Javadoc support
+
+### Linters
+
+- **Scalastyle**: Checks for missing Scaladoc
+- **Checkstyle**: Validates Javadoc
+- **Pylint**: Checks Python docstrings
+- **roxygen2**: Validates R documentation
+
+## Resources
+
+- [Scaladoc Style Guide](https://docs.scala-lang.org/style/scaladoc.html)
+- [Oracle Javadoc Guide](https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html)
+- [PEP 257 - Docstring Conventions](https://www.python.org/dev/peps/pep-0257/)
+- [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html)
+- [Roxygen2 Documentation](https://roxygen2.r-lib.org/)
+
+## Contributing
+
+When contributing code to Spark:
+
+1. Follow the documentation style for your language
+2. Document all public APIs
+3. Include examples for new features
+4. Update existing documentation when changing behavior
+5. Run documentation generators to verify formatting
+
+For more information, see [CONTRIBUTING.md](CONTRIBUTING.md).
diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md
new file mode 100644
index 0000000000000..2e5baeb6e0d36
--- /dev/null
+++ b/DEVELOPMENT.md
@@ -0,0 +1,462 @@
+# Spark Development Guide
+
+This guide provides information for developers working on Apache Spark.
+
+## Table of Contents
+
+- [Getting Started](#getting-started)
+- [Development Environment](#development-environment)
+- [Building Spark](#building-spark)
+- [Testing](#testing)
+- [Code Style](#code-style)
+- [IDE Setup](#ide-setup)
+- [Debugging](#debugging)
+- [Working with Git](#working-with-git)
+- [Common Development Tasks](#common-development-tasks)
+
+## Getting Started
+
+### Prerequisites
+
+- Java 17 or Java 21 (for Spark 4.x)
+- Maven 3.9.9 or later
+- Python 3.9+ (for PySpark development)
+- R 4.0+ (for SparkR development)
+- Git
+
+### Initial Setup
+
+1. **Clone the repository:**
+ ```bash
+ git clone https://github.com/apache/spark.git
+ cd spark
+ ```
+
+2. **Build Spark:**
+ ```bash
+ ./build/mvn -DskipTests clean package
+ ```
+
+3. **Verify the build:**
+ ```bash
+ ./bin/spark-shell
+ ```
+
+## Development Environment
+
+### Directory Structure
+
+```
+spark/
+├── assembly/ # Final assembly JAR creation
+├── bin/ # User command scripts (spark-submit, spark-shell, etc.)
+├── build/ # Build scripts and Maven wrapper
+├── common/ # Common utilities and modules
+├── conf/ # Configuration templates
+├── core/ # Spark Core
+├── dev/ # Development tools (run-tests, lint, etc.)
+├── docs/ # Documentation (Jekyll-based)
+├── examples/ # Example programs
+├── python/ # PySpark implementation
+├── R/ # SparkR implementation
+├── sbin/ # Admin scripts (start-all.sh, stop-all.sh, etc.)
+├── sql/ # Spark SQL
+└── [other modules]
+```
+
+### Key Development Directories
+
+- `dev/`: Contains scripts for testing, linting, and releasing
+- `dev/run-tests`: Main test runner
+- `dev/lint-*`: Various linting tools
+- `build/mvn`: Maven wrapper script
+
+## Building Spark
+
+### Full Build
+
+```bash
+# Build all modules, skip tests
+./build/mvn -DskipTests clean package
+
+# Build with specific Hadoop version
+./build/mvn -Phadoop-3.4 -DskipTests clean package
+
+# Build with Hive support
+./build/mvn -Phive -Phive-thriftserver -DskipTests package
+```
+
+### Module-Specific Builds
+
+```bash
+# Build only core module
+./build/mvn -pl core -DskipTests package
+
+# Build core and its dependencies
+./build/mvn -pl core -am -DskipTests package
+
+# Build SQL module
+./build/mvn -pl sql/core -am -DskipTests package
+```
+
+### Build Profiles
+
+Common Maven profiles:
+
+- `-Phadoop-3.4`: Build with Hadoop 3.4
+- `-Pyarn`: Include YARN support
+- `-Pkubernetes`: Include Kubernetes support
+- `-Phive`: Include Hive support
+- `-Phive-thriftserver`: Include Hive Thrift Server
+- `-Pscala-2.13`: Build with Scala 2.13
+
+### Fast Development Builds
+
+For faster iteration during development:
+
+```bash
+# Skip Scala and Java style checks
+./build/mvn -DskipTests -Dcheckstyle.skip package
+
+# Build specific module quickly
+./build/mvn -pl sql/core -am -DskipTests -Dcheckstyle.skip package
+```
+
+## Testing
+
+### Running All Tests
+
+```bash
+# Run all tests (takes several hours)
+./dev/run-tests
+
+# Run tests for specific modules
+./dev/run-tests --modules sql
+```
+
+### Running Specific Test Suites
+
+#### Scala/Java Tests
+
+```bash
+# Run all tests in a module
+./build/mvn test -pl core
+
+# Run a specific test suite
+./build/mvn test -pl core -Dtest=SparkContextSuite
+
+# Run specific test methods
+./build/mvn test -pl core -Dtest=SparkContextSuite#testJobInterruption
+```
+
+#### Python Tests
+
+```bash
+# Run all PySpark tests
+cd python && python run-tests.py
+
+# Run specific test file
+cd python && python -m pytest pyspark/tests/test_context.py
+
+# Run specific test method
+cd python && python -m pytest pyspark/tests/test_context.py::SparkContextTests::test_stop
+```
+
+#### R Tests
+
+```bash
+cd R
+R CMD check --no-manual --no-build-vignettes spark
+```
+
+### Test Coverage
+
+```bash
+# Generate coverage report
+./build/mvn clean install -DskipTests
+./dev/run-tests --coverage
+```
+
+## Code Style
+
+### Scala Code Style
+
+Spark uses Scalastyle for Scala code checking:
+
+```bash
+# Check Scala style
+./dev/lint-scala
+
+# Auto-format (if scalafmt is configured)
+./build/mvn scala:format
+```
+
+Key style guidelines:
+- 2-space indentation
+- Max line length: 100 characters
+- Follow [Scala style guide](https://docs.scala-lang.org/style/)
+
+### Java Code Style
+
+Java code follows Google Java Style:
+
+```bash
+# Check Java style
+./dev/lint-java
+```
+
+Key guidelines:
+- 2-space indentation
+- Max line length: 100 characters
+- Use Java 17+ features appropriately
+
+### Python Code Style
+
+PySpark follows PEP 8:
+
+```bash
+# Check Python style
+./dev/lint-python
+
+# Auto-format with black (if available)
+cd python && black pyspark/
+```
+
+Key guidelines:
+- 4-space indentation
+- Max line length: 100 characters
+- Type hints encouraged for new code
+
+## IDE Setup
+
+### IntelliJ IDEA
+
+1. **Import Project:**
+ - File → Open → Select `pom.xml`
+ - Choose "Open as Project"
+ - Import Maven projects automatically
+
+2. **Configure JDK:**
+ - File → Project Structure → Project SDK → Select Java 17 or 21
+
+3. **Recommended Plugins:**
+ - Scala plugin
+ - Python plugin
+ - Maven plugin
+
+4. **Code Style:**
+ - Import Spark code style from `dev/scalastyle-config.xml`
+
+### Visual Studio Code
+
+1. **Recommended Extensions:**
+ - Scala (Metals)
+ - Python
+ - Maven for Java
+
+2. **Workspace Settings:**
+ ```json
+ {
+ "java.configuration.maven.userSettings": ".mvn/settings.xml",
+ "python.linting.enabled": true,
+ "python.linting.pylintEnabled": true
+ }
+ ```
+
+### Eclipse
+
+1. **Import Project:**
+ - File → Import → Maven → Existing Maven Projects
+
+2. **Install Plugins:**
+ - Scala IDE
+ - Maven Integration
+
+## Debugging
+
+### Debugging Scala/Java Code
+
+#### Using IDE Debugger
+
+1. Run tests with debugging enabled in your IDE
+2. Set breakpoints in source code
+3. Run test in debug mode
+
+#### Command Line Debugging
+
+```bash
+# Enable remote debugging
+export SPARK_JAVA_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005"
+./bin/spark-shell
+```
+
+Then attach your IDE debugger to port 5005.
+
+### Debugging PySpark
+
+```bash
+# Enable Python debugging
+export PYSPARK_PYTHON=python
+export PYSPARK_DRIVER_PYTHON=python
+
+# Run with pdb
+python -m pdb your_spark_script.py
+```
+
+### Logging
+
+Adjust log levels in `conf/log4j2.properties`:
+
+```properties
+# Set root logger level
+rootLogger.level = info
+
+# Set specific logger
+logger.spark.name = org.apache.spark
+logger.spark.level = debug
+```
+
+## Working with Git
+
+### Branch Naming
+
+- Feature branches: `feature/description`
+- Bug fixes: `fix/issue-number-description`
+- Documentation: `docs/description`
+
+### Commit Messages
+
+Follow conventional commit format:
+
+```
+[SPARK-XXXXX] Brief description (max 72 chars)
+
+Detailed description of the change, motivation, and impact.
+
+- Bullet points for specific changes
+- Reference related issues
+
+Closes #XXXXX
+```
+
+### Creating Pull Requests
+
+1. **Fork the repository** on GitHub
+2. **Create a feature branch** from master
+3. **Make your changes** with clear commits
+4. **Push to your fork**
+5. **Open a Pull Request** with:
+ - Clear title and description
+ - Link to JIRA issue if applicable
+ - Unit tests for new functionality
+ - Documentation updates if needed
+
+### Code Review
+
+- Address review comments promptly
+- Keep discussions professional and constructive
+- Be open to suggestions and improvements
+
+## Common Development Tasks
+
+### Adding a New Configuration
+
+1. Define config in appropriate config file (e.g., `sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala`)
+2. Document the configuration
+3. Add tests
+4. Update documentation in `docs/configuration.md`
+
+### Adding a New API
+
+1. Implement the API with proper documentation
+2. Add comprehensive unit tests
+3. Update relevant documentation
+4. Consider backward compatibility
+5. Add deprecation notices if replacing old APIs
+
+### Adding a New Data Source
+
+1. Implement `DataSourceV2` interface
+2. Add read/write support
+3. Include integration tests
+4. Document usage in `docs/sql-data-sources-*.md`
+
+### Performance Optimization
+
+1. Identify bottleneck with profiling
+2. Create benchmark to measure improvement
+3. Implement optimization
+4. Verify performance gain
+5. Ensure no functionality regression
+
+### Updating Dependencies
+
+1. Check for security vulnerabilities
+2. Test compatibility
+3. Update version in `pom.xml`
+4. Update `LICENSE` and `NOTICE` files if needed
+5. Run full test suite
+
+## Useful Commands
+
+```bash
+# Clean build artifacts
+./build/mvn clean
+
+# Skip Scalastyle checks
+./build/mvn -Dscalastyle.skip package
+
+# Generate API documentation
+./build/mvn scala:doc
+
+# Check for dependency updates
+./build/mvn versions:display-dependency-updates
+
+# Profile a build
+./build/mvn clean package -Dprofile
+
+# Run Spark locally with different memory
+./bin/spark-shell --driver-memory 4g --executor-memory 4g
+```
+
+## Troubleshooting
+
+### Build Issues
+
+- **Out of Memory**: Increase Maven memory with `export MAVEN_OPTS="-Xmx4g"`
+- **Compilation errors**: Clean build with `./build/mvn clean`
+- **Version conflicts**: Update local Maven repo: `./build/mvn -U package`
+
+### Test Failures
+
+- Run single test to isolate issue
+- Check for environment-specific problems
+- Review logs in `target/` directories
+- Enable debug logging for more detail
+
+### IDE Issues
+
+- Reimport Maven project
+- Invalidate caches and restart
+- Check SDK and language level settings
+
+## Resources
+
+- [Apache Spark Website](https://spark.apache.org/)
+- [Spark Developer Tools](https://spark.apache.org/developer-tools.html)
+- [Spark Wiki](https://cwiki.apache.org/confluence/display/SPARK)
+- [Spark Mailing Lists](https://spark.apache.org/community.html#mailing-lists)
+- [Spark JIRA](https://issues.apache.org/jira/projects/SPARK)
+
+## Getting Help
+
+- Ask questions on [user@spark.apache.org](mailto:user@spark.apache.org)
+- Report bugs on [JIRA](https://issues.apache.org/jira/projects/SPARK)
+- Discuss on [dev@spark.apache.org](mailto:dev@spark.apache.org)
+- Chat on the [Spark Slack](https://spark.apache.org/community.html)
+
+## Contributing Back
+
+See [CONTRIBUTING.md](CONTRIBUTING.md) for detailed contribution guidelines.
+
+Remember: Quality over quantity. Well-tested, documented changes are more valuable than large, poorly understood patches.
diff --git a/DOCUMENTATION_INDEX.md b/DOCUMENTATION_INDEX.md
new file mode 100644
index 0000000000000..cd4227c35df67
--- /dev/null
+++ b/DOCUMENTATION_INDEX.md
@@ -0,0 +1,345 @@
+# Apache Spark Documentation Index
+
+This document provides a complete index of all documentation available in the Apache Spark repository.
+
+## Quick Start
+
+- **[README.md](README.md)** - Main project README with quick start guide
+- **[docs/quick-start.md](docs/quick-start.md)** - Interactive tutorial for getting started
+- **[CONTRIBUTING.md](CONTRIBUTING.md)** - How to contribute to the project
+
+## Architecture and Development
+
+### Core Documentation
+- **[ARCHITECTURE.md](ARCHITECTURE.md)** - Complete Spark architecture overview
+ - Core components and their responsibilities
+ - Execution model and data flow
+ - Module structure and dependencies
+ - Key subsystems (memory, shuffle, storage, networking)
+
+- **[DEVELOPMENT.md](DEVELOPMENT.md)** - Developer guide
+ - Setting up development environment
+ - Building and testing instructions
+ - IDE configuration
+ - Code style guidelines
+ - Debugging techniques
+ - Common development tasks
+
+- **[CODE_DOCUMENTATION_GUIDE.md](CODE_DOCUMENTATION_GUIDE.md)** - Code documentation standards
+ - Scaladoc guidelines
+ - Javadoc guidelines
+ - Python docstring conventions
+ - R documentation standards
+ - Best practices and examples
+
+## Module Documentation
+
+### Core Modules
+
+#### Spark Core
+- **[core/README.md](core/README.md)** - Spark Core documentation
+ - RDD API and operations
+ - SparkContext and configuration
+ - Task scheduling (DAGScheduler, TaskScheduler)
+ - Memory management
+ - Shuffle system
+ - Storage system
+ - Serialization
+
+#### Spark SQL
+- **[sql/README.md](sql/README.md)** - Spark SQL documentation (if exists)
+- **[docs/sql-programming-guide.md](docs/sql-programming-guide.md)** - SQL programming guide
+- **[docs/sql-data-sources.md](docs/sql-data-sources.md)** - Data source integration
+- **[docs/sql-performance-tuning.md](docs/sql-performance-tuning.md)** - Performance tuning
+
+#### Streaming
+- **[streaming/README.md](streaming/README.md)** - Spark Streaming documentation
+ - DStreams API (legacy)
+ - Structured Streaming (recommended)
+ - Input sources and output sinks
+ - Windowing and stateful operations
+ - Performance tuning
+
+#### MLlib
+- **[mllib/README.md](mllib/README.md)** - MLlib documentation
+ - ML Pipeline API (spark.ml)
+ - RDD-based API (spark.mllib - maintenance mode)
+ - Classification and regression algorithms
+ - Clustering algorithms
+ - Feature engineering
+ - Model selection and tuning
+
+#### GraphX
+- **[graphx/README.md](graphx/README.md)** - GraphX documentation
+ - Property graphs
+ - Graph operators
+ - Graph algorithms (PageRank, Connected Components, etc.)
+ - Pregel API
+ - Performance optimization
+
+### Common Modules
+- **[common/README.md](common/README.md)** - Common utilities documentation
+ - Network communication (network-common, network-shuffle)
+ - Key-value store
+ - Sketching algorithms
+ - Unsafe operations
+
+### Tools and Utilities
+
+#### User-Facing Tools
+- **[bin/README.md](bin/README.md)** - User scripts documentation
+ - spark-submit: Application submission
+ - spark-shell: Interactive Scala shell
+ - pyspark: Interactive Python shell
+ - sparkR: Interactive R shell
+ - spark-sql: SQL query shell
+ - run-example: Example runner
+
+#### Administration Tools
+- **[sbin/README.md](sbin/README.md)** - Admin scripts documentation
+ - Cluster management scripts
+ - start-all.sh / stop-all.sh
+ - Master and worker daemon management
+ - History server setup
+ - Standalone cluster configuration
+
+#### Programmatic API
+- **[launcher/README.md](launcher/README.md)** - Launcher API documentation
+ - SparkLauncher for programmatic application launching
+ - SparkAppHandle for monitoring
+ - Integration patterns
+
+#### Resource Managers
+- **[resource-managers/README.md](resource-managers/README.md)** - Resource manager integrations
+ - YARN integration
+ - Kubernetes integration
+ - Mesos integration
+ - Comparison and configuration
+
+### Examples
+- **[examples/README.md](examples/README.md)** - Example programs
+ - Core examples (RDD operations)
+ - SQL examples (DataFrames)
+ - Streaming examples
+ - MLlib examples
+ - GraphX examples
+ - Running examples
+
+## Official Documentation
+
+### Programming Guides
+- **[docs/programming-guide.md](docs/programming-guide.md)** - General programming guide
+- **[docs/rdd-programming-guide.md](docs/rdd-programming-guide.md)** - RDD programming
+- **[docs/sql-programming-guide.md](docs/sql-programming-guide.md)** - SQL programming
+- **[docs/structured-streaming-programming-guide.md](docs/structured-streaming-programming-guide.md)** - Structured Streaming
+- **[docs/streaming-programming-guide.md](docs/streaming-programming-guide.md)** - DStreams (legacy)
+- **[docs/ml-guide.md](docs/ml-guide.md)** - Machine learning guide
+- **[docs/graphx-programming-guide.md](docs/graphx-programming-guide.md)** - Graph processing
+
+### Deployment
+- **[docs/cluster-overview.md](docs/cluster-overview.md)** - Cluster mode overview
+- **[docs/submitting-applications.md](docs/submitting-applications.md)** - Application submission
+- **[docs/spark-standalone.md](docs/spark-standalone.md)** - Standalone cluster mode
+- **[docs/running-on-yarn.md](docs/running-on-yarn.md)** - Running on YARN
+- **[docs/running-on-kubernetes.md](docs/running-on-kubernetes.md)** - Running on Kubernetes
+
+### Configuration and Tuning
+- **[docs/configuration.md](docs/configuration.md)** - Configuration reference
+- **[docs/tuning.md](docs/tuning.md)** - Performance tuning guide
+- **[docs/hardware-provisioning.md](docs/hardware-provisioning.md)** - Hardware recommendations
+- **[docs/job-scheduling.md](docs/job-scheduling.md)** - Job scheduling
+- **[docs/monitoring.md](docs/monitoring.md)** - Monitoring and instrumentation
+
+### Advanced Topics
+- **[docs/security.md](docs/security.md)** - Security guide
+- **[docs/cloud-integration.md](docs/cloud-integration.md)** - Cloud storage integration
+- **[docs/building-spark.md](docs/building-spark.md)** - Building from source
+
+### Migration Guides
+- **[docs/core-migration-guide.md](docs/core-migration-guide.md)** - Core API migration
+- **[docs/sql-migration-guide.md](docs/sql-migration-guide.md)** - SQL migration
+- **[docs/ml-migration-guide.md](docs/ml-migration-guide.md)** - MLlib migration
+- **[docs/pyspark-migration-guide.md](docs/pyspark-migration-guide.md)** - PySpark migration
+- **[docs/ss-migration-guide.md](docs/ss-migration-guide.md)** - Structured Streaming migration
+
+### API References
+- **[docs/sql-ref.md](docs/sql-ref.md)** - SQL reference
+- **[docs/sql-ref-functions.md](docs/sql-ref-functions.md)** - SQL functions
+- **[docs/sql-ref-datatypes.md](docs/sql-ref-datatypes.md)** - SQL data types
+- **[docs/sql-ref-syntax.md](docs/sql-ref-syntax.md)** - SQL syntax
+
+## Language-Specific Documentation
+
+### Python (PySpark)
+- **[python/README.md](python/README.md)** - PySpark overview
+- **[python/docs/](python/docs/)** - PySpark documentation source
+- **[docs/api/python/](docs/api/python/)** - Python API docs (generated)
+
+### R (SparkR)
+- **[R/README.md](R/README.md)** - SparkR overview
+- **[docs/sparkr.md](docs/sparkr.md)** - SparkR guide
+- **[R/pkg/README.md](R/pkg/README.md)** - R package documentation
+
+### Scala
+- **[docs/api/scala/](docs/api/scala/)** - Scala API docs (generated)
+
+### Java
+- **[docs/api/java/](docs/api/java/)** - Java API docs (generated)
+
+## Data Sources
+
+### Built-in Sources
+- **[docs/sql-data-sources-load-save-functions.md](docs/sql-data-sources-load-save-functions.md)**
+- **[docs/sql-data-sources-parquet.md](docs/sql-data-sources-parquet.md)**
+- **[docs/sql-data-sources-json.md](docs/sql-data-sources-json.md)**
+- **[docs/sql-data-sources-csv.md](docs/sql-data-sources-csv.md)**
+- **[docs/sql-data-sources-jdbc.md](docs/sql-data-sources-jdbc.md)**
+- **[docs/sql-data-sources-avro.md](docs/sql-data-sources-avro.md)**
+- **[docs/sql-data-sources-orc.md](docs/sql-data-sources-orc.md)**
+
+### External Integrations
+- **[docs/streaming-kafka-integration.md](docs/streaming-kafka-integration.md)** - Kafka integration
+- **[docs/streaming-kinesis-integration.md](docs/streaming-kinesis-integration.md)** - Kinesis integration
+- **[docs/structured-streaming-kafka-integration.md](docs/structured-streaming-kafka-integration.md)** - Structured Streaming with Kafka
+
+## Special Topics
+
+### Machine Learning
+- **[docs/ml-pipeline.md](docs/ml-pipeline.md)** - ML Pipelines
+- **[docs/ml-features.md](docs/ml-features.md)** - Feature transformers
+- **[docs/ml-classification-regression.md](docs/ml-classification-regression.md)** - Classification/Regression
+- **[docs/ml-clustering.md](docs/ml-clustering.md)** - Clustering
+- **[docs/ml-collaborative-filtering.md](docs/ml-collaborative-filtering.md)** - Recommendation
+- **[docs/ml-tuning.md](docs/ml-tuning.md)** - Hyperparameter tuning
+
+### Streaming
+- **[docs/structured-streaming-programming-guide.md](docs/structured-streaming-programming-guide.md)** - Structured Streaming guide
+
+### Graph Processing
+- **[docs/graphx-programming-guide.md](docs/graphx-programming-guide.md)** - GraphX guide
+
+## Additional Resources
+
+### Community
+- **[Apache Spark Website](https://spark.apache.org/)** - Official website
+- **[Spark Documentation](https://spark.apache.org/documentation.html)** - Online docs
+- **[Developer Tools](https://spark.apache.org/developer-tools.html)** - Developer resources
+- **[Community](https://spark.apache.org/community.html)** - Mailing lists and chat
+
+### External Links
+- **[Spark JIRA](https://issues.apache.org/jira/projects/SPARK)** - Issue tracker
+- **[GitHub Repository](https://github.com/apache/spark)** - Source code
+- **[Stack Overflow](https://stackoverflow.com/questions/tagged/apache-spark)** - Q&A
+
+## Document Organization
+
+### By Audience
+
+**For Users:**
+- Quick Start Guide
+- Programming Guides (SQL, Streaming, MLlib, GraphX)
+- Configuration Guide
+- Deployment Guides (YARN, Kubernetes)
+- Examples
+
+**For Developers:**
+- ARCHITECTURE.md
+- DEVELOPMENT.md
+- CODE_DOCUMENTATION_GUIDE.md
+- Module READMEs
+- Building Guide
+
+**For Administrators:**
+- Cluster Overview
+- Standalone Mode Guide
+- Monitoring Guide
+- Security Guide
+- Admin Scripts (sbin/)
+
+### By Topic
+
+**Getting Started:**
+1. README.md
+2. docs/quick-start.md
+3. docs/programming-guide.md
+
+**Core Concepts:**
+1. ARCHITECTURE.md
+2. core/README.md
+3. docs/rdd-programming-guide.md
+
+**Data Processing:**
+1. docs/sql-programming-guide.md
+2. docs/structured-streaming-programming-guide.md
+3. docs/ml-guide.md
+
+**Deployment:**
+1. docs/cluster-overview.md
+2. docs/submitting-applications.md
+3. docs/running-on-yarn.md or docs/running-on-kubernetes.md
+
+**Optimization:**
+1. docs/tuning.md
+2. docs/sql-performance-tuning.md
+3. docs/hardware-provisioning.md
+
+## Documentation Standards
+
+All documentation follows these principles:
+
+1. **Clarity**: Clear, concise explanations
+2. **Completeness**: Comprehensive coverage of topics
+3. **Examples**: Code examples for all concepts
+4. **Structure**: Consistent formatting and organization
+5. **Accuracy**: Up-to-date and technically correct
+6. **Accessibility**: Easy to find and navigate
+
+## Contributing to Documentation
+
+To contribute to Spark documentation:
+
+1. Follow the style guides in CODE_DOCUMENTATION_GUIDE.md
+2. Update relevant documentation when changing code
+3. Add examples for new features
+4. Test documentation builds locally
+5. Submit pull requests with documentation updates
+
+See [CONTRIBUTING.md](CONTRIBUTING.md) for details.
+
+## Building Documentation
+
+### Building User Documentation
+```bash
+cd docs
+bundle install
+bundle exec jekyll serve
+# View at http://localhost:4000
+```
+
+### Building API Documentation
+```bash
+# Scala API docs
+./build/mvn scala:doc
+
+# Java API docs
+./build/mvn javadoc:javadoc
+
+# Python API docs
+cd python/docs
+make html
+```
+
+## Getting Help
+
+If you can't find what you're looking for:
+
+1. Check the [Documentation Index](https://spark.apache.org/documentation.html)
+2. Search [Stack Overflow](https://stackoverflow.com/questions/tagged/apache-spark)
+3. Ask on the [user mailing list](mailto:user@spark.apache.org)
+4. Check [Spark JIRA](https://issues.apache.org/jira/projects/SPARK) for known issues
+
+## Last Updated
+
+This index was last updated: 2025-10-19
+
+For the most up-to-date documentation, visit [spark.apache.org/docs/latest](https://spark.apache.org/docs/latest/).
diff --git a/README.md b/README.md
index 65dfd67ac520e..0dd1f7f173bea 100644
--- a/README.md
+++ b/README.md
@@ -15,11 +15,33 @@ and Structured Streaming for stream processing.
[](https://pypi.org/project/pyspark/)
-## Online Documentation
+## Documentation
You can find the latest Spark documentation, including a programming
guide, on the [project web page](https://spark.apache.org/documentation.html).
-This README file only contains basic setup instructions.
+
+### Repository Documentation
+
+- **[ARCHITECTURE.md](ARCHITECTURE.md)** - Spark architecture overview and component descriptions
+- **[DEVELOPMENT.md](DEVELOPMENT.md)** - Developer guide with build instructions, testing, and IDE setup
+- **[CONTRIBUTING.md](CONTRIBUTING.md)** - How to contribute to Apache Spark
+
+### Module Documentation
+
+- **[core/](core/README.md)** - Spark Core: RDDs, scheduling, memory management, storage
+- **[sql/](sql/README.md)** - Spark SQL: DataFrames, Datasets, SQL engine, data sources
+- **[streaming/](streaming/README.md)** - Spark Streaming: DStreams and Structured Streaming
+- **[mllib/](mllib/README.md)** - MLlib: Machine learning library with algorithms and pipelines
+- **[graphx/](graphx/README.md)** - GraphX: Graph processing framework and algorithms
+- **[examples/](examples/README.md)** - Example programs in Scala, Java, Python, and R
+
+### Tools and Utilities
+
+- **[bin/](bin/README.md)** - User-facing scripts (spark-submit, spark-shell, pyspark, etc.)
+- **[sbin/](sbin/README.md)** - Admin scripts for managing Spark standalone clusters
+- **[launcher/](launcher/README.md)** - Programmatic API for launching Spark applications
+- **[resource-managers/](resource-managers/README.md)** - Integrations with YARN, Kubernetes, and Mesos
+- **[common/](common/README.md)** - Common utilities and libraries shared across modules
## Build Pipeline Status
diff --git a/bin/README.md b/bin/README.md
new file mode 100644
index 0000000000000..e83fbf583746c
--- /dev/null
+++ b/bin/README.md
@@ -0,0 +1,453 @@
+# Spark Binary Scripts
+
+This directory contains user-facing command-line scripts for running Spark applications and interactive shells.
+
+## Overview
+
+These scripts provide convenient entry points for:
+- Running Spark applications
+- Starting interactive shells (Scala, Python, R, SQL)
+- Managing Spark clusters
+- Utility operations
+
+## Main Scripts
+
+### spark-submit
+
+Submit Spark applications to a cluster.
+
+**Usage:**
+```bash
+./bin/spark-submit \
+ --class \
+ --master \
+ --deploy-mode \
+ --conf = \
+ ... # other options
+ \
+ [application-arguments]
+```
+
+**Examples:**
+```bash
+# Run on local mode with 4 cores
+./bin/spark-submit --class org.example.App --master local[4] app.jar
+
+# Run on YARN cluster
+./bin/spark-submit --class org.example.App --master yarn --deploy-mode cluster app.jar
+
+# Run Python application
+./bin/spark-submit --master local[2] script.py
+
+# Run with specific memory and executor settings
+./bin/spark-submit \
+ --master spark://master:7077 \
+ --executor-memory 4G \
+ --total-executor-cores 8 \
+ --class org.example.App \
+ app.jar
+```
+
+**Key Options:**
+- `--master`: Master URL (local, spark://, yarn, k8s://, mesos://)
+- `--deploy-mode`: client or cluster
+- `--class`: Application main class (for Java/Scala)
+- `--name`: Application name
+- `--jars`: Additional JARs to include
+- `--packages`: Maven coordinates of packages
+- `--conf`: Spark configuration property
+- `--driver-memory`: Driver memory (e.g., 1g, 2g)
+- `--executor-memory`: Executor memory
+- `--executor-cores`: Cores per executor
+- `--num-executors`: Number of executors (YARN only)
+
+See [submitting-applications.md](../docs/submitting-applications.md) for complete documentation.
+
+### spark-shell
+
+Interactive Scala shell with Spark support.
+
+**Usage:**
+```bash
+./bin/spark-shell [options]
+```
+
+**Examples:**
+```bash
+# Start local shell
+./bin/spark-shell
+
+# Connect to remote cluster
+./bin/spark-shell --master spark://master:7077
+
+# With specific memory
+./bin/spark-shell --driver-memory 4g
+
+# With additional packages
+./bin/spark-shell --packages org.apache.spark:spark-avro_2.13:3.5.0
+```
+
+**In the shell:**
+```scala
+scala> val data = spark.range(1000)
+scala> data.count()
+res0: Long = 1000
+
+scala> spark.read.json("data.json").show()
+```
+
+### pyspark
+
+Interactive Python shell with PySpark support.
+
+**Usage:**
+```bash
+./bin/pyspark [options]
+```
+
+**Examples:**
+```bash
+# Start local shell
+./bin/pyspark
+
+# Connect to remote cluster
+./bin/pyspark --master spark://master:7077
+
+# With specific Python version
+PYSPARK_PYTHON=python3.11 ./bin/pyspark
+```
+
+**In the shell:**
+```python
+>>> df = spark.range(1000)
+>>> df.count()
+1000
+
+>>> spark.read.json("data.json").show()
+```
+
+### sparkR
+
+Interactive R shell with SparkR support.
+
+**Usage:**
+```bash
+./bin/sparkR [options]
+```
+
+**Examples:**
+```bash
+# Start local shell
+./bin/sparkR
+
+# Connect to remote cluster
+./bin/sparkR --master spark://master:7077
+```
+
+**In the shell:**
+```r
+> df <- createDataFrame(iris)
+> head(df)
+> count(df)
+```
+
+### spark-sql
+
+Interactive SQL shell for running SQL queries.
+
+**Usage:**
+```bash
+./bin/spark-sql [options]
+```
+
+**Examples:**
+```bash
+# Start SQL shell
+./bin/spark-sql
+
+# Connect to Hive metastore
+./bin/spark-sql --conf spark.sql.warehouse.dir=/path/to/warehouse
+
+# Run SQL file
+./bin/spark-sql -f query.sql
+
+# Execute inline query
+./bin/spark-sql -e "SELECT * FROM table"
+```
+
+**In the shell:**
+```sql
+spark-sql> CREATE TABLE test (id INT, name STRING);
+spark-sql> INSERT INTO test VALUES (1, 'Alice'), (2, 'Bob');
+spark-sql> SELECT * FROM test;
+```
+
+### run-example
+
+Run Spark example programs.
+
+**Usage:**
+```bash
+./bin/run-example [params]
+```
+
+**Examples:**
+```bash
+# Run SparkPi example
+./bin/run-example SparkPi 100
+
+# Run with specific master
+MASTER=spark://master:7077 ./bin/run-example SparkPi
+
+# Run SQL example
+./bin/run-example sql.SparkSQLExample
+```
+
+## Utility Scripts
+
+### spark-class
+
+Internal script to run Spark classes. Usually not called directly by users.
+
+**Usage:**
+```bash
+./bin/spark-class [options]
+```
+
+### load-spark-env.sh
+
+Loads Spark environment variables from conf/spark-env.sh. Sourced by other scripts.
+
+## Configuration
+
+Scripts read configuration from:
+
+1. **Environment variables**: Set in shell or `conf/spark-env.sh`
+2. **Command-line options**: Passed via `--conf` or specific flags
+3. **Configuration files**: `conf/spark-defaults.conf`
+
+### Common Environment Variables
+
+```bash
+# Java
+export JAVA_HOME=/path/to/java
+
+# Spark
+export SPARK_HOME=/path/to/spark
+export SPARK_MASTER_HOST=master-hostname
+export SPARK_MASTER_PORT=7077
+
+# Python
+export PYSPARK_PYTHON=python3
+export PYSPARK_DRIVER_PYTHON=python3
+
+# Memory
+export SPARK_DRIVER_MEMORY=2g
+export SPARK_EXECUTOR_MEMORY=4g
+
+# Logging
+export SPARK_LOG_DIR=/var/log/spark
+```
+
+Set these in `conf/spark-env.sh` for persistence.
+
+## Master URLs
+
+Scripts accept various master URL formats:
+
+- **local**: Run locally with one worker thread
+- **local[K]**: Run locally with K worker threads
+- **local[*]**: Run locally with as many worker threads as cores
+- **spark://HOST:PORT**: Connect to Spark standalone cluster
+- **yarn**: Connect to YARN cluster
+- **k8s://HOST:PORT**: Connect to Kubernetes cluster
+- **mesos://HOST:PORT**: Connect to Mesos cluster
+
+## Advanced Usage
+
+### Configuring Logging
+
+Create `conf/log4j2.properties`:
+```properties
+rootLogger.level = info
+logger.spark.name = org.apache.spark
+logger.spark.level = warn
+```
+
+### Using with Jupyter Notebook
+
+```bash
+# Set environment variables
+export PYSPARK_DRIVER_PYTHON=jupyter
+export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
+
+# Start PySpark (opens Jupyter)
+./bin/pyspark
+```
+
+### Connecting to Remote Clusters
+
+```bash
+# Standalone cluster
+./bin/spark-submit --master spark://master:7077 app.jar
+
+# YARN
+./bin/spark-submit --master yarn --deploy-mode cluster app.jar
+
+# Kubernetes
+./bin/spark-submit --master k8s://https://k8s-api:6443 \
+ --deploy-mode cluster \
+ --conf spark.kubernetes.container.image=spark:3.5.0 \
+ app.jar
+```
+
+### Dynamic Resource Allocation
+
+```bash
+./bin/spark-submit \
+ --conf spark.dynamicAllocation.enabled=true \
+ --conf spark.dynamicAllocation.minExecutors=1 \
+ --conf spark.dynamicAllocation.maxExecutors=10 \
+ app.jar
+```
+
+## Debugging
+
+### Enable Verbose Output
+
+```bash
+./bin/spark-submit --verbose ...
+```
+
+### Check Spark Configuration
+
+```bash
+./bin/spark-submit --class org.example.App app.jar 2>&1 | grep -i "spark\."
+```
+
+### Remote Debugging
+
+```bash
+# Driver debugging
+./bin/spark-submit \
+ --conf spark.driver.extraJavaOptions="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005" \
+ app.jar
+
+# Executor debugging
+./bin/spark-submit \
+ --conf spark.executor.extraJavaOptions="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5006" \
+ app.jar
+```
+
+## Security
+
+### Kerberos Authentication
+
+```bash
+./bin/spark-submit \
+ --principal user@REALM \
+ --keytab /path/to/user.keytab \
+ --master yarn \
+ app.jar
+```
+
+### SSL Configuration
+
+```bash
+./bin/spark-submit \
+ --conf spark.ssl.enabled=true \
+ --conf spark.ssl.keyStore=/path/to/keystore \
+ --conf spark.ssl.keyStorePassword=password \
+ app.jar
+```
+
+## Performance Tuning
+
+### Memory Configuration
+
+```bash
+./bin/spark-submit \
+ --driver-memory 4g \
+ --executor-memory 8g \
+ --conf spark.memory.fraction=0.8 \
+ app.jar
+```
+
+### Parallelism
+
+```bash
+./bin/spark-submit \
+ --conf spark.default.parallelism=100 \
+ --conf spark.sql.shuffle.partitions=200 \
+ app.jar
+```
+
+### Serialization
+
+```bash
+./bin/spark-submit \
+ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
+ app.jar
+```
+
+## Troubleshooting
+
+### Common Issues
+
+**Java not found:**
+```bash
+export JAVA_HOME=/path/to/java
+```
+
+**Class not found:**
+```bash
+# Add dependencies
+./bin/spark-submit --jars dependency.jar app.jar
+```
+
+**Out of memory:**
+```bash
+# Increase memory
+./bin/spark-submit --driver-memory 8g --executor-memory 16g app.jar
+```
+
+**Connection refused:**
+```bash
+# Check master URL and firewall settings
+# Verify master is running with: jps | grep Master
+```
+
+## Script Internals
+
+### Script Hierarchy
+
+```
+spark-submit
+├── spark-class
+│ └── load-spark-env.sh
+└── Actual Java/Python execution
+```
+
+### How spark-submit Works
+
+1. Parse command-line arguments
+2. Load configuration from `spark-defaults.conf`
+3. Set up classpath and Java options
+4. Call `spark-class` with appropriate arguments
+5. Launch JVM with Spark application
+
+## Related Scripts
+
+For cluster management scripts, see [../sbin/README.md](../sbin/README.md).
+
+## Further Reading
+
+- [Submitting Applications](../docs/submitting-applications.md)
+- [Spark Configuration](../docs/configuration.md)
+- [Cluster Mode Overview](../docs/cluster-overview.md)
+- [Running on YARN](../docs/running-on-yarn.md)
+- [Running on Kubernetes](../docs/running-on-kubernetes.md)
+
+## Examples
+
+More examples in [../examples/](../examples/).
diff --git a/common/README.md b/common/README.md
new file mode 100644
index 0000000000000..1d2890b14e6c2
--- /dev/null
+++ b/common/README.md
@@ -0,0 +1,472 @@
+# Spark Common Modules
+
+This directory contains common utilities and libraries shared across all Spark modules.
+
+## Overview
+
+The common modules provide foundational functionality used throughout Spark:
+
+- Network communication
+- Memory management utilities
+- Serialization helpers
+- Configuration management
+- Logging infrastructure
+- Testing utilities
+
+These modules have no dependencies on Spark Core, allowing them to be used by any Spark component.
+
+## Modules
+
+### common/kvstore
+
+Key-value store abstraction for metadata storage.
+
+**Purpose:**
+- Store application metadata
+- Track job and stage information
+- Persist UI data
+
+**Location**: `kvstore/`
+
+**Key classes:**
+- `KVStore`: Interface for key-value storage
+- `LevelDB`: LevelDB-based implementation
+- `InMemoryStore`: In-memory implementation for testing
+
+**Usage:**
+```scala
+val store = new LevelDB(path)
+store.write(new StoreKey(id), value)
+val data = store.read(classOf[ValueType], id)
+```
+
+### common/network-common
+
+Core networking abstractions and utilities.
+
+**Purpose:**
+- RPC framework
+- Block transfer protocol
+- Network servers and clients
+
+**Location**: `network-common/`
+
+**Key components:**
+- `TransportContext`: Network communication setup
+- `TransportClient`: Network client
+- `TransportServer`: Network server
+- `MessageHandler`: Message processing
+- `StreamManager`: Stream data management
+
+**Features:**
+- Netty-based implementation
+- Zero-copy transfers
+- SSL/TLS support
+- Flow control
+
+### common/network-shuffle
+
+Network shuffle service for serving shuffle data.
+
+**Purpose:**
+- External shuffle service
+- Serves shuffle blocks to executors
+- Improves executor reliability
+
+**Location**: `network-shuffle/`
+
+**Key classes:**
+- `ExternalShuffleService`: Standalone shuffle service
+- `ExternalShuffleClient`: Client for fetching shuffle data
+- `ShuffleBlockResolver`: Resolves shuffle block locations
+
+**Benefits:**
+- Executors can be killed without losing shuffle data
+- Better resource utilization
+- Improved fault tolerance
+
+**Configuration:**
+```properties
+spark.shuffle.service.enabled=true
+spark.shuffle.service.port=7337
+```
+
+### common/network-yarn
+
+YARN-specific network integration.
+
+**Purpose:**
+- Integration with YARN shuffle service
+- YARN auxiliary service implementation
+
+**Location**: `network-yarn/`
+
+**Usage:** Automatically used when running on YARN with shuffle service enabled.
+
+### common/sketch
+
+Data sketching and approximate algorithms.
+
+**Purpose:**
+- Memory-efficient approximate computations
+- Probabilistic data structures
+
+**Location**: `sketch/`
+
+**Algorithms:**
+- Count-Min Sketch: Frequency estimation
+- Bloom Filter: Set membership testing
+- HyperLogLog: Cardinality estimation
+
+**Usage:**
+```scala
+import org.apache.spark.util.sketch._
+
+// Create bloom filter
+val bf = BloomFilter.create(expectedItems, falsePositiveRate)
+bf.put("item1")
+bf.mightContain("item1") // true
+
+// Create count-min sketch
+val cms = CountMinSketch.create(depth, width, seed)
+cms.add("item", count)
+val estimate = cms.estimateCount("item")
+```
+
+### common/tags
+
+Test tags for categorizing tests.
+
+**Purpose:**
+- Tag tests for selective execution
+- Categorize slow/flaky tests
+- Enable/disable test groups
+
+**Location**: `tags/`
+
+**Example tags:**
+- `@SlowTest`: Long-running tests
+- `@ExtendedTest`: Extended test suite
+- `@DockerTest`: Tests requiring Docker
+
+### common/unsafe
+
+Unsafe operations for performance-critical code.
+
+**Purpose:**
+- Direct memory access
+- Serialization without reflection
+- Performance optimizations
+
+**Location**: `unsafe/`
+
+**Key classes:**
+- `Platform`: Platform-specific operations
+- `UnsafeAlignedOffset`: Aligned memory access
+- Memory utilities for sorting and hashing
+
+**Warning:** These APIs are internal and subject to change.
+
+## Architecture
+
+### Layering
+
+```
+Spark Core / SQL / Streaming / MLlib
+ ↓
+ Common Modules (network, kvstore, etc.)
+ ↓
+ JVM / Netty / OS
+```
+
+### Design Principles
+
+1. **No Spark Core dependencies**: Can be used independently
+2. **Minimal external dependencies**: Reduce classpath conflicts
+3. **High performance**: Optimized for throughput and latency
+4. **Reusability**: Shared across all Spark components
+
+## Networking Architecture
+
+### Transport Layer
+
+The network-common module provides the foundation for all network communication in Spark.
+
+**Components:**
+
+1. **TransportContext**: Sets up network infrastructure
+2. **TransportClient**: Sends requests and receives responses
+3. **TransportServer**: Accepts connections and handles requests
+4. **MessageHandler**: Processes incoming messages
+
+**Flow:**
+```
+Client Server
+ | |
+ |------ Request Message ------->|
+ | | (Process in MessageHandler)
+ |<----- Response Message -------|
+ | |
+```
+
+### RPC Framework
+
+Built on top of the transport layer:
+
+```scala
+// Server side
+val rpcEnv = RpcEnv.create("name", host, port, conf)
+val endpoint = new MyEndpoint(rpcEnv)
+rpcEnv.setupEndpoint("my-endpoint", endpoint)
+
+// Client side
+val ref = rpcEnv.setupEndpointRef("spark://host:port/my-endpoint")
+val response = ref.askSync[Response](request)
+```
+
+### Block Transfer
+
+Optimized for transferring large data blocks:
+
+```scala
+val blockTransferService = new NettyBlockTransferService(conf)
+blockTransferService.fetchBlocks(
+ host, port, execId, blockIds,
+ blockFetchingListener, tempFileManager
+)
+```
+
+## Building and Testing
+
+### Build Common Modules
+
+```bash
+# Build all common modules
+./build/mvn -pl 'common/*' -am package
+
+# Build specific module
+./build/mvn -pl common/network-common -am package
+```
+
+### Run Tests
+
+```bash
+# Run all common tests
+./build/mvn test -pl 'common/*'
+
+# Run specific module tests
+./build/mvn test -pl common/network-common
+
+# Run specific test
+./build/mvn test -pl common/network-common -Dtest=TransportClientSuite
+```
+
+## Module Dependencies
+
+```
+common/unsafe (no dependencies)
+ ↓
+common/network-common
+ ↓
+common/network-shuffle
+ ↓
+common/network-yarn
+
+common/sketch (independent)
+common/tags (independent)
+common/kvstore (independent)
+```
+
+## Source Code Organization
+
+```
+common/
+├── kvstore/ # Key-value store
+│ └── src/main/java/org/apache/spark/util/kvstore/
+├── network-common/ # Core networking
+│ └── src/main/java/org/apache/spark/network/
+│ ├── client/ # Client implementation
+│ ├── server/ # Server implementation
+│ ├── buffer/ # Buffer management
+│ ├── crypto/ # Encryption
+│ ├── protocol/ # Protocol messages
+│ └── util/ # Utilities
+├── network-shuffle/ # Shuffle service
+│ └── src/main/java/org/apache/spark/network/shuffle/
+├── network-yarn/ # YARN integration
+│ └── src/main/java/org/apache/spark/network/yarn/
+├── sketch/ # Sketching algorithms
+│ └── src/main/java/org/apache/spark/util/sketch/
+├── tags/ # Test tags
+│ └── src/main/java/org/apache/spark/tags/
+└── unsafe/ # Unsafe operations
+ └── src/main/java/org/apache/spark/unsafe/
+```
+
+## Performance Considerations
+
+### Zero-Copy Transfer
+
+Network modules use zero-copy techniques:
+- FileRegion for file-based transfers
+- Direct buffers to avoid copying
+- Netty's native transport when available
+
+### Memory Management
+
+```java
+// Use pooled buffers
+ByteBufAllocator allocator = PooledByteBufAllocator.DEFAULT;
+ByteBuf buffer = allocator.directBuffer(size);
+try {
+ // Use buffer
+} finally {
+ buffer.release();
+}
+```
+
+### Connection Pooling
+
+Clients reuse connections:
+```java
+TransportClientFactory factory = context.createClientFactory();
+TransportClient client = factory.createClient(host, port);
+// Client is cached and reused
+```
+
+## Security
+
+### SSL/TLS Support
+
+Enable encryption in network communication:
+
+```properties
+spark.ssl.enabled=true
+spark.ssl.protocol=TLSv1.2
+spark.ssl.keyStore=/path/to/keystore
+spark.ssl.keyStorePassword=password
+spark.ssl.trustStore=/path/to/truststore
+spark.ssl.trustStorePassword=password
+```
+
+### SASL Authentication
+
+Support for SASL-based authentication:
+
+```properties
+spark.authenticate=true
+spark.authenticate.secret=shared-secret
+```
+
+## Monitoring
+
+### Network Metrics
+
+Key metrics tracked:
+- Active connections
+- Bytes sent/received
+- Request latency
+- Connection failures
+
+**Access via Spark UI**: `http://:4040/metrics/`
+
+### Logging
+
+Enable detailed network logging:
+
+```properties
+log4j.logger.org.apache.spark.network=DEBUG
+log4j.logger.io.netty=DEBUG
+```
+
+## Configuration
+
+### Network Settings
+
+```properties
+# Connection timeout
+spark.network.timeout=120s
+
+# I/O threads
+spark.network.io.numConnectionsPerPeer=1
+
+# Buffer sizes
+spark.network.io.preferDirectBufs=true
+
+# Maximum retries
+spark.network.io.maxRetries=3
+
+# Connection pooling
+spark.rpc.numRetries=3
+spark.rpc.retry.wait=3s
+```
+
+### Shuffle Service
+
+```properties
+spark.shuffle.service.enabled=true
+spark.shuffle.service.port=7337
+spark.shuffle.service.index.cache.size=100m
+```
+
+## Best Practices
+
+1. **Reuse connections**: Don't create new clients unnecessarily
+2. **Release buffers**: Always release ByteBuf instances
+3. **Handle backpressure**: Implement flow control in handlers
+4. **Enable encryption**: Use SSL for sensitive data
+5. **Monitor metrics**: Track network performance
+6. **Configure timeouts**: Set appropriate timeout values
+7. **Use external shuffle service**: For production deployments
+
+## Troubleshooting
+
+### Connection Issues
+
+**Problem**: Connection refused or timeout
+
+**Solutions:**
+- Check firewall settings
+- Verify host and port
+- Increase timeout values
+- Check network connectivity
+
+### Memory Leaks
+
+**Problem**: Growing memory usage in network layer
+
+**Solutions:**
+- Ensure ByteBuf.release() is called
+- Check for unclosed connections
+- Monitor Netty buffer pool metrics
+
+### Slow Performance
+
+**Problem**: High network latency
+
+**Solutions:**
+- Enable native transport
+- Increase I/O threads
+- Adjust buffer sizes
+- Check network bandwidth
+
+## Internal APIs
+
+**Note**: All classes in common modules are internal APIs and may change between versions. They are not part of the public Spark API.
+
+## Further Reading
+
+- [Cluster Mode Overview](../docs/cluster-overview.md)
+- [Configuration Guide](../docs/configuration.md)
+- [Security Guide](../docs/security.md)
+
+## Contributing
+
+For contributing to common modules, see [CONTRIBUTING.md](../CONTRIBUTING.md).
+
+When adding functionality:
+- Keep dependencies minimal
+- Write comprehensive tests
+- Document public methods
+- Consider performance implications
+- Maintain backward compatibility where possible
diff --git a/core/README.md b/core/README.md
new file mode 100644
index 0000000000000..4a5be68b0342e
--- /dev/null
+++ b/core/README.md
@@ -0,0 +1,360 @@
+# Spark Core
+
+Spark Core is the foundation of the Apache Spark platform. It provides the basic functionality for distributed task dispatching, scheduling, and I/O operations.
+
+## Overview
+
+Spark Core contains the fundamental abstractions and components that all other Spark modules build upon:
+
+- **Resilient Distributed Datasets (RDDs)**: The fundamental data abstraction in Spark
+- **SparkContext**: The main entry point for Spark functionality
+- **Task Scheduling**: DAG scheduler and task scheduler for distributed execution
+- **Memory Management**: Unified memory management for execution and storage
+- **Shuffle System**: Data redistribution across partitions
+- **Storage System**: In-memory and disk-based storage for cached data
+- **Network Communication**: RPC and data transfer between driver and executors
+
+## Key Components
+
+### RDD (Resilient Distributed Dataset)
+
+The core abstraction in Spark - an immutable, distributed collection of objects that can be processed in parallel.
+
+**Key characteristics:**
+- **Resilient**: Fault-tolerant through lineage information
+- **Distributed**: Data is partitioned across cluster nodes
+- **Immutable**: Cannot be changed once created
+
+**Location**: `src/main/scala/org/apache/spark/rdd/`
+
+**Main classes:**
+- `RDD.scala`: Base RDD class with transformations and actions
+- `HadoopRDD.scala`: RDD for reading from Hadoop
+- `ParallelCollectionRDD.scala`: RDD created from a local collection
+- `MapPartitionsRDD.scala`: Result of map-like transformations
+
+### SparkContext
+
+The main entry point for Spark functionality. Creates RDDs, accumulators, and broadcast variables.
+
+**Location**: `src/main/scala/org/apache/spark/SparkContext.scala`
+
+**Key responsibilities:**
+- Connects to cluster manager
+- Acquires executors
+- Sends application code to executors
+- Creates and manages RDDs
+- Schedules and executes jobs
+
+### Scheduling
+
+#### DAGScheduler
+
+Computes a DAG of stages for each job and submits them to the TaskScheduler.
+
+**Location**: `src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala`
+
+**Responsibilities:**
+- Determines preferred locations for tasks based on cache status
+- Handles task failures and stage retries
+- Identifies shuffle boundaries to split stages
+- Manages job completion and failure
+
+#### TaskScheduler
+
+Submits task sets to the cluster, manages task execution, and retries failed tasks.
+
+**Location**: `src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala`
+
+**Implementations:**
+- `TaskSchedulerImpl`: Default implementation
+- `YarnScheduler`: YARN-specific implementation
+- Cluster manager-specific schedulers
+
+### Memory Management
+
+Unified memory management system that dynamically allocates memory between execution and storage.
+
+**Location**: `src/main/scala/org/apache/spark/memory/`
+
+**Components:**
+- `MemoryManager`: Base memory management interface
+- `UnifiedMemoryManager`: Dynamic allocation between execution and storage
+- `StorageMemoryPool`: Memory pool for caching
+- `ExecutionMemoryPool`: Memory pool for shuffles and joins
+
+**Memory regions:**
+1. **Execution Memory**: Shuffles, joins, sorts, aggregations
+2. **Storage Memory**: Caching and broadcasting
+3. **User Memory**: User data structures
+4. **Reserved Memory**: System overhead
+
+### Shuffle System
+
+Handles data redistribution between stages.
+
+**Location**: `src/main/scala/org/apache/spark/shuffle/`
+
+**Key classes:**
+- `ShuffleManager`: Interface for shuffle implementations
+- `SortShuffleManager`: Default shuffle implementation
+- `ShuffleWriter`: Writes shuffle data
+- `ShuffleReader`: Reads shuffle data
+
+**Shuffle process:**
+1. **Shuffle Write**: Map tasks write partitioned data to disk
+2. **Shuffle Fetch**: Reduce tasks fetch data from map outputs
+3. **Shuffle Service**: External service for serving shuffle data
+
+### Storage System
+
+Block-based storage abstraction for cached data and shuffle outputs.
+
+**Location**: `src/main/scala/org/apache/spark/storage/`
+
+**Components:**
+- `BlockManager`: Manages data blocks in memory and disk
+- `MemoryStore`: In-memory block storage
+- `DiskStore`: Disk-based block storage
+- `BlockManagerMaster`: Master for coordinating block managers
+
+**Storage levels:**
+- `MEMORY_ONLY`: Store in memory only
+- `MEMORY_AND_DISK`: Spill to disk if memory is full
+- `DISK_ONLY`: Store on disk only
+- `OFF_HEAP`: Store in off-heap memory
+
+### Network Layer
+
+Communication infrastructure for driver-executor and executor-executor communication.
+
+**Location**: `src/main/scala/org/apache/spark/network/` and `common/network-*/`
+
+**Components:**
+- `NettyRpcEnv`: Netty-based RPC implementation
+- `TransportContext`: Network communication setup
+- `BlockTransferService`: Block data transfer
+
+### Serialization
+
+Efficient serialization for data and closures.
+
+**Location**: `src/main/scala/org/apache/spark/serializer/`
+
+**Serializers:**
+- `JavaSerializer`: Default Java serialization (slower)
+- `KryoSerializer`: Faster, more compact serialization (recommended)
+
+**Configuration:**
+```scala
+conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
+```
+
+## API Overview
+
+### Creating RDDs
+
+```scala
+// From a local collection
+val data = Array(1, 2, 3, 4, 5)
+val rdd = sc.parallelize(data)
+
+// From external storage
+val textFile = sc.textFile("hdfs://path/to/file")
+
+// From another RDD
+val mapped = rdd.map(_ * 2)
+```
+
+### Transformations
+
+Lazy operations that define a new RDD:
+
+```scala
+val mapped = rdd.map(x => x * 2)
+val filtered = rdd.filter(x => x > 10)
+val flatMapped = rdd.flatMap(x => x.toString.split(" "))
+```
+
+### Actions
+
+Operations that trigger computation:
+
+```scala
+val count = rdd.count()
+val collected = rdd.collect()
+val reduced = rdd.reduce(_ + _)
+rdd.saveAsTextFile("hdfs://path/to/output")
+```
+
+### Caching
+
+```scala
+// Cache in memory
+rdd.cache()
+
+// Cache with specific storage level
+rdd.persist(StorageLevel.MEMORY_AND_DISK)
+
+// Remove from cache
+rdd.unpersist()
+```
+
+## Configuration
+
+Key configuration parameters (set via `SparkConf`):
+
+### Memory
+- `spark.executor.memory`: Executor memory (default: 1g)
+- `spark.memory.fraction`: Fraction for execution and storage (default: 0.6)
+- `spark.memory.storageFraction`: Fraction of spark.memory.fraction for storage (default: 0.5)
+
+### Parallelism
+- `spark.default.parallelism`: Default number of partitions (default: number of cores)
+- `spark.sql.shuffle.partitions`: Partitions for shuffle operations (default: 200)
+
+### Scheduling
+- `spark.scheduler.mode`: FIFO or FAIR (default: FIFO)
+- `spark.locality.wait`: Wait time for data-local tasks (default: 3s)
+
+### Shuffle
+- `spark.shuffle.compress`: Compress shuffle output (default: true)
+- `spark.shuffle.spill.compress`: Compress shuffle spills (default: true)
+
+See [configuration.md](../docs/configuration.md) for complete list.
+
+## Architecture
+
+### Job Execution Flow
+
+1. **Action called** → Triggers job submission
+2. **DAG construction** → DAGScheduler creates stages
+3. **Task creation** → Each stage becomes a task set
+4. **Task scheduling** → TaskScheduler assigns tasks to executors
+5. **Task execution** → Executors run tasks
+6. **Result collection** → Results returned to driver
+
+### Fault Tolerance
+
+Spark achieves fault tolerance through:
+
+1. **RDD Lineage**: Each RDD knows how to recompute from its parent RDDs
+2. **Task Retry**: Failed tasks are automatically retried
+3. **Stage Retry**: Failed stages are re-executed
+4. **Checkpoint**: Optionally save RDD to stable storage
+
+## Building and Testing
+
+### Build Core Module
+
+```bash
+# Build core only
+./build/mvn -pl core -DskipTests package
+
+# Build core with dependencies
+./build/mvn -pl core -am -DskipTests package
+```
+
+### Run Tests
+
+```bash
+# Run all core tests
+./build/mvn test -pl core
+
+# Run specific test suite
+./build/mvn test -pl core -Dtest=SparkContextSuite
+
+# Run specific test
+./build/mvn test -pl core -Dtest=SparkContextSuite#testJobCancellation
+```
+
+## Source Code Organization
+
+```
+core/src/main/
+├── java/ # Java sources
+│ └── org/apache/spark/
+│ ├── api/ # Java API
+│ ├── shuffle/ # Shuffle implementation
+│ └── unsafe/ # Unsafe operations
+├── scala/ # Scala sources
+│ └── org/apache/spark/
+│ ├── rdd/ # RDD implementations
+│ ├── scheduler/ # Scheduling components
+│ ├── storage/ # Storage system
+│ ├── memory/ # Memory management
+│ ├── shuffle/ # Shuffle system
+│ ├── broadcast/ # Broadcast variables
+│ ├── deploy/ # Deployment components
+│ ├── executor/ # Executor implementation
+│ ├── io/ # I/O utilities
+│ ├── network/ # Network layer
+│ ├── serializer/ # Serialization
+│ └── util/ # Utilities
+└── resources/ # Resource files
+```
+
+## Performance Tuning
+
+### Memory Optimization
+
+1. Adjust memory fractions based on workload
+2. Use off-heap memory for large datasets
+3. Choose appropriate storage levels
+4. Avoid excessive caching
+
+### Shuffle Optimization
+
+1. Minimize shuffle operations
+2. Use `reduceByKey` instead of `groupByKey`
+3. Increase shuffle parallelism
+4. Enable compression
+
+### Serialization Optimization
+
+1. Use Kryo serialization
+2. Register custom classes with Kryo
+3. Avoid closures with large objects
+
+### Data Locality
+
+1. Ensure data and compute are co-located
+2. Increase `spark.locality.wait` if needed
+3. Use appropriate storage levels
+
+## Common Issues and Solutions
+
+### OutOfMemoryError
+
+- Increase executor memory
+- Reduce parallelism
+- Use disk-based storage levels
+- Enable off-heap memory
+
+### Shuffle Failures
+
+- Increase shuffle memory
+- Increase shuffle parallelism
+- Enable external shuffle service
+
+### Slow Performance
+
+- Check data skew
+- Optimize shuffle operations
+- Increase parallelism
+- Enable speculation
+
+## Further Reading
+
+- [RDD Programming Guide](../docs/rdd-programming-guide.md)
+- [Cluster Mode Overview](../docs/cluster-overview.md)
+- [Tuning Guide](../docs/tuning.md)
+- [Job Scheduling](../docs/job-scheduling.md)
+- [Hardware Provisioning](../docs/hardware-provisioning.md)
+
+## Related Modules
+
+- [common/](../common/) - Common utilities shared across modules
+- [launcher/](../launcher/) - Application launcher
+- [sql/](../sql/) - Spark SQL and DataFrames
+- [streaming/](../streaming/) - Spark Streaming
diff --git a/dev/requirements.txt b/dev/requirements.txt
index 40e7fa46cf14b..8dec699878933 100644
--- a/dev/requirements.txt
+++ b/dev/requirements.txt
@@ -57,7 +57,7 @@ jira>=3.5.2
PyGithub
# pandas API on Spark Code formatter.
-black==23.12.1
+black==24.3.0
py
# Spark Connect (required)
diff --git a/examples/README.md b/examples/README.md
new file mode 100644
index 0000000000000..964dfaf3393c3
--- /dev/null
+++ b/examples/README.md
@@ -0,0 +1,432 @@
+# Spark Examples
+
+This directory contains example programs for Apache Spark in Scala, Java, Python, and R.
+
+## Overview
+
+The examples demonstrate various Spark features and APIs:
+
+- **Core Examples**: Basic RDD operations and transformations
+- **SQL Examples**: DataFrame and SQL operations
+- **Streaming Examples**: Stream processing with DStreams and Structured Streaming
+- **MLlib Examples**: Machine learning algorithms and pipelines
+- **GraphX Examples**: Graph processing algorithms
+
+## Running Examples
+
+### Using spark-submit
+
+The recommended way to run examples:
+
+```bash
+# Run a Scala/Java example
+./bin/run-example [params]
+
+# Example: Run SparkPi
+./bin/run-example SparkPi 100
+
+# Example: Run with specific master
+MASTER=spark://host:7077 ./bin/run-example SparkPi 100
+```
+
+### Direct spark-submit
+
+```bash
+# Scala/Java examples
+./bin/spark-submit \
+ --class org.apache.spark.examples.SparkPi \
+ --master local[4] \
+ examples/target/scala-2.13/jars/spark-examples*.jar \
+ 100
+
+# Python examples
+./bin/spark-submit examples/src/main/python/pi.py 100
+
+# R examples
+./bin/spark-submit examples/src/main/r/dataframe.R
+```
+
+### Interactive Shells
+
+```bash
+# Scala shell with examples on classpath
+./bin/spark-shell --jars examples/target/scala-2.13/jars/spark-examples*.jar
+
+# Python shell
+./bin/pyspark
+# Then run: exec(open('examples/src/main/python/pi.py').read())
+
+# R shell
+./bin/sparkR
+# Then: source('examples/src/main/r/dataframe.R')
+```
+
+## Example Categories
+
+### Core Examples
+
+**Basic RDD Operations**
+
+- `SparkPi`: Estimates π using Monte Carlo method
+- `SparkLR`: Logistic regression using gradient descent
+- `SparkKMeans`: K-means clustering
+- `SparkPageRank`: PageRank algorithm implementation
+- `GroupByTest`: Tests groupBy performance
+
+**Locations:**
+- Scala: `src/main/scala/org/apache/spark/examples/`
+- Java: `src/main/java/org/apache/spark/examples/`
+- Python: `src/main/python/`
+- R: `src/main/r/`
+
+### SQL Examples
+
+**DataFrame and SQL Operations**
+
+- `SparkSQLExample`: Basic DataFrame operations
+- `SQLDataSourceExample`: Working with various data sources
+- `RDDRelation`: Converting between RDDs and DataFrames
+- `UserDefinedFunction`: Creating and using UDFs
+- `CsvDataSource`: Reading and writing CSV files
+
+**Running:**
+```bash
+# Scala
+./bin/run-example sql.SparkSQLExample
+
+# Python
+./bin/spark-submit examples/src/main/python/sql/basic.py
+
+# R
+./bin/spark-submit examples/src/main/r/RSparkSQLExample.R
+```
+
+### Streaming Examples
+
+**DStream Examples (Legacy)**
+
+- `NetworkWordCount`: Count words from network stream
+- `StatefulNetworkWordCount`: Stateful word count
+- `RecoverableNetworkWordCount`: Checkpoint and recovery
+- `KafkaWordCount`: Read from Apache Kafka
+- `QueueStream`: Create DStream from queue
+
+**Structured Streaming Examples**
+
+- `StructuredNetworkWordCount`: Word count using Structured Streaming
+- `StructuredKafkaWordCount`: Kafka integration
+- `StructuredSessionization`: Session window operations
+
+**Running:**
+```bash
+# DStream example
+./bin/run-example streaming.NetworkWordCount localhost 9999
+
+# Structured Streaming
+./bin/run-example sql.streaming.StructuredNetworkWordCount localhost 9999
+
+# Python Structured Streaming
+./bin/spark-submit examples/src/main/python/sql/streaming/structured_network_wordcount.py localhost 9999
+```
+
+### MLlib Examples
+
+**Classification**
+- `LogisticRegressionExample`: Binary and multiclass classification
+- `DecisionTreeClassificationExample`: Decision tree classifier
+- `RandomForestClassificationExample`: Random forest classifier
+- `GradientBoostedTreeClassifierExample`: GBT classifier
+- `NaiveBayesExample`: Naive Bayes classifier
+
+**Regression**
+- `LinearRegressionExample`: Linear regression
+- `DecisionTreeRegressionExample`: Decision tree regressor
+- `RandomForestRegressionExample`: Random forest regressor
+- `AFTSurvivalRegressionExample`: Survival regression
+
+**Clustering**
+- `KMeansExample`: K-means clustering
+- `BisectingKMeansExample`: Bisecting K-means
+- `GaussianMixtureExample`: Gaussian mixture model
+- `LDAExample`: Latent Dirichlet Allocation
+
+**Pipelines**
+- `PipelineExample`: ML Pipeline with multiple stages
+- `CrossValidatorExample`: Model selection with cross-validation
+- `TrainValidationSplitExample`: Model selection with train/validation split
+
+**Running:**
+```bash
+# Scala
+./bin/run-example ml.LogisticRegressionExample
+
+# Java
+./bin/run-example ml.JavaLogisticRegressionExample
+
+# Python
+./bin/spark-submit examples/src/main/python/ml/logistic_regression.py
+```
+
+### GraphX Examples
+
+**Graph Algorithms**
+
+- `PageRankExample`: PageRank algorithm
+- `ConnectedComponentsExample`: Finding connected components
+- `TriangleCountExample`: Counting triangles
+- `SocialNetworkExample`: Social network analysis
+
+**Running:**
+```bash
+./bin/run-example graphx.PageRankExample
+```
+
+## Example Datasets
+
+Many examples use sample data from the `data/` directory:
+
+- `data/mllib/`: MLlib sample datasets
+ - `sample_libsvm_data.txt`: LibSVM format data
+ - `sample_binary_classification_data.txt`: Binary classification
+ - `sample_multiclass_classification_data.txt`: Multiclass classification
+
+- `data/graphx/`: GraphX sample data
+ - `followers.txt`: Social network follower data
+ - `users.txt`: User information
+
+## Building Examples
+
+### Build All Examples
+
+```bash
+# Build examples module
+./build/mvn -pl examples -am package
+
+# Skip tests
+./build/mvn -pl examples -am -DskipTests package
+```
+
+### Build Specific Language Examples
+
+The examples are compiled together, but you can run them separately by language.
+
+## Creating Your Own Examples
+
+### Scala Example Template
+
+```scala
+package org.apache.spark.examples
+
+import org.apache.spark.sql.SparkSession
+
+object MyExample {
+ def main(args: Array[String]): Unit = {
+ val spark = SparkSession
+ .builder()
+ .appName("My Example")
+ .getOrCreate()
+
+ try {
+ // Your Spark code here
+ import spark.implicits._
+ val df = spark.range(100).toDF("number")
+ df.show()
+ } finally {
+ spark.stop()
+ }
+ }
+}
+```
+
+### Python Example Template
+
+```python
+from pyspark.sql import SparkSession
+
+def main():
+ spark = SparkSession \
+ .builder \
+ .appName("My Example") \
+ .getOrCreate()
+
+ try:
+ # Your Spark code here
+ df = spark.range(100)
+ df.show()
+ finally:
+ spark.stop()
+
+if __name__ == "__main__":
+ main()
+```
+
+### Java Example Template
+
+```java
+package org.apache.spark.examples;
+
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+
+public class MyExample {
+ public static void main(String[] args) {
+ SparkSession spark = SparkSession
+ .builder()
+ .appName("My Example")
+ .getOrCreate();
+
+ try {
+ // Your Spark code here
+ Dataset df = spark.range(100);
+ df.show();
+ } finally {
+ spark.stop();
+ }
+ }
+}
+```
+
+### R Example Template
+
+```r
+library(SparkR)
+
+sparkR.session(appName = "My Example")
+
+# Your Spark code here
+df <- createDataFrame(data.frame(number = 1:100))
+head(df)
+
+sparkR.session.stop()
+```
+
+## Example Directory Structure
+
+```
+examples/src/main/
+├── java/org/apache/spark/examples/ # Java examples
+│ ├── JavaSparkPi.java
+│ ├── JavaWordCount.java
+│ ├── ml/ # ML examples
+│ ├── sql/ # SQL examples
+│ └── streaming/ # Streaming examples
+├── python/ # Python examples
+│ ├── pi.py
+│ ├── wordcount.py
+│ ├── ml/ # ML examples
+│ ├── sql/ # SQL examples
+│ └── streaming/ # Streaming examples
+├── r/ # R examples
+│ ├── RSparkSQLExample.R
+│ ├── ml.R
+│ └── dataframe.R
+└── scala/org/apache/spark/examples/ # Scala examples
+ ├── SparkPi.scala
+ ├── SparkLR.scala
+ ├── ml/ # ML examples
+ ├── sql/ # SQL examples
+ ├── streaming/ # Streaming examples
+ └── graphx/ # GraphX examples
+```
+
+## Common Patterns
+
+### Reading Data
+
+```scala
+// Text file
+val textData = spark.read.textFile("path/to/file.txt")
+
+// CSV
+val csvData = spark.read.option("header", "true").csv("path/to/file.csv")
+
+// JSON
+val jsonData = spark.read.json("path/to/file.json")
+
+// Parquet
+val parquetData = spark.read.parquet("path/to/file.parquet")
+```
+
+### Writing Data
+
+```scala
+// Save as text
+df.write.text("output/path")
+
+// Save as CSV
+df.write.option("header", "true").csv("output/path")
+
+// Save as Parquet
+df.write.parquet("output/path")
+
+// Save as JSON
+df.write.json("output/path")
+```
+
+### Working with Partitions
+
+```scala
+// Repartition for more parallelism
+val repartitioned = df.repartition(10)
+
+// Coalesce to reduce partitions
+val coalesced = df.coalesce(2)
+
+// Partition by column when writing
+df.write.partitionBy("year", "month").parquet("output/path")
+```
+
+## Performance Tips for Examples
+
+1. **Use Local Mode for Testing**: Start with `local[*]` for development
+2. **Adjust Partitions**: Use appropriate partition counts for your data size
+3. **Cache When Reusing**: Cache DataFrames/RDDs that are accessed multiple times
+4. **Monitor Jobs**: Use Spark UI at http://localhost:4040 to monitor execution
+
+## Troubleshooting
+
+### Common Issues
+
+**OutOfMemoryError**
+```bash
+# Increase driver memory
+./bin/spark-submit --driver-memory 4g examples/...
+
+# Increase executor memory
+./bin/spark-submit --executor-memory 4g examples/...
+```
+
+**Class Not Found**
+```bash
+# Make sure examples JAR is built
+./build/mvn -pl examples -am package
+```
+
+**File Not Found**
+```bash
+# Use absolute paths or ensure working directory is spark root
+./bin/run-example SparkPi # Run from spark root directory
+```
+
+## Additional Resources
+
+- [Quick Start Guide](../docs/quick-start.md)
+- [Programming Guide](../docs/programming-guide.md)
+- [SQL Programming Guide](../docs/sql-programming-guide.md)
+- [MLlib Guide](../docs/ml-guide.md)
+- [Structured Streaming Guide](../docs/structured-streaming-programming-guide.md)
+- [GraphX Guide](../docs/graphx-programming-guide.md)
+
+## Contributing Examples
+
+When adding new examples:
+
+1. Follow existing code style and structure
+2. Include clear comments explaining the example
+3. Add appropriate documentation
+4. Test the example with various inputs
+5. Add to the appropriate category
+6. Update this README
+
+For more information, see [CONTRIBUTING.md](../CONTRIBUTING.md).
diff --git a/graphx/README.md b/graphx/README.md
new file mode 100644
index 0000000000000..08c841b6c04d5
--- /dev/null
+++ b/graphx/README.md
@@ -0,0 +1,549 @@
+# GraphX
+
+GraphX is Apache Spark's API for graphs and graph-parallel computation.
+
+## Overview
+
+GraphX unifies ETL (Extract, Transform, and Load), exploratory analysis, and iterative graph computation within a single system. It provides:
+
+- **Graph Abstraction**: Efficient representation of property graphs
+- **Graph Algorithms**: PageRank, Connected Components, Triangle Counting, and more
+- **Pregel API**: For iterative graph computations
+- **Graph Builders**: Tools to construct graphs from RDDs or files
+- **Graph Operators**: Transformations and structural operations
+
+## Key Concepts
+
+### Property Graph
+
+A directed multigraph with properties attached to each vertex and edge.
+
+**Components:**
+- **Vertices**: Nodes with unique IDs and properties
+- **Edges**: Directed connections between vertices with properties
+- **Triplets**: A view joining vertices and edges
+
+```scala
+import org.apache.spark.graphx._
+
+// Create vertices RDD
+val vertices: RDD[(VertexId, String)] = sc.parallelize(Array(
+ (1L, "Alice"),
+ (2L, "Bob"),
+ (3L, "Charlie")
+))
+
+// Create edges RDD
+val edges: RDD[Edge[String]] = sc.parallelize(Array(
+ Edge(1L, 2L, "friend"),
+ Edge(2L, 3L, "follow")
+))
+
+// Build the graph
+val graph: Graph[String, String] = Graph(vertices, edges)
+```
+
+### Graph Structure
+
+```
+Graph[VD, ED]
+ - vertices: VertexRDD[VD] // Vertices with properties of type VD
+ - edges: EdgeRDD[ED] // Edges with properties of type ED
+ - triplets: RDD[EdgeTriplet[VD, ED]] // Combined view
+```
+
+## Core Components
+
+### Graph Class
+
+The main graph abstraction.
+
+**Location**: `src/main/scala/org/apache/spark/graphx/Graph.scala`
+
+**Key methods:**
+- `vertices: VertexRDD[VD]`: Access vertices
+- `edges: EdgeRDD[ED]`: Access edges
+- `triplets: RDD[EdgeTriplet[VD, ED]]`: Access triplets
+- `mapVertices[VD2](map: (VertexId, VD) => VD2)`: Transform vertex properties
+- `mapEdges[ED2](map: Edge[ED] => ED2)`: Transform edge properties
+- `subgraph(epred, vpred)`: Create subgraph based on predicates
+
+### VertexRDD
+
+Optimized RDD for vertex data.
+
+**Location**: `src/main/scala/org/apache/spark/graphx/VertexRDD.scala`
+
+**Features:**
+- Fast lookups by vertex ID
+- Efficient joins with edge data
+- Reuse of vertex indices
+
+### EdgeRDD
+
+Optimized RDD for edge data.
+
+**Location**: `src/main/scala/org/apache/spark/graphx/EdgeRDD.scala`
+
+**Features:**
+- Compact edge storage
+- Fast filtering and mapping
+- Efficient partitioning
+
+### EdgeTriplet
+
+Represents a edge with its source and destination vertex properties.
+
+**Structure:**
+```scala
+class EdgeTriplet[VD, ED] extends Edge[ED] {
+ var srcAttr: VD // Source vertex property
+ var dstAttr: VD // Destination vertex property
+ var attr: ED // Edge property
+}
+```
+
+## Graph Operators
+
+### Property Operators
+
+```scala
+// Map vertex properties
+val newGraph = graph.mapVertices((id, attr) => attr.toUpperCase)
+
+// Map edge properties
+val newGraph = graph.mapEdges(e => e.attr + "relationship")
+
+// Map triplets (access to src and dst properties)
+val newGraph = graph.mapTriplets(triplet =>
+ (triplet.srcAttr, triplet.attr, triplet.dstAttr)
+)
+```
+
+### Structural Operators
+
+```scala
+// Reverse edge directions
+val reversedGraph = graph.reverse
+
+// Create subgraph
+val subgraph = graph.subgraph(
+ epred = e => e.srcId != e.dstId, // No self-loops
+ vpred = (id, attr) => attr.length > 0 // Non-empty names
+)
+
+// Mask graph (keep only edges/vertices in another graph)
+val maskedGraph = graph.mask(subgraph)
+
+// Group edges
+val groupedGraph = graph.groupEdges((e1, e2) => e1 + e2)
+```
+
+### Join Operators
+
+```scala
+// Join vertices with external data
+val newData: RDD[(VertexId, NewType)] = ...
+val newGraph = graph.joinVertices(newData) {
+ (id, oldAttr, newAttr) => (oldAttr, newAttr)
+}
+
+// Outer join vertices
+val newGraph = graph.outerJoinVertices(newData) {
+ (id, oldAttr, newAttr) => newAttr.getOrElse(oldAttr)
+}
+```
+
+## Graph Algorithms
+
+GraphX includes several common graph algorithms.
+
+**Location**: `src/main/scala/org/apache/spark/graphx/lib/`
+
+### PageRank
+
+Measures the importance of each vertex based on link structure.
+
+```scala
+import org.apache.spark.graphx.lib.PageRank
+
+// Static PageRank (fixed iterations)
+val ranks = graph.staticPageRank(numIter = 10)
+
+// Dynamic PageRank (convergence-based)
+val ranks = graph.pageRank(tol = 0.001)
+
+// Get top ranked vertices
+val topRanked = ranks.vertices.top(10)(Ordering.by(_._2))
+```
+
+**File**: `src/main/scala/org/apache/spark/graphx/lib/PageRank.scala`
+
+### Connected Components
+
+Finds connected components in the graph.
+
+```scala
+import org.apache.spark.graphx.lib.ConnectedComponents
+
+// Find connected components
+val cc = graph.connectedComponents()
+
+// Count vertices in each component
+val componentCounts = cc.vertices
+ .map { case (id, component) => (component, 1) }
+ .reduceByKey(_ + _)
+```
+
+**File**: `src/main/scala/org/apache/spark/graphx/lib/ConnectedComponents.scala`
+
+### Triangle Counting
+
+Counts triangles (3-cliques) in the graph.
+
+```scala
+import org.apache.spark.graphx.lib.TriangleCount
+
+// Count triangles
+val triCounts = graph.triangleCount()
+
+// Get vertices with most triangles
+val topTriangles = triCounts.vertices.top(10)(Ordering.by(_._2))
+```
+
+**File**: `src/main/scala/org/apache/spark/graphx/lib/TriangleCount.scala`
+
+### Label Propagation
+
+Community detection algorithm.
+
+```scala
+import org.apache.spark.graphx.lib.LabelPropagation
+
+// Run label propagation
+val communities = graph.labelPropagation(maxSteps = 5)
+
+// Group vertices by community
+val communityGroups = communities.vertices
+ .map { case (id, label) => (label, Set(id)) }
+ .reduceByKey(_ ++ _)
+```
+
+**File**: `src/main/scala/org/apache/spark/graphx/lib/LabelPropagation.scala`
+
+### Strongly Connected Components
+
+Finds strongly connected components in a directed graph.
+
+```scala
+import org.apache.spark.graphx.lib.StronglyConnectedComponents
+
+// Find strongly connected components
+val scc = graph.stronglyConnectedComponents(numIter = 10)
+```
+
+**File**: `src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala`
+
+### Shortest Paths
+
+Computes shortest paths from source vertices to all reachable vertices.
+
+```scala
+import org.apache.spark.graphx.lib.ShortestPaths
+
+// Compute shortest paths from vertices 1 and 2
+val landmarks = Seq(1L, 2L)
+val results = graph.shortestPaths(landmarks)
+
+// Results contain distance to each landmark
+results.vertices.foreach { case (id, distances) =>
+ println(s"Vertex $id: $distances")
+}
+```
+
+**File**: `src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala`
+
+## Pregel API
+
+Bulk-synchronous parallel messaging abstraction for iterative graph algorithms.
+
+```scala
+def pregel[A: ClassTag](
+ initialMsg: A,
+ maxIterations: Int = Int.MaxValue,
+ activeDirection: EdgeDirection = EdgeDirection.Either
+)(
+ vprog: (VertexId, VD, A) => VD,
+ sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
+ mergeMsg: (A, A) => A
+): Graph[VD, ED]
+```
+
+**Example: Single-Source Shortest Path**
+
+```scala
+val sourceId: VertexId = 1L
+
+// Initialize distances
+val initialGraph = graph.mapVertices((id, _) =>
+ if (id == sourceId) 0.0 else Double.PositiveInfinity
+)
+
+// Run Pregel
+val sssp = initialGraph.pregel(Double.PositiveInfinity)(
+ // Vertex program: update vertex value with minimum distance
+ (id, dist, newDist) => math.min(dist, newDist),
+
+ // Send message: send distance + edge weight to neighbors
+ triplet => {
+ if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
+ Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
+ } else {
+ Iterator.empty
+ }
+ },
+
+ // Merge messages: take minimum distance
+ (a, b) => math.min(a, b)
+)
+```
+
+**File**: `src/main/scala/org/apache/spark/graphx/Pregel.scala`
+
+## Graph Builders
+
+### From Edge List
+
+```scala
+// Load edge list from file
+val graph = GraphLoader.edgeListFile(sc, "path/to/edges.txt")
+
+// Edge file format: source destination
+// Example:
+// 1 2
+// 2 3
+// 3 1
+```
+
+### From RDDs
+
+```scala
+val vertices: RDD[(VertexId, VD)] = ...
+val edges: RDD[Edge[ED]] = ...
+
+val graph = Graph(vertices, edges)
+
+// With default vertex property
+val graph = Graph.fromEdges(edges, defaultValue = "Unknown")
+
+// From edge tuples
+val edgeTuples: RDD[(VertexId, VertexId)] = ...
+val graph = Graph.fromEdgeTuples(edgeTuples, defaultValue = 1)
+```
+
+## Partitioning Strategies
+
+Efficient graph partitioning is crucial for performance.
+
+**Available strategies:**
+- `EdgePartition1D`: Partition edges by source vertex
+- `EdgePartition2D`: 2D matrix partitioning
+- `RandomVertexCut`: Random edge partitioning (default)
+- `CanonicalRandomVertexCut`: Similar to RandomVertexCut but canonical
+
+```scala
+import org.apache.spark.graphx.PartitionStrategy
+
+val graph = Graph(vertices, edges)
+ .partitionBy(PartitionStrategy.EdgePartition2D)
+```
+
+**Location**: `src/main/scala/org/apache/spark/graphx/PartitionStrategy.scala`
+
+## Performance Optimization
+
+### Caching
+
+```scala
+// Cache graph in memory
+graph.cache()
+
+// Or persist with storage level
+graph.persist(StorageLevel.MEMORY_AND_DISK)
+
+// Unpersist when done
+graph.unpersist()
+```
+
+### Partitioning
+
+```scala
+// Repartition for better balance
+val partitionedGraph = graph
+ .partitionBy(PartitionStrategy.EdgePartition2D, numPartitions = 100)
+ .cache()
+```
+
+### Checkpointing
+
+For iterative algorithms, checkpoint periodically:
+
+```scala
+sc.setCheckpointDir("hdfs://checkpoint")
+
+var graph = initialGraph
+for (i <- 1 to maxIterations) {
+ // Perform iteration
+ graph = performIteration(graph)
+
+ // Checkpoint every 10 iterations
+ if (i % 10 == 0) {
+ graph.checkpoint()
+ }
+}
+```
+
+## Building and Testing
+
+### Build GraphX Module
+
+```bash
+# Build graphx module
+./build/mvn -pl graphx -am package
+
+# Skip tests
+./build/mvn -pl graphx -am -DskipTests package
+```
+
+### Run Tests
+
+```bash
+# Run all graphx tests
+./build/mvn test -pl graphx
+
+# Run specific test suite
+./build/mvn test -pl graphx -Dtest=GraphSuite
+```
+
+## Source Code Organization
+
+```
+graphx/src/main/
+├── scala/org/apache/spark/graphx/
+│ ├── Graph.scala # Main graph class
+│ ├── GraphOps.scala # Graph operations
+│ ├── VertexRDD.scala # Vertex RDD
+│ ├── EdgeRDD.scala # Edge RDD
+│ ├── Edge.scala # Edge class
+│ ├── EdgeTriplet.scala # Edge triplet
+│ ├── Pregel.scala # Pregel API
+│ ├── GraphLoader.scala # Graph loading utilities
+│ ├── PartitionStrategy.scala # Partitioning strategies
+│ ├── impl/ # Implementation details
+│ │ ├── GraphImpl.scala # Graph implementation
+│ │ ├── VertexRDDImpl.scala # VertexRDD implementation
+│ │ ├── EdgeRDDImpl.scala # EdgeRDD implementation
+│ │ └── ReplicatedVertexView.scala # Vertex replication
+│ ├── lib/ # Graph algorithms
+│ │ ├── PageRank.scala
+│ │ ├── ConnectedComponents.scala
+│ │ ├── TriangleCount.scala
+│ │ ├── LabelPropagation.scala
+│ │ ├── StronglyConnectedComponents.scala
+│ │ └── ShortestPaths.scala
+│ └── util/ # Utilities
+│ ├── BytecodeUtils.scala
+│ └── GraphGenerators.scala # Test graph generation
+└── resources/
+```
+
+## Examples
+
+See [examples/src/main/scala/org/apache/spark/examples/graphx/](../examples/src/main/scala/org/apache/spark/examples/graphx/) for complete examples.
+
+**Key examples:**
+- `PageRankExample.scala`: PageRank on social network
+- `ConnectedComponentsExample.scala`: Finding connected components
+- `SocialNetworkExample.scala`: Complete social network analysis
+
+## Common Use Cases
+
+### Social Network Analysis
+
+```scala
+// Load social network
+val users: RDD[(VertexId, String)] = sc.textFile("users.txt")
+ .map(line => (line.split(",")(0).toLong, line.split(",")(1)))
+
+val relationships: RDD[Edge[String]] = sc.textFile("relationships.txt")
+ .map { line =>
+ val fields = line.split(",")
+ Edge(fields(0).toLong, fields(1).toLong, fields(2))
+ }
+
+val graph = Graph(users, relationships)
+
+// Find influential users (PageRank)
+val ranks = graph.pageRank(0.001).vertices
+
+// Find communities
+val communities = graph.labelPropagation(5)
+
+// Count mutual friends (triangles)
+val triangles = graph.triangleCount()
+```
+
+### Web Graph Analysis
+
+```scala
+// Load web graph
+val graph = GraphLoader.edgeListFile(sc, "web-graph.txt")
+
+// Compute PageRank
+val ranks = graph.pageRank(0.001)
+
+// Find authoritative pages
+val topPages = ranks.vertices.top(100)(Ordering.by(_._2))
+```
+
+### Road Network Analysis
+
+```scala
+// Vertices are intersections, edges are roads
+val roadNetwork: Graph[String, Double] = ...
+
+// Find shortest paths from landmarks
+val landmarks = Seq(1L, 2L, 3L)
+val distances = roadNetwork.shortestPaths(landmarks)
+
+// Find highly connected intersections
+val degrees = roadNetwork.degrees
+val busyIntersections = degrees.top(10)(Ordering.by(_._2))
+```
+
+## Best Practices
+
+1. **Partition carefully**: Use appropriate partitioning strategy for your workload
+2. **Cache graphs**: Cache graphs that are accessed multiple times
+3. **Avoid unnecessary materialization**: GraphX uses lazy evaluation
+4. **Use GraphLoader**: For simple edge lists, use GraphLoader
+5. **Monitor memory**: Graph algorithms can be memory-intensive
+6. **Checkpoint long lineages**: Checkpoint periodically in iterative algorithms
+7. **Consider edge direction**: Many operations respect edge direction
+
+## Limitations and Considerations
+
+- **No mutable graphs**: Graphs are immutable; modifications create new graphs
+- **Memory overhead**: Vertex replication can increase memory usage
+- **Edge direction**: Operations may behave differently on directed vs undirected graphs
+- **Single-machine graphs**: For small graphs (< 1M edges), NetworkX or igraph may be faster
+
+## Further Reading
+
+- [GraphX Programming Guide](../docs/graphx-programming-guide.md)
+- [GraphX Paper](http://www.vldb.org/pvldb/vol7/p1673-xin.pdf)
+- [Pregel: A System for Large-Scale Graph Processing](https://kowshik.github.io/JPregel/pregel_paper.pdf)
+
+## Contributing
+
+For contributing to GraphX, see [CONTRIBUTING.md](../CONTRIBUTING.md).
diff --git a/launcher/README.md b/launcher/README.md
new file mode 100644
index 0000000000000..7b49bcba4cfae
--- /dev/null
+++ b/launcher/README.md
@@ -0,0 +1,475 @@
+# Spark Launcher
+
+The Spark Launcher library provides a programmatic interface for launching Spark applications.
+
+## Overview
+
+The Launcher module allows you to:
+- Launch Spark applications programmatically from Java/Scala code
+- Monitor application state and output
+- Manage Spark processes
+- Build command-line arguments programmatically
+
+This is an alternative to invoking `spark-submit` via shell commands.
+
+## Key Components
+
+### SparkLauncher
+
+The main class for launching Spark applications.
+
+**Location**: `src/main/java/org/apache/spark/launcher/SparkLauncher.java`
+
+**Basic Usage:**
+```java
+import org.apache.spark.launcher.SparkLauncher;
+
+SparkLauncher launcher = new SparkLauncher()
+ .setAppResource("/path/to/app.jar")
+ .setMainClass("com.example.MyApp")
+ .setMaster("spark://master:7077")
+ .setConf(SparkLauncher.DRIVER_MEMORY, "2g")
+ .setConf(SparkLauncher.EXECUTOR_MEMORY, "4g")
+ .addAppArgs("arg1", "arg2");
+
+Process spark = launcher.launch();
+spark.waitFor();
+```
+
+### SparkAppHandle
+
+Interface for monitoring launched applications.
+
+**Location**: `src/main/java/org/apache/spark/launcher/SparkAppHandle.java`
+
+**Usage:**
+```java
+import org.apache.spark.launcher.SparkAppHandle;
+
+SparkAppHandle handle = launcher.startApplication();
+
+// Add listener for state changes
+handle.addListener(new SparkAppHandle.Listener() {
+ @Override
+ public void stateChanged(SparkAppHandle handle) {
+ System.out.println("State: " + handle.getState());
+ }
+
+ @Override
+ public void infoChanged(SparkAppHandle handle) {
+ System.out.println("App ID: " + handle.getAppId());
+ }
+});
+
+// Wait for completion
+while (!handle.getState().isFinal()) {
+ Thread.sleep(1000);
+}
+```
+
+## API Reference
+
+### Configuration Methods
+
+```java
+SparkLauncher launcher = new SparkLauncher();
+
+// Application settings
+launcher.setAppResource("/path/to/app.jar");
+launcher.setMainClass("com.example.MainClass");
+launcher.setAppName("MyApplication");
+
+// Cluster settings
+launcher.setMaster("spark://master:7077");
+launcher.setDeployMode("cluster");
+
+// Resource settings
+launcher.setConf(SparkLauncher.DRIVER_MEMORY, "2g");
+launcher.setConf(SparkLauncher.EXECUTOR_MEMORY, "4g");
+launcher.setConf(SparkLauncher.EXECUTOR_CORES, "2");
+
+// Additional configurations
+launcher.setConf("spark.executor.instances", "5");
+launcher.setConf("spark.sql.shuffle.partitions", "200");
+
+// Dependencies
+launcher.addJar("/path/to/dependency.jar");
+launcher.addFile("/path/to/file.txt");
+launcher.addPyFile("/path/to/module.py");
+
+// Application arguments
+launcher.addAppArgs("arg1", "arg2", "arg3");
+
+// Environment
+launcher.setSparkHome("/path/to/spark");
+launcher.setPropertiesFile("/path/to/spark-defaults.conf");
+launcher.setVerbose(true);
+```
+
+### Launch Methods
+
+```java
+// Launch and return Process handle
+Process process = launcher.launch();
+
+// Launch and return SparkAppHandle for monitoring
+SparkAppHandle handle = launcher.startApplication();
+
+// For child process mode (rare)
+SparkAppHandle handle = launcher.startApplication(
+ new SparkAppHandle.Listener() {
+ // Listener implementation
+ }
+);
+```
+
+### Constants
+
+Common configuration keys are available as constants:
+
+```java
+SparkLauncher.SPARK_MASTER // "spark.master"
+SparkLauncher.APP_RESOURCE // "spark.app.resource"
+SparkLauncher.APP_NAME // "spark.app.name"
+SparkLauncher.DRIVER_MEMORY // "spark.driver.memory"
+SparkLauncher.DRIVER_EXTRA_CLASSPATH // "spark.driver.extraClassPath"
+SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS // "spark.driver.extraJavaOptions"
+SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH // "spark.driver.extraLibraryPath"
+SparkLauncher.EXECUTOR_MEMORY // "spark.executor.memory"
+SparkLauncher.EXECUTOR_CORES // "spark.executor.cores"
+SparkLauncher.EXECUTOR_EXTRA_CLASSPATH // "spark.executor.extraClassPath"
+SparkLauncher.EXECUTOR_EXTRA_JAVA_OPTIONS // "spark.executor.extraJavaOptions"
+SparkLauncher.EXECUTOR_EXTRA_LIBRARY_PATH // "spark.executor.extraLibraryPath"
+```
+
+## Application States
+
+The `SparkAppHandle.State` enum represents application lifecycle states:
+
+- `UNKNOWN`: Initial state
+- `CONNECTED`: Connected to Spark
+- `SUBMITTED`: Application submitted
+- `RUNNING`: Application running
+- `FINISHED`: Completed successfully
+- `FAILED`: Failed with error
+- `KILLED`: Killed by user
+- `LOST`: Connection lost
+
+**Check if final:**
+```java
+if (handle.getState().isFinal()) {
+ // Application has completed
+}
+```
+
+## Examples
+
+### Launch Scala Application
+
+```java
+import org.apache.spark.launcher.SparkLauncher;
+
+public class LaunchSparkApp {
+ public static void main(String[] args) throws Exception {
+ Process spark = new SparkLauncher()
+ .setAppResource("/path/to/app.jar")
+ .setMainClass("com.example.SparkApp")
+ .setMaster("local[2]")
+ .setConf(SparkLauncher.DRIVER_MEMORY, "2g")
+ .launch();
+
+ spark.waitFor();
+ System.exit(spark.exitValue());
+ }
+}
+```
+
+### Launch Python Application
+
+```java
+SparkLauncher launcher = new SparkLauncher()
+ .setAppResource("/path/to/app.py")
+ .setMaster("yarn")
+ .setDeployMode("cluster")
+ .setConf(SparkLauncher.EXECUTOR_MEMORY, "4g")
+ .addPyFile("/path/to/dependency.py")
+ .addAppArgs("--input", "/data/input", "--output", "/data/output");
+
+SparkAppHandle handle = launcher.startApplication();
+```
+
+### Monitor Application with Listener
+
+```java
+import org.apache.spark.launcher.SparkAppHandle;
+
+class MyListener implements SparkAppHandle.Listener {
+ @Override
+ public void stateChanged(SparkAppHandle handle) {
+ SparkAppHandle.State state = handle.getState();
+ System.out.println("Application state changed to: " + state);
+
+ if (state.isFinal()) {
+ if (state == SparkAppHandle.State.FINISHED) {
+ System.out.println("Application completed successfully");
+ } else {
+ System.out.println("Application failed: " + state);
+ }
+ }
+ }
+
+ @Override
+ public void infoChanged(SparkAppHandle handle) {
+ System.out.println("Application ID: " + handle.getAppId());
+ }
+}
+
+// Use the listener
+SparkAppHandle handle = new SparkLauncher()
+ .setAppResource("/path/to/app.jar")
+ .setMainClass("com.example.App")
+ .setMaster("spark://master:7077")
+ .startApplication(new MyListener());
+```
+
+### Capture Output
+
+```java
+import java.io.*;
+
+Process spark = new SparkLauncher()
+ .setAppResource("/path/to/app.jar")
+ .setMainClass("com.example.App")
+ .setMaster("local")
+ .redirectOutput(ProcessBuilder.Redirect.PIPE)
+ .redirectError(ProcessBuilder.Redirect.PIPE)
+ .launch();
+
+// Read output
+BufferedReader reader = new BufferedReader(
+ new InputStreamReader(spark.getInputStream())
+);
+String line;
+while ((line = reader.readLine()) != null) {
+ System.out.println(line);
+}
+
+spark.waitFor();
+```
+
+### Kill Running Application
+
+```java
+SparkAppHandle handle = launcher.startApplication();
+
+// Later, kill the application
+handle.kill();
+
+// Or stop gracefully
+handle.stop();
+```
+
+## In-Process Launcher
+
+For testing or special cases, launch Spark in the same JVM:
+
+```java
+import org.apache.spark.launcher.InProcessLauncher;
+
+InProcessLauncher launcher = new InProcessLauncher();
+// Configure launcher...
+SparkAppHandle handle = launcher.startApplication();
+```
+
+**Note**: This is primarily for testing. Production code should use `SparkLauncher`.
+
+## Building and Testing
+
+### Build Launcher Module
+
+```bash
+# Build launcher module
+./build/mvn -pl launcher -am package
+
+# Skip tests
+./build/mvn -pl launcher -am -DskipTests package
+```
+
+### Run Tests
+
+```bash
+# Run all launcher tests
+./build/mvn test -pl launcher
+
+# Run specific test
+./build/mvn test -pl launcher -Dtest=SparkLauncherSuite
+```
+
+## Source Code Organization
+
+```
+launcher/src/main/java/org/apache/spark/launcher/
+├── SparkLauncher.java # Main launcher class
+├── SparkAppHandle.java # Application handle interface
+├── AbstractLauncher.java # Base launcher implementation
+├── InProcessLauncher.java # In-process launcher (testing)
+├── Main.java # Entry point for spark-submit
+├── SparkSubmitCommandBuilder.java # Builds spark-submit commands
+├── CommandBuilderUtils.java # Command building utilities
+└── LauncherBackend.java # Backend communication
+```
+
+## Integration with spark-submit
+
+The Launcher library is used internally by `spark-submit`:
+
+```
+spark-submit script
+ ↓
+Main.main()
+ ↓
+SparkSubmitCommandBuilder
+ ↓
+Launch JVM with SparkSubmit
+```
+
+## Configuration Priority
+
+Configuration values are resolved in this order (highest priority first):
+
+1. Values set via `setConf()` or specific setters
+2. Properties file specified with `setPropertiesFile()`
+3. `conf/spark-defaults.conf` in `SPARK_HOME`
+4. Environment variables
+
+## Environment Variables
+
+The launcher respects these environment variables:
+
+- `SPARK_HOME`: Spark installation directory
+- `JAVA_HOME`: Java installation directory
+- `SPARK_CONF_DIR`: Configuration directory
+- `HADOOP_CONF_DIR`: Hadoop configuration directory
+- `YARN_CONF_DIR`: YARN configuration directory
+
+## Security Considerations
+
+When launching applications programmatically:
+
+1. **Validate inputs**: Sanitize application arguments
+2. **Secure credentials**: Don't hardcode secrets
+3. **Limit permissions**: Run with minimal required privileges
+4. **Monitor processes**: Track launched applications
+5. **Clean up resources**: Always close handles and processes
+
+## Common Use Cases
+
+### Workflow Orchestration
+
+Launch Spark jobs as part of data pipelines:
+
+```java
+public class DataPipeline {
+ public void runStage(String stageName, String mainClass) throws Exception {
+ SparkAppHandle handle = new SparkLauncher()
+ .setAppResource("/path/to/pipeline.jar")
+ .setMainClass(mainClass)
+ .setMaster("yarn")
+ .setAppName("Pipeline-" + stageName)
+ .startApplication();
+
+ // Wait for completion
+ while (!handle.getState().isFinal()) {
+ Thread.sleep(1000);
+ }
+
+ if (handle.getState() != SparkAppHandle.State.FINISHED) {
+ throw new RuntimeException("Stage " + stageName + " failed");
+ }
+ }
+}
+```
+
+### Testing
+
+Launch Spark applications in integration tests:
+
+```java
+@Test
+public void testSparkApp() throws Exception {
+ SparkAppHandle handle = new SparkLauncher()
+ .setAppResource("target/test-app.jar")
+ .setMainClass("com.example.TestApp")
+ .setMaster("local[2]")
+ .startApplication();
+
+ // Wait for completion
+ handle.waitFor(60000); // 60 second timeout
+
+ assertEquals(SparkAppHandle.State.FINISHED, handle.getState());
+}
+```
+
+### Resource Management
+
+Launch applications with dynamic resource allocation:
+
+```java
+int executors = calculateRequiredExecutors(dataSize);
+String memory = calculateMemory(dataSize);
+
+SparkLauncher launcher = new SparkLauncher()
+ .setAppResource("/path/to/app.jar")
+ .setMainClass("com.example.App")
+ .setMaster("yarn")
+ .setConf("spark.executor.instances", String.valueOf(executors))
+ .setConf(SparkLauncher.EXECUTOR_MEMORY, memory)
+ .setConf("spark.dynamicAllocation.enabled", "true");
+```
+
+## Best Practices
+
+1. **Use SparkAppHandle**: Monitor application state
+2. **Add listeners**: Track state changes and failures
+3. **Set timeouts**: Don't wait indefinitely
+4. **Handle errors**: Check exit codes and states
+5. **Clean up**: Stop handles and processes
+6. **Log everything**: Record launches and outcomes
+7. **Use constants**: Use SparkLauncher constants for config keys
+
+## Troubleshooting
+
+### Application Not Starting
+
+**Check:**
+- SPARK_HOME is set correctly
+- Application JAR path is correct
+- Master URL is valid
+- Required resources are available
+
+### Process Hangs
+
+**Solutions:**
+- Add timeout: `handle.waitFor(timeout)`
+- Check for deadlocks in application
+- Verify cluster has capacity
+- Check logs for issues
+
+### Cannot Monitor Application
+
+**Solutions:**
+- Use `startApplication()` instead of `launch()`
+- Add listener before starting
+- Check for connection issues
+- Verify cluster is accessible
+
+## Further Reading
+
+- [Submitting Applications](../docs/submitting-applications.md)
+- [Cluster Mode Overview](../docs/cluster-overview.md)
+- [Configuration Guide](../docs/configuration.md)
+
+## API Documentation
+
+Full JavaDoc available in the built JAR or online at:
+https://spark.apache.org/docs/latest/api/java/org/apache/spark/launcher/package-summary.html
diff --git a/mllib/README.md b/mllib/README.md
new file mode 100644
index 0000000000000..dd62159f84fef
--- /dev/null
+++ b/mllib/README.md
@@ -0,0 +1,514 @@
+# MLlib - Machine Learning Library
+
+MLlib is Apache Spark's scalable machine learning library.
+
+## Overview
+
+MLlib provides:
+
+- **ML Algorithms**: Classification, regression, clustering, collaborative filtering
+- **Featurization**: Feature extraction, transformation, dimensionality reduction, selection
+- **Pipelines**: Tools for constructing, evaluating, and tuning ML workflows
+- **Utilities**: Linear algebra, statistics, data handling
+
+## Important Note
+
+MLlib includes two packages:
+
+1. **`spark.ml`** (DataFrame-based API) - **Primary API** (Recommended)
+2. **`spark.mllib`** (RDD-based API) - **Maintenance mode only**
+
+The RDD-based API (`spark.mllib`) is in maintenance mode. The DataFrame-based API (`spark.ml`) is the primary API and is recommended for all new applications.
+
+## Package Structure
+
+### spark.ml (Primary API)
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/`
+
+DataFrame-based API with:
+- **ML Pipeline API**: For building ML workflows
+- **Transformers**: Feature transformers
+- **Estimators**: Learning algorithms
+- **Models**: Fitted models
+
+```scala
+import org.apache.spark.ml.classification.LogisticRegression
+import org.apache.spark.ml.feature.VectorAssembler
+
+// Create pipeline
+val assembler = new VectorAssembler()
+ .setInputCols(Array("feature1", "feature2"))
+ .setOutputCol("features")
+
+val lr = new LogisticRegression()
+ .setMaxIter(10)
+
+val pipeline = new Pipeline().setStages(Array(assembler, lr))
+
+// Fit model
+val model = pipeline.fit(trainingData)
+
+// Make predictions
+val predictions = model.transform(testData)
+```
+
+### spark.mllib (RDD-based API - Maintenance Mode)
+
+**Location**: `src/main/scala/org/apache/spark/mllib/`
+
+RDD-based API with:
+- Classic algorithms using RDDs
+- Maintained for backward compatibility
+- No new features added
+
+```scala
+import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
+import org.apache.spark.mllib.regression.LabeledPoint
+
+// Train model (old API)
+val data: RDD[LabeledPoint] = ...
+val model = LogisticRegressionWithLBFGS.train(data)
+
+// Make predictions
+val predictions = data.map { point => model.predict(point.features) }
+```
+
+## Key Concepts
+
+### Pipeline API (spark.ml)
+
+Machine learning pipelines provide:
+
+1. **DataFrame**: Unified data representation
+2. **Transformer**: Algorithms that transform DataFrames
+3. **Estimator**: Algorithms that fit on DataFrames to produce Transformers
+4. **Pipeline**: Chains multiple Transformers and Estimators
+5. **Parameter**: Common API for specifying parameters
+
+**Example Pipeline:**
+```scala
+import org.apache.spark.ml.{Pipeline, PipelineModel}
+import org.apache.spark.ml.classification.LogisticRegression
+import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
+
+// Configure pipeline stages
+val tokenizer = new Tokenizer()
+ .setInputCol("text")
+ .setOutputCol("words")
+
+val hashingTF = new HashingTF()
+ .setInputCol("words")
+ .setOutputCol("features")
+
+val lr = new LogisticRegression()
+ .setMaxIter(10)
+
+val pipeline = new Pipeline()
+ .setStages(Array(tokenizer, hashingTF, lr))
+
+// Fit the pipeline
+val model = pipeline.fit(trainingData)
+
+// Make predictions
+model.transform(testData)
+```
+
+### Transformers
+
+Algorithms that transform one DataFrame into another.
+
+**Examples:**
+- `Tokenizer`: Splits text into words
+- `HashingTF`: Maps word sequences to feature vectors
+- `StandardScaler`: Normalizes features
+- `VectorAssembler`: Combines multiple columns into a vector
+- `PCA`: Dimensionality reduction
+
+### Estimators
+
+Algorithms that fit on a DataFrame to produce a Transformer.
+
+**Examples:**
+- `LogisticRegression`: Produces LogisticRegressionModel
+- `DecisionTreeClassifier`: Produces DecisionTreeClassificationModel
+- `KMeans`: Produces KMeansModel
+- `StringIndexer`: Produces StringIndexerModel
+
+## ML Algorithms
+
+### Classification
+
+**Binary and Multiclass:**
+- Logistic Regression
+- Decision Tree Classifier
+- Random Forest Classifier
+- Gradient-Boosted Tree Classifier
+- Naive Bayes
+- Linear Support Vector Machine
+
+**Multilabel:**
+- OneVsRest
+
+**Example:**
+```scala
+import org.apache.spark.ml.classification.LogisticRegression
+
+val lr = new LogisticRegression()
+ .setMaxIter(10)
+ .setRegParam(0.3)
+ .setElasticNetParam(0.8)
+
+val model = lr.fit(trainingData)
+val predictions = model.transform(testData)
+```
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/classification/`
+
+### Regression
+
+- Linear Regression
+- Generalized Linear Regression
+- Decision Tree Regression
+- Random Forest Regression
+- Gradient-Boosted Tree Regression
+- Survival Regression (AFT)
+- Isotonic Regression
+
+**Example:**
+```scala
+import org.apache.spark.ml.regression.LinearRegression
+
+val lr = new LinearRegression()
+ .setMaxIter(10)
+ .setRegParam(0.3)
+ .setElasticNetParam(0.8)
+
+val model = lr.fit(trainingData)
+```
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/regression/`
+
+### Clustering
+
+- K-means
+- Latent Dirichlet Allocation (LDA)
+- Bisecting K-means
+- Gaussian Mixture Model (GMM)
+
+**Example:**
+```scala
+import org.apache.spark.ml.clustering.KMeans
+
+val kmeans = new KMeans()
+ .setK(3)
+ .setSeed(1L)
+
+val model = kmeans.fit(dataset)
+val predictions = model.transform(dataset)
+```
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/clustering/`
+
+### Collaborative Filtering
+
+Alternating Least Squares (ALS) for recommendation systems.
+
+**Example:**
+```scala
+import org.apache.spark.ml.recommendation.ALS
+
+val als = new ALS()
+ .setMaxIter(10)
+ .setRegParam(0.01)
+ .setUserCol("userId")
+ .setItemCol("movieId")
+ .setRatingCol("rating")
+
+val model = als.fit(ratings)
+val predictions = model.transform(testData)
+```
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/recommendation/`
+
+## Feature Engineering
+
+### Feature Extractors
+
+- `TF-IDF`: Text feature extraction
+- `Word2Vec`: Word embeddings
+- `CountVectorizer`: Converts text to vectors of token counts
+
+### Feature Transformers
+
+- `Tokenizer`: Text tokenization
+- `StopWordsRemover`: Removes stop words
+- `StringIndexer`: Encodes string labels to indices
+- `IndexToString`: Converts indices back to strings
+- `OneHotEncoder`: One-hot encoding
+- `VectorAssembler`: Combines columns into feature vector
+- `StandardScaler`: Standardizes features
+- `MinMaxScaler`: Scales features to a range
+- `Normalizer`: Normalizes vectors to unit norm
+- `Binarizer`: Binarizes based on threshold
+
+### Feature Selectors
+
+- `VectorSlicer`: Extracts subset of features
+- `RFormula`: R model formula for feature specification
+- `ChiSqSelector`: Chi-square feature selection
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/feature/`
+
+## Model Selection and Tuning
+
+### Cross-Validation
+
+```scala
+import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
+import org.apache.spark.ml.evaluation.RegressionEvaluator
+
+val paramGrid = new ParamGridBuilder()
+ .addGrid(lr.regParam, Array(0.1, 0.01))
+ .addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))
+ .build()
+
+val cv = new CrossValidator()
+ .setEstimator(lr)
+ .setEvaluator(new RegressionEvaluator())
+ .setEstimatorParamMaps(paramGrid)
+ .setNumFolds(3)
+
+val cvModel = cv.fit(trainingData)
+```
+
+### Train-Validation Split
+
+```scala
+import org.apache.spark.ml.tuning.TrainValidationSplit
+
+val trainValidationSplit = new TrainValidationSplit()
+ .setEstimator(lr)
+ .setEvaluator(new RegressionEvaluator())
+ .setEstimatorParamMaps(paramGrid)
+ .setTrainRatio(0.8)
+
+val model = trainValidationSplit.fit(trainingData)
+```
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/tuning/`
+
+## Evaluation Metrics
+
+### Classification
+
+```scala
+import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
+
+val evaluator = new MulticlassClassificationEvaluator()
+ .setLabelCol("label")
+ .setPredictionCol("prediction")
+ .setMetricName("accuracy")
+
+val accuracy = evaluator.evaluate(predictions)
+```
+
+### Regression
+
+```scala
+import org.apache.spark.ml.evaluation.RegressionEvaluator
+
+val evaluator = new RegressionEvaluator()
+ .setLabelCol("label")
+ .setPredictionCol("prediction")
+ .setMetricName("rmse")
+
+val rmse = evaluator.evaluate(predictions)
+```
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/evaluation/`
+
+## Linear Algebra
+
+MLlib provides distributed linear algebra through Breeze.
+
+**Location**: `src/main/scala/org/apache/spark/mllib/linalg/`
+
+**Local vectors and matrices:**
+```scala
+import org.apache.spark.ml.linalg.{Vector, Vectors, Matrix, Matrices}
+
+// Dense vector
+val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)
+
+// Sparse vector
+val sv: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
+
+// Dense matrix
+val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
+```
+
+**Distributed matrices:**
+- `RowMatrix`: Distributed row-oriented matrix
+- `IndexedRowMatrix`: Indexed rows
+- `CoordinateMatrix`: Coordinate list format
+- `BlockMatrix`: Block-partitioned matrix
+
+## Statistics
+
+Basic statistics and hypothesis testing.
+
+**Location**: `src/main/scala/org/apache/spark/mllib/stat/`
+
+**Examples:**
+- Summary statistics
+- Correlations
+- Stratified sampling
+- Hypothesis testing
+- Random data generation
+
+## Building and Testing
+
+### Build MLlib Module
+
+```bash
+# Build mllib module (RDD-based)
+./build/mvn -pl mllib -am package
+
+# The DataFrame-based ml package is in sql/core
+./build/mvn -pl sql/core -am package
+```
+
+### Run Tests
+
+```bash
+# Run mllib tests
+./build/mvn test -pl mllib
+
+# Run specific test
+./build/mvn test -pl mllib -Dtest=LinearRegressionSuite
+```
+
+## Source Code Organization
+
+```
+mllib/src/main/
+├── scala/org/apache/spark/mllib/
+│ ├── classification/ # Classification algorithms (RDD-based)
+│ ├── clustering/ # Clustering algorithms (RDD-based)
+│ ├── evaluation/ # Evaluation metrics (RDD-based)
+│ ├── feature/ # Feature engineering (RDD-based)
+│ ├── fpm/ # Frequent pattern mining
+│ ├── linalg/ # Linear algebra
+│ ├── optimization/ # Optimization algorithms
+│ ├── recommendation/ # Collaborative filtering (RDD-based)
+│ ├── regression/ # Regression algorithms (RDD-based)
+│ ├── stat/ # Statistics
+│ ├── tree/ # Decision trees (RDD-based)
+│ └── util/ # Utilities
+└── resources/
+```
+
+## Performance Considerations
+
+### Caching
+
+Cache datasets that are used multiple times:
+```scala
+val trainingData = data.cache()
+```
+
+### Parallelism
+
+Adjust parallelism for better performance:
+```scala
+import org.apache.spark.ml.classification.LogisticRegression
+
+val lr = new LogisticRegression()
+ .setMaxIter(10)
+ .setParallelism(4) // Parallel model fitting
+```
+
+### Data Format
+
+Use Parquet format for efficient storage and reading:
+```scala
+df.write.parquet("training_data.parquet")
+val data = spark.read.parquet("training_data.parquet")
+```
+
+### Feature Scaling
+
+Normalize features for better convergence:
+```scala
+import org.apache.spark.ml.feature.StandardScaler
+
+val scaler = new StandardScaler()
+ .setInputCol("features")
+ .setOutputCol("scaledFeatures")
+ .setWithStd(true)
+ .setWithMean(false)
+```
+
+## Best Practices
+
+1. **Use spark.ml**: Prefer DataFrame-based API over RDD-based API
+2. **Build pipelines**: Use Pipeline API for reproducible workflows
+3. **Cache data**: Cache datasets used in iterative algorithms
+4. **Scale features**: Normalize features for better performance
+5. **Cross-validate**: Use cross-validation for model selection
+6. **Monitor convergence**: Check convergence for iterative algorithms
+7. **Save models**: Persist trained models for reuse
+8. **Use appropriate algorithms**: Choose algorithms based on data characteristics
+
+## Model Persistence
+
+Save and load models:
+
+```scala
+// Save model
+model.write.overwrite().save("path/to/model")
+
+// Load model
+val loadedModel = PipelineModel.load("path/to/model")
+```
+
+## Migration Guide
+
+### From RDD-based API to DataFrame-based API
+
+**Old (RDD-based):**
+```scala
+import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
+import org.apache.spark.mllib.regression.LabeledPoint
+
+val data: RDD[LabeledPoint] = ...
+val model = LogisticRegressionWithLBFGS.train(data)
+```
+
+**New (DataFrame-based):**
+```scala
+import org.apache.spark.ml.classification.LogisticRegression
+
+val data: DataFrame = ...
+val lr = new LogisticRegression()
+val model = lr.fit(data)
+```
+
+## Examples
+
+See [examples/src/main/scala/org/apache/spark/examples/ml/](../examples/src/main/scala/org/apache/spark/examples/ml/) for complete examples.
+
+## Further Reading
+
+- [ML Programming Guide](../docs/ml-guide.md) (DataFrame-based API)
+- [MLlib Programming Guide](../docs/mllib-guide.md) (RDD-based API - legacy)
+- [ML Pipelines](../docs/ml-pipeline.md)
+- [ML Tuning](../docs/ml-tuning.md)
+- [Feature Extraction](../docs/ml-features.md)
+
+## Contributing
+
+For contributing to MLlib, see [CONTRIBUTING.md](../CONTRIBUTING.md).
+
+New features should use the DataFrame-based API (`spark.ml`).
diff --git a/resource-managers/README.md b/resource-managers/README.md
new file mode 100644
index 0000000000000..f87ed0f06ad98
--- /dev/null
+++ b/resource-managers/README.md
@@ -0,0 +1,514 @@
+# Spark Resource Managers
+
+This directory contains integrations with various cluster resource managers.
+
+## Overview
+
+Spark can run on different cluster managers:
+- **YARN** (Hadoop YARN)
+- **Kubernetes** (Container orchestration)
+- **Mesos** (General-purpose cluster manager)
+- **Standalone** (Spark's built-in cluster manager)
+
+Each integration provides Spark-specific implementation for:
+- Resource allocation
+- Task scheduling
+- Application lifecycle management
+- Security integration
+
+## Modules
+
+### kubernetes/
+
+Integration with Kubernetes for container-based deployments.
+
+**Location**: `kubernetes/`
+
+**Key Features:**
+- Native Kubernetes resource management
+- Dynamic executor allocation
+- Volume mounting support
+- Kerberos integration
+- Custom resource definitions
+
+**Running on Kubernetes:**
+```bash
+./bin/spark-submit \
+ --master k8s://https://: \
+ --deploy-mode cluster \
+ --name spark-pi \
+ --class org.apache.spark.examples.SparkPi \
+ --conf spark.executor.instances=2 \
+ --conf spark.kubernetes.container.image=spark:3.5.0 \
+ local:///opt/spark/examples/jars/spark-examples.jar
+```
+
+**Documentation**: See [running-on-kubernetes.md](../docs/running-on-kubernetes.md)
+
+### mesos/
+
+Integration with Apache Mesos cluster manager.
+
+**Location**: `mesos/`
+
+**Key Features:**
+- Fine-grained mode (one task per Mesos task)
+- Coarse-grained mode (dedicated executors)
+- Dynamic allocation
+- Mesos frameworks integration
+
+**Running on Mesos:**
+```bash
+./bin/spark-submit \
+ --master mesos://mesos-master:5050 \
+ --deploy-mode cluster \
+ --class org.apache.spark.examples.SparkPi \
+ spark-examples.jar
+```
+
+**Documentation**: Check Apache Mesos documentation
+
+### yarn/
+
+Integration with Hadoop YARN (Yet Another Resource Negotiator).
+
+**Location**: `yarn/`
+
+**Key Features:**
+- Client and cluster deploy modes
+- Dynamic resource allocation
+- YARN container management
+- Security integration (Kerberos)
+- External shuffle service
+- Application timeline service integration
+
+**Running on YARN:**
+```bash
+# Client mode (driver runs locally)
+./bin/spark-submit \
+ --master yarn \
+ --deploy-mode client \
+ --class org.apache.spark.examples.SparkPi \
+ spark-examples.jar
+
+# Cluster mode (driver runs on YARN)
+./bin/spark-submit \
+ --master yarn \
+ --deploy-mode cluster \
+ --class org.apache.spark.examples.SparkPi \
+ spark-examples.jar
+```
+
+**Documentation**: See [running-on-yarn.md](../docs/running-on-yarn.md)
+
+## Comparison
+
+### YARN
+
+**Best for:**
+- Existing Hadoop deployments
+- Enterprise environments with Hadoop ecosystem
+- Multi-tenancy with resource queues
+- Organizations standardized on YARN
+
+**Pros:**
+- Mature and stable
+- Rich security features
+- Queue-based resource management
+- Good tooling and monitoring
+
+**Cons:**
+- Requires Hadoop installation
+- More complex setup
+- Higher overhead
+
+### Kubernetes
+
+**Best for:**
+- Cloud-native deployments
+- Containerized applications
+- Modern microservices architectures
+- Multi-cloud environments
+
+**Pros:**
+- Container isolation
+- Modern orchestration features
+- Cloud provider integration
+- Active development community
+
+**Cons:**
+- Newer integration (less mature)
+- Requires Kubernetes cluster
+- Learning curve for K8s
+
+### Mesos
+
+**Best for:**
+- General-purpose cluster management
+- Mixed workload environments (not just Spark)
+- Large-scale deployments
+
+**Pros:**
+- Fine-grained resource allocation
+- Flexible framework support
+- Good for mixed workloads
+
+**Cons:**
+- Less common than YARN/K8s
+- Setup complexity
+- Smaller community
+
+### Standalone
+
+**Best for:**
+- Quick start and development
+- Small clusters
+- Dedicated Spark clusters
+
+**Pros:**
+- Simple setup
+- No dependencies
+- Fast deployment
+
+**Cons:**
+- Limited resource management
+- No multi-tenancy
+- Basic scheduling
+
+## Architecture
+
+### Resource Manager Integration
+
+```
+Spark Application
+ ↓
+SparkContext
+ ↓
+Cluster Manager Client
+ ↓
+Resource Manager (YARN/K8s/Mesos)
+ ↓
+Container/Pod/Task Launch
+ ↓
+Executor Processes
+```
+
+### Common Components
+
+Each integration implements:
+
+1. **SchedulerBackend**: Launches executors and schedules tasks
+2. **ApplicationMaster/Driver**: Manages application lifecycle
+3. **ExecutorBackend**: Runs tasks on executors
+4. **Resource Allocation**: Requests and manages resources
+5. **Security Integration**: Authentication and authorization
+
+## Building
+
+### Build All Resource Manager Modules
+
+```bash
+# Build all resource manager integrations
+./build/mvn -pl 'resource-managers/*' -am package
+```
+
+### Build Specific Modules
+
+```bash
+# YARN only
+./build/mvn -pl resource-managers/yarn -am package
+
+# Kubernetes only
+./build/mvn -pl resource-managers/kubernetes/core -am package
+
+# Mesos only
+./build/mvn -pl resource-managers/mesos -am package
+```
+
+### Build with Specific Profiles
+
+```bash
+# Build with Kubernetes support
+./build/mvn -Pkubernetes package
+
+# Build with YARN support
+./build/mvn -Pyarn package
+
+# Build with Mesos support (requires Mesos libraries)
+./build/mvn -Pmesos package
+```
+
+## Configuration
+
+### YARN Configuration
+
+**Key settings:**
+```properties
+# Resource allocation
+spark.executor.instances=10
+spark.executor.memory=4g
+spark.executor.cores=2
+
+# YARN specific
+spark.yarn.am.memory=1g
+spark.yarn.am.cores=1
+spark.yarn.queue=default
+spark.yarn.jars=hdfs:///spark-jars/*
+
+# Dynamic allocation
+spark.dynamicAllocation.enabled=true
+spark.dynamicAllocation.minExecutors=1
+spark.dynamicAllocation.maxExecutors=100
+spark.shuffle.service.enabled=true
+```
+
+### Kubernetes Configuration
+
+**Key settings:**
+```properties
+# Container image
+spark.kubernetes.container.image=my-spark:latest
+spark.kubernetes.container.image.pullPolicy=Always
+
+# Resource allocation
+spark.executor.instances=5
+spark.kubernetes.executor.request.cores=1
+spark.kubernetes.executor.limit.cores=2
+spark.kubernetes.executor.request.memory=4g
+
+# Namespace and service account
+spark.kubernetes.namespace=spark
+spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa
+
+# Volumes
+spark.kubernetes.driver.volumes.persistentVolumeClaim.data.options.claimName=spark-pvc
+spark.kubernetes.driver.volumes.persistentVolumeClaim.data.mount.path=/data
+```
+
+### Mesos Configuration
+
+**Key settings:**
+```properties
+# Mesos master
+spark.mesos.coarse=true
+spark.executor.uri=hdfs://path/to/spark.tgz
+
+# Resource allocation
+spark.executor.memory=4g
+spark.cores.max=20
+
+# Mesos specific
+spark.mesos.role=spark
+spark.mesos.constraints=rack:us-east
+```
+
+## Source Code Organization
+
+```
+resource-managers/
+├── kubernetes/
+│ ├── core/ # Core K8s integration
+│ │ └── src/main/scala/org/apache/spark/
+│ │ ├── deploy/k8s/ # Deployment logic
+│ │ ├── scheduler/ # K8s scheduler backend
+│ │ └── executor/ # K8s executor backend
+│ └── integration-tests/ # K8s integration tests
+├── mesos/
+│ └── src/main/scala/org/apache/spark/
+│ ├── scheduler/ # Mesos scheduler
+│ └── executor/ # Mesos executor
+└── yarn/
+ └── src/main/scala/org/apache/spark/
+ ├── deploy/yarn/ # YARN deployment
+ ├── scheduler/ # YARN scheduler
+ └── executor/ # YARN executor
+```
+
+## Development
+
+### Testing Resource Manager Integrations
+
+```bash
+# Run YARN tests
+./build/mvn test -pl resource-managers/yarn
+
+# Run Kubernetes tests
+./build/mvn test -pl resource-managers/kubernetes/core
+
+# Run Mesos tests
+./build/mvn test -pl resource-managers/mesos
+```
+
+### Integration Tests
+
+**Kubernetes:**
+```bash
+cd resource-managers/kubernetes/integration-tests
+./dev/dev-run-integration-tests.sh
+```
+
+See `kubernetes/integration-tests/README.md` for details.
+
+## Security
+
+### YARN Security
+
+**Kerberos authentication:**
+```bash
+./bin/spark-submit \
+ --master yarn \
+ --principal user@REALM \
+ --keytab /path/to/user.keytab \
+ --class org.apache.spark.examples.SparkPi \
+ spark-examples.jar
+```
+
+**Token renewal:**
+```properties
+spark.yarn.principal=user@REALM
+spark.yarn.keytab=/path/to/keytab
+spark.yarn.token.renewal.interval=86400
+```
+
+### Kubernetes Security
+
+**Service account:**
+```properties
+spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa
+spark.kubernetes.authenticate.executor.serviceAccountName=spark-sa
+```
+
+**Secrets:**
+```bash
+kubectl create secret generic spark-secret --from-literal=password=mypassword
+```
+
+```properties
+spark.kubernetes.driver.secrets.spark-secret=/etc/secrets
+```
+
+### Mesos Security
+
+**Authentication:**
+```properties
+spark.mesos.principal=spark-user
+spark.mesos.secret=spark-secret
+```
+
+## Migration Guide
+
+### Moving from Standalone to YARN
+
+1. Set up Hadoop cluster
+2. Configure YARN resource manager
+3. Enable external shuffle service
+4. Update spark-submit commands to use `--master yarn`
+5. Test dynamic allocation
+
+### Moving from YARN to Kubernetes
+
+1. Build Docker image with Spark
+2. Push image to container registry
+3. Create Kubernetes namespace and service account
+4. Update spark-submit to use `--master k8s://`
+5. Configure volume mounts for data access
+
+## Troubleshooting
+
+### YARN Issues
+
+**Application stuck in ACCEPTED state:**
+- Check YARN capacity
+- Verify queue settings
+- Check resource availability
+
+**Container allocation failures:**
+- Increase memory overhead
+- Check node resources
+- Verify memory/core requests
+
+### Kubernetes Issues
+
+**Image pull failures:**
+- Verify image name and tag
+- Check image pull secrets
+- Ensure registry is accessible
+
+**Pod failures:**
+- Check pod logs: `kubectl logs `
+- Verify service account permissions
+- Check resource limits
+
+### Mesos Issues
+
+**Framework registration failures:**
+- Verify Mesos master URL
+- Check authentication settings
+- Ensure proper role configuration
+
+## Best Practices
+
+1. **Choose the right manager**: Based on infrastructure and requirements
+2. **Enable dynamic allocation**: For better resource utilization
+3. **Use external shuffle service**: For executor failure tolerance
+4. **Configure memory overhead**: Account for non-heap memory
+5. **Monitor resource usage**: Track executor and driver metrics
+6. **Use appropriate deploy mode**: Client for interactive, cluster for production
+7. **Implement security**: Enable authentication and encryption
+8. **Test failure scenarios**: Verify fault tolerance
+
+## Performance Tuning
+
+### YARN Performance
+
+```properties
+# Memory overhead
+spark.yarn.executor.memoryOverhead=512m
+
+# Locality wait
+spark.locality.wait=3s
+
+# Container reuse
+spark.yarn.executor.launch.parallelism=10
+```
+
+### Kubernetes Performance
+
+```properties
+# Resource limits
+spark.kubernetes.executor.limit.cores=2
+
+# Volume performance
+spark.kubernetes.driver.volumes.emptyDir.cache.medium=Memory
+
+# Network optimization
+spark.kubernetes.executor.podNamePrefix=spark-exec
+```
+
+### Mesos Performance
+
+```properties
+# Fine-grained mode for better sharing
+spark.mesos.coarse=false
+
+# Container timeout
+spark.mesos.executor.docker.pullTimeout=600
+```
+
+## Further Reading
+
+- [Running on YARN](../docs/running-on-yarn.md)
+- [Running on Kubernetes](../docs/running-on-kubernetes.md)
+- [Cluster Mode Overview](../docs/cluster-overview.md)
+- [Configuration Guide](../docs/configuration.md)
+- [Security Guide](../docs/security.md)
+
+## Contributing
+
+For contributing to resource manager integrations, see [CONTRIBUTING.md](../CONTRIBUTING.md).
+
+When adding features:
+- Ensure cross-compatibility
+- Add comprehensive tests
+- Update documentation
+- Consider security implications
diff --git a/sbin/README.md b/sbin/README.md
new file mode 100644
index 0000000000000..dbe86cc1a8aa4
--- /dev/null
+++ b/sbin/README.md
@@ -0,0 +1,514 @@
+# Spark Admin Scripts
+
+This directory contains administrative scripts for managing Spark standalone clusters.
+
+## Overview
+
+The `sbin/` scripts are used by cluster administrators to:
+- Start and stop Spark standalone clusters
+- Start and stop individual daemons (master, workers, history server)
+- Manage cluster lifecycle
+- Configure cluster nodes
+
+**Note**: These scripts are for **Spark Standalone** cluster mode only. For YARN, Kubernetes, or Mesos, use their respective cluster management tools.
+
+## Cluster Management Scripts
+
+### start-all.sh / stop-all.sh
+
+Start or stop all Spark daemons on the cluster.
+
+**Usage:**
+```bash
+# Start master and all workers
+./sbin/start-all.sh
+
+# Stop all daemons
+./sbin/stop-all.sh
+```
+
+**What they do:**
+- `start-all.sh`: Starts master on the current machine and workers on machines listed in `conf/workers`
+- `stop-all.sh`: Stops all master and worker daemons
+
+**Prerequisites:**
+- SSH key-based authentication configured
+- `conf/workers` file with worker hostnames
+- Spark installed at same location on all machines
+
+**Configuration files:**
+- `conf/workers`: List of worker hostnames (one per line)
+- `conf/spark-env.sh`: Environment variables
+
+### start-master.sh / stop-master.sh
+
+Start or stop the Spark master daemon on the current machine.
+
+**Usage:**
+```bash
+# Start master
+./sbin/start-master.sh
+
+# Stop master
+./sbin/stop-master.sh
+```
+
+**Master Web UI**: Access at `http://:8080/`
+
+**Configuration:**
+```bash
+# In conf/spark-env.sh
+export SPARK_MASTER_HOST=master-hostname
+export SPARK_MASTER_PORT=7077
+export SPARK_MASTER_WEBUI_PORT=8080
+```
+
+### start-worker.sh / stop-worker.sh
+
+Start or stop a Spark worker daemon on the current machine.
+
+**Usage:**
+```bash
+# Start worker connecting to master
+./sbin/start-worker.sh spark://master:7077
+
+# Stop worker
+./sbin/stop-worker.sh
+```
+
+**Worker Web UI**: Access at `http://:8081/`
+
+**Configuration:**
+```bash
+# In conf/spark-env.sh
+export SPARK_WORKER_CORES=8 # Number of cores to use
+export SPARK_WORKER_MEMORY=16g # Memory to allocate
+export SPARK_WORKER_PORT=7078 # Worker port
+export SPARK_WORKER_WEBUI_PORT=8081
+export SPARK_WORKER_DIR=/var/spark/work # Work directory
+```
+
+### start-workers.sh / stop-workers.sh
+
+Start or stop workers on all machines listed in `conf/workers`.
+
+**Usage:**
+```bash
+# Start all workers
+./sbin/start-workers.sh spark://master:7077
+
+# Stop all workers
+./sbin/stop-workers.sh
+```
+
+**Requirements:**
+- `conf/workers` file configured
+- SSH access to all worker machines
+- Master URL (for starting)
+
+## History Server Scripts
+
+### start-history-server.sh / stop-history-server.sh
+
+Start or stop the Spark History Server for viewing completed application logs.
+
+**Usage:**
+```bash
+# Start history server
+./sbin/start-history-server.sh
+
+# Stop history server
+./sbin/stop-history-server.sh
+```
+
+**History Server UI**: Access at `http://:18080/`
+
+**Configuration:**
+```properties
+# In conf/spark-defaults.conf
+spark.history.fs.logDirectory=hdfs://namenode/spark-logs
+spark.history.ui.port=18080
+spark.eventLog.enabled=true
+spark.eventLog.dir=hdfs://namenode/spark-logs
+```
+
+**Requirements:**
+- Applications must have event logging enabled
+- Log directory must be accessible
+
+## Shuffle Service Scripts
+
+### start-shuffle-service.sh / stop-shuffle-service.sh
+
+Start or stop the external shuffle service (for YARN).
+
+**Usage:**
+```bash
+# Start shuffle service
+./sbin/start-shuffle-service.sh
+
+# Stop shuffle service
+./sbin/stop-shuffle-service.sh
+```
+
+**Note**: Typically used only when running on YARN without the YARN auxiliary service.
+
+## Configuration Files
+
+### conf/workers
+
+Lists worker hostnames, one per line.
+
+**Example:**
+```
+worker1.example.com
+worker2.example.com
+worker3.example.com
+```
+
+**Usage:**
+- Used by `start-all.sh` and `start-workers.sh`
+- Each line should contain a hostname or IP address
+- Blank lines and lines starting with `#` are ignored
+
+### conf/spark-env.sh
+
+Environment variables for Spark daemons.
+
+**Example:**
+```bash
+#!/usr/bin/env bash
+
+# Java
+export JAVA_HOME=/usr/lib/jvm/java-17
+
+# Master settings
+export SPARK_MASTER_HOST=master.example.com
+export SPARK_MASTER_PORT=7077
+export SPARK_MASTER_WEBUI_PORT=8080
+
+# Worker settings
+export SPARK_WORKER_CORES=8
+export SPARK_WORKER_MEMORY=16g
+export SPARK_WORKER_PORT=7078
+export SPARK_WORKER_WEBUI_PORT=8081
+export SPARK_WORKER_DIR=/var/spark/work
+
+# Directories
+export SPARK_LOG_DIR=/var/log/spark
+export SPARK_PID_DIR=/var/run/spark
+
+# History Server
+export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://namenode/spark-logs"
+
+# Additional Java options
+export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181"
+```
+
+**Key Variables:**
+
+**Master:**
+- `SPARK_MASTER_HOST`: Master hostname
+- `SPARK_MASTER_PORT`: Master port (default: 7077)
+- `SPARK_MASTER_WEBUI_PORT`: Web UI port (default: 8080)
+
+**Worker:**
+- `SPARK_WORKER_CORES`: Number of cores per worker
+- `SPARK_WORKER_MEMORY`: Memory per worker (e.g., 16g)
+- `SPARK_WORKER_PORT`: Worker communication port
+- `SPARK_WORKER_WEBUI_PORT`: Worker web UI port (default: 8081)
+- `SPARK_WORKER_DIR`: Directory for scratch space and logs
+- `SPARK_WORKER_INSTANCES`: Number of worker instances per machine
+
+**General:**
+- `SPARK_LOG_DIR`: Directory for daemon logs
+- `SPARK_PID_DIR`: Directory for PID files
+- `SPARK_IDENT_STRING`: Identifier for daemons (default: username)
+- `SPARK_NICENESS`: Nice value for daemons
+- `SPARK_DAEMON_MEMORY`: Memory for daemon processes
+
+## Setting Up a Standalone Cluster
+
+### Step 1: Install Spark on All Nodes
+
+```bash
+# Download and extract Spark on each machine
+tar xzf spark-X.Y.Z-bin-hadoopX.tgz
+cd spark-X.Y.Z-bin-hadoopX
+```
+
+### Step 2: Configure spark-env.sh
+
+Create `conf/spark-env.sh` from template:
+```bash
+cp conf/spark-env.sh.template conf/spark-env.sh
+# Edit conf/spark-env.sh with appropriate settings
+```
+
+### Step 3: Configure Workers File
+
+Create `conf/workers`:
+```bash
+cp conf/workers.template conf/workers
+# Add worker hostnames, one per line
+```
+
+### Step 4: Configure SSH Access
+
+Set up password-less SSH from master to all workers:
+```bash
+ssh-keygen -t rsa
+ssh-copy-id user@worker1
+ssh-copy-id user@worker2
+# ... for each worker
+```
+
+### Step 5: Synchronize Configuration
+
+Copy configuration to all workers:
+```bash
+for host in $(cat conf/workers); do
+ rsync -av conf/ user@$host:spark/conf/
+done
+```
+
+### Step 6: Start the Cluster
+
+```bash
+./sbin/start-all.sh
+```
+
+### Step 7: Verify
+
+- Check master UI: `http://master:8080`
+- Check worker UIs: `http://worker1:8081`, etc.
+- Look for workers registered with master
+
+## High Availability
+
+For production deployments, configure high availability with ZooKeeper.
+
+### ZooKeeper-based HA Configuration
+
+**In conf/spark-env.sh:**
+```bash
+export SPARK_DAEMON_JAVA_OPTS="
+ -Dspark.deploy.recoveryMode=ZOOKEEPER
+ -Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181,zk3:2181
+ -Dspark.deploy.zookeeper.dir=/spark
+"
+```
+
+### Start Multiple Masters
+
+```bash
+# On master1
+./sbin/start-master.sh
+
+# On master2
+./sbin/start-master.sh
+
+# On master3
+./sbin/start-master.sh
+```
+
+### Connect Workers to All Masters
+
+```bash
+./sbin/start-worker.sh spark://master1:7077,master2:7077,master3:7077
+```
+
+**Automatic failover:** If active master fails, standby masters detect the failure and one becomes active.
+
+## Monitoring and Logs
+
+### Log Files
+
+Daemon logs are written to `$SPARK_LOG_DIR` (default: `logs/`):
+
+```bash
+# Master log
+$SPARK_LOG_DIR/spark-$USER-org.apache.spark.deploy.master.Master-*.out
+
+# Worker log
+$SPARK_LOG_DIR/spark-$USER-org.apache.spark.deploy.worker.Worker-*.out
+
+# History Server log
+$SPARK_LOG_DIR/spark-$USER-org.apache.spark.deploy.history.HistoryServer-*.out
+```
+
+### View Logs
+
+```bash
+# Tail master log
+tail -f logs/spark-*-master-*.out
+
+# Tail worker log
+tail -f logs/spark-*-worker-*.out
+
+# Search for errors
+grep ERROR logs/spark-*-master-*.out
+```
+
+### Web UIs
+
+- **Master UI**: `http://:8080` - Cluster status, workers, applications
+- **Worker UI**: `http://:8081` - Worker status, running executors
+- **Application UI**: `http://:4040` - Running application metrics
+- **History Server**: `http://:18080` - Completed applications
+
+## Advanced Configuration
+
+### Memory Overhead
+
+Reserve memory for system processes:
+```bash
+export SPARK_DAEMON_MEMORY=2g
+```
+
+### Multiple Workers per Machine
+
+Run multiple worker instances on a single machine:
+```bash
+export SPARK_WORKER_INSTANCES=2
+export SPARK_WORKER_CORES=4 # Cores per instance
+export SPARK_WORKER_MEMORY=8g # Memory per instance
+```
+
+### Work Directory
+
+Change worker scratch space:
+```bash
+export SPARK_WORKER_DIR=/mnt/fast-disk/spark-work
+```
+
+### Port Configuration
+
+Use non-default ports:
+```bash
+export SPARK_MASTER_PORT=9077
+export SPARK_MASTER_WEBUI_PORT=9080
+export SPARK_WORKER_PORT=9078
+export SPARK_WORKER_WEBUI_PORT=9081
+```
+
+## Security
+
+### Enable Authentication
+
+```bash
+export SPARK_DAEMON_JAVA_OPTS="
+ -Dspark.authenticate=true
+ -Dspark.authenticate.secret=your-secret-key
+"
+```
+
+### Enable SSL
+
+```bash
+export SPARK_DAEMON_JAVA_OPTS="
+ -Dspark.ssl.enabled=true
+ -Dspark.ssl.keyStore=/path/to/keystore
+ -Dspark.ssl.keyStorePassword=password
+ -Dspark.ssl.trustStore=/path/to/truststore
+ -Dspark.ssl.trustStorePassword=password
+"
+```
+
+## Troubleshooting
+
+### Master Won't Start
+
+**Check:**
+1. Port 7077 is available: `netstat -an | grep 7077`
+2. Hostname is resolvable: `ping $SPARK_MASTER_HOST`
+3. Logs for errors: `cat logs/spark-*-master-*.out`
+
+### Workers Not Connecting
+
+**Check:**
+1. Master URL is correct
+2. Network connectivity: `telnet master 7077`
+3. Firewall allows connections
+4. Worker logs: `cat logs/spark-*-worker-*.out`
+
+### SSH Connection Issues
+
+**Solutions:**
+1. Verify SSH key: `ssh worker1 echo test`
+2. Check SSH config: `~/.ssh/config`
+3. Use SSH agent: `eval $(ssh-agent); ssh-add`
+
+### Insufficient Resources
+
+**Check:**
+- Worker has enough memory: `free -h`
+- Enough cores available: `nproc`
+- Disk space: `df -h`
+
+## Cluster Shutdown
+
+### Graceful Shutdown
+
+```bash
+# Stop all workers first
+./sbin/stop-workers.sh
+
+# Stop master
+./sbin/stop-master.sh
+
+# Or stop everything
+./sbin/stop-all.sh
+```
+
+### Check All Stopped
+
+```bash
+# Check for running Java processes
+jps | grep -E "(Master|Worker)"
+```
+
+### Force Kill if Needed
+
+```bash
+# Kill any remaining Spark processes
+pkill -f org.apache.spark.deploy
+```
+
+## Best Practices
+
+1. **Use HA in production**: Configure ZooKeeper-based HA
+2. **Monitor resources**: Watch CPU, memory, disk usage
+3. **Separate log directories**: Use dedicated disk for logs
+4. **Regular maintenance**: Clean old logs and application data
+5. **Automate startup**: Use systemd or init scripts
+6. **Configure limits**: Set file descriptor and process limits
+7. **Use external shuffle service**: For better fault tolerance
+8. **Back up metadata**: Regularly back up ZooKeeper data
+
+## Scripts Reference
+
+| Script | Purpose |
+|--------|---------|
+| `start-all.sh` | Start master and all workers |
+| `stop-all.sh` | Stop master and all workers |
+| `start-master.sh` | Start master on current machine |
+| `stop-master.sh` | Stop master |
+| `start-worker.sh` | Start worker on current machine |
+| `stop-worker.sh` | Stop worker |
+| `start-workers.sh` | Start workers on all machines in `conf/workers` |
+| `stop-workers.sh` | Stop all workers |
+| `start-history-server.sh` | Start history server |
+| `stop-history-server.sh` | Stop history server |
+
+## Further Reading
+
+- [Spark Standalone Mode](../docs/spark-standalone.md)
+- [Cluster Mode Overview](../docs/cluster-overview.md)
+- [Configuration Guide](../docs/configuration.md)
+- [Security Guide](../docs/security.md)
+- [Monitoring Guide](../docs/monitoring.md)
+
+## User-Facing Scripts
+
+For user-facing scripts (spark-submit, spark-shell, etc.), see [../bin/README.md](../bin/README.md).
diff --git a/streaming/README.md b/streaming/README.md
new file mode 100644
index 0000000000000..4e16b8f12b11e
--- /dev/null
+++ b/streaming/README.md
@@ -0,0 +1,430 @@
+# Spark Streaming
+
+Spark Streaming provides scalable, high-throughput, fault-tolerant stream processing of live data streams.
+
+## Overview
+
+Spark Streaming supports two APIs:
+
+1. **DStreams (Discretized Streams)** - Legacy API (Deprecated as of Spark 3.4)
+2. **Structured Streaming** - Modern API built on Spark SQL (Recommended)
+
+**Note**: DStreams are deprecated. For new applications, use **Structured Streaming** which is located in the `sql/core` module.
+
+## DStreams (Legacy API)
+
+### What are DStreams?
+
+DStreams represent a continuous stream of data, internally represented as a sequence of RDDs.
+
+**Key characteristics:**
+- Micro-batch processing model
+- Integration with Kafka, Flume, Kinesis, TCP sockets, and more
+- Windowing operations for time-based aggregations
+- Stateful transformations with updateStateByKey
+- Fault tolerance through checkpointing
+
+### Location
+
+- Scala/Java: `src/main/scala/org/apache/spark/streaming/`
+- Python: `../python/pyspark/streaming/`
+
+### Basic Example
+
+```scala
+import org.apache.spark.streaming._
+import org.apache.spark.SparkConf
+
+val conf = new SparkConf().setAppName("NetworkWordCount")
+val ssc = new StreamingContext(conf, Seconds(1))
+
+// Create DStream from TCP source
+val lines = ssc.socketTextStream("localhost", 9999)
+
+// Process the stream
+val words = lines.flatMap(_.split(" "))
+val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
+
+// Print results
+wordCounts.print()
+
+// Start the computation
+ssc.start()
+ssc.awaitTermination()
+```
+
+### Key Components
+
+#### StreamingContext
+
+The main entry point for streaming functionality.
+
+**File**: `src/main/scala/org/apache/spark/streaming/StreamingContext.scala`
+
+**Usage:**
+```scala
+val ssc = new StreamingContext(sparkContext, Seconds(batchInterval))
+// or
+val ssc = new StreamingContext(conf, Seconds(batchInterval))
+```
+
+#### DStream
+
+The fundamental abstraction for a continuous data stream.
+
+**File**: `src/main/scala/org/apache/spark/streaming/dstream/DStream.scala`
+
+**Operations:**
+- **Transformations**: map, flatMap, filter, reduce, join, window
+- **Output Operations**: print, saveAsTextFiles, foreachRDD
+
+#### Input Sources
+
+**Built-in sources:**
+- `socketTextStream`: TCP socket source
+- `textFileStream`: File system monitoring
+- `queueStream`: Queue-based testing source
+
+**Advanced sources** (require external libraries):
+- Kafka: `KafkaUtils.createStream`
+- Flume: `FlumeUtils.createStream`
+- Kinesis: `KinesisUtils.createStream`
+
+**Location**: `src/main/scala/org/apache/spark/streaming/dstream/`
+
+### Windowing Operations
+
+Process data over sliding windows:
+
+```scala
+val windowedStream = lines
+ .window(Seconds(30), Seconds(10)) // 30s window, 10s slide
+
+val windowedWordCounts = words
+ .map(x => (x, 1))
+ .reduceByKeyAndWindow(_ + _, Seconds(30), Seconds(10))
+```
+
+### Stateful Operations
+
+Maintain state across batches:
+
+```scala
+def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
+ val newCount = runningCount.getOrElse(0) + newValues.sum
+ Some(newCount)
+}
+
+val runningCounts = pairs.updateStateByKey(updateFunction)
+```
+
+### Checkpointing
+
+Essential for stateful operations and fault tolerance:
+
+```scala
+ssc.checkpoint("hdfs://checkpoint/directory")
+```
+
+**What gets checkpointed:**
+- Configuration
+- DStream operations
+- Incomplete batches
+- State data (for stateful operations)
+
+### Performance Tuning
+
+**Batch Interval**
+- Set based on processing time and latency requirements
+- Too small: overhead increases
+- Too large: latency increases
+
+**Parallelism**
+```scala
+// Increase receiver parallelism
+val numStreams = 5
+val streams = (1 to numStreams).map(_ => ssc.socketTextStream(...))
+val unifiedStream = ssc.union(streams)
+
+// Repartition for processing
+val repartitioned = dstream.repartition(10)
+```
+
+**Memory Management**
+```scala
+conf.set("spark.streaming.receiver.maxRate", "10000")
+conf.set("spark.streaming.kafka.maxRatePerPartition", "1000")
+```
+
+## Structured Streaming (Recommended)
+
+For new applications, use Structured Streaming instead of DStreams.
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/sql/streaming/`
+
+**Example:**
+```scala
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.streaming._
+
+val spark = SparkSession.builder()
+ .appName("StructuredNetworkWordCount")
+ .getOrCreate()
+
+import spark.implicits._
+
+// Create DataFrame from stream source
+val lines = spark
+ .readStream
+ .format("socket")
+ .option("host", "localhost")
+ .option("port", 9999)
+ .load()
+
+// Process the stream
+val words = lines.as[String].flatMap(_.split(" "))
+val wordCounts = words.groupBy("value").count()
+
+// Output the stream
+val query = wordCounts
+ .writeStream
+ .outputMode("complete")
+ .format("console")
+ .start()
+
+query.awaitTermination()
+```
+
+**Advantages over DStreams:**
+- Unified API with batch processing
+- Better performance with Catalyst optimizer
+- Exactly-once semantics
+- Event time processing
+- Watermarking for late data
+- Easier to reason about
+
+See [Structured Streaming Guide](../docs/structured-streaming-programming-guide.md) for details.
+
+## Building and Testing
+
+### Build Streaming Module
+
+```bash
+# Build streaming module
+./build/mvn -pl streaming -am package
+
+# Skip tests
+./build/mvn -pl streaming -am -DskipTests package
+```
+
+### Run Tests
+
+```bash
+# Run all streaming tests
+./build/mvn test -pl streaming
+
+# Run specific test suite
+./build/mvn test -pl streaming -Dtest=BasicOperationsSuite
+```
+
+## Source Code Organization
+
+```
+streaming/src/main/
+├── scala/org/apache/spark/streaming/
+│ ├── StreamingContext.scala # Main entry point
+│ ├── Time.scala # Time utilities
+│ ├── Checkpoint.scala # Checkpointing
+│ ├── dstream/
+│ │ ├── DStream.scala # Base DStream
+│ │ ├── InputDStream.scala # Input sources
+│ │ ├── ReceiverInputDStream.scala # Receiver-based input
+│ │ ├── WindowedDStream.scala # Windowing operations
+│ │ ├── StateDStream.scala # Stateful operations
+│ │ └── PairDStreamFunctions.scala # Key-value operations
+│ ├── receiver/
+│ │ ├── Receiver.scala # Base receiver class
+│ │ ├── ReceiverSupervisor.scala # Receiver management
+│ │ └── BlockGenerator.scala # Block generation
+│ ├── scheduler/
+│ │ ├── JobScheduler.scala # Job scheduling
+│ │ ├── JobGenerator.scala # Job generation
+│ │ └── ReceiverTracker.scala # Receiver tracking
+│ └── ui/
+│ └── StreamingTab.scala # Web UI
+└── resources/
+```
+
+## Integration with External Systems
+
+### Apache Kafka
+
+**Deprecated DStreams approach:**
+```scala
+import org.apache.spark.streaming.kafka010._
+
+val kafkaParams = Map[String, Object](
+ "bootstrap.servers" -> "localhost:9092",
+ "key.deserializer" -> classOf[StringDeserializer],
+ "value.deserializer" -> classOf[StringDeserializer],
+ "group.id" -> "test-group"
+)
+
+val stream = KafkaUtils.createDirectStream[String, String](
+ ssc,
+ PreferConsistent,
+ Subscribe[String, String](topics, kafkaParams)
+)
+```
+
+**Recommended Structured Streaming approach:**
+```scala
+val df = spark
+ .readStream
+ .format("kafka")
+ .option("kafka.bootstrap.servers", "localhost:9092")
+ .option("subscribe", "topic1")
+ .load()
+```
+
+See [Kafka Integration Guide](../docs/streaming-kafka-integration.md).
+
+### Amazon Kinesis
+
+```scala
+import org.apache.spark.streaming.kinesis._
+
+val stream = KinesisInputDStream.builder
+ .streamingContext(ssc)
+ .endpointUrl("https://kinesis.us-east-1.amazonaws.com")
+ .regionName("us-east-1")
+ .streamName("myStream")
+ .build()
+```
+
+See [Kinesis Integration Guide](../docs/streaming-kinesis-integration.md).
+
+## Monitoring and Debugging
+
+### Streaming UI
+
+Access at: `http://:4040/streaming/`
+
+**Metrics:**
+- Batch processing times
+- Input rates
+- Scheduling delays
+- Active batches
+
+### Logs
+
+Enable detailed logging:
+```properties
+log4j.logger.org.apache.spark.streaming=DEBUG
+```
+
+### Metrics
+
+Key metrics to monitor:
+- **Batch Processing Time**: Should be < batch interval
+- **Scheduling Delay**: Should be minimal
+- **Total Delay**: End-to-end delay
+- **Input Rate**: Records per second
+
+## Common Issues
+
+### Batch Processing Time > Batch Interval
+
+**Symptoms**: Scheduling delay increases over time
+
+**Solutions:**
+- Increase parallelism
+- Optimize transformations
+- Increase resources (executors, memory)
+- Reduce batch interval data volume
+
+### Out of Memory Errors
+
+**Solutions:**
+- Increase executor memory
+- Enable compression
+- Reduce window/batch size
+- Persist less data
+
+### Receiver Failures
+
+**Solutions:**
+- Enable WAL (Write-Ahead Logs)
+- Increase receiver memory
+- Add multiple receivers
+- Use Structured Streaming with better fault tolerance
+
+## Migration from DStreams to Structured Streaming
+
+**Why migrate:**
+- DStreams are deprecated
+- Better performance and semantics
+- Unified API with batch processing
+- Active development and support
+
+**Key differences:**
+- DataFrame/Dataset API instead of RDDs
+- Declarative operations
+- Built-in support for event time
+- Exactly-once semantics by default
+
+**Migration guide**: See [Structured Streaming Migration Guide](../docs/ss-migration-guide.md)
+
+## Examples
+
+See [examples/src/main/scala/org/apache/spark/examples/streaming/](../examples/src/main/scala/org/apache/spark/examples/streaming/) for more examples.
+
+**Key examples:**
+- `NetworkWordCount.scala`: Basic word count
+- `StatefulNetworkWordCount.scala`: Stateful processing
+- `WindowedNetworkWordCount.scala`: Window operations
+- `KafkaWordCount.scala`: Kafka integration
+
+## Configuration
+
+Key configuration parameters:
+
+```properties
+# Batch interval (set in code)
+# StreamingContext(conf, Seconds(batchInterval))
+
+# Backpressure (rate limiting)
+spark.streaming.backpressure.enabled=true
+
+# Receiver memory
+spark.streaming.receiver.maxRate=10000
+
+# Checkpoint interval
+spark.streaming.checkpoint.interval=10s
+
+# Graceful shutdown
+spark.streaming.stopGracefullyOnShutdown=true
+```
+
+## Best Practices
+
+1. **Use Structured Streaming for new applications**
+2. **Set appropriate batch intervals** based on latency requirements
+3. **Enable checkpointing** for stateful operations
+4. **Monitor batch processing times** to ensure they're less than batch interval
+5. **Use backpressure** to handle variable input rates
+6. **Test failure scenarios** with checkpointing
+7. **Consider using Kafka** for reliable message delivery
+
+## Further Reading
+
+- [Structured Streaming Programming Guide](../docs/structured-streaming-programming-guide.md) (Recommended)
+- [DStreams Programming Guide](../docs/streaming-programming-guide.md) (Legacy)
+- [Kafka Integration](../docs/streaming-kafka-integration.md)
+- [Kinesis Integration](../docs/streaming-kinesis-integration.md)
+
+## Contributing
+
+For contributing to Spark Streaming, see [CONTRIBUTING.md](../CONTRIBUTING.md).
+
+Note: New features should focus on Structured Streaming rather than DStreams.