Apache Spark - Ryan Lynch's Hub

# Overview An [[in-memory]], distributed data processing engine. It improves upon the approach from [[MapReduce]] by only writing to disk during input and output (as opposed to between each step). Also, it uses operators which have more flexibility than mappers and reducers. Additionally, it stores intermediate state in [[Resilient Distributed Datasets (RDDs)]], which is in-memory. This means it uses more memory. # Key Considerations Sparks addresses fault tolerance by looking at operators as either a: - Narrow Dependency - all computation is on a single node between two steps of a Spark job - count characters of each message - Wide Dependency - computation relies on data from other nodes For narrow dependencies, if there is a fault, then the data is moved to the online nodes and done in parallel. For wide dependencies, the intermediate data is written to disk to allow for recovery. # Implementation Details # Useful Links # Related Topics ## Reference #### Working Notes #### Sources