MapReduce - Ryan Lynch's Hub

# Overview A programming model for processing large amounts of data in bulk across many machines. It involves breaking down the processing into two functions: a *mapper* and *reducer*. By dividing a more complicated data process into two simple functions, the data processing can be distributed over multiple machines to improve performance. ## Key Terms Used in MapReduce #flashcard - **Mapper** - called once for every input record, and its job is to extract the key and value from the input record. For each input, it may generate any number of key-value pairs (including none). It does not keep any state from one input record to the next, so each record is handled independently. - **Reducer** - takes the key-value pairs produced by the mappers, collects all the values belonging to the same key, and calls the reducer with an iterator over that collection of values. The reducer can produce output records.  ![[2024-11-22_MapReduce.png]] ## Use Cases - Building search indexes # Key Considerations ## Strengths of MapReduce #flashcard - Provides fault tolerance because portions of the job can be restarted in the case of a node failure  ## Weaknesses of MapReduce #flashcard - Can have weak performance if there is a long-running node. The results are only available when all nodes complete. - Chained jobs don't know about one another, lots of waiting. - Stores intermediate state of the processing between MapReduce jobs, which leads to inefficient use of memory - Each jobs requires a mapper and reducers, causing unnecessary sorting - Tons of disk usage  # Implementation Details # Useful Links # Related Topics ## Reference #### Working Notes #### Sources