Partitioning - Ryan Lynch's Hub

# Overview An approach for dividing data among separate portions of a node or on different node. It is often used as a means to improve a system's [[Scalability]] by spreading data evenly across the partitions. This allows for less data that needs to be searched across and for accessing data in parallel across separate partitions. # Key Considerations ## Partitioning Strategies A partition essentially physically isolates a set of data from the overall dataset. Thus, in order to design an effective partition it is key to use a strategy all allows data that should be processed together to end up in the same partition. This way, when [[Horizontal Scaling]] is used in a system, the machine is able to access all necessary data from a single partition without expensive operations to access additional partitions. An inefficient partitioning strategy can result in too much data being in a single partition, which is considered *skewing* the data. For example, if you are designing a system that involved geoproximity, like [[Design a Rideshare Service]], then your data model may have a [[Geohash]]. This column would be a strong candidate for partitioning, because a user would want all the data in a certain region. Below are some general partitioning strategies: ### Partitioning by Key Range Assign a continuous range of keys to each partition. Simple, but results in an uneven distribution of data. This works well for: - Time-series data (logs, events, transactions) - Numerical ranges (price bands, age groups) - Alphabetical ranges (customer names) ### Partitioning by Hash of Key Use a good hash function to make skewed data uniformly distributed. ## Rebalancing Strategies As data or number of nodes grow and shrink over time, there will be a need to re-balance the partitions to ensure the data remains uniformly distributed. This can be done automatically or manually. It is recommended to have some human in the loop regardless, as moving data can impact many parts of the system (e.g., routing requests). ### Fixed Number of Partitions Create many more partitions than nodes in the database. Each partition is assigned a node and a node can have many partitions. To rebalance, (i.e., when a node dies) you can send the available partitions to the remaining nodes. The partition boundaries remain the same, but just the node holding the partitions change. ### Dynamic Partitioning When a partition grows to exceed a configured size, it is split into two partitions so that approximately half of the data ends up on each side of the split. Conversely, if lots of data is deleted and a partition shrinks below some threshold, it can be merged with an adjacent partition. This is a strategy that adapts to total data volume. ### Partitioning Based on Nodes Make the number of partitions proportional to the number of nodes—in other words, to have a fixed number of partitions per node. In this case, the size of each partition grows proportionally to the dataset size while the number of nodes remains unchanged, but when you increase the number of nodes, the partitions become smaller again. Since a larger data volume generally requires a larger number of nodes to store, this approach also keeps the size of each partition fairly stable. # Implementation Details # Useful Links # Related Topics ## Reference #### Working Notes #### Sources