Log Structured Merge (LSM) Trees

## Data Structure Name: LSM Trees ### Description An LSM tree is a data structure designed to efficiently store key-value pairs of data often used for storing data in databases. It is designed to leverage in-memory storage (i.e., [[RAM]]) with [[MemTable]], disk storage with [[SSTable]], and a [[Write-ahead Log (WAL)]] to enable fast writes, while also providing strong durability. ### Variations ### Diagram ![[Log Structured Merge (LSM) Trees 2024-10-03 12.54.14.excalidraw.svg]] ### General Use ## Key Considerations ### Properties ### Strengths - Fast writes to in-memory ### Weaknesses - Slow reads due to having to search multiple different places for the value of a given key ### Implementation Details | | [[Big O - Time Complexity]] | [[Big O - Space Complexity]] | | ------------- | --------------------------- | ---------------------------- | | Access | | | | Add | O(log(n)) | | | Delete | O(N) | | | Read / Search | O(log(n)) | | #### The 3 Amigos: Write-ahead Log, MemTables, and SSTables When data is written to an LSM Tree, the data is first written to the write-ahead log to ensure durability. Next, it is written to the [[MemTable]]. After the MemTable reaches a certain size or time threshold it is flushed to an [[SSTable]] for persistence. When the MemTable is flushed, the write-ahead log also flushes data associated with that MemTable to save space. For reads, an LSM Tree first checks the MemTable. If the data is not found, it uses a [[Bloom Filter]] to determine which SSTables may have the data. The SSTables are read from newest to oldest to ensure the latest version of the data is found first. This is also done is combination with an SSTable Index that tracks the byte offset for where SSTable files are stored on-disk. #### Compaction As data enters the LSM Tree, the size and number of SSTables needed to hold the data increases. Eventually, this can become an issue due to excessive storage costs and reduce read performance to search across the many SSTables. Fortunately, these SSTables are full of old data that has since been updated so it can be removed. Moreover, since the SSTables are sorted this can be done efficiently and SSTables can be merged to improve performance. This is done through a process called [[Compaction]]. #### Potential Optimizations - [[Bloom Filter]] - [[Sparse Index]] as an SSTable Index ### Implementation Tips ### Special Properties ## Specific Use Cases ## Reference #### Working Notes #### Sources