Understanding Performance in a Data Lakehouse Architecture

![[From Storage to Presentation - Performance Considerations in a Data Lakehouse Architecture 2025-01-10 14.02.22.excalidraw.svg]] %%[[From Storage to Presentation - Performance Considerations in a Data Lakehouse Architecture 2025-01-10 14.02.22.excalidraw.md|🖋 Edit in Excalidraw]]%% *Note, this is the last of a four part series on self-service analytics. For the the other parts, please check out [my substack](https://substack.com/@ryanlynchlog).* While a [[Data Lakehouse]] centralizes data storage, the journey from source to end user still involves multiple layers and technologies. Understanding the role of each component can be challenging (although, I try my best to summarize them here: [[The Self-Service Analytics Tech Stack - Finding your Sweet Spot]]). What is even more difficult is understanding how each component affects your overall data lakehouse performance, as each component can introduce potential bottlenecks. ![[open table format 2025-01-10 09.15.35.excalidraw.svg]] %%[[open table format 2025-01-10 09.15.35.excalidraw.md|🖋 Edit in Excalidraw]]%% In this article, I’ll explore how each component of a data lakehouse architecture impacts your analytical workloads and where you can apply performance improvements. To best understand performance, I find it helpful to drill down until I can picture what's happening at a physical level. Imagining someone running around a library looking for data (books) makes performance concepts more concrete. After all, data is subject to the same rules of physical reality. For each layer, we'll examine two primary aspects of accessing data quickly: 1. What are the impacts on how the data is physically stored (i.e., how are things organized)?[^1] 2. What are the impacts on how the data is retrieved (i.e., how can we quickly find the right data)? ## The Data Journey in a Data Lakeheouse ![[Understanding Performance in a Data Lakehouse Architecture 2025-01-14 12.34.35.excalidraw.svg]] %%[[Understanding Performance in a Data Lakehouse Architecture 2025-01-14 12.34.35.excalidraw.md|🖋 Edit in Excalidraw]]%% ### Data Sources Source systems design their data based on their primary purpose, which typically isn't analytics. That's okay – we can optimize the data as it moves through our analytical architecture. While we don't want to change source data, this is a good time to discuss an important performance factor that affects all layers: [[schemas]]. #### Schemas Data can be structured, semi-structured, or unstructured. The primary difference between the first two and the latter is the presence of a schema, which is essentially a blueprint of the data. A *logical schema* is the focus of [[data modeling]]. It provides a logical structure around the data (tables, columns, relationships, etc.). The logical schema influences how the data will be physically stored, which is the *physical schema*. The logical schema is essentially your main opportunity as a developer to dictate answers to our two questions posed earlier. **1. What are the impacts on how the data is physically stored (i.e., how are things organized)?** The physical schema is largely decided by the specific technology storing the data (see [[Databases#Database Storage Engines]]), but there are attributes in a logical schema that directly impact the physical schema and therefore have significant impact on performance. These include partitions (impacts which data is located physically close to each other) and other attributes that decide how the data is physically ordered (i.e. sorted). Think of it like organizing a library, should the fantasy section be next to non-fiction? Should fantasy books be sorted by author name or title? ![[From Source to Semantics - Performance Considerations in a Data Lakehouse Architecture 2025-01-13 14.23.42.excalidraw.svg]] %%[[From Source to Semantics - Performance Considerations in a Data Lakehouse Architecture 2025-01-13 14.23.42.excalidraw.md|🖋 Edit in Excalidraw]]%% Certain types of logical schemas are better for analytical use cases than others, such as a [[star schema]] or [[One Big Table (OBT)]] model. These can be paired with storage types that store column data close to each other physically (as opposed to row data close to each other, more on this later) to allow for super fast calculations across columns. **2. What are the impacts on how the data is retrieved (i.e., how can I quickly run to the right spot for this data)?** Additionally, schemas can define indexes, which are additional pieces of data stored that act as maps to find your data quickly. Rather than running across the entire library searching across everything, you can go straight to the right shelf[^2]. With the importance of schemas well-stated, you can probably deduce that unstructured data with no schemas can negatively impact performance. There are workarounds to improve performance, but a well-designed, logical schema will always perform better. We'll continue to mention a schema's role throughout the following sections. It's role providing the right blueprint to allow for data to be efficiently physically stored and retrieved is always pivotal. #### Data Replication During [[data replication]], the source data is copied and moved into the data lake using whichever tools meet your use case (e.g., Airbyte, Fivetran). It is considered a best practice for the data to land in a 'raw' zone where the data is in the same state it was in the source data. This means the source schema remains intact. From there, the data is further processed via the data pipeline. The most common approach is a [[Medallion (or Multi-Hop) Architecture]]. We won't get into specifics of the changes on the data between each layer (and the right approach will be different based on the use case). But, the key takeaway is that at some point during this data pipeline you data is transformed and new schema(s) are applied for your analytical needs. Your source data will be reformatted using the new schema (e.g., star schema) to allow for faster analytical queries. To further improve performance, it will be physically stored in an open file format that best serves your use cases. ### Object Storage ![[Understanding Performance in a Data Lakehouse Architecture 2025-01-14 12.35.29.excalidraw.svg]] %%[[Understanding Performance in a Data Lakehouse Architecture 2025-01-14 12.35.29.excalidraw.md|🖋 Edit in Excalidraw]]%% The unique performance benefits of storing data in object storage, isn't from the storage itself but rather is a result of object storage byproducts: - **Decoupled Storage and Compute** - using object storage allows for compute to be scaled separately from storage, so you have the ability to easily add more processing power as needed. - **Integration with Columnar Open File Formats and Open Table Formats** - both these technologies have been built to allow for efficiently using object storage for analytics! They are the driving force behind the legitimacy of the data lakehouse approach. Of course, the cheap, infinite scaling is also a nice perk of this storage, but we'll continue to focus specifically on performance. The answers related to our two key questions is best covered in the discussion of columnar open file formats and open table formats. #### Columnar Open File Formats The [[Columnar Open File Formats]] are a subset of open file formats that specialize in storing data in columnar fashion. We'll use [[Apache Parquet]] as our example, since it is the most popular for analytics at the time of writing. **1. What are the impacts on how the data is physically stored (i.e., how are things organized)?** Your original data will be physically shifted around to fit the new schema[^3] applied during the data pipeline. If we have a perfect blueprint in place, though, it doesn't matter unless we have the right materials to execute. This is where something like Parquet comes into play. In our library analogy, imagine we have a plan that meets all of our needs, but the only shelves we have look like this: ![[From Source to Semantics - Performance Considerations in a Data Lakehouse Architecture 2025-01-13 11.57.10.excalidraw.svg]] %%[[From Source to Semantics - Performance Considerations in a Data Lakehouse Architecture 2025-01-13 11.57.10.excalidraw.md|🖋 Edit in Excalidraw]]%% A good plan needs the right tools. As an example, Parquet stores data in a columnar format that allows for quick scans across an entire column. As mentioned before, this pairs nicely with our analytical schemas. Also, file size management is crucial – you need to balance between files that are too large (requiring more data to be read) and too small (requiring too many file reads). [[Compaction]] helps maintain this balance. ![[From Source to Semantics - Performance Considerations in a Data Lakehouse Architecture 2025-01-13 12.23.26.excalidraw.svg]] %%[[From Source to Semantics - Performance Considerations in a Data Lakehouse Architecture 2025-01-13 12.23.26.excalidraw.md|🖋 Edit in Excalidraw]]%% **2. What are the impacts on how the data is retrieved (i.e., how can I quickly run to the right spot for this data)?** Additionally, we should use tools that help us search more quickly across all of our data. Parquet offers this through encoding and compression to reduce the file size. Instead of walking past 20 copies of *Twilight*, there is only one copy and a note that says "we have 20 of these". You don't need to walk all the way to the next book you needed, it is right in arm's reach! ![[From Source to Semantics - Performance Considerations in a Data Lakehouse Architecture 2025-01-13 12.12.22.excalidraw.svg]] %%[[From Source to Semantics - Performance Considerations in a Data Lakehouse Architecture 2025-01-13 12.12.22.excalidraw.md|🖋 Edit in Excalidraw]]%% ### Open Table Formats ![[Understanding Performance in a Data Lakehouse Architecture 2025-01-14 12.36.26.excalidraw.svg]] %%[[Understanding Performance in a Data Lakehouse Architecture 2025-01-14 12.36.26.excalidraw.md|🖋 Edit in Excalidraw]]%% You made your plan (i.e., logical schema) and executed into a physical schema stored on columnar open file formats. You'd love to live at this moment of serendipity forever, but you have to handle some complicated situations due to the data being distributed across many individual files and ever changing requirements: - You are not always looking for a single piece of data. Sometimes you need to find all data of a certain category or parameter. How do know which files hold all the data you need? - You need to update the data. How do you do so without impacting in-progress queries? - You need to update your schema, or need to go back to old versions of your data. How can this be done without needing to update every individual file? In the database world, there are many approaches for managing these requirements through your schemas, [[database transactions]], [[isolation]], and other core concepts. In our data lakehouse, we have the [[open table format]]. Broadly, an open table format is a way of capturing the underlying organized nature of files/objects within [[object storage]]. In more concrete terms, it describes the underlying files and data using metadata to describe which files belong to the same "table", which files are in the same "partition" of a table, and many other facets of the data (different open tables capture different information that enable advanced features like data "time travel" and schema management). This allows us to say "get all of my book titles", and the open table format will tell us what files have that data. Hm, an open table format is like a blueprint of how all our files are related and stored? Yes... sound familiar? It uses *another* schema! You can think of the Parquet schema as describing the data *within* the file, and the open table schema as describing the data *across* files. This allows for creating a unified view of data. Even if two parquet files use different schemas, they can be unified in the open table format's schema. But, it is important to make sure the open table format schema and open file format schema work well together. **1. What are the impacts on how the data is physically stored (i.e., how are things organized)?** Since the data is already stored within a file, there is not much impact on how the data is actually stored. Although, open table formats do offer "logical" partitioning that can improve performance without actually moving the physical data. Certain open table formats also offer approaches for changing the way data is stored on-disk for the purposes of performance. [[Delta Lake]] offers [[z-ordering]] and [[Apache Iceberg]] offers clustering. Both approaches make it so data from frequently queried columns are located closer together. Also, we briefly mentioned the impacts of file size and the process of compaction when discussing open file formats. Open table formats assist with this process based on their knowledge of how the files are related to tables and file sizes across all tables and in the schema. **2. What are the impacts on how the data is retrieved (i.e., how can I quickly run to the right spot for this data)?** The real performance benefits come in this second area. Thanks to all the metadata stored about the underlying files, there is a myriad of options for allowing queries to take "shortcuts" on retrieving data: - **File locations** - by knowing the location of all the files, a query can avoid scanning across all the files - **Partition pruning** - the logical partitions mentioned earlier can be used to skip over irrelevant files. If you are searching for books in a specific genre, only the relevant files will be searched. - **File Pruning and Skipping** - open table formats include summary statistics about each file (e.g., counts, max / min values). If you are looking for books with less than 100 pages, query can check if the summarized minimum number of pages in the file is less than 100. If - **Incremental Queries** - through open table format transaction logs, queries can run only on new data added since the last time the query ran. Aside from these performance benefits, open table formats also allow for schema management (updating schemas or moving to new versions of a schema), schema enforcement, time travel, and other capabilities. I'm sure real libraries wish they had an open table format equivalent. ![[From Source to Semantics - Performance Considerations in a Data Lakehouse Architecture 2025-01-13 14.29.41.excalidraw.svg]] %%[[From Source to Semantics - Performance Considerations in a Data Lakehouse Architecture 2025-01-13 14.29.41.excalidraw.md|🖋 Edit in Excalidraw]]%% ### Query Engine ![[Understanding Performance in a Data Lakehouse Architecture 2025-01-14 12.37.34.excalidraw.svg]] %%[[Understanding Performance in a Data Lakehouse Architecture 2025-01-14 12.37.34.excalidraw.md|🖋 Edit in Excalidraw]]%% Fortunately, query engines are here to help decide which of the aforementioned performance improvements will help our queries most. There are various different types of query engines that integrate with open table formats. This list is not comprehensive, but these are two of the most popular types: - **Massive Parallel Processing (MPP) Query Engines** - specialized systems designed to handle large-scale data processing by dividing queries into smaller tasks and executing them in parallel across multiple nodes within a tightly integrated system - **Distributed SQL Query Engines** - designed to process and execute SQL queries across distributed data sources, such as object storage (e.g., AWS S3), HDFS, or even multiple databases, without tightly coupling the query engine with the storage **1. What are the impacts on how the data is physically stored (i.e., how are things organized)?** For the most part, data is not stored in a query engine (other than reference data the engine uses to make decisions in the query optimizer). The exception is for caching. The query optimizer may store the results of popular queries, so they don't need to be re-run in the future. **2. What are the impacts on how the data is retrieved (i.e., how can I quickly run to the right spot for this data)?** Each type of query engine uses the open table format metadata to plan the most efficient approach for completing a query. These approaches include: - **"Pre-filtering" (Predicate Pushdown)** - query engines use the logical schema and Parquet metadata to filter rows before reading the data. For example, if a query filters on `genre = 'fantasy'`, only the relevant Parquet column and chunks are read. - **Parallelism** - queries are broken into tasks that can run simultaneously on different parts of the dataset, leveraging the distributed nature of object storage. - **Vectorized Processing** - processes data in batches rather than row-by-row, speeding up computations. - **Lazy Evaluation** - executes only the parts of the query that are absolutely necessary, avoiding unnecessary work. ![[Understanding Performance in a Data Lakehouse Architecture 2025-01-14 09.20.33.excalidraw.svg]] %%[[Understanding Performance in a Data Lakehouse Architecture 2025-01-14 09.20.33.excalidraw.md|🖋 Edit in Excalidraw]]%% ### Semantic Layer ![[Understanding Performance in a Data Lakehouse Architecture 2025-01-14 12.38.20.excalidraw.svg]] %%[[Understanding Performance in a Data Lakehouse Architecture 2025-01-14 12.38.20.excalidraw.md|🖋 Edit in Excalidraw]]%% In our penultimate layer, we put the data into a format ready for business or application consumption. We lightly covered this topic during the discussion of the [[Medallion (or Multi-Hop) Architecture]]. The layers of cleaned and curated data are the start of a semantic layer. Just those views of the data may even be sufficient for most organizations, but defining the right capabilities for a semantic layer is a discussion for another time. In the semantic layer, you are essentially making two key decisions: 1. What pre-computed and pre-aggregated views of the data do you want available? 2. Do you want to persist any of the pre-computed and pre-aggregated views of the data (i.e., create a materialized view)? ![[Understanding Performance in a Data Lakehouse Architecture 2025-01-14 09.58.50.excalidraw.svg]] %%[[Understanding Performance in a Data Lakehouse Architecture 2025-01-14 09.58.50.excalidraw.md|🖋 Edit in Excalidraw]]%% The right answers will depend on your enterprise and analytical workloads. **1. What are the impacts on how the data is stored on-disk (i.e., how are things organized)?** As you can guess, your answer to the key semantic layer decisions will dictate the impact here. If you are not persisting the views, then the physical data will remain untouched. The query will run against the physical data based on the factors discussed earlier every time you access that particular view. If you do create materialized views, then there will be new data and files physically stored within your data lake. The query used to create the data will no longer execute when you access the materialized view. All of the computations and aggregations will already be done, which is the benefit of the pre-determine views. It will be a fast, straightforward retrieval of data stored together. Also, similar to the query engines, results can be cached in the semantic layer to speed up repetitive queries. **2. What are the impacts on how the data is retrieved (i.e., how can I quickly run to the right spot for this data)?** When queries run in your semantic layer, they will use all of the schemas and tools to improve performance as discussed earlier. Some tools used for building semantic layers offer specialized capabilities that bundle characteristics from earlier layers to improve performance (e.g., [[Dremio]]'s Reflections and [[Starburst]]'s Warp Speed). I have not done a deep-dive on these, but they are likely just extensions of the approaches we've already discussed: caching, pre-computing and materializing certain data, and schema improvements. One unique aspect of a semantic layer's performance that is worth mentioning is the concept of [[data virtualization]]. As defined by Google's AI results: *a data management technique that allows users to access and manipulate data without needing to know its technical details. It creates a single virtual data layer that combines data from different sources, locations, and formats.* From a user's perspective, you may not know you are querying unstructured data or data with a poor schema, which is causing the poor performance. This is why it is pivotal to understand how the data is physically stored to know where to fix performance issues. ![[Understanding Performance in a Data Lakehouse Architecture 2025-01-14 10.34.12.excalidraw.svg]] %%[[Understanding Performance in a Data Lakehouse Architecture 2025-01-14 10.34.12.excalidraw.md|🖋 Edit in Excalidraw]]%% ### BI and Analytical Tools Finally, we are at the end of our data's journey. The users are accessing the data through whichever tools your organization supports. Each tool likely has some strategies to store data closer to the user (within the tool), or other performance enhancements. This can vary so much from product to product, so we won't discuss it in detail here. It is a good time to mention how the queries themselves can impact performance. This applies both to the user queries and the queries in the earlier layers to created views and materialized views. The query itself is what the query engine considers when optimizing an approach for consideration. Two queries can have the same results, but be written in different ways that can improve performance. Writing a performance optimized query is another topic that needs a book to cover, but I wanted to make sure to mention it to put a bow on our "Understanding Performance in a Data Lakehouse Architecture" topic. # Recap Performance doesn't have a one-size fits all solution. And in an architecture with so many layers, there isn't even a one-place solves all approach. A schema that works well in one layer could cause issues when there is more processing down stream. To best understand these relationships, always remember that the data needs to adhere to the physical world around us. There is no real "virtual" world that bends the laws of physics. Performance will always come down to how the data is physically stored and the tricks for finding that data as fast as possible. All of the various options we discussed are at your disposal to win the performance race. ![[open table format 2025-01-10 11.21.42.excalidraw.svg]] %%[[open table format 2025-01-10 11.21.42.excalidraw.md|🖋 Edit in Excalidraw]]%% # Footnotes [^1]: The more technical phrasing would be 'data [[locality]]' [^2]: I say unique to delineate between the pros outside of parallelization and partitioning, which many storage technologies offer [^3]: Quick reminder that as your data moves physically in storage, you may not *see* a difference. This is typically abstracted away from the user interface, but the impacts will still be felt on query performance.