Data Lakes - Ryan Lynch's Hub

# Overview A data lake is an approach for storing and maintaining enterprise data. It uses file storage to store hetergenous (i.e., data of different types) in a centralized repository. This allows for cheaper storage, since file storage is typically less expensive than other types of storage, and decouples storage and compute so they can be scaled independently. # Key Considerations # Pros - Provides flexible data storage for structured, semi-structured, and unstructured data - Data can become quickly available via [[Data Stream]]s since there is no need to pre-process the data - Provides cost efficient storage for large amount of data # Cons - The convenience of a flexible schema and quickly available data comes with a tradeoff on transactional support, data reliability, and data governance - leading to [[Data Swamps]] - While the data is more quickly available, the analysis can take longer to run on such large sets of data ## Key Characteristics Data lakes typically use open file formats, such as [[Apache Avro]], [[Apache Parquet]], or [[ORC]] # Use Cases - [[Machine Learning (ML) and AI]] # Related Topics