Databases - Ryan Lynch's Hub

# Overview The data used in systems needs a location to be [[persisted]], so that it can be accessed and used when needed. A database provides this capability (and much more) by storing data on hardware, like a [[Solid State Drive (SSD)]] or [[Hard Disk Drive (HDD)]] There are many types of different databases each with their own strengths and weakness. Even databases of generally the same type (i.e., [[Relational Databases]]) can have wildly different implementations under the hood. The first step in selecting the correct database is to first fully understand a system's [[Functional Requirements]] and [[Nonfunctional Requirements]]. This will help determine the access patterns of the data, thus informing what needs to be considered in how the database needs to be used. # Key Considerations ## Types of Databases - [[Relational Databases]] - [[NoSQL Database]] - [[Blob Storage]] ## OLTP vs. OLAP Once a certain database is selected, the way in which it is designed can have significant impact. Some databases are designed to be On-line Transaction Processing (OLTP) databases used to take and complete updates on the data based on some actions / events / transactions from some upstream processes. The updates are often against individual rows in a database (or a small set of rows), thus many OLTP database using [[Relational Databases]]. Others are considered On-line Analytical Processing databases used to do computational functions across a large set of data, such as counts, sums, aggregations, etc. and deliver these results to downstream systems. This type of action often goes against entire columns of a database, thus many OLAP databases using [[Columnar]] store or other types of [[NoSQL Database]]. Under the hood, the difference between an OLTP and OLAP database is how the data is stored and accessed via the [[Databases#Database Storage Engines]]. ![[2024-11-18_Databases.png]] ## Database Selection At a deeper level database selection comes down to the tradeoffs across the following areas. ### Consistency vs. Availability As summarized by the [[CAP Theorem]], there is a tradeoff between a databases' ability to meet [[Consistency]] vs. [[Availability]] needs. #### Availability To improve availability and [[Fault Tolerance]] a database's data can be replicated to protect against data corruption or outages. It can also be replicated to bring data closer to users to improve performance. The [[Database Replication]] process must be finely tuned to meet the needs of a system. #### Consistency As data is replicated, it becomes more and more difficult to ensure the data is consistent in all of its locations. Different databases offer different levels of [[Consistency]] (e.g., [[Strongly Consistent]] vs. [[Weakly Consistent]]). Moreover, even within a specific type of consistency, there can be many different approaches for implementing consistency through [[Consensus]]. # Implementation Details ## Database Storage Engines #flashcard A database's storage engine is responsible for *how* data will be stored, which then impacts the ability to read and write data. The two primary families of storage engines are [[Log-structured Storage Engines]] and [[Page-Oriented Storage Engines]].  | Topic | Pros | Cons | | ---------------------------------- | ---------------------------------------- | ---------------------------------------- | | [[Log-structured Storage Engines]] | ![[Log-structured Storage Engines#Pros]] | ![[Log-structured Storage Engines#Cons]] | | [[Page-Oriented Storage Engines]] | ![[Page-Oriented Storage Engines#Pros]] | ![[Page-Oriented Storage Engines#Cons]] | ## Database Query Engines ## Performance and Scalability The database storage engine is a driving factor for performance that a user may not be able to refine or change. There are many, many other implementation details that can be considered to improve a system's performance that is part of the design implemented by the database users. Note, the reoccurring theme of different databases handling the specifics for these areas still holds. Under the hood, the implementations of things like concurrency and indexing are specific to a [[Relational DB Products]] and [[NoSQL DB Products]]. ![[Back of the Envelope Estimation#Databases]] ### Indexing Remember, although it is easy to think about data and system existing in the ether between our physical world and the world of the tech gods, there is still a physical manifestation of every piece of technologies. This physical piece is what fundamentally causes slow performance on a database. It takes longer to access data that is stored far apart from each other. A [[Database Index]] is a data structure a database will create to more efficiently find the data it needs. Rather than constantly moving randomly across the physical storage of the data, the indexing data structure will provide a means of quickly finding the data (the method used to find the data differs depending on the *type* index). An index is typically applied on a certain column or attribute of the data, which is selected based on how the data is most frequently accessed (i.e., are you accessing users by their name or by their location?). The indexing strategy used differs from database to database, so it should be considered in database selection. ## Additional Advanced Database Topics - [[Database Transactions]] - [[Database Locking]] # Useful Links - [[Data Query and Processing Languages]] # Related Topics ## Reference #### Working Notes #### Sources #### Related Topics - [[00_Data Movement]]