# Overview
Data serialization (or encoding or marshaling) is the process of converting data into a stream of bits. This process is often done when data needs in a memory-based storage location requires traveling over the network, such as moving in-memory storage to a database. There are various formats that can be used for this process that all have their owns strengths and weaknesses. These include, JSON, XML, Avro, and others.
## Why Must This Be Done?
![[Data Serialization 2024-11-07 14.07.24.excalidraw.svg]]
To best understand this concept, consider the different physical mediums where data may live at any given moment and how data is stored on that physical medium. When data is on your computer or in a database, it is stored directly on the hardware. Data, like an array, can be stored and accessed using indexes. The storage is designed to optimize for reading and writing the data.
When you are sending the data elsewhere, the data must travel over the network. Unfortunately, there isn't a really skinny CPU inside that wire. There are actual wires that can only represent data as a byte sequence in [[Binary]] (a series of 0s and 1s). This data is design to optimize for speed of moving data. Thus, the data you want to send must change its format via data serialization.
Finally, this process must be reversible, so the data can be returned to its original format when it is on another computer.
# Key Considerations
## Categories of Data Serialization Formats
- **Language-Specific Formats** - a serialization package / library associated with a specific language.
- Pros
- Convenient to use due to minimal code needed and easy integration with current programming language
- Cons
- Encoding / decoding won't work when transferred to a system written in a different language
- Can introduce security problem due to the decoding process needing the ability to be able to instantiate arbitrary classes
- Typically don't prioritize versions, performance, and CPU usage
- **Textual Encoding Formats** - widely-accepted formats that are human-readable
- Pros
- Provides formats can be used across organizations to send data in a format that both organizations can encode / decode
- Cons
- Does not provide the best capabilities when it comes to speed or minimizing memory usage
- Some textual encodings have binary variants, which help address this issue
- **Binary Encoding Formats** - formats which represent data in a compact binary representation
- Pros
- Typically faster to transfer and uses less memory
- They can be much more compact than the various “binary JSON” variants, since they can omit field names from the encoded data.
- The schema is a valuable form of documentation, and because the schema is required for decoding, you can be sure that it is up to date (whereas manually maintained documentation may easily diverge from reality).
- Keeping a database of schemas allows you to check forward and backward compatibility of schema changes, before anything is deployed.
- For users of statically typed programming languages, the ability to generate code from the schema is useful, since it enables type checking at compile time.
- Cons
- Does not provide the best capabilities when it comes to speed or minimizing memory usage
## Types of Formats
| Category | Format | Pros | Cons |
| ---------------------------------- | -------------------- | -------------------------- | -------------------------- |
| Language-Specific | [[Pickle]] | ![[Pickle#Pros]] | ![[Pickle#Cons]] |
| Language-Specific | [[Marshal]] | ![[Marshal#Pros]] | ![[Marshal#Cons]] |
| Textual Encoding | [[JSON]] | ![[JSON#Pros]] | ![[JSON#Cons]] |
| Textual Encoding | [[XML]] | ![[XML#Pros]] | ![[XML#Cons]] |
| Textual Encoding | [[csv]] | ![[csv#Pros]] | ![[csv#Cons]] |
| Binary Variant of Textual Encoding | [[BSON]] | ![[BSON#Pros]] | ![[BSON#Cons]] |
| Binary Encoding | [[Apache Avro]] | ![[Apache Avro#Pros]] | ![[Apache Avro#Cons]] |
| Binary Encoding | [[Apache Thrift]] | ![[Apache Thrift#Pros]] | ![[Apache Thrift#Cons]] |
| Binary Encoding | [[Protocol Buffers]] | ![[Protocol Buffers#Pros]] | ![[Protocol Buffers#Cons]] |
| | | | |
- [[Orc]]
- [[YAML]]
- [[Parquet]]
# Implementation Details
# Useful Links
# Related Topics
## Reference
#### Working Notes
#### Sources
#### Related Topics
- [[Compression]]