2025-01-31_Part 2 - From Buzz to Building - Introduction to GenAI for Developers - The Technical Stack-1.png

In part 1 (From Buzz to Building - Introduction to GenAI for Developers - Part 1 - Key Concepts), we sifted through the buzz words to determine what is created when using GenAI, but didn’t discuss how that is done. Understanding the “how” can be even more challenging, as you try and decipher catchy product names to determine what they actually do.

Just as we don’t write machine code when developing software, we likely won’t build an LLM from scratch. Instead, we’ll leverage various components that abstract away complexity while working together to create the final product. This collection of technical components forms our technical stack.

In this post, we’ll discuss the GenAI Technical Stack as of January 2025. There are two caveats to keep in mind:

This stack primarily focuses on search, the most common GenAI use case today. Search applies to both chatbots and general retrieval. While we’ll cover the essential components for search below, other use cases like agents, image generation, or data extraction may require additional components not discussed here.
Given the rapid evolution of GenAI, this stack might be outdated by publication! Though documenting the current state or something moving at breakneck speed might be a fool’s errand, let’s forge ahead.

The GenAI Technical Stack

Before diving in, let’s clarify our search use case. The process requires:

Populating underlying data for model training and/or Retrieval Augmented Generation (RAG)
Instantiating or training a model for product use
Receiving user input
Using the model and data to generate accurate responses
Providing infrastructure to manage, experiment, evaluate, monitor, and optimize the system

From Buzz to Building - Introduction to GenAI for Developers - Part 2 Technical Stack 2025-01-05 14.28.26.excalidraw.svg

Within my proposed technical stack, some components may be overkill for a simple search application, but I want to cover them here so you can make that determination on your own. You may just need a simple stack like I used in Dr. Spin - a positive spin on life using AI. Typically, I would talk through this diagram left-to-right, then up-and-down. To best learn how the stack works to achieve our search, let’s instead highlight the areas involved with each of the five steps involved with search¹.

For each area, there are a lot of implementation-level details that would impact how each of these steps are designed, such as what type of data you are ingesting or what specific RAG techniques you plan to use. I will try to keep the concepts in this post abstract to prevent turning this into a book, but will provide links for you to do deeper exploration.

1. Populating underlying data…

Of course, the first step is ingesting and populating the data that will be used to search against or train / fine-tune a model. Unsurprisingly, this step occurs entirely within our data layer.

Part 2 - From Buzz to Building - Introduction to GenAI for Developers - The Technical Stack 2025-01-23 14.29.29.excalidraw.svg

The mantra of “garbage in, garbage out” continues to ring true in all data-driven work. It is pivotal to get the right data into the right format during this process to have an effective architecture. The actual tools you’ll use in this process will differ based on your types of data and what pre-processing you’d like to apply to your data, but the end goal is the same no matter what: ingest and process data into an effectively scoped and sized numerical representation usable by LLMs.

Data Layer

Data Ingestion

First, select an appropriate data ingestion tool for your sources. For data within your technical ecosystem, consider data replication tools like Fivetran, Stitch, or Airbyte. These tools handle the extract and load phases of extract, load, transform (ELT), preserving raw data for later transformation. You can also use these tools to ingest directly from external data sources.

One common use case is using web pages as data for your GenAI application. In these cases, you can use specialized web scraping tools, such as Firecrawl or Jina Al Reader, to ingest data in LLM-ready formats. Or, you can build your own web scraper.

There are many other specialized tools depending on what data you need. There is Gitingest for git repositories and Unstructered.io for all types of unstructured data..

Data Storage

Let’s keep this short and sweet - you store the raw data you ingested. Woo!

Examples: MongoDB for documents, PostgreSQL and MySQL for relational, S3 for file storage

Data Pipelines

In our next step, we prepare the data so that it is not the aforementioned “garbage”. This may be using the same tool used for data ingest or another tool. The goals of this step are two-fold:

Transform data into a format suitable for numerical representation
Achieve the “effectively scoped and sized” portion of the overarching goal mentioned at the start of this step. This means:
- Scoped: the context of your numerical representation should include sufficient detail to obtain relevant results. This may include adding specific metadata to the chunks of data.
- Sized: there is a balance between too small and too large when it comes to the size of context. A context too large² results in excessive costs and potentially too much information for the model to hold in-memory, thus confusing itself. A context too small results in retrieving irrelevant results (i.e., a single relevant sentence from a document that is not relevant as a whole to the query)

Example html document - before:

<p>AI is transforming industries!     At OpenAI, we develop cutting-edge technologies like GPT-4. Terms and Conditions apply. Visit us at https://openai.com.</p>

<!-- Page Header --> Artificial Intelligence (AI) is the simulation of human intelligence by machines. AI applications include Natural Language Processing (NLP), robotics, and predictive analytics. Page 1 of 10. Contact: [email protected]

Once again, the exact tool you use may depend on the type of data you are transforming. You can use the aforementioned tool. Additionally, some other document parsing tools include LlamaParse or pyPDF.

We don’t want to delve too much into Retrieval Augmented Generation (RAG) technique and design in this discussion, but some high-level descriptions of data transformation you may use during this process in include:

Parsing - Extracting structured data from unstructured or semi-structured formats like PDFs, web pages, or JSON files. For example:
- Extract plain text from a PDF document.
- Parse HTML content into clean text using tags.
- Extract relevant sections from JSON.
Data Cleaning - removing irrelevant, duplicate, or noisy data to ensure the input to the embedding model is clean and accurate. For example:
- Remove special characters, HTML tags, and unnecessary whitespace.
- Standardize text formats (e.g., lowercase everything, normalize unicode).
- Remove unrelated content like headers, footers, or boilerplate text.
Tokenization - Pre-tokenize text to fit within the token limits of your embedding model or segment based on logical boundaries. For example:
- Tokenize sentences into words or subwords.
- Split overly long chunks intelligently by sentences or paragraphs, keeping context intact.
Chunking - Breaking large blocks of text into smaller, coherent pieces suitable for embedding models, which often have input size limits. For example:
- Chunk a book chapter into sections with a 500-word limit.
- Split an article into paragraphs or sentences.
- Break conversational transcripts into segments based on dialogue turns.
Apply Metadata - Tagging each chunk of data with additional context to make retrieval more precise. For example:
- Add document source metadata.
- Annotate with topic, section headings, or key words.
- Include embedding-specific detail like the model or preprocessing steps used.

Example html document - after:

[ {

“chunk”: “AI is transforming industries! At OpenAI, we develop cutting-edge technologies like GPT-4.”, “metadata”: { “source”: “webpage”, “author”: “OpenAI”, “topic”: “AI Overview”, “page”: 1, “date”: “2025-01-25” } }, { “chunk”: “Artificial Intelligence (AI) is the simulation of human intelligence by machines. AI applications include Natural Language Processing (NLP), robotics, and predictive analytics.”, “metadata”: { “source”: “webpage”, “author”: “OpenAI”, “topic”: “AI Applications”, “page”: 1, “date”: “2025-01-25” }}]```

Embedding Models

Now we turn our human text into the language of computers and LLMs - numbers! You can dive deeper into this end-to-end process by reading A Deep Dive into NLP Tokenization and Encoding with Word and Sentence Embeddings – Data Jenius, or get a quick overview through this Reddit comment.

Fortunately, many pre-built embedding models can handle this conversion, similar to how gpt-xx-xx and claude-sonnet-3.5 offer pre-trained LLM capabilities. Your choice depends on your search and RAG strategy. Consider these options:

Dense Vectors - continuous, fixed-dimensional embeddings where each dimension contains information about the input data. Dense vectors are compact and encode semantic relationships between inputs.
- Strengths:
  - Great for semantic search and understanding.
  - Effective at handling synonyms and context-aware queries.
- Limitations:
  - Can be computationally expensive to train and query.
  - May not work well with highly sparse, keyword-heavy, or domain-specific data without fine-tuning.
- Examples: OpenAI’s Ada-002. Sentence Transformers (e.g., SBERT)
Sparse Vectors - high-dimensional vectors where most values are zeros. They represent data in a keyword or feature-based manner, making them explicitly interpretable and directly tied to input tokens or features.
- Strengths:
  - Excellent for exact matches and keyword-heavy domains (e.g., legal, scientific texts).
  - Easy to interpret and debug.
  - Low computational cost for indexing.
- Limitations:
  - Lack semantic understanding; struggles with synonyms and paraphrased queries.
  - High-dimensional nature may require careful storage optimizations.
- Examples: BM25, Term Frequency-Inverse Document Frequency (TF-IDF), Apache Lucene

Vector Databases

Finally, the numerical embeddings get stored for later use. As of today, this is most often done in a vector database. There are two general options to consider, which we discuss below. In the end, you’ll want to use a product that meets your applications needs in terms of size, cost, performance, consistency, and other factors important to your use case.

Purpose Built Vector Databases - these are databases created for the sole purpose of storing and working with vector data. The most popular options today are Pinecone and weaviate. Other options include Milvus, qdrant, Facebook AI Similarity Search (FAISS), chromadb, and LanceDB.
General Relational Databases with Vectors Add-ons - some common relational databases offer plug-ins / add-ons to get functionality similar to a vector database. The most common is postgreSQL with the pgvector add-on. You can also consider Elasticsearch or Vector Search on MongoDB.

2. Instantiate and/or train a model that can be used within our product

The brain of our search lives in the Large Language Models (LLMs) layer. In this step, there are two major decisions to make. The first decision has two parts that are closely related, so I like to view it as one decision.

What model do you want to use, and how do you want to host the model?
Do you want to fine-tune the model?

Let’s walk through what portions of the technical stack are involved.

Part 2 - From Buzz to Building - Introduction to GenAI for Developers - The Technical Stack 2025-01-25 06.40.31.excalidraw.svg

Large Language Models (LLMs) Layer

Your approach to model selection and hosting typically follows one of three paths:

A.) Use a “Model Provider” This route leverages closed-source models hosted entirely by providers, accessible only through APIs. While offering the lowest barrier to entry, this approach may incur higher long-term costs. Though you can’t access the model directly, robust APIs enable fine-tuning and customization.

Also, many of these models are considered state-of-the-art with arguably better capabilities than open-source models, but the battle for “dominant” model is always in flux.

Examples: OpenAI’s GPT-4 and 4o and GPT-3, Anthropic’s Claude Sonnet, Claude Haiku, and Claude Opus, and Google’s Gemini

B.) Use a cloud service with pre-hosted models Alternatively, you can use pre-hosted versions of models available from cloud services. In this case, the model may be closed source (e.g., Claude is available on AWS Bedrock) or open source.

In this scenario, you get more control over the model for better customizability, scalability, and flexibility. Of course, this is at the cost of increased flexibility to manage the model.

Examples: Variety of Models Hosted on (e.g., AWS Bedrock, Google AI, Azure Cognitive Services)

C.) Select an open-source model and host yourself In open source models, you get visibility into what makes the model ‘tick’. You can review the specific weights, understand the architecture, and training data / process. You can use the open-source model as your starting point via one of the available checkpointing versions (i.e., a method of sharing a pre-trained model).

To run and train these models, though, you need access to specialized hardware capable of completing the advanced mathematical operations across billions of parameters (e.g., Graphics Processing Units (GPU)). Since these are open source, you can do this using cloud providers that allow shared access to their GPUs, or locally on your own GPU using special hardware. If you choose to host locally, there is software available to make this happen more easily.

Open Source Model Examples: LLaMA, DeepSeek, BERT, T5, Mistral Cloud Hosting Examples: AWS Bedrock, RunPod, Hugging Face Inference API, Triton, vLLM Local Hosting Examples: vLLM, Ollama

The pace of progress for both closed source and open source models is blazing. You’ll want to thoroughly research what model works best for your use case, and determine what you’d like to host.

Optional: Fine-tuning a Model

Fine-tuning a model for Large Language Models (LLMs) refers to the process of taking a pre-trained model and adapting it to perform better on a specific task or dataset. Pre-trained models like GPT or Claude are trained on massive, general-purpose datasets, which gives them a broad understanding of language.

However, these models may not always perform optimally for specialized tasks—such as legal document summarization, customer service chatbots, or sentiment analysis—because their original training data is not tailored to those specific domains. Fine-tuning involves continuing the training process on a smaller, domain-specific dataset to adjust the model’s weights, making it more attuned to the particular nuances, vocabulary, and requirements of that task.

Our vector database is full of data that can be used for this fine-tuning process. Some models contain APIs you can use to interface with your model and data for fine-tuning. Fine-tuning is a bit complicated, though, and can hurt your models if not implemented correctly. Many consider RAG a ‘safer’ option for improving performance. Unless your use case is very specialized, I recommend using RAG first and only fine-tuning if necessary.

Examples: Hugging Face Transformers, PyTorch, DeepSpeed, Replicate, OpenAI

3. Receive input from a user

While frontend interfaces and backend request handling are crucial (with various technology options available), let’s focus on the Orchestration Layer, which manages the journey from user input to final response.

Orchestration Layer

Part 2 - From Buzz to Building - Introduction to GenAI for Developers - The Technical Stack 2025-02-01 07.05.53.excalidraw.svg

LLM Orchestration Frameworks

Your final application may have a very straight-forward path for handing user inputs. It may receive the input in the UI, send to a model provider’s API, then send that response back to the user.

A production implementation of a GenAI, though, usually involves many complicated steps to ensure consistent quality and the most relevant responses. The input enters the system, a cache is checked to see if the answer is already available (it’s not), the request is sent to an embedding model, the numerical version is then used for a similarity search in the vector database to add relevant data to the request, the package of original request and relevant docs are sent to an LLM trained for answering these questions, the LLM realizes it needs data from a relational data store, it uses a function call to another LLM that is trained for creating SQL queries, this LLM queries the database, and finally sends back the result to the user. Then, this process needs to happen for thousands of your requests. PHEW!

An LLM orchestrator helps you track and manage these complicated workflows through simple API calls and integration with LLM observability tooling. You’ll need to pick one that integrates well with your backend services.

There is still debate whether the current out-of-the-box orchestrators can scale to production workloads, but you’ll need to use one or develop your own orchestrator-like functionality.

Examples: Langchain, LlamaIndex, Haystack, temporal.io, Microsoft Semantic Kernel, EmbedChain

LLM Agent Orchestrators

We’re only covering the technical stack for search in this post, but agents are a hot topic. There are also orchestrators that specialize in managing agentic workflows. In our previous example, the workflow was somewhat pre-determined based on our design.

In agentic workflows, the models decide what to do with the request. Clearly, this adds even more complexity and variability in handling inputs, thus making some sort of orchestrator a pivotal piece of the stack.

Examples: Vertex AI Agent Builder

4. Generate a response…

Finally, we need to generate the response! We touched on this briefly earlier. One of the reasons we have the complexity we discussed earlier is to enable Retrieval Augmented Generation (RAG).

RAG is one of the most common best approaches for GenAI applications today. It is an approach for improving the accuracy and relevance of Large Language Models (LLMs) by using an additional, specific set of documents / text / data (e.g., a company’s policy manual, a company’s website). After the user sends a prompt / query, it is augmented with the relevant portions of additional data before being sent to the LLM. The full package of the prompt and additional data is sent to the LLM, so the LLM has the additional context when answering the prompt.

It is worth mentioning that lately some are starting to question if RAG is still necessary. The context size (i.e., amount of information you can have an LLM consider/remember) is increasing. Some argue that this means you can just feed large portions of data to the LLM without strategically selecting relevant information. I believe, even in larger contexts, there is value to RAG to help prevent hallucinations.

Data Layer

Retrieval Augmented Generation (RAG)

Part 2 - From Buzz to Building - Introduction to GenAI for Developers - The Technical Stack 2025-02-02 06.59.31.excalidraw.svg

The components highlighted in green are responsible for our RAG process, as described in the previous section.

There are many, many approaches to RAG. You can make decisions related to the user input, embedding model, vector database, and all around routing logic that will all significantly impact how RAG works in your system.

If you’re looking for a crash course in how to understand what techniques are available, check out Learn RAG From Scratch – Python AI Tutorial from a LangChain Engineer - YouTube for a great start.

5. All while providing the supporting infrastructure to manage, experiment, evaluate, monitor, and improve (especially performance) within the system

We’ve already got our response generated, so what else is there? Well, you need to have some supporting services to make sure everything runs smoothly. Of course, there are the “must-haves” like CI/CD and authentication, but there are also some LLM-specific considerations.

With LLMs, we enter a new whole new world of development. No longer are we primarily working with deterministic functions that should be producing a “right” value. Instead, we have a nondeterministic output that can change every single time. As a result, we need to ensure we thoroughly test possible outcomes through an evaluation layer and add appropriate safe guards through the operational layer.

Evaluation Layer

ML Experimentation and Evaluation

The line between experimentation and evaluation from a product standpoint is rather thin, so we’ll cover both here.

Experimentation is the tooling that allows you to easily test and track new models or innovations. You can use a different model or fine-tune on some new data without interfering with your primary model.

Evaluation is tooling that provides a systematic approach for more broadly testing your successful experiments. You can test to ensure that the appropriate response formats are being used, that you are properly controlling your context usage, and run checks to ensure minimal hallucinations.

Examples: MLflow, Comet.ml, Optimizely, Split.io, Athina, Opik, DeepChecks, Evidently AI, RAGAS, TruLens, Velvet, DeepEval, Guardrails AI, FastChat

Operational Layer

LLM Observability

![[2025-02-09_Part 2 - From Buzz to Building - Introduction to GenAI for Developers - The Technical Stack-1.png | Source: langsmith.com]]

Your LLM observability tooling should integrate well with your approach for LLM orchestration. These tools provide a means of seeing what is going on as your production LLM application executes. Performance can be monitored and you can see the intermediate responses that led to your final result.

Examples: Langfuse, Langsmith, Datadog, Arize AI, Weights and Biases, Helicone

Conclusion

As we’ve explored the components of a modern GenAI technical stack, a few key takeaways emerge:

Modularity Matters: The GenAI stack isn’t monolithic—it’s a collection of specialized components working in harmony. This modularity allows you to start simple and scale up as needed, whether you’re building a basic search function or a complex enterprise solution.
Data Foundation is Critical: The quality of your GenAI application heavily depends on your data pipeline. From ingestion to embedding, each step in data processing shapes your application’s capabilities and performance.
Flexibility is Key: With rapid advancement in the field, your technical stack should accommodate change. Whether switching embedding models or upgrading your LLM, a well-designed architecture makes adaptation easier.
Infrastructure Deserves Attention: While the LLM might be the star, supporting infrastructure—orchestration, monitoring, evaluation—often determines your application’s real-world success.

Remember, you don’t need every component we’ve discussed to build a functional GenAI application. Start with the essentials for your use case, then expand based on your specific needs and challenges. The stack outlined here serves as a reference architecture—adapt it to your requirements, constraints, and goals.

As we move through 2025, expect this stack to evolve. New components will emerge, others will consolidate, and best practices will shift. Stay informed, but don’t let the rapid pace of change paralyze you. The fundamental principles—quality data processing, thoughtful model selection, robust orchestration, and solid infrastructure—will likely remain relevant even as specific tools change.

What we didn’t cover is how to actually pick what technical stack may work best for your use case! That is a conversation for another time, but I’d recommend this post in the mean time: An Expert’s Guide to Picking Your LLM Tech Stack - AIMon Labs Also, if you’re looking for a resource with a constantly updated list of the most popular products used in LLM development, check out: a16z-infra/llm-app-stack

What’s your next step? Perhaps start small with a basic RAG implementation, or experiment with different embedding models. The GenAI technical stack might seem daunting, but remember: every production system started with a simple proof of concept.

Feel free to reach out with questions or share your experiences building with this stack. The GenAI community grows stronger through shared learning and collaboration.

Footnotes

Web Application Layer

Frontend Frameworks

Examples: React, Vue.js, Angular

Backend Frameworks

Examples: Node.js, Express, FastAPI, Flask, Django

Combination Frameworks

Examples: TypeScript, Streamlit, Gradio

Orchestration Layer

LLM Caching

Examples: GPTCache, Redis, SQLite, Memcache

APIs and Plugins

Wolfram Alpha, Zapier API AI Plugin, Instructor for typing, AI Gateways (e.g., Portkey)

Managed Workflow Systems

Examples: Apache Airflow, Prefect, Dagster

Operational Layer

Logging and Monitoring

Examples: Datadog, New Relic, Prometheus + Grafana, AWS CloudWatch

Error Handling

Examples: Sentry

Authentication

Examples: Sytch, Auth0, Okta, Keycloak

DevOps and Infrastructure

Cloud Computing Provider

Example: AWS, GCP, Azure

Application Hosting

Examples: Vercel, Modal, Replicate

Compute

Examples: AWS EC2, GCP Compute Engine, Azure VM

Container Orchestration

Examples: Kubernetes, Docker Warm, AWS ECS

CI/CD

Examples: Jenkins, GitLab CI

#blog-post #technical-deep-dive

This will result in us not talking about non-LLM specific components of the technical stack. The list of areas skipped are below, in case you’d like to do some additional research on your own. ↩
As models get more advanced and capable of “remembering” larger contexts, there are thoughts around how larger context is better than RAG, but this is still an area being explored. ↩