Note, this is the third of a four part series on self-service analytics. For the the other parts, please check out my substack. The other articles will provide the context needed to make appropriate decisions about the technical stack we’ll discuss!

So, you want to democratize data through a federated approach that gives a 360-degree holistic view of the data? Well, all you need is a cloud-native solution using an AI-powered data platform with no-code/low-code features and a supporting semantic layer! Blah. The use of obtuse language for marketecture is my favorite. The buzzwords surrounding a new technical paradigm make it difficult to determine what you actually need to build, and self-service analytics is no different.

In this post, we’ll sift through the buzzwords in each layer and discuss what actually needs to be built to find your self-service analytics “sweet spot”.

The Self-Service Analytics Reference Architecture

Self-service Analytics 2025-01-07 11.05.39.excalidraw.svg

Once you understand your enterprise’s realistic vision for self-service analytics, you can begin defining your technical stack. I developed the reference architecture above to illustrate all the key pieces and differentiate between physical components you’ll actually build and the logical layers/groupings that often become buzzwords. The latter are typically just collections of pre-existing physical components bundled together in a product. While this bundling makes implementation and operation easier, it can obscure what you’re actually purchasing and implementing.

Additionally, the reference architecture highlights the components crucial to self-service analytics. We’ll cover each layer and most components1 within the architecture throughout this post, but only dive-deep on the components in blue. The goal is to provide a clear understanding of each component’s role in a self-service analytics architecture. However, you’ll need to conduct deeper research to determine which specific products meet your requirements.

Self-service sweet spot alert!

Watch for these boxes when we identify potential self-service sweet spots. Remember, just because something doesn’t fit the ideal description of ‘self-service’ doesn’t mean it can’t help enable your people to serve themselves. That’s ultimately the whole point!

Data Sources

Data sources feed your analytical data architecture. An enterprise’s technical landscape typically contains many systems that support transactional processing, real-time events, business operations (like payroll and finance), and other data-producing activities.

Examples: OLTP Databases (Databases) via Change Data Capture (CDC), Enterprise Resource Planning (ERP) Systems, Event Aggregators, Logs, 3rd Party APIs, File Storage, Object Storage

Data Ingestion

The data from your sources must be moved into your analytical data architecture. Since source data often resides in business-critical systems that act as sources of truth, we can’t operate directly on it without risking performance issues. Instead, we replicate the data into our analytical system.

Data Replication for Analytical Architectures

The heading is a bit verbose to avoid confusion with database replication. Here, we’re replicating data outside a source system so it can be processed without interfering with business-critical applications. It’s essentially just the E and L from extract, load, transform (ELT), where we postpone the T to preserve a version of the “raw” data and handle transformation further down in the architecture after storing the data.

Examples: Fivetran, Stitch, Airbyte

Data Stream

A data stream is a real-time, continuous flow of data that can be captured, processed, and analyzed as it’s generated. These specialized pieces of software handle use cases where real-time data is needed, such as event sourcing systems.

Examples: Confluent, Apache Kafka, AWS Kinesis

Orchestration

Managed Workflow Systems

By now, you can tell that our data’s journey across this architecture is complex. We’re just two layers in and already dealing with multiple data sources and both batch and real-time data ingestion.

A workflow manager is a tool that orchestrates, monitors, and manages complex workflows in an analytical data architecture. It automates the scheduling and execution of data pipelines, ensuring tasks complete in the correct order, dependencies are respected, and errors are handled efficiently.

Think of it as the puppeteer at the top of our architecture, directing our Data-occhio as it dances along.

Examples: Airflow, Prefect, Dagster

Data Storage and Processing / Query Engines

This section covers where data is stored and how it’s processed—yes, it’s as important as it sounds. This foundation defines every subsequent aspect of the data architecture and impacts what features you can achieve with your self-service analytics.

In an analytical architecture, you likely won’t build data storage and processing/query engines from scratch. Instead, you’ll use products that bundle the physical components in these layers to abstract away complexity and (sometimes) let external vendors handle most infrastructure management while you focus on your data and logic.

Regardless of the product you choose, the underlying physical components remain the same as depicted, though specific implementations may have their own pros and cons to consider (i.e., different query engines may work better on different workloads, different open file or table formats have their own tradeoffs, etc.). Many of these products can be decoupled, allowing you to combine different aspects to achieve your desired functionality. For example, using Databricks and S3 for object storage doesn’t mean you must use their query engine (Databricks Photon).

Let’s examine these logical groupings that span multiple architecture layers and understand their underlying physical components. We need a closer look:

From Buzz to Building - Self-Service Analytics Tech Stack 2025-01-09 14.27.12.excalidraw.svg

Data Warehouses

A data warehouse2 serves as a centralized repository for storing data for analytical queries. It contains data from various disparate sources in a format ready for analytics.

Think of the storage as a database designed specifically for analytical purposes, like a columnar database with a star schema. The data must be transformed to properly adhere to the selected schema. As a result, data warehouses offer high data integrity but require significant maintenance effort and don’t handle unstructured data well.

For processing/queries, the underlying engines vary between products. Many data warehouses use specialized Massively Parallel Processing (MPP) engines to improve performance by splitting queries among many machines. While this enhances processing speed, data warehouses can still grow to a point where processing lags behind business needs.

When is a data warehouse in your self-service sweet spot?

Woah - a data warehouse can’t be self-service analytics!”, says Don Quixote. Ignore him. If enabling business owners to query a data warehouse alleviates your IT team’s burden and speeds up business insights, you’ve hit a self-service sweet spot!

Already have that capability but need more business-friendly attributes and better performance? Creating a data mart in your data warehouse might suffice. A data mart can act as a simple semantic layer over your data and improve performance by including only domain-specific data.

These approaches might work for your enterprise if you found the following answers when applying your people, process, & technology (PPT) mindset in part 2 of the article series. In this case, you’re looking at less of a technology investment and more of a people investment to achieve self-service analytics.

People:

  • There isn’t strong appetite for shifting the current data management paradigm within the enterprise, and business organizations aren’t highly engaged
  • Your business people are willing to learn technical skills and take SQL classes
  • Your small data team stays busy maintaining the current data warehouse

Process:

  • The business’s analytical needs aren’t highly complex at present
  • The data environment isn’t complex; the enterprise focuses on 1-2 domains of data that can be easily managed in a data warehouse

Examples: Amazon Redshift, Google BigQuery

Data Lakehouse

A data lakehouse addresses the scaling and flexibility limitations of data warehouses. For data storage, rigid schemas are replaced with Data Lakes, an approach for storing and maintaining enterprise data. It uses object storage (AWS S3, GCS, ABS, HDFS) to store heterogeneous data (data of different types) in a centralized repository. This allows for cheaper storage since object storage costs less than other types, and it decouples storage and compute so they can scale independently. The data typically uses Columnar Open File Formats, such as Apache Parquet, ORC, and Apache Avro.

While a data lake works great for storage, it creates a gap between how data stored as files can be used for analytics. How do you track what data exists in which file and how it’s related without a prescribed schema affecting where the data is stored? This is where the open table format comes into play!

An open table format captures the organized nature of objects within object storage. More specifically, it uses metadata to describe which files belong to the same “table,” which files are in the same “partition” of a table, and many other data characteristics. Different formats capture different information that enable advanced features like “time travel” and schema management. The result is a “thin” interface sitting atop unstructured storage that creates the facade of structure for applications using the open table format. This illusion of structure and abstraction of differing characteristics across heterogeneous data types is often called data virtualization (more on this later).

With the open table format bridging the gap between underlying physical files and the query/processing engines (distributed SQL engines like Trino, Apache Hive, Apache Drill, and Databricks Photon, and Spark Platforms like Apache Spark and Amazon EMR), we have a data lakehouse.

Self-service Analytics Performance Impacts

The relationship between the underlying data schema, open file format, open table format, and semantic layer (via data virtualization) determines your analytical queries’ performance. This is crucial when selecting your final product set. For more details, visit: Understanding Performance in a Data Lakehouse Architecture.

Today, most data lakehouse products have evolved into full data platforms (e.g., Databricks, Snowflake, Dremio, Starburst), offering extensive functionality across the stack. From early on, these products recommend using a medallion (or multi-hop) architecture for storing and processing data. This creates different “layers” of data that become increasingly curated and cleaned.

These cleaned layers of data provide a basic version of our next architectural layer—the semantic layer.

When is a data lakehouse your self-service sweet spot?

If your enterprise handles large quantities of data from many different sources and wants to centrally implement and manage analytics, a data lakehouse might be right for you (especially if you haven’t already invested in proprietary data warehouses). Once you make this leap, you’ve completed much of the heavy technology investment needed for self-service analytics.

A data lakehouse already includes the beginnings of a semantic layer through if you use a Medallion (or Multi-Hop) Architecture. The real decision at this point is how much your enterprise wants to formalize its semantic layer development, so keep on reading.

Semantic Layer

The open table format provides the technology for a unified data view. However, two issues remain when using this directly for self-service:

  • Technical Implementation Details - The format (schema, attribute names) likely doesn’t align with what business users need for self-service analytics
  • Integration with the Enterprise - The metadata about the data remains siloed inside the data lakehouse

The semantic layer addresses these challenges by abstracting away technical details and integrating with key enterprise data management components.

Personally, I think “semantic layer” is a buzzword that markets well but creates confusion when you want to understand what’s actually being built. A semantic layer is not a new piece of technology. You don’t concretely “build” a semantic layer the same way you build a database or write application code.

So what is it? I’d define it as a “methodology for defining the interface between data and its consumers based on language and logic.” More concretely, it’s a way of thinking about your data that prioritizes formatting it for effective direct user interaction. Several capabilities can achieve this result. The combination of these capabilities forms your “semantic layer”.

Don’t let this explanation (and minor criticism of the term) undersell the semantic layer’s value. This paradigm of thinking about data is extremely valuable and may be the right path for maximizing your data’s value. Ultimately, your semantic layer’s strength is what creates your technical strength in offering self-service analytics.

What capabilities should you consider? Your semantic layer investment should directly reflect your enterprise’s commitment to the self-service analytics journey. This was the primary topic in parts 1 and 2 of this series, but it bears repeating: don’t heavily invest in a semantic layer without considering the supporting people and processes.

From Buzz to Building - Self-Service Analytics Tech Stack 2025-01-14 13.44.47.excalidraw.svg

I like to picture the potential capabilities in a manner similar to Maslow’s Hierarchy of Needs. Don’t consider higher capabilities until you satisfy the more basic ones. Each tier represents a “self-service sweet spot” based on your enterprise’s investment and alignment. Let’s examine each tier’s capabilities and the technical components that implement them.

Tier 1 Semantic Layer Capabilities

You’re probably already implementing some of these within your enterprise, though you may not have formalized them as a “semantic layer.”

From Buzz to Building - Self-Service Analytics Tech Stack 2025-01-15 12.46.48.excalidraw.svg

Application APIs and Caching

Your semantic layer should consider both humans and systems interacting with your data. If you expose APIs, you’ve already thought about designing data to meet consumer needs. As you build on this concept in your new “semantic” mindset, extend this thinking to human data consumers.

Also, always consider caching in the layer that interfaces with users to improve performance.

Examples: Application APIs (AWS API Gateway, Apigee, Kong, Tyk), caching (Redis, memcached)

Universal data virtualization

A semantic layer’s fundamental purpose is abstracting technical details when working with data. Data virtualization helps achieve this by presenting a unified view where end users don’t need to know about different formats (e.g., csv vs. json vs. parquet) or schemas.

In a data lakehouse, open table formats provide this capability. These formats capture how files/objects are organized within object storage, using metadata to describe file relationships, partitions, and other characteristics. The result is a unified view of any file type in object storage.

I use “universal” to emphasize the semantic layer’s centralized nature. Modern semantic layers ensure analytical data remains consistent across the enterprise.

Examples: Apache Iceberg, Delta Lake, Apache Hudi

Business Semantics

This is just a fancy way of saying “using terms that align with the business, rather than the technical implementation”. For example, your source data model might have a field called customer_id because it’s the primary key in the database, and data modelers typically use id for such fields. However, in reports, it’s called “Membership Number” because that’s what the business has always used. This is business semantics in action.

You’re likely already doing this. If you have a data warehouse, your final reports or data marts use business-relevant language. With a data lakehouse using a medallion (or multi-hop) architecture, field names update in later layers to align with business terminology. Even with a standard database replica, you probably have views acting as your miniature semantic layer!

As you shift to considering this part of a “semantic layer,” you’ll become more intentional with terminology changes and aim to centralize naming conventions so all data aligns with standardized business semantics.

When are tier 1 semantic layer capabilities your self-service sweet spot?

If there isn’t any willing level of engagement from the business, then this is where I suggest you focus. Make the most out of your current architecture by centralizing and exposing your data with a virtualization layer. Be intentional around the names in your views and interested business parties can dabble in the world of writing queries.

People:

  • The business teams are not meaningfully interested in serving their own queries.

Process:

  • Only IT processes are candidates to be updated. Take full advantage of what you can control.

Tier 2 Semantic Layer Capabilities

You might think of tier 1 capabilities as “common sense” rather than a semantic layer. I highlight them to demonstrate the fundamental ideas behind this layer. In tier 2, we consider capabilities that make data more “usable” for self-service analytics.

The Self-Service Analytics Tech Stack - Finding your Sweet Spot 2025-01-15 13.27.35.excalidraw.svg

Business Abstractions with Pre-computed Aggregates

Or as technologists have called it for decades - data modeling! Similar to how business semantics updates names, business abstractions update data structure into something familiar to the business.

For example, source data containing customer purchases might feed both a “Weekly Sales Report” for the Sales Department and a “Low Inventory Report” for the Logistics Department. These are different ways of looking at the same data that make sense to different departments. The data is abstracted into the relevant business context. Each situation requires different reference data, aggregations, and dimensions.

Self-service Analytics Performance Impacts

In a well-designed semantic layer, you determine which business abstractions are worth pre-aggregating regularly (materialized view) vs. computing ad-hoc (virtual view). The choice will have major performance implications, which are covered here: Understanding Performance in a Data Lakehouse Architecture.

These concepts likely exist in your current architecture through data marts, views, and data layers—anywhere data modeling occurs. As you shift to the semantic layer paradigm, consider not just business abstractions that fit certain reports, but designs that enable users to build their own reports.

Examples: Any tools where you handle data transformations and modeling, such as Data Warehouses and Data Lakehouses. Specialized tools include dbt and LookML.

Natural Language Queries

This emerging area shows promise for early adoption in the self-service analytics journey. Recent GenAI innovations have enabled data platforms, BI tools, and specialized tools to accept queries written in English rather than SQL (e.g., simply typing “what was the average cost of items sold in the fishing department last month?”).

For enterprises reluctant to invest in technical training for business departments, natural language queries offer a solid middle ground to leverage centralized data and business abstractions.

Examples: Data Platforms (Dremio), BI Tools (Thoughtspot)

When are tier 2 semantic layer capabilities your self-service sweet spot?

If business departments are literally ready to “put their money where their mouth is”, you can use those funds to start to explore these areas. We’re assuming that there isn’t a full investment at this point, so we want to make sure to prioritize these capabilities that can give quick self-service analytics wins.

People:

  • The business is “analytics-curious” and willing to dedicate some time and resources…

Process:

  • …but they are not ready for major organizational changes or in-depth training.

Tier 3 Semantic Layer Capabilities

At this level, your enterprise must be willing to restructure its operating model to fully leverage the semantic layer investment. You should have centralized data teams, data stewards within business departments, and eager team members excited to define metrics and create reports! Tier 3 capabilities capitalizes on your semantic layer’s centralized nature.

The Self-Service Analytics Tech Stack - Finding your Sweet Spot 2025-01-15 14.08.53.excalidraw.svg

Centralized Data Governance, Data Catalog, Entitlements etc.

As a semantic layer becomes formalized and grows richer in metadata, it needs to integrate with your enterprise’s other foundational data capabilities. You don’t want to maintain metadata in multiple locations.

The communication flows both ways. The semantic layer should share with your other tools and vice versa. For example, when creating a new field in the semantic layer, your data catalog tool should discover it for users to find. Conversely, new enterprise standards introduced in the data governance tool should cascade into the semantic layer.

While we won’t detail each area here, I highly recommend understanding how these components integrate with your semantic layer. Centralizing these capabilities is one of a semantic layer’s biggest advantages, especially when managing entitlements and access across many departments and tools.

Examples: Data Discovery (e.g., AWS Glue, Amundsen, DataHub), Data Catalog (e.g., Informatica EDC, Databricks Unity Catalog), Data Governance (e.g., Collibra), Data Observability (e.g., Monte Carlo, Bigeye), Entitlement and Security (e.g., Immuta)

Metrics Store and Metric Catalog

A metrics store offers a novel solution through data centralization. Many organizations struggle with both defining consistent metrics and calculating identical results—even when definitions match! The metrics store enables creating agreed-upon definitions that all run against the same dataset. When two departments need the “average number of users in Norway from 12PM to 1PM,” they get the same result.

Beyond metrics, dimensions can be standardized too. Everyone uses the same date ranges for “Quarter 1,” and weeks can consistently start Monday and end Sunday. Both aspects require careful design and close collaboration with business teams. Any inaccuracies can erode trust in the metrics store.

Examples: Supergrain, DataJunction

When are tier 3 semantic layer capabilities your self-service sweet spot?

Congratulations! You’re part of an enterprise that truly wants to change its data culture.

People:

  • Businesses departments want to be a part of the data journey. They designate data stewards, so they have a seat at the table for all things data.
  • Their other team members are trained on the tools within the semantic layer, so they can fully access and use them to their full potential

Process:

  • The centralized data team has the resources and commitment to integrate into enterprise-level technologies.

Tier 4 Semantic Layer Capabilities

Ah, you seek semantic layer nirvana? To be honest, tier 4 is less of a capability and more of a result of the full commitment to being a data-driven organization. Data is no longer viewed as the fuel to make your products go, it is the product itself.

The Self-Service Analytics Tech Stack - Finding your Sweet Spot 2025-01-15 14.30.38.excalidraw.svg

Data Products

In enterprises viewing data-as-a-service, semantic layer outputs become Data Products. Datasets receive the same careful design and curation as customer-facing products, complete with dedicated teams and product owners.

Creating Data Products this way represents the ultimate goal for those heavily invested in self-service analytics. It requires all previous semantic layer capabilities to succeed.

When are tier 4 semantic layer capabilities your self-service sweet spot?

At this pinnacle, the mindset of data as a product is engrained from leadership down to all enterprise members. You maintain long-term commitment and readily adopt new capabilities to enhance data-driven operations.

People:

  • Business departments drive their own processes through data and have a data-first mentality.
  • There are dedicated product owners for data products.

Process:

  • All of the infrastructure that typically surrounds a customer product is applied to data products. From requirements to sprint planning to releases.

Up to now, we’ve only defined the semantic layer’s components… next comes actual design! Will you choose Analytics-based, Warehouse-based, or Pipeline-based architecture? What supporting schema will you use? Which views should be materialized? These are conversations for another time.

The journey to a fully robust semantic layer can take multiple years for large organizations—hence my repeated emphasis on buy-in. But that doesn’t mean you can’t reach your self-service sweet spot with a more basic implementation!

Analysis / Output

We’re here for self-service analytics, so this layer clearly matters—it’s how users serve themselves. Every previous topic laid the foundation for success here. Yet your entire solution will ultimately be judged by this layer.

While data remains centralized, the interface doesn’t need to be. Based on your users and use cases, you’ll need to determine which tools meet your needs. Regardless, ensure the tool integrates well with your semantic layer. Avoid the anti-pattern where each tool maintains its own version of a semantic layer.

Generally, end users judge this layer entirely on usability. While many design considerations fall outside this technical stack discussion, I’ll emphasize one point: ensure the right level of abstraction addresses your users’ pain points. Are your data scientists spending too much time configuring environments instead of applying machine learning? Do your business analytics have trouble doing complicated joins? Ease those burdens.

The Self-Service Analytics Tech Stack - Finding your Sweet Spot 2025-01-16 06.57.16.excalidraw.svg

From a technical perspective, consider the inbound query protocols (e.g., SQL, MDX, DAX, Python, and RESTful) and standard protocols (ODBC, JDBC, HTTP(S) and XMLA) that must interface with your semantic layer.

Some of the potential options include:

Closing Thoughts

If you started here, read the earlier posts! As a technologist, you might be tempted to skip the people and process considerations, but I promise it’s worth your time.

If you take one thing away from this post, I hope it’s the understanding that you get to define what successful self-service analytics looks like for your enterprise—your sweet spot. While it’s great to aim high with capabilities (and I hope you’re in an enterprise that enables you to do so), remember: the best product is one that gets used.

#blog-post #technical-deep-dive

Footnotes

  1. Here are the components I didn’t discuss with some links in case you’d like to dig further:

  2. *Many consider a Dara Warehouse a logical layer too, but I think warehouses and their supporting products are so tightly coupled, it is not worth delineating (or discussing) here.