From Data Lake to Data Products: Operationalising Analytics at Scale

Introduction

The rise of enterprise data lakes in the 2010s promised consolidated storage for any data at scale. However, while flexible and scalable, they often resulted in so-called “data swamps”- repositories of inaccessible, unmanaged, or low-quality data with fragmented ownership. This article highlights the shift toward decentralised, domain-aligned structures, such as showcased in Data Mesh, and the emergence of productised data, showing how leading tech companies operationalise analytics at scale.

Shift from Monolithic Storage to Decentralised Data Ownership

The Centralisation Trap

Traditional data lakes and warehouses centralise data ingestion and transformation through a central data team. While efficient initially, this model causes bottlenecks and slows domain-level innovation.

Enter Data Mesh

Zhamak Dehghani introduced Data Mesh in 2019, advocating:

Domain-oriented, decentralised ownership
Treating data as a product
A self-serve infrastructure platform
Federated governance

Under Data Mesh, individual domain teams own their datasets end to end, improving quality, reducing bottlenecks, and increasing scalability

Why Big Tech Embraced It

For companies such as Amazon and Netflix the need for autonomy at scale, along with low latency and only the most accurate data for production purposes is paramount in the case for data being represented through the form of APIs (i.e.. Recommendation APIs, Microservices Pipelines; service focused). Data mesh fits this evolutionary architecture.

Defining and Managing Data Products

What is a Data Product?

A data product is more than just a dataset, it’s a self-contained, user-centric asset combining data, metadata, contracts, governance, and API interfaces. It’s discoverable, reliable, and maintained by a domain team

DJ Patil described data products as “facilitating an end goal through the use of data”- a principle refined by Dehghani’s Data Mesh approach

Core Attributes of Data Products

Per Wikipedia and industry sources, data products should be:

Discoverable: They can be found in catalogues using rich metadata
Addressable: They can be accessed using clearly defined versioned endpoints or APIs
Trustworthy: They deliver correct, high-quality data and stay on-time about delivering that data
Interoperable & Self-describing: They are using some standardisations schemas and some standard metaphors like FAIR principles
Governed: There are data contracts, restrictions to access, SLA monitoring

Lifecycle Management

Following Dawiso’s framework, the lifecycle includes:

Define: Business goals, governance, data contracts.
Engineer: Pipelines, APIs, data contracts, metadata.
Test: Validate timeliness, schema, quality.
Deploy & Maintain: Insights monitoring usage, SLA monitoring, logs, support.

SLAs, SLOs & Contracts

SLAs (Agreements) and SLOs (Objectives) are fundamental for data products. SLAs define availability, latency, freshness, and failure rates and remediation strategies.

Data contracts, usually defined in YAML, specify schema restrictions, change policies, freshness guarantees, and quality requirements and ultimately catch changes before they impact users downstream.

Discoverability & API Access

Metadata catalogue descriptions, tags, schemas are essential for discoverability. APIs serve as access interfaces for analytics, BI tools, microservices, and partner integration. This fosters ease of consumption and reuse

Toolchains & Platforms for Enablement

Achieving this vision necessitates a toolchain which should help accelerate development, governance, and operation:

dbt (Data-Build-Tool)

dbt enables teams to develop SQL pipelines using version-controlled transformations, enabling the creation of standardised metrics, documentation, and testing processes across domains. These are necessary for reliable and consistent data products

Apache Iceberg & Delta Lake

For scalable table formats that support ACID transactions and scalability:

Apache Iceberg allows schema evolution, time travel, partitioning.
Delta Lake (built on Iceberg) supports ingestion, deduplication, and optimised storage.

Both are reliable warehouse backends for domain-owned tables

Data Catalogues and Metadata

Enterprise catalogues (Atlan, Alation, Collibra) ingest metadata from pipelines, dbt models, Iceberg tables, etc. This enables search, lineage, tagging, and schema documentation which is crucial for discoverability and compliance.

Data Contract Automation & Governance

Tools like Open Data Product Standard (ODPS) and contractual YAML definitions allow domain teams to outline their contract specifications. Contracts are then enforced through CI/CD pipelines and governance layers that federate the knowledge and agility of domains to operate independently.

Monitoring & Observability

For any platform layers, pipeline health, SLA metrics, data governance logs, lineage, audits and error notifications all need to be captured, if the product is to be reliable and if the responsive/federate governed processes can be effective.

Industry Examples: Amazon, Netflix & Beyond

Netflix

Famous for their microservices and domain-driven ownership in compute, Netflix applies the same streamlined architecture to data by creating domain-owned streams, APIs for recommendations, analytics data products, with decentralised SLAs and tracking systems.

Amazon

Amazon emphasises single-writer schemas, data-as-a-service APIs like product catalogue, order, and recommendations. Each domain owns contracts, quality, and SLAs – a pure Data Mesh approach.

Emerging Leaders

Organisations like PayPal and Zalando leverage federated learning across domains within the mesh, showing how privacy-safe, cross-domain model development can work.

Why This Matters

Scalability

Data mesh encourages decentralised ownership of data products, avoiding bottlenecks, promoting parallel development. Each domain can build and reuse their data products at the same time, resulting in exponential scaling.

Quality & Trust

By giving domains ownership of quality, schema, freshness, they can reduce errors and increase trust in data products authored across the mesh and wider organisation.

Agility

Product-based pipelines allow faster iteration and model evolution. Versioned schemas and contracts enable safe downstream changes.

Compliance

Federated governance supported by SLAs, contracts, metadata, and access control provide assurance that policies are being followed and makes compliance less of a burden.

Alignment with Modern Architecture

Although decentralised and federated, the data mesh is a reflection of the modern architectures of microservices, incorporating data service in an integrated way for analytics, operational data, machine-learning, and sharing with external partners.

Implementation Blueprint

Below is a condensed implementation framework:

Assess maturity: Choose domains ready for ownership and assess culture of autonomy.
Pilot Data Product: Define goals, consumers, format, SLAs, transform pipelines, catalogue integration, API endpoint.
Build Platform Tools: Provide dbt starter repo, Iceberg templates, catalogue integration, SLA enforcement pipelines.
Governance & Contracts: Establish contract schemas, SLA metrics, federated policy reviews.
Roll Out: Expand to more domains, showcase early wins, encourage reuse.
Measure & Evolve: Monitor adoption, compliance, usage, and continue platform improvements.

Challenges & Mitigations

Challenge	Mitigation
Governance drift	Automated policy checks; federated governance groups
Domain inertia	Governance onboarding, education, tooling support
Contract versioning	Semantic versioning, downstream compatibility testing
Platform fatigue	Continuous enhancement driven by domain feedback

Conclusion

The shift from monolithic data lakes to domain-specific data products reflects a fundamental evolution in analytics infrastructure. Driven by Data Mesh principles, product-based data structures, complete with SLAs, discoverability, and APIs enable organisations to scale analytics with autonomy, reliability, and velocity. Tech giants like Amazon and Netflix have paved the way; adopting this model is now essential for any data-driven organisation aiming to operationalise analytics at scale.

The post From Data Lake to Data Products: Operationalising Analytics at Scale appeared first on Datafloq.

Categories