Introduction
The rise of enterprise data lakes in the 2010s promised consolidated storage for any data at scale. However, while flexible and scalable, they often resulted in so-called “data swamps”- repositories of inaccessible, unmanaged, or low-quality data with fragmented ownership. This article highlights the shift toward decentralised, domain-aligned structures, such as showcased in Data Mesh, and the emergence of productised data, showing how leading tech companies operationalise analytics at scale.
Shift from Monolithic Storage to Decentralised Data Ownership
The Centralisation Trap
Traditional data lakes and warehouses centralise data ingestion and transformation through a central data team. While efficient initially, this model causes bottlenecks and slows domain-level innovation.
Enter Data Mesh
Zhamak Dehghani introduced Data Mesh in 2019, advocating:
- Domain-oriented, decentralised ownership
- Treating data as a product
- A self-serve infrastructure platform
- Federated governance
Under Data Mesh, individual domain teams own their datasets end to end, improving quality, reducing bottlenecks, and increasing scalability
Why Big Tech Embraced It
For companies such as Amazon and Netflix the need for autonomy at scale, along with low latency and only the most accurate data for production purposes is paramount in the case for data being represented through the form of APIs (i.e.. Recommendation APIs, Microservices Pipelines; service focused). Data mesh fits this evolutionary architecture.
Defining and Managing Data Products
What is a Data Product?
A data product is more than just a dataset, it’s a self-contained, user-centric asset combining data, metadata, contracts, governance, and API interfaces. It’s discoverable, reliable, and maintained by a domain team
DJ Patil described data products as “facilitating an end goal through the use of data”- a principle refined by Dehghani’s Data Mesh approach
Core Attributes of Data Products
Per Wikipedia and industry sources, data products should be:
- Discoverable: They can be found in catalogues using rich metadata
- Addressable: They can be accessed using clearly defined versioned endpoints or APIs
- Trustworthy: They deliver correct, high-quality data and stay on-time about delivering that data
- Interoperable & Self-describing: They are using some standardisations schemas and some standard metaphors like FAIR principles
- Governed: There are data contracts, restrictions to access, SLA monitoring
Lifecycle Management
Following Dawiso’s framework, the lifecycle includes:
- Define: Business goals, governance, data contracts.
- Engineer: Pipelines, APIs, data contracts, metadata.
- Test: Validate timeliness, schema, quality.
- Deploy & Maintain: Insights monitoring usage, SLA monitoring, logs, support.
SLAs, SLOs & Contracts
SLAs (Agreements) and SLOs (Objectives) are fundamental for data products. SLAs define availability, latency, freshness, and failure rates and remediation strategies.
Data contracts, usually defined in YAML, specify schema restrictions, change policies, freshness guarantees, and quality requirements and ultimately catch changes before they impact users downstream.
Discoverability & API Access
Metadata catalogue descriptions, tags, schemas are essential for discoverability. APIs serve as access interfaces for analytics, BI tools, microservices, and partner integration. This fosters ease of consumption and reuse
Toolchains & Platforms for Enablement
Achieving this vision necessitates a toolchain which should help accelerate development, governance, and operation:
dbt (Data-Build-Tool)
dbt enables teams to develop SQL pipelines using version-controlled transformations, enabling the creation of standardised metrics, documentation, and testing processes across domains. These are necessary for reliable and consistent data products
Apache Iceberg & Delta Lake
For scalable table formats that support ACID transactions and scalability:
- Apache Iceberg allows schema evolution, time travel, partitioning.
- Delta Lake (built on Iceberg) supports ingestion, deduplication, and optimised storage.
Both are reliable warehouse backends for domain-owned tables
Data Catalogues and Metadata
Enterprise catalogues (Atlan, Alation, Collibra) ingest metadata from pipelines, dbt models, Iceberg tables, etc. This enables search, lineage, tagging, and schema documentation which is crucial for discoverability and compliance.
Data Contract Automation & Governance
Tools like Open Data Product Standard (ODPS) and contractual YAML definitions allow domain teams to outline their contract specifications. Contracts are then enforced through CI/CD pipelines and governance layers that federate the knowledge and agility of domains to operate independently.
Monitoring & Observability
For any platform layers, pipeline health, SLA metrics, data governance logs, lineage, audits and error notifications all need to be captured, if the product is to be reliable and if the responsive/federate governed processes can be effective.
Industry Examples: Amazon, Netflix & Beyond
Netflix
Famous for their microservices and domain-driven ownership in compute, Netflix applies the same streamlined architecture to data by creating domain-owned streams, APIs for recommendations, analytics data products, with decentralised SLAs and tracking systems.
Amazon
Amazon emphasises single-writer schemas, data-as-a-service APIs like product catalogue, order, and recommendations. Each domain owns contracts, quality, and SLAs – a pure Data Mesh approach.
Emerging Leaders
Organisations like PayPal and Zalando leverage federated learning across domains within the mesh, showing how privacy-safe, cross-domain model development can work.
Why This Matters
Scalability
Data mesh encourages decentralised ownership of data products, avoiding bottlenecks, promoting parallel development. Each domain can build and reuse their data products at the same time, resulting in exponential scaling.
Quality & Trust
By giving domains ownership of quality, schema, freshness, they can reduce errors and increase trust in data products authored across the mesh and wider organisation.
Agility
Product-based pipelines allow faster iteration and model evolution. Versioned schemas and contracts enable safe downstream changes.
Compliance
Federated governance supported by SLAs, contracts, metadata, and access control provide assurance that policies are being followed and makes compliance less of a burden.
Alignment with Modern Architecture
Although decentralised and federated, the data mesh is a reflection of the modern architectures of microservices, incorporating data service in an integrated way for analytics, operational data, machine-learning, and sharing with external partners.
Implementation Blueprint
Below is a condensed implementation framework:
- Assess maturity: Choose domains ready for ownership and assess culture of autonomy.
- Pilot Data Product: Define goals, consumers, format, SLAs, transform pipelines, catalogue integration, API endpoint.
- Build Platform Tools: Provide dbt starter repo, Iceberg templates, catalogue integration, SLA enforcement pipelines.
- Governance & Contracts: Establish contract schemas, SLA metrics, federated policy reviews.
- Roll Out: Expand to more domains, showcase early wins, encourage reuse.
- Measure & Evolve: Monitor adoption, compliance, usage, and continue platform improvements.
Challenges & Mitigations
| Challenge | Mitigation |
|---|---|
| Governance drift | Automated policy checks; federated governance groups |
| Domain inertia | Governance onboarding, education, tooling support |
| Contract versioning | Semantic versioning, downstream compatibility testing |
| Platform fatigue | Continuous enhancement driven by domain feedback |
Conclusion
The shift from monolithic data lakes to domain-specific data products reflects a fundamental evolution in analytics infrastructure. Driven by Data Mesh principles, product-based data structures, complete with SLAs, discoverability, and APIs enable organisations to scale analytics with autonomy, reliability, and velocity. Tech giants like Amazon and Netflix have paved the way; adopting this model is now essential for any data-driven organisation aiming to operationalise analytics at scale.
The post From Data Lake to Data Products: Operationalising Analytics at Scale appeared first on Datafloq.
