Polyglot Persistence

Polyglot Persistence #

Polyglot Persistence

Unbind your data. Use multiple specialized databases.

Known as: Polyglot Persistence.

Aspects: those of the databases involved.

Variants:

Independent storage:

  • Specialized Databases,
  • Private and Shared Databases,
  • Data File / Content Delivery Network (CDN).

Derived storage:

Structure: A layer of data services used by higher-level components.

Type: Extension, derived from Shared Repository.

Benefits Drawbacks
Performance is fine-tuned for various data types and use cases The peculiarities of each database need to be learned
Less load on each database Muсh more work for the DevOps team
The databases may satisfy conflicting forces More points of failure in the system
Consistency is hard or slow to achieve

References: The original and closely related CQRS articles from Martin Fowler, chapter 7 of [MP], chapter 11 of [DDIA] and much information dispersed all over the Web.

You can choose a dedicated technology for each kind of data or pattern of data access in your system. That improves performance (as each database engine is optimized for a few use cases), distributes load between the databases, and may solve conflicts between forces (like when you need both low latency and large storage). However, you’ll likely have to hire several experts to get the best use of and to support the multiple databases. Moreover, having your data spread over multiple databases makes it the application’s responsibility to keep the data in sync (by implementing some kind of distributed transactions or making sure that the clients don’t get stale data).

Performance #

Polyglot Persistence aims at improving performance through the following means:

Read-write separation introduces a replication lag which is a pain when data consistency is important for the system’s clients.

Dependencies #

In general, each service depends on all of the databases which it uses. There may also be an additional dependency between the databases if they share a dataset (one or more databases are derived).

PolyglotPersistence

Applicability #

Polyglot Persistence helps:

  • High load and low latency projects. Specialized Databases shine when given fitting tasks. Caching and Read-Only Replicas take the load off the main database. External Search Indices save the day.
  • Event sourcing. Materialized Views maintain the current states of the system’s components.
  • Conflicting forces. An instance of a stateless service inherits many of the qualities of the database which it accesses for any given request it is processing. When there are several databases, the qualities of a service instance may vary from request to request,depending on which database is involved.

Polyglot Persistence may harm:

  • Small projects. Properly setting up and maintaining multiple databases is not that easy.
  • High availability. Each database which your system uses will tend to fail in its own crazy way.
  • User experience. For systems with read-write database separation the replication lag between the databases will make you choose between reading changes from the leader (write database), adding synchronization code to your application to wait for the read database to be updated, and risking returning outdated results to the users.

Relations #

Polyglot Persistence

Polyglot Persistence:

Variants with independent storage #

Many cases of Polyglot Persistence use multiple datastores just because there is no single technology that matches all the application’s needs. The databases used are filled with different subsets of the system’s data:

Specialized Databases #

PP - Specialized

Databases vary in their optimal use cases. You can employ several different databases to achieve the best performance for each kind of data that you persist.

Private and Shared Databases #

PP - Private and Shared

If several services or shards become coupled through a subset of the system’s data, that subset can be put into a separate database which is accessible to all the participants. All the other data remains private to the shards or services.

Data File, Content Delivery Network (CDN) #

PP - File Storage

Some data is happy to stay in files. Web frameworks load web page templates from OS files and store images and videos in a Content Delivery Network (CDN) which replicates the data all over the world so that each user downloads the content from the nearest server (which is faster and cheaper).

Variants with derived storage #

In other cases there is a single writable database (system of record [DDIA]) which is the main source of truth from which the other databases are derived. The primary reason to use several databases is to relieve the main database of read requests and maybe support some additional qualities: special kinds of queries, aggregation for materialized and CQRS views, full text search for text indices, huge dataset size for historical data or low latency for an in-memory cache.

The updates to the derived databases may come from:

  • the main database as Change Data Capture (CDC) [DDIA] (a log of changes),
  • the application after it changes the main database (see caching strategies below),
  • another service as event stream [DDIA, MP],
  • a dedicated indexer that periodically crawls the main database or web site.

PP - Derived Storage

Read-Only Replica #

Read-only Replica

Multiple instances of the database are deployed and one of them is the leader [DDIA] instance which processes all writes to the system’s data. The changes are then replicated to the other instances (via Change Data Capture (CDC)) which are used for read requests. Distributing workload over multiple instances increases maximum read throughput which the system is capable of, as the database is usually the system’s bottleneck. Having several running replicas greatly improves reliability and allows for nearly instant recovery of database failures as any replica may quickly be promoted to the leader role to serve write traffic.

Reporting Database, CQRS View Database, Event-Sourced View, Source-Aligned (Native) Data Product Quantum (DPQ) of Data Mesh #

Reporting DB and CQRS View

It is common wisdom that a database is good for either OLTP (transactions) or OLAP (queries). Here we have two databases: one optimized for commands (write traffic protected with transactions) and another one for complex analytical queries. The databases differ at least in schema (OLAP schema is optimized for queries) and often vary in type (e.g. SQL vs NoSQL).

A Reporting Database (or Source-Aligned (Native) Data Product Quantum of Data Mesh [SAHP]) derives its data from a write-enabled database in the same subsystem (service) while a CQRS View [MP] or Event-Sourced View [DEDS] is fed a stream of events from another service from which it filters the data relevant to its owner. This way a CQRS View lets its owner service query (its replica of) the data that originally belonged to other services.

Memory Image, Materialized View #

Memory Image

Event sourcing (of Event-Driven Architecture or Microservices) is all about changes. A service persists only changes to its data instead of the current data. As a result, the service needs to aggregate its history into a Memory Image (Materialized View [DDIA]) by loading a snapshot and replaying any further events to rebuild its current state (which other architectural styles store in databases) and start operating.

Query Service, Front Controller, Data Warehouse, Data Lake, Aggregate Data Product Quantum (DPQ) of Data Mesh #

Query Service

A Query Service [MP] (or Aggregate Data Product Quantum of Data Mesh [SAHP]) subscribes to events from several full-featured services and aggregates them into its database, making it a CQRS View of several services or even the whole system. If any other service or a data analyst needs to process data which belongs to multiple services, it retrieves it from the Query Service which has already joined the data streams and represents the join in a convenient way.

A Front Controller [SAHP but not PEAA] is a Query Service embedded in the first (user-facing) service of a Pipeline. It collects status updates from downstream components of the Pipeline to track the state of every request being processed by the Pipeline.

Data Warehouse [SAHP] and Data Lake [SAHP] are analytical databases that connect directly to and import all the data from the operational (main) databases of all the system’s services. A Data Warehouse translates the imported data into its own unified schema while a Data Lake stores the imported data in its original formats.

External Search Index #

Search Index

Some domains require a kind of search which is not naturally supported by ordinary database engines. Full text search, especially NLP-enabled, is one such case. Geospatial data may be another. If you are comfortable with your main database(s), you can set up an External Search Index by deploying a product dedicated to the special kind of search that you need and feeding it updates from your main database.

Historical Data, Data Archiving #

Historical Data

It is common to store the history of sales in a database. However, once a month or two has passed, it is very unlikely that the historical records will ever be edited. And though they are queried on very rare occasions, like audits, they still slow down your database. Some businesses offload any data older than a couple of months to a cheaper archive storage which does not allow changes to the data and has limited query capabilities in order to keep the main datasets small and fast.

Database Cache, Cache-Aside #

Cache-Aside

Database queries are resource-heavy while databases scale only to a limited extent. That means that a highly loaded system benefits from bypassing its main database with as many queries as possible, that is usually achieved by storing recent queries and their results in an in-memory database (Cache-Aside). Each incoming query is first looked for in the fast cache, and if it is found then you are lucky to get the result immediately without having to consult the main database.

Keeping the cache consistent with the main database is the hard part. There are quite a few strategies (some of them treat the cache as a Proxy for the database): write-through, write-behind, write-around and refresh-ahead.

Evolutions #

Polyglot Persistence with derived storage can often be made subject to CQRS:

Polyglor Persistence - 1

Summary #

Polyglot Persistence employs several specialized databases to improve performance, often at the cost of eventual data consistency or implementing transactions in the application.

CC BY Denys Poltorak. Editor: Lars Noodén. Download from Leanpub or GitHub. Powered by odt2wiki and Hugo Book.