Шардирование vs репликация: масштабируем БД
Представим себе БД услуг барбершопа в Южном Бутово. Данные лежат в одной табличке в PostgreSQL и используются внутренней системой для составления расписания мастеров, расчета премий и т. д.
Бородачей в Бутово оказалось сильно больше, чем предполагалось в изначальном бизнес-плане.
Владельцам барбершопа пришлось выкупить соседний спортзал и нанять еще 100 барберов.
Бизнес процветает. Роскошные бороды заполонили Бутово. Однако, БД начала подводить:• из-за выхода из строя сервера БД 400 людей остались не побритыми• в самое загруженное время, чтобы узнать расписание мастера,администратор вынужден ждать до 60 секунд ответа БД
Такие потери недопустимы для бизнеса. Что ж, давай разбираться. Технически есть 2 проблемы:• нет отказоустойчивости• слишком много чтения из БД
Для решения этих проблем можно использовать репликацию.
Репликация — полное копирование БД на такой же сервер.
Основной сервер назовем «Мастер», а дополнительные — «Реплика». Любой из этих серверов будем называть «Нода» (node).
Писать мы сможем только в мастер. А реплики будут вытягивать из мастера все изменения (синхронизироваться).
Принципы шардинга реляционных баз данных
![]()
Когда ваша база данных небольшая (10 ГБ), вы можете легко добавить больше ресурсов и таким образом масштабировать ее. Однако, поскольку таблицы растут, нужно подумать и о других способах масштабирования базы данных.
С одной стороны шардинг — лучший способ масштабирования. Он позволяет линейно масштабировать ресурсы базы данных, памяти и диска, дробя базу данных на более мелкие части. С другой стороны целесообразность использования шаринга — спорная тема. Интернет полон советов по шардингу, от «масштабирования инфраструктуры базы данных» до «почему вы никогда не используете шардинг». Итак, вопрос в том, какую сторону принять.
Всегда, когда возникал вопрос шардинга, ответ был «раз на раз не приходится». Теория шардинга проста: выберите один ключ (столбец), который равномерно распределяет данные. Убедитесь, что большинство запросов могут быть решены с помощью этого ключа. Эта теория проста, но только до того момента, пока вы не приступите к практике.
В Citus мы помогли сотням команд, когда они обращались к шардингу баз данных. С получением опыта мы обнаружили, что имеются ключевые шаблоны.
В этой статье мы сначала рассмотрим ключевые параметры, которые влияют на успех шардинга, а затем раскроем основную причину, по которой мнения о шардинге столь разные. Когда дело доходит до шардинга базы данных, на успех в большей степени влияет тип приложения, которое вы создаете.
3 ключевых параметра успешности шардинга
На успех шардинга базы данных влияют 3 ключевых параметра. На диаграмме они показаны на трех осях, а также приведены примеры известных компаний.
Ось X на диаграмме показывает тип рабочей нагрузки. Эта ось начинается с транзакционных нагрузок слева и продолжается организацией хранилищ данных. Изменения этой оси более заметны при шардинге.
Ось Z демонстрирует еще один важный параметр — нахождение в жизненном цикле приложения. Сколько таблиц у вас есть в базе данных (10, 100, 1000) или как долго приложение находится в производстве? Приложение, запущенное на PostgreSQL в течение нескольких месяцев, будет легче шардироваться, чем приложение, которое было в производстве в течение многих лет.
В Citus мы обнаружили, что большинство пользователей имеют достаточно развитые приложения. Когда приложение развито, ось У становится критической. К сожалению, изменения этой оси не так заметны, как изменения остальных осей. Фактически большинство статей, которые противоречат выводам о фрагментации, предоставляют свои рекомендации в контексте одного типа приложения.
Что имеет наибольшее значение в шардинге: тип приложения (B2B или B2C)
Ось У на диаграмме показывает наиболее важный параметр при шардинге баз данных — тип приложения. В верхней части этой оси находятся приложения B2B, модели данных которых более удобны для фрагментации. В нижней части этой оси — приложения B2C, такие как Amazon и Facebook, которые требуют больше работы. Далее мы расскажем о различиях трех известных компаний.
B2B Example: Salesforce
Хорошим примером приложения для B2B является программное обеспечение CRM. Когда вы создаете CRM-приложение, такое как Salesforce, ваше приложение будет обслуживать других клиентов. Например, компания GE Aviation будет одним из ваших клиентов, использующих Salesforce.
В GE Aviation есть пользователи, которые входят в свою панель мониторинга компании. GE также фиксирует:
потенциальных клиентов, с которыми они могут вести бизнес,
контакты/людей, которые уже известны и с которыми установлены деловые отношения,
счета, которые представляют бизнес-единицы и у которых есть работающие на них контакты,
возможности, которые являются событиями продаж, связанными с учетной записью и одного или нескольких контактов.
Сопоставление этих сложных соотношений выглядит следующим образом:
График выглядит сложным. Но изучив график, можно заметить, что большинство таблиц происходит из таблицы клиентов. Графы можно преобразовать, добавив столбец customer_id ко всем таблицам.
С помощью этого простого преобразования у базы данных теперь есть хороший ключ оглавления: customer_id. Он равномерно распределяет данные, и большинство запросов к базе данных будут включать ключ клиента. Кроме того, вы можете размещать таблицы в client_id и продолжать использовать ключевые функции реляционной базы данных, такие как транзакция, объединение таблиц и ограничение внешнего ключа.
Другими словами, если у вас есть приложение B2B, характер ваших данных дает вам фундаментальное преимущество при шардинге.
Пример B2C: Amazon.com
Amazon.com — хороший пример расширенного приложения B2C. Если бы вы строили сайт Amazon.com сегодня, у вас было бы несколько концепций для рассмотрения. Во-первых, пользователь приходит на ваш сайт и начинает смотреть продукты: книги, электронику. Когда пользователь посещает страницу продукта, скажем, Harry Potter 7, он видит информацию каталога, связанную с этим продуктом. Пример информации о каталоге включает автора, цену, обложку и другие изображения.
Когда пользователь регистрируется на веб-сайте, он получает доступ к данным, связанным с пользователем. Пользователь должен быть аутентифицирован, может писать отзывы о любимых продуктах и добавлять элементы в корзину покупок. В какой-то момент пользователь решает сделать покупку и размещает заказ. Заказ обрабатывается, забирается со склада и отправляется.
При сопоставлении отношений в реляционнной базе данных вы обнаружите, что они отличаются от примера Salesforce одной важной чертой. У вас нет единого измерения, которое является центром всех отношений, а есть как минимум три: каталог, пользователь и данные заказа.
При фрагментации данных типа B2C один из вариантов заключается в преобразовании приложения в микросервисы. Например, есть связанные службы каталога, которые владеют каталогом и предлагают данные, а также связанные с пользователем службы, которые владеют данными корзины проверки подлинности и покупок. API-интерфейсы между службами определяют границы доступа к базам данным.
При создании такого разделения между данными можно шардировать данные, которые предоставляют каждую услугу или группу услуг отдельно. Фактически Amazon.com использовал аналогичный подход к шардингу, когда перешел на сервис-ориентированную архитектуру.
Такой подход к очертаниям имеет более выгодное соотношение затрат и стоимости, чем шардинг приложения B2B. Что касается преимуществ, при разделении данных на группы таким образом можно полагаться на базу данных для объединения данных из разных источников или обеспечения транзакций и ограничений для групп данных. Со стороны затрат теперь нужно очертить не одну, а несколько групп данных.
Пример B2C2C: Instacart
Подкатегория, которая находится между B2B и B2C, включает такие приложения, как Postmates, Instacart или Lyft. Например, Instacart доставляет продукты пользователям из местных магазинов. В некотором смысле Instacart похож на пример Amazon.com. Instacart имеет три основных габаритных поля: местные магазины (предлагают продукты), пользователи (заказывают продукты) и водители (доставляют продукты). Таким образом, трудно выбрать один ключ, на котором можно очертить базу данных.
Если у вас есть расширенные приложения B2C2C, такие как Instacart, вы можете следовать другой стратегии. Большинство таблиц базы данных имеют другое измерение: география. В этом случае вы можете выбрать город или местоположение в качестве своего ключа и очертить таблицы по ключу географии.
В общем, шардинг приложений B2B2C / B2C2C находится в середине спектра. Шардинг для B2B2C имеет тенденцию к более высокому соотношению выгод и затрат, чем шардинг приложений B2C, и более низкое, чем приложения B2B.
Мнения о шардинге
Интернет полон мнений о шардинге. Мы обнаружили, что большинство этих мнений формируются с учетом одного типа приложения. Фактически тип приложения (B2B или B2C) влияет на успех более всего. В частности, если у вас приложение для B2B, то вам будет легче шардировать реляционную базу данных.
При планировании масштабирования базы данных нужно иметь полное представление об этом процесcе и оценить все параметры с учетом требований проекта.
Что такое шардирование бд
Learn about Oracle Sharding capabilities and benefits in this high level conceptual discussion.
What is Sharding
Sharding is a method of partitioning data to distribute the computational and storage workload, which helps in achieving hyperscale computing.
Hyperscale computing is a computing architecture that can scale up or down quickly to meet increased demand on the system. This architecture innovation was originally driven by internet giants that run distributed sites and has been adopted by large-scale cloud providers.
Companies often achieve hyperscale computing using a technology called database sharding, in which they distribute segments of a data set—a shard—across lots of databases on lots of different computers.
Sharding uses a shared-nothing architecture in which shards share no hardware or software. All of the shards together make up a single logical database, called a sharded database .
From the perspective of the application, a sharded database looks like a single database: the number of shards, and the distribution of data across those shards, are completely transparent to database applications. From the perspective of a database administrator, a sharded database consists of multiple databases that can be managed collectively.
Figure 1-1 Distribution of a Table Across Database Shards

Description of «Figure 1-1 Distribution of a Table Across Database Shards»
About Oracle Sharding
Oracle Sharding is a feature of Oracle Database that lets you automatically distribute and replicate data across a pool of Oracle databases that share no hardware or software. Oracle Sharding provides the best features and capabilities of mature RDBMS and NoSQL databases, as described here.
SQL language used for object creation, strict data consistency, complex joins, ACID transaction properties, distributed transactions, relational data store, security, encryption, robust performance optimizer, backup and recovery, and patching with Oracle Database
Oracle innovations and enterprise-level features, including Advanced Security, Automatic Storage Management (ASM), Advanced Compression, partitioning, high-performance storage engine, SMP scalability, Oracle RAC, Exadata, in-memory columnar, online redefinition, JSON document store, and so on
Sharding-aware Oracle Database tools, such as SQL Developer, Enterprise Manager Cloud Control, Recovery Manager (RMAN), and Data Pump, for sharded database application development and management
Programmatic interfaces, such as Java Database Connectivity (JDBC), Oracle Call Interface (OCI), Universal Connection Pool (UCP), Oracle Data Provider for .NET (ODP.NET), and PL/SQL, including extensions for sharded application development
Extreme availability with Oracle Data Guard and Active Data Guard.
Oracle GoldenGate replication support for Oracle Sharding High Availability is deprecated in Oracle Database 21c.
Support for multi-model data like relational, text, and JSON
Existing life-cycle management and operational processes can be kept, leveraging in-house and world-wide Oracle database administrator skill sets
Extreme scalability and availability of NoSQL databases
Oracle Sharding as Distributed Partitioning
Sharding is a database scaling technique based on horizontal partitioning of data across multiple independent physical databases. Each physical database in such a configuration is called a shard.
From the perspective of an application, a sharded database in Oracle Sharding looks like a single database; the number of shards, and the distribution of data across those shards, are completely transparent to the application.
Even though a sharded database looks like a single database to applications and application developers, from the perspective of a database administrator, a sharded database consists of a set of discrete Oracle databases, each of which is a single shard, that can be managed collectively.
A sharded table is partitioned across all shards of the sharded database. Table partitions on each shard are not different from partitions that could be used in an Oracle database that is not sharded.
The following figure shows the difference between partitioning on a single logical database and partitions distributed across multiple shards.
Figure 1-2 Sharding as Distributed Partitioning

Description of «Figure 1-2 Sharding as Distributed Partitioning»
Oracle Sharding automatically distributes the partitions across shards when you issue the CREATE SHARDED TABLE statement, and the distribution of partitions is transparent to applications. The figure above shows the logical view of a sharded table and its physical implementation.
Benefits of Oracle Sharding
Oracle Sharding provides linear scalability, complete fault isolation, and global data distribution for the most demanding applications.
Key benefits of Oracle Sharding include:
The Oracle Sharding shared–nothing architecture eliminates performance bottlenecks and provides unlimited scalability. Oracle Sharding supports scaling up to 1000 shards.
Extreme Availability and Fault Isolation
Single points of failure are eliminated because shards do not share resources such as software, CPU, memory, or storage devices. The failure or slow-down of one shard does not affect the performance or availability of other shards.
Shards are protected by Oracle MAA best practice solutions, such as Oracle Data Guard and Oracle RAC.
An unplanned outage or planned maintenance of a shard impacts only the availability of the data on that shard, so only the users of that small portion of the data are affected, for example, during a failover brownout.
Geographical Distribution of Data
Sharding enables Global Database where a single logical database could be distributed over multiple geographies. This makes it possible to satisfy data privacy regulatory requirements (Data Sovereignty) as well as allows to store particular data close to its consumers (Data Proximity).
Example Applications using Database Sharding
Oracle Sharding provides benefits for a variety of use cases.
Real time OLTP applications have a very high transaction processing throughput, a large user population, huge amounts of data, and require strict data consistency and management at scale. Some examples include internet-facing consumer applications, financial applications such as mobile payments, large scale SaaS applications such as billing and medical applications. The benefits of using Oracle Sharding for such applications include:
- Linear scalability of transactions per second, with response time staying constant as new shards are added to support larger data volume
- Better application SLAs, because planned and unplanned outages on any given shard does not impact the data stored and available on other shards
- Strict data consistency for transactional applications
- Transactions spanning multiple shards
- Support for complex joins, triggers, and stored procedures
- Simplified manageability at scale
Many enterprise applications are global in nature, where the same application serves customers in multiple geographic locations. Such applications typically use a single logical global database which is shared across multiple geographical regions. The benefits of a shared global database include:
- Strict enforcement of data sovereignty, where data privacy regulations require data to stay in a certain geographic location, region, country, or even state.
- Reduction of data replication across locations
- Better application SLAs, because planned and unplanned outages in one region do not impact other regions
Internet of Things and Data Streaming Applications
Typically such applications collect large amounts of data and stream it at a very high speed. Oracle Sharding has optimized data stream libraries which use Oracle Database’s direct path I/O technology to load data into the sharded database with extremely high speed. Data load requirements for these applications can be in to 100s of millions of records per second. Once the data is loaded directly into the database, it is available for immediate processing with advanced query processing and analytic capabilities.
Many machine learning applications require training and scoring of models in real time. Model training and scoring for many applications using algorithms like anomaly detection, and clustering is specific to a given entity (for example, a given user’s financial transaction patterns or specific device metrics at a certain time of the day). This kind of data can easily be shared by using a sharding key specific to the user or devices. Additionally, Oracle Database Machine Learning algorithms can be applied directly in the database obviating the need for a separate data pipeline and machine learning processing infrastructure.
Big Data Analytics
When you have terabytes of data, sharding means you don’t have to warehouse data to do analytics on it. With up to 1000 shards in capacity, Oracle Sharding can turn a relational database into a warehouse-sized data store. With the Federated Sharding solution, multiple database installations in different locations that run the same application can be converted into a federated sharded database so that you can run data analytics without moving the data.
NoSQL solutions lack major RDBMS features, such as relational schema, SQL, complex data types, online schema changes, multi-core scalability, security, ACID properties, CR for single-shard operations, and so on. With Oracle Sharding you get the nearly limitless scaling and sharding you had with NoSQL and all of the features and benefits of Oracle Database.
Flexible Deployment Models
The shared-nothing architecture of Oracle Sharding lets you keep your data on-premises, in the cloud, or on a hybrid of cloud and on-premises systems. Because the database shards do not share any resources, the shards can exist anywhere on a variety of on-premises and cloud systems.
You can choose to deploy all of the shards on-premises, have them all in the cloud, or you can split them up between cloud and on-premises systems to suit your needs.
Shards can be deployed on all database deployment models such as single instance, Exadata, and Oracle RAC.
High Availability in Oracle Sharding
Oracle Sharding is tightly integrated with Oracle Data Guard to provide high availability and disaster recovery. Replication is automatically configured and deployed when the sharded database is created.
Oracle Data Guard replication maintains one or more synchronized copies (standbys) of a shard (the primary) for high availability and data protection. Standbys can be deployed locally or remotely, and when using Oracle Active Data Guard can also be open for read-only access. Use this option when application needs strict data consistency and zero data loss.
Oracle GoldenGate is used for fine-grained active-active replication. Though applications must be able to deal with conflicts and data loss upon potential failover.
Oracle GoldenGate replication support for Oracle Sharding High Availability is deprecated in Oracle Database 21c.
Optionally, you can use Oracle RAC for shard-level high availability, complemented by replication, to maintain shard-level data availability in the event of a cluster outage. Each shard can be deployed on an Oracle RAC cluster to give it instant protection from node failure. For example, each shard could be a two node Oracle RAC cluster.
Sharding Methods
Because Oracle Sharding is based on table partitioning, all of the sub-partitioning methods provided by Oracle Database are also supported by Oracle Sharding. A data sharding method controls the placement of the data on the shards. Oracle Sharding supports system-managed, user defined, or composite sharding methods.
System-managed sharding does not require you to map data to shards. The data is automatically distributed across shards using partitioning by consistent hash. The partitioning algorithm uniformly and randomly distributes data across shards.
User-defined sharding lets you explicitly specify the mapping of data to individual shards. It is used when, because of performance, regulatory, or other reasons, certain data needs to be stored on a particular shard, and the administrator needs to have full control over moving data between shards.
Composite sharding allows you to use two levels of sharding. First the data is sharded by range or list and then it is sharded further by consistent hash.
In many use cases, especially for data sovereignty and data proximity requirements, the composite sharding method offers the best of both system-managed and user-defined sharding methods, giving you the automation you want and the control over data placement you need.
Client Request Routing
Oracle Sharding supports direct, key-based routing from an application to a shard, routing by proxy with the shard catalog, and routing to middle tiers, such as application containers, web containers, and so on, which are affinitized with shards. Oracle Database client drivers and connection pools are sharding aware.
Key-based routing. Oracle client-side drivers (JDBC, OCI, UCP, ODP.NET) can recognize sharding keys specified in the connection string for high performance data dependent routing. A shard routing cache in the connection layer is used to route database requests directly to the shard where the data resides.
Routing by proxy. Oracle Sharding supports routing for queries that do not specify a sharding key, giving any database application the flexibility to run SQL statements, without specifying the shards on which the query should be processed. Proxy routing can handle single-shard queries and multi-shard queries.
Middle-tier routing. In addition to sharding the data tier, you can shard the web tier and application tier, distributing the shards of those middle tiers to service a particular set of database shards, creating a pattern known as a swim lane . A smart router can route client requests based on specific sharding keys to the appropriate swim lane, which in turn establishes connections on its subset of shards.
Query Processing
No changes to query and DML statements are required to support Oracle Sharding. Most existing DDL statements will work the same way on a sharded database with the same syntax and semantics as they do on a non-sharded Oracle Database.
In the same way that DDL statements can be processed on all shards in a configuration, so too can certain Oracle-provided PL/SQL procedures.
Oracle Sharding also has its own keywords in the SQL DDL statements, which can only be run against a sharded database.
High Speed Data Ingest
SQL*Loader enables direct data loading into the database shards for a high speed data ingest.
SQL*Loader is a bulk loader utility used for moving data from external files into the Oracle database. Its syntax is similar to that of the DB2 load utility, but comes with more options. SQL*Loader supports various load formats, selective loading, and multi-table loads. Other benefits include:
- Streaming capability lets you receive data from a large group of clients without blocking
- Group records according to Oracle RAC shard affinity using native UCP
- Optimize CPU allocation while decoupling record processing from I/O
- Fastest insert method for the Oracle Database through Direct Path Insert, bypassing SQL and writing directly in the database files
Deployment Automation
Sharded database deployment is highly automated with Terraform, Kubernetes, and Ansible scripts.
The deployment scripts take a simple input file describing your desired deployment topology, and run from a single host to deploy shards to all of the sharded database hosts. Pause, resume, and cleanup operations are included in the scripts in case of errors.
Data Migration
The Sharding Advisor tool helps with sharded database schema design for migration from a non-sharded to sharded database. Oracle Data Pump is sharding aware and is used to migrate data from a non-sharded Oracle database to a sharded Oracle database.
The Sharding Advisor is a tool provided with Oracle Sharding which can help you design an optimal sharded database configuration by analyzing your current database schema and workload, and recommending Oracle Sharding topology configurations and database schema designs. The Sharding Advisor bases recommendations on key goals such as parallelism (distributing query processing evenly among shards), minimizing cross-shard join operations, and minimizing duplicated data.
Oracle Data Pump
You can load data directly into the shards by running Oracle Data Pump on each shard. This method is very fast because the entire data loading operation can complete within the period of time needed to load the shard with the maximum subset of the entire data set.
Lifecycle Management of Shards
The Oracle Sharding command-line interface and Oracle Enterprise Manager help you manage your sharded database.
Using the tools provided you can:
- Provision new sharded databases with scripts
- Scale out as needed by adding more shards online and take advantage of automatic rebalancing
- Scale in by moving data and consolidating hardware when loads are low
- Monitor performance statistics using Enterprise Manager
- Back up for disaster recovery using Cloud Backup Service, RMAN, and Zero Data Loss Recovery Appliance
- Patches and Upgrades automated with oPatchAuto in rolling mode
Federated Sharding
Unify multiple existing databases into one sharded database architecture.
Global businesses might have multiple instances of same applications deployed for multiple departments in multiple regions. Federated sharding allows mapping of databases of such applications in to a single federated database and provides the following benefits.
- Queries can be seamlessly processed against a single federated database using multi-shard query coordinator
- Removes the need to replicate data for reporting and analytics purposes
- Tolerance for differences in schema and database versions
What’s New in Oracle Sharding 21c
The following are major new features for Oracle Sharding in Oracle Database 21c.
Sharding Advisor is a tool provided with Oracle Sharding which can help you design an optimal sharded database configuration by analyzing your current database schema and workload, and recommending Oracle Sharding topology configurations and database schema designs. The Sharding Advisor bases recommendations on key goals such as parallelism (distributing query execution evenly among shards), minimizing cross-shard join operations, and minimizing duplicated data.
See Using the Sharding Advisor for information about using Sharding Advisor.
Federated Sharding lets you unify multiple existing databases into one sharded database architecture. Oracle Sharding, in a federated sharding configuration, treats each independent database as a shard, and as such can issue multi-shard queries on those shards.
See Combine Existing Non-Sharded Databases into a Federated Sharded Database for information about created a federated sharded database.
Centralized Backup and Restore provides an automated and centralized management and monitoring infrastructure for sharded database backup and restore operations, including logging those operations using Oracle MAA best practices.
See Backing Up and Recovering a Sharded Database for information about configuring centralized backup and restore operations.
Where To Go From Here
Planning and deploying a sharded database configuration that best fits your requirements can be a daunting task. The following roadmap can guide you through the process, from initial planning to life cycle management of a sharded database.
Database Sharding: Concepts and Examples
Your application is growing. It has more active users, more features, and generates more data every day. Your database is now becoming a bottleneck for the rest of your application. Database sharding could be the solution to your problems, but many do not have a clear understanding of what it is and, especially, when to use it. In this article, we’ll cover the basics of database sharding, its best use cases, and the different ways you can implement it.
Jump to:
What is database sharding?
Sharding is a method for distributing a single dataset across multiple databases, which can then be stored on multiple machines. This allows for larger datasets to be split into smaller chunks and stored in multiple data nodes, increasing the total storage capacity of the system. See more on the basics of sharding here.
Similarly, by distributing the data across multiple machines, a sharded database can handle more requests than a single machine can.
Sharding is a form of scaling known as horizontal scaling or scale-out, as additional nodes are brought on to share the load. Horizontal scaling allows for near-limitless scalability to handle big data and intense workloads. In contrast, vertical scaling refers to increasing the power of a single machine or single server through a more powerful CPU, increased RAM, or increased storage capacity.

Do you need database sharding?
Database sharding, as with any distributed architecture, does not come for free. There is overhead and complexity in setting up shards, maintaining the data on each shard, and properly routing requests across those shards. Before you begin sharding, consider if one of the following alternative solutions will work for you.
Vertical scaling
By simply upgrading your machine, you can scale vertically without the complexity of sharding. Adding RAM, upgrading your computer (CPU), or increasing the storage available to your database are simple solutions that do not require you to change the design of either your database architecture or your application. 
Specialized services or databases
Depending on your use case, it may make more sense to simply shift a subset of the burden onto other providers or even a separate database. For example, blob or file storage can be moved directly to a cloud provider such as Amazon S3. Analytics or full-text search can be handled by specialized services or a data warehouse. Offloading this particular functionality can make more sense than trying to shard your entire database.
Replication
If your data workload is primarily read-focused, replication increases availability and read performance while avoiding some of the complexity of database sharding. By simply spinning up additional copies of the database, read performance can be increased either through load balancing or through geo-located query routing. However, replication introduces complexity on write-focused workloads, as each write must be copied to every replicated node. 
On the other hand, if your core application database contains large amounts of data, requires high read and high write volume, and/or you have specific availability requirements, a sharded database may be the way forward. Let’s look at the advantages and disadvantages of sharding.
Advantages of sharding
Sharding allows you to scale your database to handle increased load to a nearly unlimited degree by providing increased read/write throughput, storage capacity, and high availability. Let’s look at each of those in a little more detail.
- Increased read/write throughput — By distributing the dataset across multiple shards, both read and write operation capacity is increased as long as read and write operations are confined to a single shard.
- Increased storage capacity — Similarly, by increasing the number of shards, you can also increase overall total storage capacity, allowing near-infinite scalability.
- High availability — Finally, shards provide high availability in two ways. First, since each shard is a replica set, every piece of data is replicated. Second, even if an entire shard becomes unavailable since the data is distributed, the database as a whole still remains partially functional, with part of the schema on different shards.
Disadvantages of sharding
Sharding does come with several drawbacks, namely overhead in query result compilation, complexity of administration, and increased infrastructure costs.
- Query overhead — Each sharded database must have a separate machine or service which understands how to route a querying operation to the appropriate shard. This introduces additional latency on every operation. Furthermore, if the data required for the query is horizontally partitioned across multiple shards, the router must then query each shard and merge the result together. This can make an otherwise simple operation quite expensive and slow down response times.
- Complexity of administration — With a single unsharded database, only the database server itself requires upkeep and maintenance. With every sharded database, on top of managing the shards themselves, there are additional service nodes to maintain. Plus, in cases where replication is being used, any data updates must be mirrored across each replicated node. Overall, a sharded database is a more complex system which requires more administration.
- Increased infrastructure costs — Sharding by its nature requires additional machines and compute power over a single database server. While this allows your database to grow beyond the limits of a single machine, each additional shard comes with higher costs. The cost of a distributed database system, especially if it is missing the proper optimization, can be significant.
Having considered the pros and cons, let’s move forward and discuss implementation.
How does sharding work?
In order to shard a database, we must answer several fundamental questions. The answers will determine your implementation.
First, how will the data be distributed across shards? This is the fundamental question behind any sharded database. The answer to this question will have effects on both performance and maintenance. More detail on this can be found in the “Sharding Architectures and Types” section.
Second, what types of queries will be routed across shards? If the workload is primarily read operations, replicating data will be highly effective at increasing performance, and you may not need sharding at all. In contrast, a mixed read-write workload or even a primarily write-based workload will require a different architecture.
Finally, how will these shards be maintained? Once you have sharded a database, over time, data will need to be redistributed among the various shards, and new shards may need to be created. Depending on the distribution of data, this can be an expensive process and should be considered ahead of time.
With these questions in mind, let’s consider some sharding architectures.
Sharding architectures and types
While there are many different sharding methods, we will consider four main kinds: ranged/dynamic sharding, algorithmic/hashed sharding, entity/relationship-based sharding, and geography-based sharding.
Ranged/dynamic sharding
Ranged sharding, or dynamic sharding, takes a field on the record as an input and, based on a predefined range, allocates that record to the appropriate shard. Ranged sharding requires there to be a lookup table or service available for all queries or writes. For example, consider a set of data with IDs that range from 0-50. A simple lookup table might look like the following:
| Range | Shard ID |
| [0, 20) | A |
| [20, 40) | B |
| [40, 50] | C |
The field on which the range is based is also known as the shard key. Naturally, the choice of shard key, as well as the ranges, are critical in making range-based sharding effective. A poor choice of shard key will lead to unbalanced shards, which leads to decreased performance. An effective shard key will allow for queries to be targeted to a minimum number of shards. In our example above, if we query for all records with IDs 10-30, then only shards A and B will need to be queried.
Two key attributes of an effective shard key are high cardinality and well-distributed frequency. Cardinality refers to the number of possible values of that key. If a shard key only has three possible values, then there can only be a maximum of three shards. Frequency refers to the distribution of the data along the possible values. If 95% of records occur with a single shard key value then, due to this hotspot, 95% of the records will be allocated to a single shard. Consider both of these attributes when selecting a shard key.
Range-based sharding is an easy-to-understand method of horizontal partitioning, but the effectiveness of it will depend heavily on the availability of a suitable shard key and the selection of appropriate ranges. Additionally, the lookup service can become a bottleneck, although the amount of data is small enough that this typically is not an issue.
Algorithmic/hashed sharding
Algorithmic sharding or hashed sharding, takes a record as an input and applies a hash function or algorithm to it which generates an output or hash value. This output is then used to allocate each record to the appropriate shard.
The function can take any subset of values on the record as inputs. Perhaps the simplest example of a hash function is to use the modulus operator with the number of shards, as follows:
Hash Value=ID % Number of Shards
This is similar to range-based sharding — a set of fields determines the allocation of the record to a given shard. Hashing the inputs allows more even distribution across shards even when there is not a suitable shard key, and no lookup table needs to be maintained. However, there are a few drawbacks.
First, query operations for multiple records are more likely to get distributed across multiple shards. Whereas ranged sharding reflects the natural structure of the data across shards, hashed sharding typically disregards the meaning of the data. This is reflected in increased broadcast operation occurrence.
Second, resharding can be expensive. Any update to the number of shards likely requires rebalancing all shards to moving around records. It will be difficult to do this while avoiding a system outage.
Entity-/relationship-based sharding
Entity-based sharding keeps related data together on a single physical shard. In a relational database (such as PostgreSQL, MySQL, or SQL Server), related data is often spread across several different tables.
For instance, consider the case of a shopping database with users and payment methods. Each user has a set of payment methods that is tied tightly with that user. As such, keeping related data together on the same shard can reduce the need for broadcast operations, increasing performance.
Geography-based sharding
Geography-based sharding, or geosharding, also keeps related data together on a single shard, but in this case, the data is related by geography. This is essentially ranged sharding where the shard key contains geographic information and the shards themselves are geo-located.
For example, consider a dataset where each record contains a “country” field. In this case, we can both increase overall performance and decrease system latency by creating a shard for each country or region, and storing the appropriate data on that shard. This is a simple example, and there are many other ways to allocate your geoshards which are beyond the scope of this article.
Summary
We’ve defined what sharding is, discussed when to use it, and explored different sharding architectures. Sharding is a great solution for applications with large data requirements and high-volume read/write workloads, but it does come with additional complexity. Consider whether the benefits outweigh the costs or whether there is a simpler solution before you begin implementation.