Realtime Analytics Databases and Considerations

In this blog, we are going to unwind all the learnings we found while researching about the real time analytics databases, consider our use cases and see which database we can use for realtime analytics.
OLTP, OLAP, and HTAP
To better understand the real-time analytics let’s first get familiar with three concepts that are at the core of our use case: OLTP, OLAP & HTAP:
OLTP: Online Transaction Processing (OLTP) systems are designed for transaction-oriented applications, ensuring data integrity through ACID (Atomicity, Consistency, Isolation, Durability) properties.
OLAP: Online Analytical Processing (OLAP) systems enable high-speed, multidimensional analysis of large data volumes, aiding in data-driven decision-making.
HTAP: Hybrid Transaction/Analytical Processing (HTAP) combines OLTP and OLAP functionalities, permitting real-time analytics on transactional data

Initial Implementation and the Challenges
We thought of solving our use cases using data warehouse tools like Redshift and Snowflake. However, none of these solutions worked for us given the design pattern and requirements for real-time data ingestion and merge
Amazon Aurora
Amazon Aurora is a relational database service offered by Amazon Web Services (AWS). It is a fully managed, highly available, and scalable database engine that is compatible with MySQL and PostgreSQL. Amazon Aurora was designed to provide the performance and availability of commercial-grade databases at a fraction of the cost.

Key features and characteristics of Amazon Aurora include:
- Compatibility: Amazon Aurora is compatible with MySQL and PostgreSQL, which means you can use familiar database engines and tools with Aurora.
- High Performance: Aurora is designed for high performance and low latency. It uses a distributed, log-structured storage system that provides fast read and write operations.
- Scalability: Aurora can automatically scale both read and write operations to meet the demands of your applications. This makes it suitable for applications with fluctuating workloads.
- High Availability: Aurora replicates data across multiple Availability Zones (AZs) in a region, providing high availability and data durability. It can automatically fail over to a standby instance in case of hardware or software failures.
- Backups and Snapshots: Aurora automatically takes continuous backups of your database and allows you to create snapshots for point-in-time recovery.
- Security: It supports encryption at rest and in transit, as well as IAM-based authentication for database access control.
- Replication: Aurora supports read replicas, which can be used to offload read traffic from the primary database and improve read scalability.
- Global Databases: With Amazon Aurora Global Databases, you can create a global, distributed database that spans multiple AWS regions for low-latency global access.
- Serverless Aurora: AWS offers a serverless version of Aurora, which automatically adjusts capacity based on your actual usage, eliminating the need to provision and manage database servers.
- Aurora Data API: This API allows you to easily access your Aurora databases from serverless applications without managing database connections.
- Aurora Serverless v2: An updated version of the serverless Aurora, it offers even more performance and flexibility, with the ability to pause and resume database capacity.
Challenges Faced with Amazon Aurora
The system initially performed well, until it had to handle a throughput beyond 3K messages per second. This marked the onset of several challenges:
- Scalability limitation: As we exceeded a throughput of 3K messages per second, Aurora’s Input/Output Operations Per Second (IOPS) limitations became a bottleneck. The scalability constraint had started impacting our database operations.
- Bloating issue: Every record update led to the creation of a new record and a dead tuple (previous version of the record). When the production rate of these dead tuples outpaced the cleanup process, bloating occurred. Since VACUUM FULL was not able to claim the storage, the disk usage continuously increased. For roughly 5 TB of data, Aurora was using 30+ TB of storage.
- Maintenance burden: The bloating issue is directly tied to our maintenance challenges. With over 70 pipelines and a total write QPS surpassing 5k messages/second, we found PostgreSQL’s auto cleanup process, Auto Vacuum, failed to keep pace with the rate of dead tuple generation. Therefore, manually running VACUUM or VACUUM FULL is required to recover the database. Our attempts with PostgreSQL tools like pg_repack and pgcompacttable also proved unsuccessful. Consequently, maintenance became increasingly complex and time-consuming.
Amazon Aurora is also very costly to operate, for a simple cluster setup it would be 5x the costs which we are paying for RDS
The Searching for Database Alternatives
To resolve Amazon Aurora’s limitations, we set out to find a better alternative that met the following requirements:
- Scalable with high write QPS: The database should support at least 10k+ write QPS and be horizontally scalable.
- Real-time analytics: The database should be able to provide high-speed or real-time OLAP capabilities
- Fully distributed: The database should be distributed across multiple sites to provide high availability and fault tolerance.
- Strong consistency: The database should maintain strong consistency, ensuring that all users see the same data.
Considering all the above requirements, we initially explored many RDBMS alternatives including Google Spanner and YugabyteDB because we wanted to keep our change management minimal.
Google Spanner
Google Spanner is a distributed SQL database management and storage service offered by Google. It is fully managed on the Google Cloud Platform (GCP). However, Google Spanner might not be a good use case for us because we did not find the proper tools to load historical data. We explored Harbourbridge, an open-source tool for Spanner evaluation and migration. However, it had limitations around 100 GB of data loading.
YugabyteDB
YugabyteDB is a high-performance transactional distributed SQL database for cloud-native applications, developed by Yugabyte. This database is very close to our use case because it was fully PostgreSQL compliant, horizontally scalable, and fully distributed. Unfortunately, it didn’t work as well because of its limitation with scalability, Our success criteria demanded 7k+ transactions per second but YugabyteDB was only able to scale up to 5k.
We also looked into other possible candidates like Google BigQuery, but none of them served our requirements well.
Choosing TiDB
After the above RDBMS alternatives, we decided to add HTAP to our requirements, which led us to explore TiDB. It supports out-of-the-box scalability, consistency, availability, multi-site deployment topology, and many more features. As a distributed SQL database, TiDB has multiple components that communicate with each other and form a complete TiDB system.

TiDB: The stateless SQL processing component that provides the client-facing endpoint to the user. It locates the correct TiKV node to connect from PD to get the data.
- TiKV: A distributed transactional key-value data store that keeps the data in the left-closed-right-open range. Data is kept in shards with multiple replicas. TiKV uses the Raft protocol for replication.
- PD: The placement driver (PD) keeps the metadata of the cluster such as shard replica locations. It’s also responsible for scheduling the shards across TiKV nodes. PD leader handles such tasks while other nodes maintain high availability.
- TiFlash: The columnar storage extension that uses the Multi-Raft Learner protocol to replicate data from TiKV in real-time, ensuring consistent data between the TiKV row-based storage engine.
The following features of TiDB addressed our key challenges and met our operational requirements:
- Horizontal scaling: TiDB’s architecture separates compute from storage, letting you scale out or in the compute or storage capacity online as needed. The scaling process is transparent to application operations and maintenance staff.
- Fully ACID compliant: TiDB is ACID-compliant and supports transactions out of the box. It supports both optimistic and pessimistic types of transactions. This makes it unique from other databases.
- High availability: TiKV stores data in multiple replicas and uses the Multi-Raft protocol to obtain the transaction log. A transaction can only be committed when the data has been successfully written into the majority of replicas. This guarantees strong consistency and high availability when a minority of replicas go down.
- Real-time HTAP: TiDB combines both row storage (TiKV) and columnar storage (TiFlash) in the same architecture, forming a streamlined tech stack that makes it easier to produce real-time analytics on operational data.
Our TiDB Infrastructure
Our TiDB infrastructure is deployed on a separate namespace in our staging and production Kubernetes Cluster on Devtron We use TiDB Operator (https://artifacthub.io/packages/olm/community-operators/tidb-operator), to manage the cluster and all the administrative operations. Our cluster is deployed across a single AZ for now, but we are using replication count as 3 for data redundancies
Our cluster configurations are as follows:
- PD: The PD layer has 3 nodes in a separate node group PD leader handles such tasks while other nodes maintain high availability.
- TiDB: The TiDB layer has 3 nodes of the m5.xlarge family. These nodes were chosen based on the memory requirements, with 64 GB RAM & 8 Core CPUs allocated for each TiDB node.
- TiKV: The TiKV layer has 3 nodes of the m5.xlarge family which has 128 GB RAM & 16 vCORE CPUs.
By deploying our TiDB cluster in our EKS cluster and carefully selecting node types to meet our processing and memory needs, we’ve created a robust, highly available infrastructure capable of handling our high data throughput requirements.
Tuning TiDB for Our Case
To make it work for our use case, we worked closely with the PingCAP team to tune the database. Here are some of the critical adjustments we made:
- Index Optimization:
Set the following parameters before starting the index.
@@global.tidb_ddl_reorg_worker_cnt = 16; SET @@global.tidb_ddl_reorg_batch_size = 4096;
- Reset to default values after index creation.
@@global.tidb_ddl_reorg_worker_cnt = 4; SET @@global.tidb_ddl_reorg_batch_size = 256
- Partition Pruning: This is mainly important for partitioned tables. It analyzes the filter conditions in query statements and eliminates (prunes) partitions when they do not contain any required data.
SET @@global.tidb_ddl_reorg_worker_cnt = 16;
SET @@global.tidb_ddl_reorg_batch_size = 4096;
Tuning Analyze: Sometimes the auto analyzer in TiDB fails if a high volume of data is ingested. In that case, all the queries might use the wrong execution plan and end up scanning the full table. To avoid such a situation we made the following changes in TiDB configurations:
SET global tidb_max_auto_analyze_time = 86400; SET global tidb_enable_pseudo_for_outdated_stats = off;SET global tidb_sysproc_scan_concurrency = 15;
If you are working with partitioned tables, we suggest you run analyze table operations manually for one partition at a time to avoid analyzing failures.
Through adjustments like these, we were able to effectively streamline our use of TiDB, so that we can achieve an optimal performance for our real-time datasource
Other Database Considerations
Apache Druid
Apache Druid is a distributed, column-oriented database designed for real-time analytics. It is highly optimized for OLAP (Online Analytical Processing) queries and can ingest and query large volumes of data in real-time. Druid uses a columnar data storage format that enables fast query response times, even for complex queries.

Druid is designed to work with a variety of data types, including structured, semi-structured, and unstructured data. It can ingest data from various sources such as Kafka, Amazon S3, HDFS, and more. Druid is also highly scalable and fault-tolerant, making it suitable for organizations with large volumes of data.
How does Apache Druid work?
Druid is designed to be highly distributed and scalable. It consists of several components, including a Coordinator, Broker, Historical, and Real-time nodes. The Coordinator manages metadata and coordinates data ingestion, while the Broker routes queries to the appropriate Historical and Real-time nodes.
The Historical nodes store and serve historical data, while the Real-time nodes ingest and serve real-time data. Druid uses a segment-based architecture, where data is stored in segments that are optimized for query performance. Druid also supports data partitioning, which allows data to be split into smaller, manageable partitions for improved query performance.
Getting started with Apache Druid
To get started with Apache Druid, you will need to set up a Druid cluster. The Druid documentation provides detailed instructions on how to set up a cluster on different platforms such as AWS, GCP, or a local machine. Once you have set up the cluster, you can start ingesting data into Druid.
Ingesting data into Apache Druid
Druid supports various ingestion methods, including batch ingestion and real-time ingestion. Batch ingestion is used to ingest large volumes of data in bulk, while real-time ingestion is used to ingest data in real-time. Druid supports data ingestion from various sources such as Kafka, Amazon S3, HDFS, and more.
To ingest data into Druid, you will need to create a data source and specify the data format, ingestion method, and ingestion configuration. The Druid documentation provides detailed instructions on how to set up a data source and ingest data into Druid.
Querying data in Apache Druid
Once you have ingested data into Druid, you can start querying it in real-time. Druid supports a SQL-like query language called Druid SQL, which allows you to query data using SQL-like syntax. Druid SQL supports a variety of query types, including aggregation queries, filter queries, and join queries.
Druid also provides a REST API that allows you to query data programmatically. The API supports various query types and allows you to retrieve query results in different formats such as JSON, CSV, or TSV
Although Druid is a powerful, open-source, columnar storage database designed specifically for fast OLAP (Online Analytical Processing) queries on large volumes of data. It's an excellent choice for analytics workloads, and many organizations use it successfully for this purpose. However, whether or not to use Druid for analytics depends on several factors, and there are situations where it might not be the best fit. Here are some considerations:
- Complexity: Druid is a specialized database designed for analytics, and it has a learning curve. Setting up and managing Druid clusters can be complex compared to more traditional relational databases or cloud-based analytical databases.
- Use Case: Druid is ideal for scenarios where you need sub-second query response times on large datasets. If your analytics requirements are not latency-sensitive or your dataset is relatively small, using a simpler and more general-purpose analytics database or data warehousing solution may be more cost-effective and straightforward.
- Data Ingestion: Druid is optimized for handling large-scale, high-velocity data streams. If your data ingestion patterns are not streaming or if you don't have high data velocity requirements, other databases like Amazon Redshift, Google BigQuery, Snowflake, or traditional data warehouses may be easier to work with.
- Ecosystem Integration: Consider your existing technology stack. If you're heavily invested in a particular cloud ecosystem (e.g., AWS, Google Cloud, or Azure), you might find it more convenient to use the native analytics solutions provided by those cloud providers.
- Maintenance and Operations: Druid requires ongoing maintenance and monitoring to ensure optimal performance. If your organization lacks the resources or expertise to manage Druid clusters, it could be challenging to maintain its reliability.
- Cost: While Druid is open source, the costs associated with running and maintaining a Druid cluster can vary depending on factors like data volume, query complexity, and hardware requirements. You should carefully evaluate the total cost of ownership when considering Druid for analytics.
- Query Flexibility: Druid excels at certain types of queries (e.g., time-series analytics), but it may not be as versatile as traditional relational databases for ad-hoc, complex querying.
In summary, Druid is an excellent choice for organizations with specific requirements for high-speed, low-latency analytics on large datasets, particularly in scenarios where traditional databases struggle to provide the required performance. However, it's essential to assess your organization's specific needs, expertise, and resources before choosing Druid or any other analytics database. Consider alternatives like cloud-based data warehouses or analytical databases, which may provide a more straightforward and cost-effective solution for your analytics workloads
MongoDB
Since MongoDB is a NoSQL database, we do not need to use any data definition language (DDL) operations, like creating schemas or creating tables.
As a standalone database, MongoDB also performed better than MySQL, so it can save us a great deal of maintenance cost.

MongoDB offered various options on data consistency, which, combined with multi-version concurrency control (MVCC) on the application layer, could implement simple transactions
MongoDB is a NoSQL database designed for flexible and scalable document storage. While MongoDB can be used for certain types of analytics workloads, it's not typically the first choice for traditional analytics databases due to several factors:
- Schema Flexibility: MongoDB's flexible schema is well-suited for applications where data structures can change frequently. However, for analytics, especially in cases where you need to perform complex queries and aggregations, having a well-defined schema can be more efficient and make query optimization easier. Traditional relational databases or analytical databases often provide a better fit for this purpose.
- Aggregation Performance: MongoDB does offer aggregation framework features that allow you to perform complex data transformations and calculations. Still, it might not perform as well as specialized analytical databases (e.g., Amazon Redshift, Google BigQuery, Snowflake) or data warehousing solutions when it comes to handling large-scale, complex analytics queries on extensive datasets.
- Indexes: While MongoDB supports indexing, optimizing MongoDB for analytical workloads might require a different indexing strategy than what's needed for typical transactional workloads. Analytical databases are often optimized for analytical query patterns and can handle complex queries more efficiently.
- Data Modeling: In MongoDB, data is typically denormalized to support fast query access. This can lead to data duplication and make it challenging to maintain data consistency, especially in analytics scenarios where data evolves or when you need to aggregate data from multiple sources.
- Performance at Scale: MongoDB can scale horizontally (sharding) to handle large volumes of data, but optimizing it for analytics at scale can be complex and might not be as efficient as using dedicated analytical databases.
- Tooling and Ecosystem: Analytical databases often come with specialized tools and integrations for data visualization, reporting, and business intelligence. These tools might not be as readily available or well-integrated with MongoDB.
Requirements for Real Time Data analytics
We need ACID transactions. ACID, which is short for atomicity, consistency, isolation and durability.
We need big data capacity. It is hard enough to do OLAP workloads in a traditional OLTP database like MongoDB, and our huge data volumes only added to the difficulty. To make matters worse, almost the entire big data ecosystem was built on top of MySQL. If we tried to integrate MongoDB into that ecosystem, it would be almost like reinventing the wheel.
We need more data constraints. MongoDB has almost no data constraints, which meant the data schema could be out of control. With more engineers involved in the development cycle, it might become a disaster. Data constraints are now a higher priority
After analysing our use scenarios, we can realize that we need a new database that can provide:
- High availability
- High throughput
- ACID transactions
- Big data ecosystem
- Horizontal scalability, without intrusion into the application
Viewing these requirements, we knew that NoSQL wouldn’t be an option. To address massive amounts of data, NoSQL is designed to be “basically available, soft state, eventual consistency (BASE)”—just the opposite of ACID. The lack of an ACID transactional capability limits the use scenarios of MongoDB, as well as other NoSQL databases. Besides, NoSQL doesn’t support SQL, so we can’t reuse the rich resources accumulated by other RDBMSs in the past few decades.
We need an open source, distributed SQL database. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability.
Conclusion
As a NewSQL database, TiDB is compatible with the MySQL protocol and supports horizontal scalability, high availability, ACID transactions, and real-time analytics for large amounts of data. That’s why we can choose this databases over MongoDB and MySQL for our Analytics use case
Comments ()