Citus – Exploration and Evaluation
The Fellowship
One of the cool things we do here at Enova is the “Fellowship.” It’s a chance for people to pitch ideas on exploring new technologies that could benefit Enova and our customers. If the pitch is successful, the goal of the fellowship is to deliver a proof-of-concept at the end of four weeks. During my Fellowship I investigated whether Citus can provide performance and scalability benefits for Real-time Reporting Analytics here.
What is Citus?
- PostgreSQL extension (requires 9.5 or later)
- Turns a single database into a distributed database
- Enables sharding by distributing data across separate physical database servers
- Supports parallel queries
How does Citus work?
Citus has a master and worker nodes configuration, wherein the former stores metadata and has its custom distributed query planner, and the latter store shards. The application is unaware about this setup and all queries are directed inherently at the master.
Citus parallelizes incoming queries by breaking it into multiple fragment queries, which run in parallel on worker shards. This allows Citus to utilize the processing power of all the nodes in the cluster and also of individual cores on each node for each query. Due to this parallelization, performance is cumulative of computing power of all the cores in the cluster leading to a dramatic increase in query times compared to traditional PostgreSQL on a single server.
Benefits
- Easy Setup: download open-source extension from PGXN and install
- No Cost: Community edition
- Simple interface with predefined commands
- Runs Anywhere: Cloud (I tested this on AWS multi-node setup) or on-premise
Limitations
There are substantial issues, even with Citus (v5.0) that cannot be overlooked if we were to deploy Citus in production. Work-arounds do exist for some of these, but they’re not ideal for our use cases. Note: Citus does have a new v6.0 but it requires an upgrade to PostgreSQL 9.6.
Here are some of the drawbacks of V5.0:
- No support for these standard PostgreSQL features:
- Serial Primary Keys
- Triggers (work-around: create update-able view on the table)
- Sub-queries in the WHERE clause (IN/NOT IN, EXISTS/NOT EXISTS)
- SELECT DISTINCT
- Foreign Keys (fix released in Citus 6.0 but compatible only with Postgres9.6)
- CTEs (work-around: TEMP tables)
- Window Functions
- Set Operations
- For INSERTs:
- INSERT INTO .. SELECT ..
- Must specify distribution column (most of our tables have serial Primary Keys)
- Distribution Key issues:
- Primary Key must be the Distribution Key for each table.
- No support for partitioning on multiple columns (work-around: create a composite type and partition on it)
- Join Issues:
- Joins not possible between a regular and distributed table (work-around: replicate regular table to a single shard on every worker and push the join query down to the workers. This table must be defined as a ‘reference table’ and set citus.shard_replication_factor to the current number of worker nodes)
- Outer Joins must Join on the same partition key and underlying tables must have same number of shards
- Miscellaneous Issues:
- Update/Delete must target one shard
- Re-balancing shards feature is available only on Citus Enterprise
Performance Evaluation
Setup: A test Citus AWS cluster was setup by Grant Evans with one master and three worker nodes having PostgreSQL 9.5, Linux v3.13.0-92-generic x86_64 (Ubuntu 4.8.4), CPU count=1 and 1GB RAM.
Case 1:
After reviewing some existing functionality, the first experiment involved a long-running query (runtime:~9.75min) on our existing PostgreSQL database. Using Citus and appropriately sharding certain large tables of 19M rows, 2.6B rows, and 3.5M rows, some performance gain was measured.
Result: Unfortunately, even after re-writing this query to satisfy join issues of Citus, the query timed-out after 24 hours on Citus without finishing. However, it is possible that improved hardware would address this.
Case 2:
The second experiment involved another long-running query (runtime:~30min) with table sizes of 11.3M rows and 121.6M rows. Performance gain was again measured to assess for any improvements.
Result: Unfortunately, this query does not run on Citus as its distributed query planner fails to execute it.
Summary
Citus, although theoretically solves the issues seen with large datasets by automated distribution of databases physically, in both our real-world cases, there was no performance gain. Additionally, while implementing the test cases, several underlying limitations of the technology were revealed. But, that’s why we do these proof-of-concepts.
Conclusion
Citus is a definitely a promising concept and could be worth investigating in the near future once it matures, but owing to its current limitations and performance issues on existing Enova-specific queries, it does not seem to be an ideal fit for us. There are some instances where it would clearly provide performance gain – for example, for Large Dataset Archival where attachments can be moved to AWS and the application can directly access these when required, Citus would be substantially faster than traditional Postgres. Similar examples are also available in Citus documentation wherein for certain test cases, performance gain is stark. Based on my research, where this technology stands today, not only its limited technical adaptation makes moving entire Enova datasets to Citus not possible, but also the optimization it provides does not translate to measurable performance gains.