Performance deficits in Apache Cassandra

Recently I was conducting an evaluation of several different databases for a messaging workload. While benchmarking Apache Cassandra, I noticed unusual patterns in performance metrics. I followed these clues and eventually found some major thread-pool design questions and a potential 18x performance gain realizable on Windows.

I had been investigating a range of different databases as potential backends to re-engineer a messaging product. While I had some candidates in mind already, I wanted to be able to show a robust exploration of database options.

Why Cassandra?

While PostgreSQL was a strong contender among SQL databases and had given good results in early experiments, I was looking for the ideal NoSQL candidate.

Apache Cassandra was a little older than the latest NewSQL & key-value databases, but it seemed in in many ways architecturally ideal:

Log-Structured Merge Tree, ideal for compacting transient work
durable storage to disk
partitioning for scalability
high write performance

The paradigm I was aiming for was to use the database almost entirely as a ‘write-only datastore’. Messaging has two major requirements: durable recording of work done (messages received & sent) to allow crash recovery, and recording a longer-term searchable log.

Cassandra had already been identified by our engineers as a preferred database, and should theoretically have aligned very well with these requirements. With highly scalable partitioned write performance and its LSMT datastructure able to automatically compact completed work from the table, it seemed like a clear leader.

Benchmarking Cassandra

Key workloads I was assessing in the messaging system involved receiving messages, capturing properties, transforming the messages, and sending them on.

In many customer systems there would be one major route processing a large majority in sequential order. The requirement to durably record receipt of these messages made this essentially a single-threaded usecase, so this was a major benchmark. There were also multi-threaded benchmarks to characterize performance for customers running a more even distribution of work across multiple routes.

PostgreSQL had given excellent results across single- and multi-threaded benchmarks, but I had hopes that Cassandra — with its simpler model and extreme performance focus — would be able to deliver even higher.

However, I was surprised & disappointed by the results.

Single-threaded benchmark:

PostgreSQL — 200k transactions /minute
Cassandra — 37k transactions /minute

In the multi-threaded benchmark, Cassandra performed far better:

Performance with 10 threads was 27 times better.
Latencies in the 10-thread case were much lower than the single-threaded case.

However these results suggested some concerning anomalies in single-threaded write performance. Several key concerns existed:

The single-threaded Cassandra benchmark showed abysmally poor performance on the hardware, compared to what PostgreSQL had shown was physically possible.
Latencies were inverted from the usual pattern that queueing theory indicates; in a multi-threaded system, more threads should give higher throughput but at the cost of increased latency. Cassandra showed an inversion of this pattern.
Throughput increasing beyond the number of threads — the fact that 10x the threads gave far more than 10x the throughput was also suspicious.

Taken together, these symptoms made me suspicious there might be some inefficiencies in single-thread workloads on Cassandra.

Tracing the Problem

Understanding Cassandra’s internal execution processes, to an unfamiliar engineer like myself, posed a significant degree of complexity. There is a strong degree of concurrency, with database operations processed across multiple threads via a number of worker pools.

Given this complexity, instrumenting & tracing the request processing seemed the only plausible route to understand the problem. So I fetched the source, built Cassandra & started added custom logging to instrument the problem.

My logging focused on recording the start & end times of the overall request and of the component tasks devolved to worker pools. The aim here was to be able to track, at a microsecond level, when tasks were actually executing in order to look for delays.

My initial findings:

Tracing showed an average delay of 1.52 ms between StorageProxy.performLocally() being called, and the LocalMutationRunnable actually executing.
Total operation time averaged 2.06 ms (measured at Message.Dispatcher processRequest()). This suggested ~72% of the total operation time being lost waiting for thread scheduling in SEPExecutor.

Given the delays I found, the SEPExecutor thread pool became a focus of investigation. This is a Cassandra-specific custom thread pool with significant internal complexity.

I tried a number of clumsy interventions with SEPExecutor, none of which were fully successful.

I tried hacking SEPExecutor.takeWorkPermit() to make tasks execute immediately on the same thread.
I tried hacking SEPExecutor/ Worker so that one worker stayed awake at all times.
I started asking questions about how SEPWorker.assign() doesn’t unpark threads when transitioning from SPINNING to ‘working’ state.

Having documented my investigation & results so far, I raised an issue on the Cassandra JIRA bug-tracker. Initially it was met with some fair questions but to be honest a certain amount of skepticism. (To be honest this is very understandable for any project, given the limited resources projects have and the numbers of stupid questions & false positives they likely receive.)

Finding the Cause

I continued investigating, digging deeper into SEPExecutor.

In the meantime I was finding & discussing further clues on the JIRA issue. One of the Cassandra leads had joined the issue and was discussing typical environments & usecases. I was open to hearing some insight or explanation, but believed strongly myself that there was likely something to be found here.

And then I found it.

A detailed trace of task scheduling behavior and resultant scheduling delays, showed that delays occurred when a worker was already available but parked in a ‘SPINNING’ state.

In this case, Cassandra’s SEPExecutor thread pool neither started an extra worker, nor did it wake (unpark) the sleeping one.

Under single-threaded conditions, workers would go to sleep immediately after each request; and take up to 1.5 milliseconds to wake up for the next one!

I quickly produced an experimental patch to improve this scheduling behavior. Results on Windows:

Version/ Fix	1 thread op/s	1 thread latency mean	10 thread op/s	10 thread latency mean
3.11.10 baseline	362 op/s	2.6 ms	5,696 op/s	1.7 ms
with MaybeStartSpinning Unpark fix	6,809 op/s	0.1 ms	24,639 op/s	0.4 ms
	18.8x faster		4.3x faster

An 18 times performance increase — unbelievable!

While this seemed like strong evidence to me, the Cassandra community noted that the thread-pool is designed for Linux and that (as of the new version 4) they had dropped support for Windows as a platform.

So, I went and got an EC2 instance and undertook some Linux testing. I was able to find a +30.9% performance improvement on Linux in the single-thread case, with improvements varying from small to marginal over 10, 50 and 200 thread cases.

Further Questions

At this point I felt that I’d reported a clear bug (CASSANDRA-16499), provided a patch, and shown good evidence of performance improvement and non-regression.

However there were architectural questions raised, that this was the intended design of the executor — that worker threads self-organize with limited interaction between producers and consumers. And that adding a proactive wakeup behavior, would obviate this design feature.

Rather than accepting the patch, the Cassandra lead asked for a comprehensive architectural reassessment of possible thread-pooling options.

While it was a valuable exploration, the course of my benchmarking & assessment had identified other drawbacks of Cassandra:

apparent performance deficits in Batch Operations
possible inefficiency in parallel commit/ fsync — PostgreSQL is able to commit multiple waiting transactions in a single fsync(), Cassandra may have limitations here
narrow platform support
Cassandra still apparently being at the “bleeding” part of bleeding edge

We discussed our database options, and potential of investing further in Cassandra to find & fix these other likely deficits. It’s a great platform, and we were interested, but it was hard for us to make a business case.

So thanks for reading, and I hope you found this interesting! Should anyone wish to continue this work, I’d be very happy to discuss.

Literate Java

Performance deficits in Apache Cassandra

Why Cassandra?

Benchmarking Cassandra

Tracing the Problem

Finding the Cause

Further Questions

See Also

Leave a Reply Cancel reply

Insights into Java coding, OO design & architecture