Ceph Performance Fine Tuning

Ceph performance chart growing
Contents
Share

What would you think if I told you that every little change can make a big difference? This is not an usual catch phrase, but it is concretely the way we have increased the performance of our Ceph storage.

According to many, and even our Cloud Architects, optimising Ceph can be a difficult challenge. With this article we do not intend to disprove that assumption, quite the contrary. Between Ceph, RocksDB, and the Linux kernel, there are thousands of options that can be modified to improve the efficiency and performance of a Storage Ceph.

We have studied, worked and tested a lot in order to significantly improve the performance of our Ceph storage by achieving a satisfactory level of performance. Let's see together how we did it.

Prerequisites

For those already familiar with Ceph, this will probably be a triviality. But it is only fair to give a brief introduction.

Ceph is an entirely Open source Software Defined Storage born for large scales (e.g., Cloud providers) focused on data resilience. Such data can be exposed in different modes: block, file and object-based.

Esposizione dati di Software Defined Storage

When Ceph is installed, unless modified, standard configurations and features are adopted, which by default are suitable for generic workloads. Such a baseline, results in performance that is equally generic, and therefore not excellent, nor optimised for a specific workload.

To achieve better performance, changes need to be made. What we've focused and worked on is how RADOS saves its objects to disk.

A leap into the past and one into the present

Over the past decade, Ceph has relied on two different implementations of object storage. When we talk about low-level object storage, we are not talking about the famous object storage protocol that Ceph exposes (S3). We are rather talking about how the Rados protocol handles all input-output with objects that are physically read and written to disks.

The first Object Storage implementation used, currently deprecated, was FileStore. This version of Ceph used an XFS filesystem to store object data and metadata onto a disk. Specifically, the file store separated the object into two parts:

  • Raw data - The raw, unprocessed data;
  • Metadata- that is, the information accompanying the data that is useful for identifying and researching the data.

Funzionamento File Store Ceph

Let's go back to FileStore. While the raw part was written directly to XFS, the metadata was saved within a database, specifically Google's project database: Leveldb. In turn, Leveldb saved data onto the disk through the XFS file system, which, being a journaled filesystem itself, wrote additional metadata. This whole process involved three writing procedures: one for the raw data, one for leveldb and one for XFS, each of them introduced more latency.

This triple writing procedure was quite problematic in terms of latency, both for the older hard disks and for the most modern SSDs. In order to overcome this problem, we adopted the best practice of separating, at the File Store level, the raw data writes (on HDD or SSD) from the metadata writes, typically onto NVME.

Ceph, however, revolutionises this process with a second implementation: BlueStore.

Funzionamento Bluestore Ceph

This new concept avoids triple writing through a new methodology of writing data and metadata.

The raw data is in fact written directly to the disk, without going through the xfs filesystem. This already results in a general improvement in latencies and allows one to get closer to the disk.

Instead, metadata changes databases, moving to rocksDB, a project of Facebook. In particular, RocksDB is a database more suited to Ceph's use of metadata. In order to maximise the writing and reading of RocksDB to disk, ad hoc BlueFS was then additionally developed.

Thanks to BlueStore, data write and read latencies within storage are generally reduced.

From theory to practice

Having appreciated its increased performance, we have now entirely switched to BlueStore. Having said that, we didn't just go along with the change and adopt the technology proposed by Ceph. We did more than that. As a matter of fact, this is where CloudFire decided to make a difference: more is needed to achieve sufficient performance for enterprise-level workloads.

Analysis

Analyzing real workloads and running benchmarks on an out-of-the-box cluster, we have observed a significant discrepancy between raw disk performance and the Ceph cluster, particularly the metadata storage overhead in RocksDB has a huge impact on performance. This is especially true for small random writes, where the raw part of the objects can be almost as small as the metadata.

All of this then leads to a problem. While in a large object the latency of retrieval of metadata is negligible, in an object as small as metadata this latency becomes considerable and significant.

On virtual machines hosting databases, which perform very small block reads and writes, this latency is concretely felt.

Resolution

Workload analyses have uncovered a major limitation of software defined storage technology. By increasing the resilience and reliability of the data, the performance delivered is marginal compared to the physical limit of the storage nodes. The challenge for our Cloud Engineers has been to question all architectural choices in both hardware and software, with the goal of meeting our customers' expectations.

Hardware architecture review

The first step was to review the hardware architecture (CPU, RAM, Disk, and Network) to check that no component had been undersized from Ceph's specifications. Sticking to the Best Practices of open source projects is always a must. However, it can happen that the rapid evolution of software can change a few cards on the table.

✅ The realised design has proven to be more than sufficient for the workload. Each release of Ceph brings with it better performance whilst maintaining the same hardware architecture.

Ceph setting review

Given the suitable hardware architecture, the next step was to work on Ceph settings and understand the logic blockage related to latencies.

The "Community First" approach led us to share the issue with all members, and we found that the same request was particularly popular among all Ceph users. However, as often happens with almost all issues, the answer is always the same: "It depends."

And it really did!

The key to solving the issue was fine-tuning Ceph's settings. Modifying RocksDB settings resulted in a dramatic range of performance, from poor to exceptional.

Indeed, considering how the algorithm described earlier works, one can abstract Ceph as an object container with its metadata stored in a database. Writing these metadata is the algorithm's priority and therefore must be done as quickly as possible. Slowing down this operation exponentially increases the queue with the list of subsequent operations, consequently increasing latencies.

✅ Properly sizing RocksDB with the hardware architecture (CPU, RAM, and NVME) has finally led to achieving the maximum performance of Ceph storage, further reducing latency while maintaining the same resilience of a triple data copy across three different data centers.

The results

In order to analyse the results and make a comparison, it is necessary to have a benchmark. Generally speaking, the 4kilobyte write speed benchmark is used to measure the performance of a storage. Being the smallest writable data, it's consequently the most sensitive to latencies as well.

Taking a 4kb write as a reference, we measure how many input-outputs occur in one second, measuring this value in IOPS. In other words, if our storage can write 1000 files of 4 kilobytes each in one second onto a disk, then we can safely say that the storage achieves 1000 IOPS.

We started by measuring individual disk performance. This figure, multiplied by the disks used in our storage, gives us an idea of the maximum theoretical Raw performance achievable. For example, if I have 100 disks of 1000 IOPS, my theoretical maximum will be 100 000 IOPS (without taking replicas into account).

After that, we measured the performance of the installed storage at the default values. This figure was 30% of the estimated theoretical performance. This was obviously not satisfactory to us. Through the implemented changes we were able to improve the performance of our Ceph storage by achieving results close to the theoretical limit.

Benchmark Performance CEPH: SSD Drive, Ceph default, CloudFire
Benchmark Results - Write 4K - 3replica

Definitely an accomplishment achieved with a lot of hard work and study that gives us pride in what we are implementing to offer an increasingly high-performing and quality service.

I'd like to conclude by thanking you for reading this article until the end and by sharing the invitation to follow us constantly, as we seek to improve the performance of our infrastructures.

You might also be interested