A little while ago, we announced that we would be look at optimization for datanommer/datagrepper.

The main issue being that it has grown over the course of its life (it's been running for 9 years now!) and the 180 millions messages are all stored in a single table making the application slow to perform some of the queries it does. Queries can even be so slow in some situations that the web server ends up raising a 504 gateway time-out error.

So we looked at a few ways to improve performances.

Default delta

We found out that most of the queries that time-out are requests that do not include a delta in their arguments. Turns out this is because, if no delta is specified, datagrepper counts all the messages that fit the given criteria, since the beginning of the database and in some cases, that can lead to querying all the 180 millions messages in the database.

Turns out that datagrepper has a configuration option to specify a default delta if none is provided by the user. This alone helps datagrepper quite a bit and allows to not run into time out error while just browsing its UI.



Once we figured out how to improve the UI, we started looking at how to improve the database side. Our first try was to manually partition the messages table which contains all the messages.

Manually partitioning the database

Our first attempt was to partition the message by year, having a partition for each year. However, the way the partitioning works means the field which is used for partitioning is the only field that can be unique across all partitions, as a consequences the field must be part of all foreign keys.

We stopped our first attempt there as we didn't want to adjust all the foreign key constraints.

Our second attempt was to partition the database by id which is the primary key of the messages table, which solves the foreign key question since these constraints rely on that field. We partitioned the table in chunks of 10 million records. So we created 19 partitions and loaded the data in them.



With this partitioning and the default delta, we started seeing some good results but we also wanted to test the timescaledb postgresql plugin.

The timescaledb postgresql plugin

That plugin is designed to improve performance of databases that store time-related data. In the case of datagrepper its main use case it to store messages and retrieve them based on time information (and potentially other criteria). So timescaledb sounds like a good candidate for our use-case.

Timescaledb gets set-up on timestamp fields. So we set up timescaledb on the timestamp field of the messages table. Once set-up and the data imported we realized that timescaledb also does table partitioning, meaning the issue we had earlier about the foreign key constraints and the year appeared again (but this time on the timestamp field). This time however, we decided to adjust the tables linked to the messages table to include the timestamp field in the foreign key constraints.

This led to some good gain in performance with one little issue which is that we found some duplicated messages in the database. They have the same msg_id but different timestamp. We considered this to be an artifact of using fedmsg and we expect that moving to fedora-messaging will solve this issue as using rabbitmq will ensure that messages are only processed by one consumer at a time. It could be that this is also cause by the bridge between fedora-messaging and fedmsg, in which case datanommer may have to be adjusted for checking if a msg_id exists in the db before inserting a new message.

timescaledb without external tables

We thought that simplifying the database model maybe a way to optimize some of the queries some more. So we've changed the current database schema: datanommer_db.jpeg

We moved the user and package information into the messages table, using postgresql arrays in the hope that combined this with Generalized Inverted Index (GIN) we would have some optimization.

However, when testing the queries in postgresql directly, we saw that as soon as the query involved an ordering by timestamp as well as filtering on other criteria (such as package's name or user's name) the performances dropped. We thus never adjusted datanommer and datagrepper to work with this setup (which is why you will not find it in the results section below)



We have of course tried to measure our different experiments. So let's see how they look.

Results

Environments

We used four different environments for our tests:

  • prod/openshift

This is a openshift deployment of datagrepper which hits the production postgresql database and is configured just like the actual (VM-based) production instance is.

  • prod/aws

All our experiments being done in AWS, we needed a production-like instance to avoid comparing the production instance which has all the production traffic and load to single instance on AWS that have no load or traffic. So this instance is just like the production instance with one difference, it has a default delta specified in its configuration (with a value of 3 days, so if no delta is specified, it returns 3 days worth of messages).

  • partition/aws

This is the AWS instance that is running datanommer and datagrepper with the messages table partitioned by id. Datagrepper is also configured with a 3 days default delta.

  • timescaledb/aws

This is the AWS instance that is running datanommer and datagrepper with the messages table configured (and thus partitioned) by timescaledb. Datagrepper is also configured with a 3 days default delta.

Requests

Our test script launches 10 threads, each of them doing 30 requests (so 300 requests are made in total) and we ran it against six different requests:

  • filter_by_topic: /raw?topic=org.fedoraproject.prod.copr.chroot.start
  • Plain_raw: /raw
  • Filter_by_category: /raw?category=git
  • Filter_by_username: /raw?user=pingou
  • Filter_by_package: /raw?package=kernel
  • Get_by_id: /id?id=2019-cc9e2d43-6b17-4125-a460-9257b0e52d84
Graphs

datanommer_percent_sucess.jpg

As you can see timescaledb is the only environment in which all requests returned successfully! It is also important to keep in mind the poor performance of prod/openshift when seeing the other results. The aws/partition environment performed the least well on the get_by_id request. This can be explained by postgresql having to parse all partitions in parallel to find out which partition contains that specific msg_id. Timescaledb performed better there, potentially thanks to optimizations that timescaledb has that we missed when doing the partitioning manually.

datanommer_req_per_sec.jpg

timescaledb pretty much outperform all other environment for all queries here.

datanommer_mean_per_req.jpg

We lacked actual results for most of the prod/openshift requests there and here as well timescaledb has the lowest mean request time for each request.

datanommer_max_per_req.jpg

Here again, timescaledb outperforms all other environments for each request.

Conclusions

Seeing these graphs, you can probably already guess what our recommendations are:

  • Set a default delta in datagrepper's configuration file - even though this is going to break the API, but not specifying a delta today results in 504 errors more often than not.
  • Port the database to use timescaledb. We could probably replicate some of the gain by doing the partitioning manually on the timestamp field, but timescaledb takes care of all of this for us.




References: