datanommer/datagrepper investigations
By Pierre-Yves on Thursday, February 25 2021, 09:31 - Général - Permalink
A few team members of the CPE team have investigated how to improve the performances of datanommer/datagrepper.
A little while ago, we announced that we would be look at optimization for datanommer/datagrepper.
The main issue being that it has grown over the course of its life (it's been running for 9 years now!) and the 180 millions messages are all stored in a single table making the application slow to perform some of the queries it does. Queries can even be so slow in some situations that the web server ends up raising a 504 gateway time-out error.
So we looked at a few ways to improve performances.
Default delta
We found out that most of the queries that time-out are requests that do not include a delta
in their arguments. Turns out this is because, if no delta is specified, datagrepper counts all the messages that fit the given criteria, since the beginning of the database and in some cases, that can lead to querying all the 180 millions messages in the database.
Turns out that datagrepper has a configuration option to specify a default delta if none is provided by the user. This alone helps datagrepper quite a bit and allows to not run into time out error while just browsing its UI.
Once we figured out how to improve the UI, we started looking at how to improve the database side. Our first try was to manually partition the messages
table which contains all the messages.
Manually partitioning the database
Our first attempt was to partition the message by year, having a partition for each year. However, the way the partitioning works means the field which is used for partitioning is the only field that can be unique across all partitions, as a consequences the field must be part of all foreign keys.
We stopped our first attempt there as we didn't want to adjust all the foreign key constraints.
Our second attempt was to partition the database by id
which is the primary key of the messages
table, which solves the foreign key question since these constraints rely on that field.
We partitioned the table in chunks of 10 million records. So we created 19 partitions and loaded the data in them.
With this partitioning and the default delta, we started seeing some good results but we also wanted to test the timescaledb postgresql plugin.
The timescaledb postgresql plugin
That plugin is designed to improve performance of databases that store time-related data. In the case of datagrepper its main use case it to store messages and retrieve them based on time information (and potentially other criteria). So timescaledb sounds like a good candidate for our use-case.
Timescaledb gets set-up on timestamp
fields. So we set up timescaledb on the timestamp
field of the messages
table. Once set-up and the data imported we realized that timescaledb also does table partitioning, meaning the issue we had earlier about the foreign key constraints and the year appeared again (but this time on the timestamp
field).
This time however, we decided to adjust the tables linked to the messages
table to include the timestamp
field in the foreign key constraints.
This led to some good gain in performance with one little issue which is that we found some duplicated messages in the database. They have the same msg_id
but different timestamp
. We considered this to be an artifact of using fedmsg and we expect that moving to fedora-messaging will solve this issue as using rabbitmq will ensure that messages are only processed by one consumer at a time. It could be that this is also cause by the bridge between fedora-messaging and fedmsg, in which case datanommer may have to be adjusted for checking if a msg_id
exists in the db before inserting a new message.
timescaledb without external tables
We thought that simplifying the database model maybe a way to optimize some of the queries some more. So we've changed the current database schema:
We moved the user and package information into the messages table, using postgresql arrays in the hope that combined this with Generalized Inverted Index (GIN) we would have some optimization.
However, when testing the queries in postgresql directly, we saw that as soon as the query involved an ordering by timestamp as well as filtering on other criteria (such as package's name or user's name) the performances dropped. We thus never adjusted datanommer and datagrepper to work with this setup (which is why you will not find it in the results section below)
We have of course tried to measure our different experiments. So let's see how they look.
Results
Environments
We used four different environments for our tests:
- prod/openshift
This is a openshift deployment of datagrepper which hits the production postgresql database and is configured just like the actual (VM-based) production instance is.
- prod/aws
All our experiments being done in AWS, we needed a production-like instance to avoid comparing the production instance which has all the production traffic and load to single instance on AWS that have no load or traffic. So this instance is just like the production instance with one difference, it has a default delta specified in its configuration (with a value of 3 days, so if no delta is specified, it returns 3 days worth of messages).
- partition/aws
This is the AWS instance that is running datanommer and datagrepper with the messages
table partitioned by id
. Datagrepper is also configured with a 3 days default delta.
- timescaledb/aws
This is the AWS instance that is running datanommer and datagrepper with the messages
table configured (and thus partitioned) by timescaledb. Datagrepper is also configured with a 3 days default delta.
Requests
Our test script launches 10 threads, each of them doing 30 requests (so 300 requests are made in total) and we ran it against six different requests:
filter_by_topic: /raw?topic=org.fedoraproject.prod.copr.chroot.start
Plain_raw: /raw
Filter_by_category: /raw?category=git
Filter_by_username: /raw?user=pingou
Filter_by_package: /raw?package=kernel
Get_by_id: /id?id=2019-cc9e2d43-6b17-4125-a460-9257b0e52d84
Graphs
As you can see timescaledb is the only environment in which all requests returned successfully! It is also important to keep in mind the poor performance of prod/openshift
when seeing the other results.
The aws/partition
environment performed the least well on the get_by_id
request. This can be explained by postgresql having to parse all partitions in parallel to find out which partition contains that specific msg_id
. Timescaledb performed better there, potentially thanks to optimizations that timescaledb has that we missed when doing the partitioning manually.
timescaledb pretty much outperform all other environment for all queries here.
We lacked actual results for most of the prod/openshift
requests there and here as well timescaledb has the lowest mean request time for each request.
Here again, timescaledb outperforms all other environments for each request.
Conclusions
Seeing these graphs, you can probably already guess what our recommendations are:
- Set a default delta in datagrepper's configuration file - even though this is going to break the API, but not specifying a delta today results in 504 errors more often than not.
- Port the database to use timescaledb. We could probably replicate some of the gain by doing the partitioning manually on the
timestamp
field, but timescaledb takes care of all of this for us.
References:
- Report of the investigations
- SQL script to migrate the database (probably needs some polishing but provides a basis)
- Datanommer patch to support the new database schema