Le blog de pingou - Tag - DatabaseLe blog de pingou, ses actualités sur Fedora, ses RPMs, ses tests, son Linux... :-)
Pingou's weblog, his fedora's news, his RPMs, his tests, his Linux... :-)2022-02-17T10:46:15+01:00pingouurn:md5:66db5ce1ed1a80cb2f424695b4bb7780Dotcleardatanommer/datagrepper investigationsurn:md5:cad1a1168211ad7359bf3e281c586b982021-02-25T09:31:00+00:002021-02-26T17:25:21+00:00Pierre-YvesGénéralDatabasedatagrepperFedoraFedora-InfraPostgreSQL<p>A few team members of the <a href="https://docs.fedoraproject.org/en-US/cpe/">CPE</a> team have investigated how to improve the performances of datanommer/<a href="https://apps.fedoraproject.org/datagrepper/">datagrepper</a>.</p> <p>A little while ago, we <a href="https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedoraproject.org/message/6NRUH7EP6ERTBUEVTTXYLA25QUSHTKBE/">announced</a> that we would be look at optimization for datanommer/datagrepper.</p>
<p>The main issue being that it has grown over the course of its life (it's been running for 9 years now!) and the 180 millions messages are all stored in a single table making the application slow to perform some of the queries it does. Queries can even be so slow in some situations that the web server ends up raising a 504 gateway time-out error.</p>
<p>So we looked at a few ways to improve performances.</p>
<h4>Default delta</h4>
<p>We found out that most of the queries that time-out are requests that do not include a <code>delta</code> in their arguments. Turns out this is because, if no delta is specified, datagrepper counts all the messages that fit the given criteria, since the beginning of the database and in some cases, that can lead to querying all the 180 millions messages in the database.</p>
<p>Turns out that datagrepper has a configuration option to specify a default delta if none is provided by the user. This alone helps datagrepper quite a bit and allows to not run into time out error while just browsing its UI.</p>
<p><br />
<br /></p>
<p>Once we figured out how to improve the UI, we started looking at how to improve the database side. Our first try was to manually partition the <code>messages</code> table which contains all the messages.</p>
<h4>Manually partitioning the database</h4>
<p>Our first attempt was to partition the message by year, having a partition for each year. However, the way the partitioning works means the field which is used for partitioning is the only field that can be unique across all partitions, as a consequences the field must be part of all foreign keys.</p>
<p>We stopped our first attempt there as we didn't want to adjust all the foreign key constraints.</p>
<p>Our second attempt was to partition the database by <code>id</code> which is the primary key of the <code>messages</code> table, which solves the foreign key question since these constraints rely on that field.
We partitioned the table in chunks of 10 million records. So we created 19 partitions and loaded the data in them.</p>
<p><br />
<br /></p>
<p>With this partitioning and the default delta, we started seeing some good results but we also wanted to test the <a href="https://www.timescale.com/">timescaledb</a> postgresql plugin.</p>
<h4>The timescaledb postgresql plugin</h4>
<p>That plugin is designed to improve performance of databases that store time-related data. In the case of datagrepper its main use case it to store messages and retrieve them based on time information (and potentially other criteria). So timescaledb sounds like a good candidate for our use-case.</p>
<p>Timescaledb gets set-up on <code>timestamp</code> fields. So we set up timescaledb on the <code>timestamp</code> field of the <code>messages</code> table. Once set-up and the data imported we realized that timescaledb also does table partitioning, meaning the issue we had earlier about the foreign key constraints and the year appeared again (but this time on the <code>timestamp</code> field).
This time however, we decided to adjust the tables linked to the <code>messages</code> table to include the <code>timestamp</code> field in the foreign key constraints.</p>
<p>This led to some good gain in performance with one little issue which is that we found some duplicated messages in the database. They have the same <code>msg_id</code> but different <code>timestamp</code>. We considered this to be an artifact of using fedmsg and we expect that moving to fedora-messaging will solve this issue as using rabbitmq will ensure that messages are only processed by one consumer at a time. It could be that this is also cause by the bridge between fedora-messaging and fedmsg, in which case datanommer may have to be adjusted for checking if a <code>msg_id</code> exists in the db before inserting a new message.</p>
<h4>timescaledb without external tables</h4>
<p>We thought that simplifying the database model maybe a way to optimize some of the queries some more. So we've changed the current database schema:
<a href="https://blog.pingoured.fr/public/datanommer_db.jpeg" title="datanommer_db.jpeg"><img src="https://blog.pingoured.fr/public/.datanommer_db_m.jpg" alt="datanommer_db.jpeg" style="display:table; margin:0 auto;" title="datanommer_db.jpeg, Feb 2021" /></a></p>
<p>We moved the user and package information into the messages table, using <a href="https://www.postgresql.org/docs/current/arrays.html">postgresql arrays</a> in the hope that combined this with <a href="https://www.postgresql.org/docs/current/gin-intro.html">Generalized Inverted Index (GIN)</a> we would have some optimization.</p>
<p>However, when testing the queries in postgresql directly, we saw that as soon as the query involved an ordering by timestamp as well as filtering on other criteria (such as package's name or user's name) the performances dropped. We thus never adjusted datanommer and datagrepper to work with this setup (which is why you will not find it in the results section below)</p>
<p><br />
<br /></p>
<p>We have of course tried to measure our different experiments. So let's see how they look.</p>
<h4>Results</h4>
<h5>Environments</h5>
<p>We used four different environments for our tests:</p>
<ul>
<li>prod/openshift</li>
</ul>
<p>This is a openshift deployment of datagrepper which hits the production postgresql database and is configured just like the actual (VM-based) production instance is.</p>
<ul>
<li>prod/aws</li>
</ul>
<p>All our experiments being done in AWS, we needed a production-like instance to avoid comparing the production instance which has all the production traffic and load to single instance on AWS that have no load or traffic.
So this instance is just like the production instance with one difference, it has a default delta specified in its configuration (with a value of 3 days, so if no delta is specified, it returns 3 days worth of messages).</p>
<ul>
<li>partition/aws</li>
</ul>
<p>This is the AWS instance that is running datanommer and datagrepper with the <code>messages</code> table partitioned by <code>id</code>. Datagrepper is also configured with a 3 days default delta.</p>
<ul>
<li>timescaledb/aws</li>
</ul>
<p>This is the AWS instance that is running datanommer and datagrepper with the <code>messages</code> table configured (and thus partitioned) by timescaledb. Datagrepper is also configured with a 3 days default delta.</p>
<h5>Requests</h5>
<p>Our test script launches 10 threads, each of them doing 30 requests (so 300 requests are made in total) and we ran it against six different requests:</p>
<ul>
<li><code>filter_by_topic: /raw?topic=org.fedoraproject.prod.copr.chroot.start</code></li>
<li><code>Plain_raw: /raw</code></li>
<li><code>Filter_by_category: /raw?category=git</code></li>
<li><code>Filter_by_username: /raw?user=pingou</code></li>
<li><code>Filter_by_package: /raw?package=kernel</code></li>
<li><code>Get_by_id: /id?id=2019-cc9e2d43-6b17-4125-a460-9257b0e52d84</code></li>
</ul>
<h5>Graphs</h5>
<p><a href="https://blog.pingoured.fr/public/datanommer_percent_sucess.jpg" title="datanommer_percent_sucess.jpg"><img src="https://blog.pingoured.fr/public/.datanommer_percent_sucess_m.jpg" alt="datanommer_percent_sucess.jpg" style="display:table; margin:0 auto;" title="datanommer_percent_sucess.jpg, Feb 2021" /></a></p>
<p>As you can see timescaledb is the only environment in which all requests returned successfully! It is also important to keep in mind the poor performance of <code>prod/openshift</code> when seeing the other results.
The <code>aws/partition</code> environment performed the least well on the <code>get_by_id</code> request. This can be explained by postgresql having to parse all partitions in parallel to find out which partition contains that specific <code>msg_id</code>. Timescaledb performed better there, potentially thanks to optimizations that timescaledb has that we missed when doing the partitioning manually.</p>
<p><a href="https://blog.pingoured.fr/public/datanommer_req_per_sec.jpg" title="datanommer_req_per_sec.jpg"><img src="https://blog.pingoured.fr/public/.datanommer_req_per_sec_m.jpg" alt="datanommer_req_per_sec.jpg" style="display:table; margin:0 auto;" title="datanommer_req_per_sec.jpg, Feb 2021" /></a></p>
<p>timescaledb pretty much outperform all other environment for all queries here.</p>
<p><a href="https://blog.pingoured.fr/public/datanommer_mean_per_req.jpg" title="datanommer_mean_per_req.jpg"><img src="https://blog.pingoured.fr/public/.datanommer_mean_per_req_m.jpg" alt="datanommer_mean_per_req.jpg" style="display:table; margin:0 auto;" title="datanommer_mean_per_req.jpg, Feb 2021" /></a></p>
<p>We lacked actual results for most of the <code>prod/openshift</code> requests there and here as well timescaledb has the lowest mean request time for each request.</p>
<p><a href="https://blog.pingoured.fr/public/datanommer_max_per_req.jpg" title="datanommer_max_per_req.jpg"><img src="https://blog.pingoured.fr/public/.datanommer_max_per_req_m.jpg" alt="datanommer_max_per_req.jpg" style="display:table; margin:0 auto;" title="datanommer_max_per_req.jpg, Feb 2021" /></a></p>
<p>Here again, timescaledb outperforms all other environments for each request.</p>
<h4>Conclusions</h4>
<p>Seeing these graphs, you can probably already guess what our recommendations are:</p>
<ul>
<li>Set a default delta in datagrepper's configuration file - even though this is going to break the API, but not specifying a delta today results in 504 errors more often than not.</li>
<li>Port the database to use timescaledb. We could probably replicate some of the gain by doing the partitioning manually on the <code>timestamp</code> field, but timescaledb takes care of all of this for us.</li>
</ul>
<p><br />
<br />
<br /></p>
<h5>References:</h5>
<ul>
<li><a href="https://fedora-arc.readthedocs.io/en/latest/datanommer_datagrepper/index.html">Report of the investigations</a></li>
<li><a href="https://pagure.io/fedora-infra/arc/blob/main/f/scripts/migration_timescaledb.sql">SQL script to migrate the database</a> (probably needs some polishing but provides a basis)</li>
<li><a href="https://fedora-arc.readthedocs.io/en/latest/datanommer_datagrepper/pg_timescaledb.html#patch">Datanommer patch to support the new database schema</a></li>
</ul>Faitout changes homeurn:md5:e72cf44f30565b5eb5d4a65a7685d8fe2015-08-05T11:02:00+01:002015-08-05T10:06:38+01:00Pierre-YvesGénéralDatabasefaitoutFedoraFedora-planetjenkinsPostgreSQLPythonUnit-tests <p><a href="http://faitout.fedorainfracloud.org/">Faitout</a> is an application giving you full access to a postgresql database for 30 minutes.</p>
<p>This is really handy to run tests against.</p>
<p>For example, for some of my applications, I run the tests locally against a in-memory sqlite database (very fast) and when I push, the tests are ran on jenkins but this time using faitout (a little slower, but much closer to the production environment). This setup allows me to find early potential error in the code that sqlite does not trigger.</p>
<p>Faitout is running the cloud of the Fedora infrastructure and since this cloud has just been rebuilt, we had to move it.
While doing so, faitout got a nice new address:</p>
<p><a href="http://faitout.fedorainfracloud.org/">http://faitout.fedorainfracloud.org/</a></p>
<p>So if you are using it, don't forget to update your URL ;-)</p>
<p><br />
<br />
See also: <a href="http://blog.pingoured.fr/index.php?tag/faitout">Previous blog posts about faitout</a></p>Faitout, 1000 sessionsurn:md5:ff0d178095d372a3a1cf7c46d5617f902014-06-26T16:28:00+01:002014-06-26T16:28:00+01:00Pierre-YvesGénéralDatabasefaitoutFedoraFedora-planetjenkinsPostgreSQLPythonUnit-tests <p><a href="http://blog.pingoured.fr/index.php?post/2013/10/28/Faitout-test-against-a-real-database">A while back</a>, I introduced <a href="http://209.132.184.152/faitout/">faitout</a> on this blog.</p>
<p>Since then I have been using it to tests most if not all the project I work on.
I basically use the following set-up:</p>
<pre>
DB_PATH = 'sqlite:///:memory:'
FAITOUT_URL = 'http://209.132.184.152/faitout/'
try:
import requests
req = requests.get('%s/new' % FAITOUT_URL)
if req.status_code == 200:
DB_PATH = req.text
print 'Using faitout at: %s' % DB_PATH
except:
pass
</pre>
<p>This way, if I have network, the tests are run with faitout and thus against a
real postgresql database while if I do not have network, they run against a
sqlite in memory database.</p>
<p>This set-up allows me to work offline and still be easily able to run all the
unit-tests as I change the code.</p>
<p>What the point of this blog was actually more to announce the fact that despite
it's limited spread (only 25 different IP addresses have requested sessions),
the tool is used and it has already reached the 1,000 sessions created (and
dropped) in less than a year.</p>
<p><br />
<br /></p>
<p>If you're not using it, I am inviting you to have a look at it, I find it
marvelous in combination with Jenkins and it does help finding bugs in your
code.</p>
<p>If you are using it, congrats and keep up the good work!!</p>Faitout - test against a real databaseurn:md5:bab888d759c3ca1fd2caca7c395ab5de2013-10-28T15:32:00+00:002013-10-28T15:32:00+00:00Pierre-YvesGénéralDatabasefaitoutFedoraFedora-planetPostgreSQLPythonUnit-tests <ul>
<li>Do you do unit-tests?</li>
</ul>
<ul>
<li>Do you do continuous integration?</li>
</ul>
<ul>
<li>Do you use sqlite for your tests while deploying against postgresql?</li>
</ul>
<ul>
<li>Do you hate using sqlite for your tests?</li>
</ul>
<p><br /></p>
<p>If you answer 'yes' to any of those three questions, the following post is for you.</p>
<p>Otherwise, well, stay, it might still be interesting ;-)</p>
<p>When doing unit-tests you want to have something fast which allows you to quickly see if your last changes affect other part of your code.</p>
<p><a href="http://www.sqlite.org/">sqlite</a> is great for that. You can easily create in memory database, no FileIO, it all goes fast and smooth.</p>
<p>That is until you push your application to production where it is deployed against a real database system such as <a href="https://blog.pingoured.fr/index.php?post/2013/10/28/">PostgreSQL</a>. Then suddenly, queries which run fine under sqlite start breaking under PostgreSQL.
sqlite and PostgreSQL implements some things differently and this leads to this kind of situation.</p>
<p>The solution for this is of course to run your tests in an environment as close as possible from the production on, ie: run your tests on the same database system as the one you use on production.</p>
<p>But this can also become complex, it means setting up a new database server, create a new database, clean the database after the tests, handle permissions...</p>
<p>With this in mind, project such as <a href="http://www.postgression.com/">postgression</a> appeared.</p>
<p>The idea is simple: easily get access to postgresql databases which are thrown away after a certain time.</p>
<p>The problem is that postgression is not FOSS, thus when a couple of weeks ago there was no way to get a database, there was also no way to set up our own postgression server that could be used by a restricted number of person.</p>
<p>So after discussing it with <a href="https://fedoraproject.org/wiki/User:Abompard">Aurélien</a>, somewhere between lunch and dessert, faitout appeared.</p>
<p>The idea was simple, have a small web application, create on the fly a user and a database made available to the on who asks and after 30 minutes (via a cron job for the moment) destroy the database and the user.</p>
<p>The API is pretty simple and all is documented on the front page of the application.</p>
<p>So feel free to have a look at it, test it, break it (but let us know how you did that ;-)) at the test instance we have:</p>
<p><a href="http://209.132.184.152/faitout/">http://209.132.184.152/faitout/</a></p>