PostgreSQL vs MongoDB

By Pierre-Yves on Sunday, May 20 2012, 11:15 - Général - Permalink

A comparative tests of postgresql vs mongodb

English version

As you may know I have spent some time recently working on the hyperkitty program. The idea being to offer a new interface to the archives in mailman 3 (which has never been closer to a release).

Hyperkitty aims at implementing a numbers of the ideas developed by Máirín Duffy in her blog posts. The main one being to try to unify mailing lists and forum (ie: providing a web-interface to mailing lists).

In this quest, I have started to look some time ago to MongoDB. It is a NoSQL database which is becoming quite popular (probably helped with its integration into openshift). The results were satisfying but then a big question came up:

Do we really want to impose the burden of 2 different database systems to our sysadmin for a mailman archives interface ?

Of course, if we can avoid it, we would prefer to do it.

But MongoDB was performing really nicely with regards to searching the archives. So testing was needed.

Hardware

The machine on which I ran the test has:

16G of ram
4 cores
2x1To in RAID1

All this operated by RHEL 6.2 (Santiago).

The databases

PostgreSQL version 8.4.9
MongoDB version 1.8.2

The data

I used the archives from the devel mailing list, since its creation in 2002.

PostgreSQL loaded 166672 emails
MongoDB loaded 166642 emails

So there is a difference of 30 emails which I considered to be negligible for the tests.

The data structure

PostgreSQL

PostgreSQL has one table per list and each table as the same (following) structure:

 id serial NOT NULL,
sender character varying(100) NOT NULL,
email character varying(75) NOT NULL,
subject text NOT NULL,
"content" text NOT NULL,
date timestamp without time zone,
message_id character varying(150) NOT NULL,
stable_url_id character varying(250) NOT NULL,
thread_id character varying(150) NOT NULL,
"references" text,

Primary key: id

Index: date, message_id, stable_url_id, subject, thread_id

MongoDB

MongoDB being a NoSQL database using dictionary has a much more flexible structure. For the test the same information was stored in a dictionary structure:

Content
InReplyTo
From
Subject
ThreadID
Date
References
_id #internal id defined by MongoDB
MessageID
Email

UPDATE: The fields which are searched are being indexed. So the following indexes are generated:

Content
From
Subject
ThreadID
Date
References
MessageID
Email

The queries

11 different types of queries were ran:

get_thread_length: return how many emails are part of a thread
get_thread_participants: return the list of all the participants in a thread
get_email: return a given email based on its message_id
first_email_in_archive_range: for a time range, return the first email (ordered by date)
get_archives_range: return all the email in a time range
get_archives_length: return information of years and month since the creation of the list
get_list_size: return how many emails are archived for this list
search_subject: performs a search on the subject of the emails
search_content: performs a search on the content of the emails
search_content_subject: performs a search on the subject and the content of the emails
search_sender: performs a search on the sender of the emails (name and email)

Each queries was run 30 times, thus allowing caching to take place.

The four search queries were ran using both case insensitive and case sensitive queries

Few addition for PostgreSQL:

the search_sender and search_content_subject queries were ran twice in using a union of queries of simply a or statement

Results

Output

The first we want to know is

Did the queries return the same results ?

The following queries returned the same result:

get_email
get_archives_range
first_email_in_archives_range
get_thread_length
get_thread_participants
get_archives_length
search_subject
search_subject_cs
search_content
search_content_cs

The other ones returned different results:

search_content_subject

** Results differs
MG: 28762
PG: 34110
PG-OR: 28762

search_content_subject_cs

** Results differs
MG-CS: 24886
PG-CS: 28578
PG-OR-CS: 24886

search_sender

** Results differs
MG: 1883
PG: 3325
PG-OR: 1883

search_sender_cs

** Results differs
MG-CS: 1882
PG-CS: 1883
PG-OR-CS: 1882

get_list_size

** Results differs
MG: 166642
PG: 166672

For this last query, the difference in the result was expected nothing new there.

But what about the four queries returning different results, well as mentioned the PG{,-CS} queries are using a union of two queries to retrieve the information, so basically what we have is some duplicates. We could remove the duplicate in python easily using a set but as that will only impact the performance even more.

When we compare the results returned by MongoDB with the results returned by PostgreSQL using a or statement, they are similar. So we can conclude that the different queries are returning the same information.

Let's look at their performances now.

Performance

Results are presenting as a boxplot.

Legend:

CS in the title stands for Case Sensitive query (as opposed to case insensitive)
MG: results for MongoDB
PG: results for PostgreSQL
PG-OR: results for PostgreSQL using a or statement in the query (as opposed to a union)

This is a simplified boxplot (as in the outliers are not presented in the picture) but the full boxplot (with the outliers presented) is also available

Additionnal results:

I also looked at the influence of case sensitive vs case insensitive queries within a database system.

MongoDB

So apparently there is a 'case' effect, but looking at the y scale we can see that this difference is ~0.2 seconds. I believe this is negligible.

PostgreSQL

In this case the 'case' effect is much stronger and can go up to ~4 seconds which is much more significant for us.

Conclusions

Looking at the box plot

PostgreSQL performs worse than MongoDB in the retrieval queries but the time it takes is negligible (we are way under 1 second)
For PostgreSQL case sensitive queries perform better than MongoDB but worse for case-insensitive queries.
Case sensitivity in the queries has a greater influence in the performances of PostgreSQL than of MongoDB.
Or statement should be preferred to union of queries (nothing really surprising there).

So, do we want to perform case-insensitive queries and have 2 database systems to maintain or do we want an option to turn on case-insensitive and have one database system ? Or do we consider 7-8 seconds to be fine while searching in the content of 166000 emails ?

On a final note, everything I used to make the tests and the output of the said tests is available on my git and mirrored on my github.

Comments

1. On Sunday, May 20 2012, 12:54 by Vincent Bernat

About case insensitive search and PostgreSQL, you should build a case insensitive index to get better performances. There is such an example in CREATE INDEX documentation of PostgreSQL.

2. On Sunday, May 20 2012, 13:17 by pingou

Thanks for the info, I will look at this (and more precisely, see if SQLAlchemy supports this).

3. On Sunday, May 20 2012, 14:10 by giulivo

Thanks for the report as it is something I was very interested into. It's good to point out that the conclusions should not be taken looking only at the single query perfomance but there is also an important feature in mongo which should be considered when deciding, its scaling capabilities. At the same time there is a downside, it forces you to have some db logic in the web app, which will also move probably some load over the web server, when compared to postgresql. It would be great to move this analysis further.

4. On Sunday, May 20 2012, 14:22 by pingou

@Giulivo,

I am not sure to follow you wrt "it forces you to have some db logic in the web app", could you expand a little bit?

For example, for me, at the moment, I have split my db engine logic outside of the web-app[1], so that actually I can use either postgresql or mongodb without changing the web-app code.

1: https://github.com/pypingou/kittyst... this is the project which implement the interface between the web-app and the db engine.

5. On Sunday, May 20 2012, 14:45 by tobias florek

you might want to test another time with postgres 9.1 or even 9.2 in development atm. it should perform even better. be sure to add appropiate indexes in that case as well.

6. On Sunday, May 20 2012, 14:45 by giulivo

Hi there, thanks for publishing my comment and add your thoughts.

I might not be mastering mongo but while in some occasions I managed to produce useful queries in javascript [1], in some other occasions I had to add some code on the app [2] because I couldn't make mongo do those operations for me

1. https://github.com/giulivo/opinoid/...
2. https://github.com/giulivo/opinoid/...

I don't know if there are better solutions, but even if there are not, the switch could still be worthwhile if taking the specific "scalability on commodity hardware" feature into account.

7. On Sunday, May 20 2012, 15:03 by pingou

@tobias,

I agree, I should try to re-run the tests with newer versions of MongoDB, PostgreSQL and SQLAlchemy.

@giulivo,

Your links appear to be broken but from your description I think I see what you mean.
I took a different approach where I used MongoDB just like another database system (like I would do with PostgreSQL or MySQL), so I have no ajax or so calling the db from the html.

8. On Sunday, May 20 2012, 15:59 by charlie

I'd also be curious if you tuned your postgres server to use more RAM than its ridiculously conservative defaults. See http://wiki.postgresql.org/wiki/Tun... for some ideas.

9. On Sunday, May 20 2012, 16:00 by pjwarner

Also from a quick read I did not see any mention of indexes on the mongo DB. If you are doing content specific searches, ie on the subject you should index the subject. Obviously you want to index the most important queries. In this case to me creating indexes on subject and content may significantly increase your times in those areas, but then again it could have zero effect :).

10. On Sunday, May 20 2012, 16:08 by pingou

@charlie,

I did not change anything to the configuration of my PostgreSQL, so I use the defaults provided by RHEL.

@ pjwarner,

Good point I forgot the mention it, there are index also in the MongoDB (I will update the post):
- Date
- Email
- Subject
- From
- Content
- ThreadID
- MessageID
- References
That's basically all the fields which are searched and the index are created if needed just before the search.
To give you an idea, you can see the queries there: https://github.com/pypingou/kittyst...

And before the question is asked, I cannot create an index for the content of the emails in PostgreSQL as some emails are too big/long for its taste. There might be an option to tune this but I did not look.

11. On Sunday, May 20 2012, 16:17 by csenger

When I last used mongodb (<1.8) you needed very specialized indexes. E.g. mongodb was not able the combine the single indexes you create in https://github.com/pypingou/kittybe... and use it in the query.

This seemed to not have changed:

http://www.mongodb.org/display/DOCS...

You can use the query explain() to see if it does a table scan, and hint() to force a not perfectly fitting index to be used if the query analyzer doesn't choose it.

The Indexing FAQ has more important info like:

* Performance drops if your indexes don't fit in RAM. If you need many of them it
becomes a problem.
* Compound indexes should use the right options and order if you use them for
sorting

All search_(content|subject)*() benchmarks will do a table scan in mongodb anyway because of the `^.*`
regexpression: http://www.mongodb.org/display/DOCS... like postgres for '%...' queries.

Postgres has full text indexes that tokenize the text. It may save you from 'LIKE '%...' queries which are expensive. If you tokenize 'This is textual data' you can search for 'text%' and do not need to search for '%text%'. You can implemnet tokenizing in mogodb within your application logic (http://www.mongodb.org/display/DOCS...) but I don't think that scales for big amounts of data. If you want to offer advanced full text search you can use solr/elastic search/Whoosh for that, but then you again have a second complex component in your stack, even if they give you, beside full text search, facetting, similarity searches, stemming, result highlighting and a lot more. (side note: I can recommend the sunburnt library to access solr.)

Another thing You might also look at the patterns to do pagination in mongodb and rdbmses. If you might get 27000 mails in a result you don't want to deserialize all of them to show 20, and instead do a count() and fetch only the 20 you'd like to display. In sqlalchemy it's a combination of `offset()` and `limit()`. What makes the benchmark times of 7-8 seconds for search_sender|content are mostly the cost for transfering the data for 20000+ mails, not the search. The webhelpers package has a support class to do pagination with sqlalchemy.

In postgres/sqlalchemy side you can use `desc()` to sort decending.

The `KittySAStore.search_content_subject_or*()` methods should not need a `list(set(mails))` at the end.

Is the data you used available to play with? Or maybe I missed the code in the github repo to generate the data.

Cheers,

..Carsten

12. On Sunday, May 20 2012, 16:27 by pingou

@Carsten

Thanks for all the feedback.

- Regarding the 1 index/query it says the exception is $or, chance for me :)
But then, I guess the index on Date is not necessary.

- I agree that the pagination should be done at the DB level, that would save quite some time.

- True for the `list(set(mails))` that's a left over from the time it used union.

- Regarding the use of desc() I tried it but was getting weird results (as in the most recent emails were not at the top)

- To regenerate the data, the information are within the "official" kittystore: https://github.com/pypingou/kittyst... Patch always welcome of course ;-)

13. On Sunday, May 20 2012, 16:38 by Derek Arnold

At the very least change the shared_buffers parameter in your Postgres config. The default is meant to be universally compatible, not provide performance.

14. On Sunday, May 20 2012, 16:51 by pingou

@Derek Arnold,

Apparently my configuration is at:
shared_buffers = 32MB
I have doubled it and restarted, let's see how different the results are :-)

15. On Sunday, May 20 2012, 17:06 by giulivo

I don't query mongo from the html. Here are the two corrected links pointing to the db util. Javascript is executed inside mongo

1. https://github.com/giulivo/opinoid/...
2. https://github.com/giulivo/opinoid/...

16. On Sunday, May 20 2012, 17:12 by pingou

@giulivo,

Ok I see what you mean now, thanks :)

17. On Sunday, May 20 2012, 17:31 by pingou

@Derek Arnold,

With 64MB I don't see any changes in the results of the searches, the order remains the same and the difference between MongoDB and PostgreSQL also.

18. On Sunday, May 20 2012, 17:56 by Alex

Is the RHEL default config at all different from the Postgres default config?

If so, this seems like an awfully specific comparison: a recent (but not latest) patch version of a 3-year-old release, using the config changes made by one Linux distro.

19. On Sunday, May 20 2012, 18:06 by fdr

Usually shared_buffers is about 1/4th of system memory, so many gigabytes, in your case. However, I wonder if you are CPU bound (can you check?) in which case using more specialized data types and indexes would help more. You might also want to post the plan of the queries you are issuing (via EXPLAIN) if you want to be sure they are using the index.

I think your work would be interesting to the pgsql-performance mailing lists, although the version is pretty old.

Another thing to worry about (if you are doing this) is SQL compilation time from SQLAlchemy. I like SQLAlchemy, but my recollection is that it is really quite slow at query formation; a cost of its generality.

20. On Sunday, May 20 2012, 18:19 by pingou

@Alex,

I do not know precisely. The test might be specific but it is basically using the packages on which we would like to see the application running.

Everything is there to duplicate the work on another platform, let me know if you need help to do so ;-)

@fdr

I can make the shared_buffers 4G but iiuc the documentation mentions some kernel parameters to twick, not sure I want to do this. Plus is it realistic to dedicate 4G of RAM to PostgreSQL for a mailman server? I am not convinced.

If I re-run the tests on a Fedora machine, I'll post the new results allowing us to have a better view on the performance of the newer versions.

Good point regarding SQLAlchemy but I would prefer to keep having an ORM and I started with the one I know best.

21. On Sunday, May 20 2012, 19:06 by Arthur

I know this a specific test, but some things to consider:

The test dataset is negligible for such a powerful machine.

Test ran with a recent MongDB and an outdated Postgres.

MongoDB uses all available memory, Postgress need to be manually tunned in the config files.

22. On Sunday, May 20 2012, 19:21 by pingou

@ Arthur,

- The dataset is negligible for such a machine that's true but I did not tune the database systems to adjust it to the machine.

- For the version I agree and it has been mentioned before but they are the version present in the repos thus the version run on the servers (unless specific work/configuration).

I will try to replicate this test on a Fedora 17 with "only" 8Go of ram to see how it goes then.

23. On Sunday, May 20 2012, 20:41 by damm

Looking over this, I find many critical flaws in this post.

1. Can you post your configuration files? Looking at this I don't see anything to indicate you have configured PostgreSQL in a way where it can actually perform?

2. Your also aware how the MVC model differs from MongoDB's model and your results may be worth very little?

3. I would suggest that your results are flawed due to 1) and that you should update to 9.1 and have a better configured server. (e.g. allocating at least 70% or more memory to effective cache)

24. On Sunday, May 20 2012, 21:05 by pingou

@damm,

I believe your questions have already been answered in the previous comments.

And I never said it would be the most fair test possible, I wanted to test two systems for my need using my hardware so I did and these are the results.

25. On Monday, May 21 2012, 02:34 by fdr

@pingou

with regard to the kernel parameters: you definitely want to tune those. All they do is allow more SysV shared memory to be allocated. Oracle has a similar SysV-backed system, and nearly identical settings required. I think there is a high likelihood that these kernel tweaks can and will be made optional in the 9.3 (which is the next-next release). The gory details are that POSIX shared memory lacks a safety feature that SysV shared memory has and mmap is not portable to all systems, if you are wondering as to why the difficulty.

However, as I said, these shared buffer settings may not matter because your data set is very tiny. They probably would boost it by some percentage because it avoids copying out of the kernel buffers and there is some reduced lock contention, but you don't have very many processors and it's not clear to me how many postgres clients you are running with.

I'd like to express a lot of thanks for doing the work to make this benchmark possible, including the code so it can be iterated upon.

26. On Monday, May 21 2012, 05:35 by amin

you are comparing "2 technologies". Try comparing Postgresql nosql hstore vs Mongodb. Should be interesting.

27. On Tuesday, May 22 2012, 02:16 by V

If you are considering MongoDB then you probably don't care about ACID compliance. Then turn off fsync in PostgreSQL config and test again, you'll be surprised.

Also SQLA does not support expression indexes, you need to create one "manually" via DDL, but attach that to table post-create event in your app and it remains defined by your SQLA layer.

28. On Tuesday, May 22 2012, 08:49 by pingou

@V,
Thanks for the advise, I just tested it but the difference between MonoDB and PostgreSQL remains about the same.

29. On Tuesday, May 22 2012, 19:29 by V

Right, because youre tests are read only... eh, silly me. ;)

30. On Friday, May 25 2012, 11:19 by csenger

I've experimented with the data using a postres fulltext index to do the search_content() query.
You can find the changes on github: https://github.com/csenger/kittyben... (the sensitive visualizations are broken in the fork).

It takes ~ 0.08 seconds to get the count for 'rawhid*' (~17000 rows). 1 Second to get the plain data with engine.execute() (without using the orm) and ~1.5 seconds to get full objects. The index on 'content' is 165 MB big. As said it can only do postfix wildcard searches. Fetching 30 emails from somewhere the middle of the sorted result set also takes ~ 0.08 seconds.

Beside the pure speed improvement it will also scale much better than table scans cause it needs to load less data into memory.

There is one pitfalls in doing the search_(content|subject)* with *big* result sets in postgres. Postgres might use an index but needs to store the result set in memory to process it further. Otherwise it stores information in which parts of the table results where found, and scans these parts later for hits. The setting to specify the max. amount of memory is `work_mem` and defaults to 1MB. For the full text index searches the result need's to fit into `work_mem` or it will suffer badly. I
http://archives.postgresql.org/pgsq... is a good explanation of the consequences of `work_mem`.

31. On Friday, May 25 2012, 11:47 by pingou

@ Carsten,

Thanks a lot for all this work, I am looking forward to run it here but that already looks pretty awesome :)

32. On Friday, May 25 2012, 16:38 by Richard Bucker

Your conclusion that the performance delta is negligible is totally wrong. Clearly on a single transaction basis it might appear that way but when you throw tens of thousands of transactions at the database(s) it will turn into something quite significant. Also, you are really comparing a CAP implementation against an ACID implementation. They solve different problems and therefore you have apples and oranges.

33. On Tuesday, May 29 2012, 18:22 by pingou

@Richard,

That's kinda of the idea, I have a basket I want to put fruits in it and I wonder which is/are the most appropriate.

Le blog de pingou