English version

As you may know I have spent some time recently working on the hyperkitty program. The idea being to offer a new interface to the archives in mailman 3 (which has never been closer to a release).

Hyperkitty aims at implementing a numbers of the ideas developed by Máirín Duffy in her blog posts. The main one being to try to unify mailing lists and forum (ie: providing a web-interface to mailing lists).

In this quest, I have started to look some time ago to MongoDB. It is a NoSQL database which is becoming quite popular (probably helped with its integration into openshift). The results were satisfying but then a big question came up:

  • Do we really want to impose the burden of 2 different database systems to our sysadmin for a mailman archives interface ?

Of course, if we can avoid it, we would prefer to do it.

But MongoDB was performing really nicely with regards to searching the archives. So testing was needed.

Hardware

The machine on which I ran the test has:

  • 16G of ram
  • 4 cores
  • 2x1To in RAID1

All this operated by RHEL 6.2 (Santiago).

The databases

  • PostgreSQL version 8.4.9
  • MongoDB version 1.8.2

The data

I used the archives from the devel mailing list, since its creation in 2002.

  • PostgreSQL loaded 166672 emails
  • MongoDB loaded 166642 emails

So there is a difference of 30 emails which I considered to be negligible for the tests.

The data structure

PostgreSQL

PostgreSQL has one table per list and each table as the same (following) structure:

 id serial NOT NULL,
sender character varying(100) NOT NULL,
email character varying(75) NOT NULL,
subject text NOT NULL,
"content" text NOT NULL,
date timestamp without time zone,
message_id character varying(150) NOT NULL,
stable_url_id character varying(250) NOT NULL,
thread_id character varying(150) NOT NULL,
"references" text,

Primary key: id

Index: date, message_id, stable_url_id, subject, thread_id

MongoDB

MongoDB being a NoSQL database using dictionary has a much more flexible structure. For the test the same information was stored in a dictionary structure:

  • Content
  • InReplyTo
  • From
  • Subject
  • ThreadID
  • Date
  • References
  • _id #internal id defined by MongoDB
  • MessageID
  • Email

UPDATE: The fields which are searched are being indexed. So the following indexes are generated:

  • Content
  • From
  • Subject
  • ThreadID
  • Date
  • References
  • MessageID
  • Email

The queries

11 different types of queries were ran:

  • get_thread_length: return how many emails are part of a thread
  • get_thread_participants: return the list of all the participants in a thread
  • get_email: return a given email based on its message_id
  • first_email_in_archive_range: for a time range, return the first email (ordered by date)
  • get_archives_range: return all the email in a time range
  • get_archives_length: return information of years and month since the creation of the list
  • get_list_size: return how many emails are archived for this list
  • search_subject: performs a search on the subject of the emails
  • search_content: performs a search on the content of the emails
  • search_content_subject: performs a search on the subject and the content of the emails
  • search_sender: performs a search on the sender of the emails (name and email)

Each queries was run 30 times, thus allowing caching to take place.

The four search queries were ran using both case insensitive and case sensitive queries

Few addition for PostgreSQL:

  • the search_sender and search_content_subject queries were ran twice in using a union of queries of simply a or statement

Results

Output

The first we want to know is

  • Did the queries return the same results ?

The following queries returned the same result:

  • get_email
  • get_archives_range
  • first_email_in_archives_range
  • get_thread_length
  • get_thread_participants
  • get_archives_length
  • search_subject
  • search_subject_cs
  • search_content
  • search_content_cs

The other ones returned different results:

  • search_content_subject
** Results differs
MG: 28762
PG: 34110
PG-OR: 28762
  • search_content_subject_cs
** Results differs
MG-CS: 24886
PG-CS: 28578
PG-OR-CS: 24886
  • search_sender
** Results differs
MG: 1883
PG: 3325
PG-OR: 1883
  • search_sender_cs
** Results differs
MG-CS: 1882
PG-CS: 1883
PG-OR-CS: 1882
  • get_list_size
** Results differs
MG: 166642
PG: 166672

For this last query, the difference in the result was expected nothing new there.

But what about the four queries returning different results, well as mentioned the PG{,-CS} queries are using a union of two queries to retrieve the information, so basically what we have is some duplicates. We could remove the duplicate in python easily using a set but as that will only impact the performance even more.

When we compare the results returned by MongoDB with the results returned by PostgreSQL using a or statement, they are similar. So we can conclude that the different queries are returning the same information.

Let's look at their performances now.

Performance

Results are presenting as a boxplot.

Legend:

  • CS in the title stands for Case Sensitive query (as opposed to case insensitive)
  • MG: results for MongoDB
  • PG: results for PostgreSQL
  • PG-OR: results for PostgreSQL using a or statement in the query (as opposed to a union)

Overview of the benchmark

This is a simplified boxplot (as in the outliers are not presented in the picture) but the full boxplot (with the outliers presented) is also available

Additionnal results:

I also looked at the influence of case sensitive vs case insensitive queries within a database system.

MongoDB

MongoDB sensitivity to case in queries

So apparently there is a 'case' effect, but looking at the y scale we can see that this difference is ~0.2 seconds. I believe this is negligible.

PostgreSQL

PostgreSQL sensitivity to case in queries

In this case the 'case' effect is much stronger and can go up to ~4 seconds which is much more significant for us.

Conclusions

Looking at the box plot

  • PostgreSQL performs worse than MongoDB in the retrieval queries but the time it takes is negligible (we are way under 1 second)
  • For PostgreSQL case sensitive queries perform better than MongoDB but worse for case-insensitive queries.
  • Case sensitivity in the queries has a greater influence in the performances of PostgreSQL than of MongoDB.
  • Or statement should be preferred to union of queries (nothing really surprising there).



So, do we want to perform case-insensitive queries and have 2 database systems to maintain or do we want an option to turn on case-insensitive and have one database system ? Or do we consider 7-8 seconds to be fine while searching in the content of 166000 emails ?


On a final note, everything I used to make the tests and the output of the said tests is available on my git and mirrored on my github.