Le blog de pingou - Tag - mongoDBLe blog de pingou, ses actualités sur Fedora, ses RPMs, ses tests, son Linux... :-)
Pingou's weblog, his fedora's news, his RPMs, his tests, his Linux... :-)2022-02-17T10:46:15+01:00pingouurn:md5:66db5ce1ed1a80cb2f424695b4bb7780DotclearPostgreSQL vs MongoDBurn:md5:2234e75e59154d171bcd38233a6e8a3f2012-05-20T11:15:00+01:002012-05-20T14:10:51+01:00Pierre-YvesGénéralBenchmarkFedoraFedora-planetmongoDBPostgreSQLPython<p><img src="https://blog.pingoured.fr/public/source.png" alt="source.png" /></p>
<p>A comparative tests of postgresql vs mongodb</p> <p><strong><em>English version</em></strong></p>
<p>As you may know I have spent some time recently working on the <a href="https://fedorahosted.org/hyperkitty/">hyperkitty</a> program. The idea being to offer a new interface to the archives in mailman 3 (which has never been closer to a release).</p>
<p>Hyperkitty aims at implementing a numbers of the ideas developed by Máirín Duffy in <a href="http://blog.linuxgrrl.com/2012/02/29/7750-pixels-of-mailing-list-thread/">her</a> <a href="http://blog.linuxgrrl.com/2012/03/13/mailman-brainstorm/">blog</a> <a href="http://blog.linuxgrrl.com/2012/03/14/mailman-brainstorm-2/">posts</a>. The main one being to try to unify mailing lists and forum (ie: providing a web-interface to mailing lists).</p>
<p>In this quest, I have started to look some time ago to <a href="http://www.mongodb.org/">MongoDB</a>. It is a NoSQL database which is becoming quite popular (probably helped with its integration into <a href="https://openshift.redhat.com/">openshift</a>).
The results were satisfying but then a big question came up:</p>
<ul>
<li>Do we really want to impose the burden of 2 different database systems to our sysadmin for a mailman archives interface ?</li>
</ul>
<p>Of course, if we can avoid it, we would prefer to do it.</p>
<p>But MongoDB was performing really nicely with regards to searching the archives. So testing was needed.</p>
<h3>Hardware</h3>
<p>The machine on which I ran the test has:</p>
<ul>
<li>16G of ram</li>
<li>4 cores</li>
<li>2x1To in RAID1</li>
</ul>
<p>All this operated by RHEL 6.2 (Santiago).</p>
<h3>The databases</h3>
<ul>
<li>PostgreSQL version 8.4.9</li>
<li>MongoDB version 1.8.2</li>
</ul>
<h3>The data</h3>
<p>I used the archives from the <a href="https://lists.fedoraproject.org/mailman/listinfo/devel">devel</a> mailing list, since its creation in 2002.</p>
<ul>
<li>PostgreSQL loaded 166672 emails</li>
<li>MongoDB loaded 166642 emails</li>
</ul>
<p>So there is a difference of 30 emails which I considered to be negligible for the tests.</p>
<h3>The data structure</h3>
<h4>PostgreSQL</h4>
<p>PostgreSQL has one table per list and each table as the same (following) structure:</p>
<pre> id serial NOT NULL,
sender character varying(100) NOT NULL,
email character varying(75) NOT NULL,
subject text NOT NULL,
"content" text NOT NULL,
date timestamp without time zone,
message_id character varying(150) NOT NULL,
stable_url_id character varying(250) NOT NULL,
thread_id character varying(150) NOT NULL,
"references" text,</pre>
<p>Primary key: id</p>
<p>Index: date, message_id, stable_url_id, subject, thread_id</p>
<h4>MongoDB</h4>
<p>MongoDB being a NoSQL database using dictionary has a much more flexible structure.
For the test the same information was stored in a dictionary structure:</p>
<ul>
<li>Content</li>
<li>InReplyTo</li>
<li>From</li>
<li>Subject</li>
<li>ThreadID</li>
<li>Date</li>
<li>References</li>
<li>_id #internal id defined by MongoDB</li>
<li>MessageID</li>
<li>Email</li>
</ul>
<p> UPDATE: The fields which are searched are being indexed. So the following indexes are generated: </p>
<ul>
<li>Content</li>
<li>From</li>
<li>Subject</li>
<li>ThreadID</li>
<li>Date</li>
<li>References</li>
<li>MessageID</li>
<li>Email</li>
</ul>
<h3>The queries</h3>
<p>11 different types of queries were ran:</p>
<ul>
<li>get_thread_length: return how many emails are part of a thread</li>
<li>get_thread_participants: return the list of all the participants in a thread</li>
<li>get_email: return a given email based on its message_id</li>
<li>first_email_in_archive_range: for a time range, return the first email (ordered by date)</li>
<li>get_archives_range: return all the email in a time range</li>
<li>get_archives_length: return information of years and month since the creation of the list</li>
<li>get_list_size: return how many emails are archived for this list</li>
<li>search_subject: performs a search on the subject of the emails</li>
<li>search_content: performs a search on the content of the emails</li>
<li>search_content_subject: performs a search on the subject and the content of the emails</li>
<li>search_sender: performs a search on the sender of the emails (name and email)</li>
</ul>
<p>Each queries was run 30 times, thus allowing caching to take place.</p>
<p>The four search queries were ran using both case insensitive and case sensitive queries</p>
<p>Few addition for PostgreSQL:</p>
<ul>
<li>the search_sender and search_content_subject queries were ran twice in using a union of queries of simply a <strong>or</strong> statement</li>
</ul>
<h3>Results</h3>
<h4>Output</h4>
<p>The first we want to know is</p>
<ul>
<li>Did the queries return the same results ?</li>
</ul>
<p>The following queries returned the same result:</p>
<ul>
<li>get_email</li>
<li>get_archives_range</li>
<li>first_email_in_archives_range</li>
<li>get_thread_length</li>
<li>get_thread_participants</li>
<li>get_archives_length</li>
<li>search_subject</li>
<li>search_subject_cs</li>
<li>search_content</li>
<li>search_content_cs</li>
</ul>
<p>The other ones returned different results:</p>
<ul>
<li>search_content_subject</li>
</ul>
<pre>** Results differs
MG: 28762
PG: 34110
PG-OR: 28762</pre>
<ul>
<li>search_content_subject_cs</li>
</ul>
<pre>** Results differs
MG-CS: 24886
PG-CS: 28578
PG-OR-CS: 24886</pre>
<ul>
<li>search_sender</li>
</ul>
<pre>** Results differs
MG: 1883
PG: 3325
PG-OR: 1883</pre>
<ul>
<li>search_sender_cs</li>
</ul>
<pre>** Results differs
MG-CS: 1882
PG-CS: 1883
PG-OR-CS: 1882</pre>
<ul>
<li>get_list_size</li>
</ul>
<pre>** Results differs
MG: 166642
PG: 166672</pre>
<p>For this last query, the difference in the result was expected nothing new there.</p>
<p>But what about the four queries returning different results, well as mentioned the PG{,-CS} queries are using a union of two queries to retrieve the information, so basically what we have is some duplicates. We could remove the duplicate in python easily using a <code>set</code> but as that will only impact the performance even more.</p>
<p>When we compare the results returned by MongoDB with the results returned by PostgreSQL using a <strong>or</strong> statement, they are similar.
So we can conclude that the different queries are returning the same information.</p>
<p>Let's look at their performances now.</p>
<h4>Performance</h4>
<p>Results are presenting as a boxplot.</p>
<p>Legend:</p>
<ul>
<li><strong>CS </strong>in the title stands for <strong>C</strong>ase <strong>S</strong>ensitive query (as opposed to case insensitive)</li>
<li>MG: results for MongoDB</li>
<li>PG: results for PostgreSQL</li>
<li>PG-OR: results for PostgreSQL using a <strong>or</strong> statement in the query (as opposed to a union)</li>
</ul>
<p><a href="http://pingoured.fr/public/kittybenchmark/overview_simplified.png" target="_new"><img src="http://pingoured.fr/public/kittybenchmark/overview_simplified.png" alt="Overview of the benchmark" width="500" /></a></p>
<p>This is a simplified boxplot (as in the outliers are not presented in the picture) but the full boxplot (with the outliers presented) is also <a href="http://pingoured.fr/public/kittybenchmark/overview.png" target="_new">available</a></p>
<h4>Additionnal results:</h4>
<p>I also looked at the influence of case sensitive vs case insensitive queries within a database system.</p>
<h5>MongoDB</h5>
<p><a href="http://pingoured.fr/public/kittybenchmark/mg_sensitivity.png" target="_new"><img src="http://pingoured.fr/public/kittybenchmark/mg_sensitivity.png" alt="MongoDB sensitivity to case in queries" width="500" /></a></p>
<p>So apparently there is a 'case' effect, but looking at the y scale we can see that this difference is ~0.2 seconds. I believe this is negligible.</p>
<h5>PostgreSQL</h5>
<p><a href="http://pingoured.fr/public/kittybenchmark/pg_sensitivity.png" target="_new"><img src="http://pingoured.fr/public/kittybenchmark/pg_sensitivity.png" alt="PostgreSQL sensitivity to case in queries" width="500" /></a></p>
<p>In this case the 'case' effect is much stronger and can go up to ~4 seconds which is much more significant for us.</p>
<h3>Conclusions</h3>
<p>Looking at the box plot</p>
<ul>
<li>PostgreSQL performs worse than MongoDB in the retrieval queries but the time it takes is negligible (we are way under 1 second)</li>
<li>For PostgreSQL case sensitive queries perform better than MongoDB but worse for case-insensitive queries.</li>
<li>Case sensitivity in the queries has a greater influence in the performances of PostgreSQL than of MongoDB.</li>
<li><strong>Or</strong> statement should be preferred to union of queries (nothing really surprising there).</li>
</ul>
<p><br /><br />So, do we want to perform case-insensitive queries and have 2 database systems to maintain or do we want an option to turn on case-insensitive and have one database system ? Or do we consider 7-8 seconds to be fine while searching in the content of 166000 emails ?</p>
<p><br />On a final note, everything I used to make the tests and the output of the said tests is available on <a href="http://ambre.pingoured.fr/cgit/mm_benchmark.git/">my git</a> and mirrored on <a href="https://github.com/pypingou/kittybenchmark">my github.</a></p>Mailman archives and mongoDBurn:md5:d5a0dad2e653f5fef36cd5a7a347c52d2012-03-16T14:31:00+00:002012-03-16T14:38:44+00:00Pierre-YvesGénéralFedoraFedora-planetmongoDBPythonSemantic Webvirtuoso<p><img src="https://blog.pingoured.fr/public/source.png" alt="source.png" /></p>
<p>The results of my fooling around with mailman archives and mongodb</p> <p><strong><em>English version</em></strong></p>
<p><a href="http://www.mongodb.org/">mongoDB</a> is a NoSQL database, meaning it is a graph database allowing very flexible and extensible database scheme. I have been playing with <a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/">Virtuoso (Opense Source Edition)</a> at work while working on <a href="http://en.wikipedia.org/wiki/Semantic_Web">semantic web</a>, but I have not had a chance to look at mongoDB.</p>
<p>As dataset I decided to look at mailman's archives. The idea was to load a whole lot of emails into the mongo database and see how I could use it and how well it would perform. On the back of my head I have the idea that <a href="http://blog.linuxgrrl.com/">Márín</a> developed in her <a href="http://blog.linuxgrrl.com/category/fedora/mailing-list-improvements/">posts about a new archiver/interface for mailman's archive</a> (if you haven't read them yet, go do this now!!).</p>
<p>I created a small project on my <a href="http://github.com/pypingou">github</a> called <a href="https://github.com/pypingou/mongomail">mongomail</a>. In this I retrieve all the archives from 2010, 2011 and 2012, load them into mongoDB and then run certain query on the data.</p>
<p>As I found the output to be interesting, there is what I have looked at:</p>
<p>Since January 1st 2010 on the development list of the Fedora project:</p>
<ul>
<li>35182 emails were sent</li>
</ul>
<p>(actually more were sent but some emails could not be loaded into the database)</p>
<ul>
<li>1345 person emailed the list</li>
<li>6367 different subject where discussed</li>
<li>Adam Williamson is the most prolifix writer with 1691 emails</li>
<li>Kevin Kofler is the second most prolifix writer with 1474 emails</li>
<li>FamilleCollet.com sent 77 emails</li>
<li>pingoured.fr sent 78 emails</li>
<li>redhat.com sent 9733 emails</li>
</ul>
<p>(this gives an idea of a ratio between the number of emails sent by @redhat.com and by member of the community)</p>
<ul>
<li>The thread who gathered the most people is: "[HEADS-UP] systemd for F14 - the next steps" with 302 emails</li>
<li>The second thread who gathered the most people is: "fedora mission (was Re: systemd and changes)" with 193 emails</li>
</ul>
<p>All these information are gathered in about 4 seconds, quite nice :-)</p>