English version

mongoDB is a NoSQL database, meaning it is a graph database allowing very flexible and extensible database scheme. I have been playing with Virtuoso (Opense Source Edition) at work while working on semantic web, but I have not had a chance to look at mongoDB.

As dataset I decided to look at mailman's archives. The idea was to load a whole lot of emails into the mongo database and see how I could use it and how well it would perform. On the back of my head I have the idea that Márín developed in her posts about a new archiver/interface for mailman's archive (if you haven't read them yet, go do this now!!).

I created a small project on my github called mongomail. In this I retrieve all the archives from 2010, 2011 and 2012, load them into mongoDB and then run certain query on the data.

As I found the output to be interesting, there is what I have looked at:

Since January 1st 2010 on the development list of the Fedora project:

  • 35182 emails were sent

(actually more were sent but some emails could not be loaded into the database)

  • 1345 person emailed the list
  • 6367 different subject where discussed
  • Adam Williamson is the most prolifix writer with 1691 emails
  • Kevin Kofler is the second most prolifix writer with 1474 emails
  • FamilleCollet.com sent 77 emails
  • pingoured.fr sent 78 emails
  • redhat.com sent 9733 emails

(this gives an idea of a ratio between the number of emails sent by @redhat.com and by member of the community)

  • The thread who gathered the most people is: "[HEADS-UP] systemd for F14 - the next steps" with 302 emails
  • The second thread who gathered the most people is: "fedora mission (was Re: systemd and changes)" with 193 emails

All these information are gathered in about 4 seconds, quite nice :-)