Mailman archives and mongoDB
By Pierre-Yves on Friday, March 16 2012, 14:31 - Général - Permalink
The results of my fooling around with mailman archives and mongodb
English version
mongoDB is a NoSQL database, meaning it is a graph database allowing very flexible and extensible database scheme. I have been playing with Virtuoso (Opense Source Edition) at work while working on semantic web, but I have not had a chance to look at mongoDB.
As dataset I decided to look at mailman's archives. The idea was to load a whole lot of emails into the mongo database and see how I could use it and how well it would perform. On the back of my head I have the idea that Márín developed in her posts about a new archiver/interface for mailman's archive (if you haven't read them yet, go do this now!!).
I created a small project on my github called mongomail. In this I retrieve all the archives from 2010, 2011 and 2012, load them into mongoDB and then run certain query on the data.
As I found the output to be interesting, there is what I have looked at:
Since January 1st 2010 on the development list of the Fedora project:
- 35182 emails were sent
(actually more were sent but some emails could not be loaded into the database)
- 1345 person emailed the list
- 6367 different subject where discussed
- Adam Williamson is the most prolifix writer with 1691 emails
- Kevin Kofler is the second most prolifix writer with 1474 emails
- FamilleCollet.com sent 77 emails
- pingoured.fr sent 78 emails
- redhat.com sent 9733 emails
(this gives an idea of a ratio between the number of emails sent by @redhat.com and by member of the community)
- The thread who gathered the most people is: "[HEADS-UP] systemd for F14 - the next steps" with 302 emails
- The second thread who gathered the most people is: "fedora mission (was Re: systemd and changes)" with 193 emails
All these information are gathered in about 4 seconds, quite nice :-)
Comments
What do I win?!