As I noted on the last post, my next goal is to use mailman .txt archive to extract required information.

Following is an extract of the .txt file. All messages are in one .txt file.

From bckurera at fedoraproject.org Sun Apr 1 15:23:17 2012
From: bckurera at fedoraproject.org (Buddhike Kurera)
Date: Sun, 1 Apr 2012 20:53:17 +0530
Subject: [Important Notice] Regarding students' application submission
Message-ID:


Dear Students,

We believe that your are enjoying the Fedora project and getting ready
to GSoC with Fedora.

The only way to identify a mail is it start with From: , the next challenge is getting the url with related the mail. There is no url noted in the .txt file. Therefore my plan is to parse the .txt file and get the required info extracted, parse the archive HTML page and get the info extracted. Finally link two info extracted. In that why it is possible to generate the summary of the list.

However if the concern is limited to the subject and the sender, it is easy to use the HTML archive page. If you want to consider the date of course you need to refer the .txt file.

The interesting thing is the link between mails is maintained using the message-ID. Most of the time if it is a new mail there is no In-Reply-To: and References: tag, which connect the reply with the original mail. Using this connection we can build the relationship between mails.

Advertisements