Reading in Files with Meaningful Whitespace (Python)

Sean Steinle :

I'm trying to read in files from the released Enron Dataset for a data science project. My problem lies in how I'm trying to read in my files. Basically, the first 15 or so lines of every email is information about the email itself: to, from, subject, etc. Thus, you would think read in the first 15 lines and assign them into an array. The problem that arises is that I'm trying to use whitespace in my algorithm, but sometimes there can be like 50 lines for the "to" column.

Example of a (slightly truncated) troublesome email:

Message-ID: <29403111.1075855665483.JavaMail.evans@thyme>
Date: Wed, 13 Dec 2000 08:22:00 -0800 (PST)
From: [email protected]
To: [email protected], [email protected], [email protected], 
    [email protected], [email protected], 
    [email protected], [email protected], 
    [email protected], collee [email protected], 
    [email protected], [email protected]
Subject: Final Filed Version -- SDG&E Comments

My code:

def readEmailHead(username, emailNum):
    text = ""
    file = open(corpus_root + username + '/all_documents/' + emailNum)
    for line in file:
        text += line
    file.close()
    email = text.split('\n')
    count = 0
    for line in email:
        mem = []
        if line == '':
            pass
        else:
            if line[0].isspace():
                print(line,count)
                email[count-1] += line
                del email[count]
        count += 1
        return [email[:20]]

Right now it can handle emails with an extra line in the subject/to/from/etc, but not any more. Any ideas?

FredrikHedman :

No need to reinvent the wheel. The module email.parse can be your friend. I include a more portable way of constructing the file name so to just parse the header you could use the built-in parser and write a function like:

import email.parser
import os.path


def read_email_header(username, email_number, corpus_root='~/tmp/data/enron'):
    corpus_root = os.path.expanduser(corpus_root)
    fname = os.path.join(corpus_root, username, 'all_documents', email_number)
    with open(fname, 'rb') as fd:
        header = email.parser.BytesHeaderParser().parse(fd)
    return header


mm = read_email_header('dasovich-j', '13078.')

print(mm.keys())
print(mm['Date'])
print(mm['From'])
print(mm['To'].split())
print(mm['Subject'])

Running this gives:

['Message-ID', 'Date', 'From', 'To', 'Subject', 'Mime-Version', 'Content-Type', 'Content-Transfer-Encoding', 'X-From', 'X-To', 'X-cc', 'X-bcc', 'X-Folder', 'X-Origin', 'X-FileName']
Fri, 25 May 2001 02:50:00 -0700 (PDT)
[email protected]
['[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected]']
Reuters -- FERC told Calif natgas to reach limit this summer

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=346004&siteId=1