How to use regular expressions?

Part of a data scientist's mission is to manipulate large amounts of data. Sometimes, these data contain large text corpora. We can do it manually and read it for ourselves, but we can also harness the power of Python. After all, code exists to automate tasks.

Even so, writing a script from scratch takes a lot of time and effort. This is where regular expressions come in. Regular expressions, also known as RE, regex, and regular pattern, are a compact language that allows us to quickly screen and analyze text. Regular expressions began in 1956 - Stephen Cole Kleene created it and used it to describe the McCulloch-Pitts model of the human nervous system. By the 1960s, Ken Thompson added this notation method to a Windows Notepad-like text editor, and since then, regular expressions have grown from strength to strength.

A key feature of regular expressions is their cost-effective scripting. You can even think of it as a shortcut in your code. Without it, we would have to write more code to achieve the same functionality. Regularity is also convenient for crawlers to use. Here you need a code editor, such as Visual Code Studio, PyCharm or Atom, anaconda3 is recommended. Also, it's helpful to know a little about the basics of pandas so you don't get lost as we decipher each line of code. If you need to learn pandas, you can refer to this tutorial: pandas tutorial

  • Introducing our dataset
    We will use the Fraudulent Email Corpus from Kaggle. It contained thousands of phishing emails sent between 1998 and 2007. These emails are interesting to read. We'll first learn basic regex commands using a single email, and then we'll work with the entire corpus .

  • Introducing Python's Regular Expressions Module
    First , prepare the dataset: open that text file, make it "read-only", and read it. We also assign it a variable fh, which represents the file handle.

fh = open(r"test_emails.txt", "r").read()

Now, suppose we want to know the sender of these emails. We can try to do it with just plain Python:

for line in fh.split("\n"):
    if "From:" in line:
        print(line)

write picture description here
Regular expressions can also be used:

import re

for line in re.findall("From:.*", fh):
 print(line)

The result is the same.
Let's decipher this code. We first imported Python's re module. Then we wrote the action code. In this simple example, this code is only one line shorter than the original Python. However, as the number of tasks increases, regular expressions can keep your scripts simple and economical.

re.findall() returns a list of all instances in the string that satisfy its pattern. This is one of the most commonly used functions in Python's built-in re module. Break it down. The function is of the form re.findall(pattern, string) and takes two parameters. where pattern represents the substring we want to look for and string represents the main string we want to look for. The main string can contain many lines.

.* is shorthand for string pattern. We will explain in detail shortly. For now just know that what they do is match the name and email address in the From: field.

  • Common Regular Expression Patterns The pattern
    we used in re.findall() above contained a fully spelled out string From:. This is very useful when we know what we are looking for, and can determine the actual letter and case. If we don't know the exact format of the string we want, we're going to be stuck. Fortunately, regular expressions have basic patterns for dealing with these kinds of situations. Let's look at some of the patterns that will be used in this tutorial:

  • \w matches alphanumeric characters, i.e. az, AZ and 0-9, but also underscore _ and hyphen –

  • \d matches digits, i.e. 0-9
  • \s matches whitespace characters, including tabs, newlines, carriage returns, and spaces
  • \S matches non-whitespace characters
  • . matches any character except newline \n

    Armed with these regular expression patterns, you'll quickly understand as we continue to explain the code.

  • Use the regular expression pattern
    .* to match 0 or more instances of the pattern to its left. That is, it looks for duplicate patterns. When we look for repeating patterns, we say our search is a "greedy match". If we are not looking for repeating patterns, we can say that our search is "non-greedy" or "lazy".
for line in re.findall("From:.*", fh):
 print(line)

Because * matches 0 or more instances of the pattern to its left and . is to its left, we can get all characters in the From: field until the end of the line. This displays an entire line with beautiful and concise code output.

We can even go a step further and just take out the names in it.

match = re.findall("From:.*", fh)

for line in match:
 print(re.findall("\".*\"", line))

write picture description here
After the first quote is matched, .* gets all the characters on the line before the next quote. Of course, the next quote in the pattern is also escaped. This allows us to get the name in quotes. The output of each name is shown in square brackets because re.findall returns the matching results as a list.

What if we want to get an email address?

match = re.findall("From:.*", fh)

for line in match:
 print(re.findall("\w\S*@.*\w", line))

Email addresses always contain an @ sign, so let's start with that. The part of the email address before the @ sign may contain alphanumeric characters, which means \w is required. However, since some email addresses contain periods or hyphens, this is not enough. We added \S to find non-whitespace characters. But \w\S only gets two characters, so add * to repeat the search. So the pattern before the @ sign is \w\S*@. Next look at the part after the @ symbol.
write picture description here
Email addresses end with alphanumeric characters, so we end this pattern with \w. So the part after the @ sign is . \w, which means that the pattern we want is a set of characters of any type ending in an alphanumeric character. This rules out >. So the full email address pattern is \w\S @.*\w

  • The common regular expression function
    re.findall() is undoubtedly very useful, and the re module provides some equally convenient functions, including:
    • re.search()
      re.findall() matches all instances of a pattern in a string and returns them as a list, while re.search() matches the first instance of a pattern in a string instance, and return it as a re match object.
    • re.split()
      assumes we need a fast way to get the domain name of an email address. We can do it with 3 regular expression operations. as follows:
address = re.findall("From:.*", fh)
for item in address:
    for line in re.findall("\w\S*@.*\w", item):
        username, domain_name = re.split("@", line)
        print("{}, {}".format(username, domain_name))

write picture description here
The first line is familiar to us. We return a list of strings and assign it a variable, where each string contains the contents of the From: field. Next we traverse the entire list looking for email addresses. At the same time, we iterate over the email addresses and use the re module's split() function to split each email in half using the @ symbol as a delimiter. Finally, we display it.

  • re.sub()
    re.sub() is another useful re function. As the name suggests, its function is to replace part of a string. for example:
sender = re.search("From:.*", fh)
address = sender.group()
email = re.sub("From", "Email", address)
print(address)
print(email)

The tasks in the first and second lines we have seen before. On the third line we apply re.sub() on address; address is the full From: field in the email header.

re.sub() takes three arguments. The first is the substring to replace, the second is the string to replace the former, and the third is the main string itself.

  • Regular Expressions in pandas
    Now that we have the basics of regular expressions, we can try out some more advanced features. However, we need to combine regular expressions with the pandas Python data analysis library. Pandas is very useful in organizing data into neat tables (also known as dataframes), and it also allows us to understand data from different perspectives. Combined with the economical and concise code of regular expressions, it's like cutting butter with a sharp knife - simple and neat.
  • Organizing emails using regular expressions and pandas
    Our corpus is a single text file containing thousands of emails. We will use regular expressions and pandas to organize parts of each email into appropriate categories to make reading and analysis of this corpus easier.

    • sender_name (sender name)
    • sender_address (sender address)
    • recipient_address (recipient address)
    • recipient_name (recipient name)
    • date_sent (sent time)
    • subject
    • email_body (email body)
      where each category becomes a column in our pandas dataframe or table. This can be useful because it allows us to manipulate each column itself. For example, this allows us to write code to find out which domains those emails are from without first writing code to separate the email address from the rest. Essentially, categorizing the important parts of our dataset allows us to obtain fine-grained information later with much cleaner code. Concise code, in turn, reduces the number of operations our machines have to perform, which can speed up our analysis process, especially when working with large datasets.
      First at the top of the script, we import re and pandas following standard convention. We also imported the Python email package, which is especially needed for email body processing. If you just use regular expressions, the email body can be quite complicated to process, and it might even require a separate tutorial to say please. So let's save time with a well-developed email package and let's focus on learning regular expressions.

Next we create an empty list emails to store the dictionary. Each dictionary will contain the details of each email.

We often display the results of our code on the screen to see where the code is right or wrong. However, since there are thousands of emails in the dataset, this prints thousands of lines on the screen, bloating the tutorial. We definitely don't want to keep scrolling through thousands of rows of results. So, as we did at the beginning of this tutorial, we open and read a shortened version of a corpus. We prepared it manually for this tutorial. But you can use actual datasets for your own practice. Whenever you run the print() function, you can see thousands of lines of results on the screen within seconds.

Now, start using regular expressions.

if date_field is not None:
 date = re.search(r"\d+\s\w+\s\d+", date_field.group())
else:
 date = None

print(date_field.group())

Dates start with a number. So we use \d to represent it. However, the date in the DD part may be one number or two numbers. So the + sign here is very important. In regular expressions, + matches 1 or more instances of the pattern to its left. So \d+ can match the DD part, whether it's one digit or two digits.

After that, there is a space. Use \s to represent whitespace characters. The month consists of three letters, so use \w+. Then another space \s. Years consist of numbers, so again use \d+.

The full pattern \d+\s\w+\s\d+ is valid because it is an exact pattern with whitespace characters on both sides.
Next, we check for the None value as before.

full_email = email.message_from_string(item)
body = full_email.get_payload()
emails_dict["email_body"] = body

Separating message headers from the body is a complex task, especially when many headers are different. Consistency is rare in the raw, untrimmed data. Fortunately, the work has already been done. Python's email package is perfect for this task. We have imported this package before. Now we apply message_from_string() to item, turning the entire email into an email message object. The message object contains a header and a payload, corresponding to the header and body of the email, respectively.

Next, we apply the get_payload() function on this message object. This function can separate out the body of the email. We assign it to the variable body and insert it into our emails_dict dictionary under "email_body".

Full code:

import re
import pandas as pd
import email

emails = []

fh = open(r"./fraudulent-email-corpus/fradulent_emails.txt","r",errors='ignore').read()

contents = re.split(r"From r",fh)
contents.pop(0)

for item in contents:
    emails_dict = {}
    sender = re.search(r"From:.*", item)
    if sender is not None:
        s_email = re.search(r"\w\S*@.*\w", sender.group())
        s_name = re.search(r":.*<", sender.group())
    else:
        s_email = None
        s_name = None
    if s_email is not None:
        sender_email = s_email.group()
    else:
        sender_email = None
    emails_dict["sender_email"] = sender_email
    if s_name is not None:
        sender_name = re.sub("\s*<", "", re.sub(":\s*", "", s_name.group()))
    else:
        sender_name = None
    emails_dict["sender_name"] = sender_name
    recipient = re.search(r"To:.*", item)
    if recipient is not None:
        r_email = re.search(r"\w\S*@.*\w", recipient.group())
        r_name = re.search(r":.*<", recipient.group())
    else:
        r_email = None
        r_name = None
    if r_email is not None:
        recipient_email = r_email.group()
    else:
        recipient_email = None
    emails_dict["recipient_email"] = recipient_email
    if r_name is not None:
        recipient_name = re.sub("\s*<", "", re.sub(":\s*", "", r_name.group()))
    else:
        recipient_name = None
    emails_dict["recipient_name"] = recipient_name
    date_field = re.search(r"Date:.*", item)
    if date_field is not None:
        date = re.search(r"\d+\s\w+\s\d+", date_field.group())
    else:
        date = None
    if date is not None:
        date_sent = date.group()
    else:
        date_sent = None
    emails_dict["date_sent"] = date_sent
    subject_field = re.search(r"Subject: .*", item)
    if subject_field is not None:
        subject = re.sub(r"Subject: ", "", subject_field.group())
    else:
        subject = None
    emails_dict["subject"] = subject
# "item" substituted with "email content here" so full email not displayed.
    full_email = email.message_from_string(item)
    body = full_email.get_payload()
    emails_dict["email_body"] = "email body here"

    emails.append(emails_dict)

# Print number of dictionaries, and hence, emails, in the list.
print("Number of emails: " + str(len(emails_dict)))

print("\n")

# Print first item in the emails list to see how it looks.
for key, value in emails[2].items():
    print(str(key) + ": " + str(emails[0][key]))

write picture description here

emails_df[emails_df["sender_email"].str.contains("maktoob|spinfinder")]

write picture description here
The pipe symbol | looks for the characters on either side of it, eg a|b looks for a or b.

| looks the same as [ ], but it's not. If we were looking for a crab or lobster or isopod, it would be much more reasonable to use crab|lobster|isopod than to use [crablobsterisopod]. The former looks for every word in it, while the latter searches for every letter in it.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324703020&siteId=291194637