R language extracts the content in text (string) - regular expression (2)

In scientific research, sometimes the data we collect is very messy and cannot be analyzed immediately, such as SEER data. Anyone who has used it knows that we need to clean the data and extract what we need from the data before we can proceed. Analysis, at this time there is a useful thing called regular expressions, which is very practical for us to extract data from strings. We have already introduced it in the previous chapter "R Language Extracting Content from Text (Strings) - Regular Expressions (1)" Some common functions of regular expressions have been introduced. Today we will further introduce the use of regular expressions.
Don’t be afraid of not understanding, you are moving forward little by little! This thing needs to be accumulated slowly!
Insert image description here
Let’s import the data first. Suppose we have a bunch of messy data,

readLines("E:/r/test/messages.txt")

Insert image description here
And we think we need to organize it. We need to find the data about fruits, which is convenient for statistics, but computers don’t know what fruits are.
Regular expressions provide a series of The symbol used to represent the pattern.
The above pattern can be described as ^\w+:\s\d+$,
where meta-symbols are used to represent a type of characters. Here are some brief introductions, see the references for details.
• ^: This symbol indicates the beginning of a line.
• \w: This symbol represents a letter or number.
• \s: This symbol represents a space character.
• \d: This symbol represents a numeric character.
• $: This symbol indicates the end of a line.

"\w+"means one or more letters,":"is the symbol we want to see after the word,
"\d+" Represents one or more numeric characters. This pattern expresses all the cases we need and excludes all the cases we don't.
We first perform the matching we need, and continue to use the grep function we used last time to match strings. We can see that fruits such as apples (apple: 20) are represented by letters plus a colon (:) and numbers. However, the number of letters and numbers is uncertain, so we can tell the computer, Letters plus colon (:) plus numbers are fruits.
Let’s first write a rule matcher.

matches <- grep("^\\w+:\\s\\d+$",bc)

Insert image description here
Let me explain the above code. ^ means the beginning of this line, $ means the end of this line. \w means that the beginning is a letter, but \ needs to add another \ to escape, so it is written as 2 \, \w+ means it can be one or more characters. The following colon (:) indicates that we need to see a colon after the letter. \s means that there is a space after the colon, and a \ must be added for escape. Let us note here that the space also needs to occupy a place. Failure to process the space will often lead to matching failure. \d+ and w+ have similar meanings, meaning One or more numbers also need to be escaped by adding a \, so they are written as \d+. Experience this content.
According to the rules, the computer helps us select 1, 3, 5, and 6 as fruits, and we can just extract them.

bc[matches]

Insert image description here
In this way, the content of the fruit can be easily extracted. If we use the stringr package, the function is more powerful. It extracts data in the form of a matrix.

library(stringr)
matches <- str_match(bc,"^(\\w+):\\s(\\d+)$")
matches

Insert image description here
The str_match function rules of the Stringr package are slightly different from grep. The content matched by multiple characters needs to be enclosed in parentheses (), and its function is more powerful. It extracts each component, which I like better.

Let’s look at an example again, first import the data

be<-readLines("E:/r/test/messages.txt")
be

Insert image description here
The data is connected together. What we need is the following data, classified and divided into sections.

Insert image description here
This is equivalent to data cleaning. When the amount of data is large, it cannot be done manually. Let's first analyze one of the rows of data. We can see that the first one is the date, the second one is the time, and the last two are letters, but one is connected by commas and the other is connected by spaces, which is still very regular
2014-02-01,09:20:29,Ken,James,Hey, how are you?
Continue to use the str_ _match function we just used,
Date extraction can be used

(\\d+-\\d+-\\d+)

Time extraction can be used

(\\d+:\\d+:\\d+)

Extract comma-connected characters

(\\w+),(\\w+)

Extract the characters of the space link. Here I want to say that \s represents a space, but please note that uppercase S and lowercase s are different. \s* indicates that a space appears zero, one or more times. In (.+), the dot symbol (.) can replace any symbol. (.+) here means that it can be any content and ends with $

\\s*(.+)$

Let’s connect the above content together

pattern <- "^(\\d+-\\d+-\\d+),(\\d+:\\d+:\\d+),(\\w+),(\\w+),\\s*(.+)$"
matches <- str_match(be,pattern)

Insert image description here
You can see that the content has been extracted separately. After a little organization, we have the data we need.

df <- data.frame(matches[, -1])
colnames(df) <- c("Date", "Time", "Sender", "Receiver", "Message")

Insert image description here
Regular expressions may seem a bit complicated at first, but they are very useful once you understand and master them. This needs to be accumulated slowly.

Recommend several good tutorials to study together, see references

references

  1. https://github.com/ziishaned/learn-regex/blob/master/translations/README-cn.md#4-Preview before and after zero-width assertion
  2. http://www.regexlab.com/zh/regref.htm
  3. R programming guide
  4. The Art of R Language Programming
  5. R data science

Guess you like

Origin blog.csdn.net/dege857/article/details/134431756