Spark machine learning combat (three) movie rating data processing and feature extraction

Spark machine learning combat (three) movie rating data processing and feature extraction

 

 

This part is mainly about how to make the necessary data processing after data visualization, because the original data is not complete. Then, from the data we extracted features we need. The dataset used is still MovieLens 100k data sets , platform Python Spark.

Article lists the key in the code, complete code, see my GitHub Repository , the code in this articlechapter03/movielens_feature.py

Step 1: Conversion data processing

When data is missing or appears abnormal, treatment methods are common:

  • Filter or delete non-structured or missing data
  • Filling non-structured data or missing
  • Outliers for robust treatment
  • For possible outliers conversion

Since the data collection problem of missing data we used little, so this part without special treatment.

Step 2: feature extraction

The main features include the following three:

  • Numerical characteristics: such as age, can be directly used as a dimensional data
  • Category features: one of a plurality of categories, but the categories are generally characterized by the number of classes will be how many dimensions
  • Text features: such as movie reviews

Numerical wherein

Numerical characteristics also need to be converted, because not all of the features values ​​have meaning.

For example, the age is a good value characteristics, can not addressed directly, because of increasing age and reducing the target has a direct relationship. However, features such as the latitude and longitude of the location, not so good sometimes directly, need to do some processing, and even can be converted to class feature.

Category feature

Class characteristic class k k bits need to be converted into a vector

We come to MovieLens dataset professional users for processing, be converted to class feature.

all_occupations = occupation_data.distinct().collect()
all_occupations.sort()
occupation_dict = {}
for i, occu in enumerate(all_occupations):
    occupation_dict[occu] = i
user_tom_occupation = 'programmer'
tom_occupation_feature = np.zeros(len(all_occupations))
tom_occupation_feature[occupation_dict[user_tom_occupation]] = 1
print("Binary feature of tom's occupation (programmer) is:")
print(tom_occupation_feature)

The results are:

Binary feature of tom's occupation (programmer) is:
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.
  0.  0.  0.]

Derived features

Derived from the original feature refers to data obtained after processing through some of the features, such as user previously calculated the total number of movie scores, age, etc. movie.

The following example is to convert the timestamp is characterized u.data class distinction, this score is characterized in what time of day given.

rating_data = sc.textFile("%s/ml-100k/u.data" % PATH)
rating_fields = rating_data.map(lambda line: line.split('\t'))
timestamps = rating_fields.map(lambda fields: int(fields[3]))
hour_of_day = timestamps.map(lambda ts: datetime.fromtimestamp(ts).hour)
times_of_day_dict = {}
for hour in range(24):
    if hour in range(7, 12):
        times_of_day_dict[hour] = "morning"
    elif hour in range(12, 14):
        times_of_day_dict[hour] = "lunch"
    elif hour in range(14, 18):
        times_of_day_dict[hour] = "afternoon"
    elif hour in range(18, 23):
        times_of_day_dict[hour] = "evening"
    else:
        times_of_day_dict[hour] = "night"
time_of_day = hour_of_day.map(lambda hour: times_of_day_dict[hour])
print(hour_of_day.take(5))
print(time_of_day.take(5))

The results of this code is run

[23, 3, 15, 13, 13]
['night', 'night', 'afternoon', 'lunch', 'lunch']

Can be seen, the first time stamp is converted to point-hour day, for subsequent conversion period, it can then be transformed into categories characterized

Text feature

In theory, the text can be seen as a characteristic feature category, but rarely repeated text, so the effect will be far from ideal.

The following use of natural language processing (NLP) common word bag method (bag-of-word), in short, the word bag method is to centralize all data appeared in a dictionary of words, for example, there are K word . Followed by a K-dimensional vector indicating the occurrence of a text, the text had the word record is 1, and 0. Since most word does not appear, so it fit in a sparse matrix.

First, we use a regular expression to remove the information in the Year of movie titles in parentheses, and then broken down into a list of movie titles each word.

def extract_title(raw):
    grps = re.search("\((\w+)\)", raw)
    if grps:
        return raw[:grps.start()].strip()
    else:
        return raw
movie_data = sc.textFile("%s/ml-100k/u.item" % PATH)
movie_fields = movie_data.map(lambda line: line.split('|'))
raw_titles = movie_fields.map(lambda fields: fields[1])
print
print("Remove year information in '()'")
for raw_title in raw_titles.take(5):
    print(extract_title(raw_title))
movie_titles = raw_titles.map(extract_title)
title_terms = movie_titles.map(lambda line: line.split(' '))
print
print("Split words.")
print(title_terms.take(5))

The output is:

Remove year information in '()'
Toy Story
GoldenEye
Four Rooms
Get Shorty
Copycat

Split words.
[[u'Toy', u'Story'], [u'GoldenEye'], [u'Four', u'Rooms'], [u'Get', u'Shorty'], [u'Copycat']]

Reuse flatMap RDD operating all figured out a word that appears to construct the word dictionary in the form of (words, numbers).

all_terms = title_terms.flatMap(lambda x: x).distinct().collect()
all_terms_dict = {}
for i, term in enumerate(all_terms):
    all_terms_dict[term] = i
print
print("Total number of terms: %d" % len(all_terms_dict))

Finally, the title is mapped into a high-dimensional sparse matrix, a word that appears at the 1. Note that we as a broadcast dictionary all_terms_dict variable because this variable will be very large, distributed in advance to each computing node would be better.

from scipy import sparse as sp
def create_vector(terms, term_dict):
    num_terms = len(term_dict)
    x = sp.csc_matrix((1, num_terms))
    for t in terms:
        if t in term_dict:
            idx = term_dict[t]
            x[0, idx] = 1
    return x
all_terms_bcast = sc.broadcast(all_terms_dict)
term_vectors = title_terms.map(lambda 
    terms: create_vector(terms, all_terms_bcast.value))
print
print("The first five terms of converted sparse matrix of title")
print(term_vectors.take(5))

The output is:

[<1x2645 sparse matrix of type '<type 'numpy.float64'>'
    with 2 stored elements in Compressed Sparse Column format>, 
..., <1x2645 sparse matrix of type '<type 'numpy.float64'>'
    with 1 stored elements in Compressed Sparse Column format>]

Regularization features

We usually characterized by the need to obtain about regularization process. Regularization characterized into two categories:

  • The first one is the regularization characteristics, such as age and set data regularization, so that their average is 0 and variance 1

  • The second feature vector for the regularization, is a sample of a regularization of the features, so that its norm is 1 (typically from 1 second order is the norm, the norm refers to the square of the second order and the square root)

Is the second example, regularization feature vector. The first way is to use numpy function.

np.random.seed(42)
x = np.random.randn(4)
norm_x = np.linalg.norm(x)
normalized_x = x / norm_x
print
print("x: %s" % x)
print("2-norm of x: %.4f" % norm_x)
print("normalized x: %s" % normalized_x)

The output is:

x: [ 0.49671415 -0.1382643   0.64768854  1.52302986]
2-norm of x: 1.7335
normalized x: [ 0.28654116 -0.07976099  0.37363426  0.87859535]

The second way is to use a feature vector MLlib regularization

from pyspark.mllib.feature import Normalizer
normalizer = Normalizer()
vector = sc.parallelize([x])
normalized_x_mllib = normalizer.transform(vector).first().toArray()
print("MLlib normalized x: %s" % normalized_x)

Natural result is the same, of course, choose to use MLlib own function better.

At this point, the article content is over.

Released three original articles · won praise 4 · views 10000 +

Guess you like

Origin blog.csdn.net/dl2277130327/article/details/88948280