Road of big data [Title XIII]: Data mining --- Chinese word

First, data mining --- Chinese word

• What is not just a text literally, but also in how segmentation and understanding.
• For example:
    - Fried Asan Hotel:
    - Asan / fried rice / shop Asan / fried / hotel
• English and different, with no spaces between Chinese words, so to achieve the Chinese search engine, more than a minute than the English
word task .
• If there is no Chinese word will be:
    - for "within reach" relevant "Zidane" message appears

• To solve the problem of Chinese word segmentation accuracy, can provide a free version of the generic word program?
   - problems like this in the field of natural language processing word, it is difficult to completely fully resolved
   - each focusing on different industry or business, segmentation tool design strategy is not the same

Second, segmentation scheme

 

 

Start position corresponding bit • cut is 1, otherwise the corresponding bit is 0, to indicate in the "Yes / opinions / divergence," the bit capacity is: 11010

• you can also use a word sequence of nodes to represent the segmentation scheme, for example, "there / comments / difference, "the word node sequence is {0,1,3,5}

Third, the most common method

• The most common method is based on the word dictionary match
      - the maximum length to find (look for forward, backward look)

 

• Data structures
      - in order to improve search efficiency, not one by matching words in the dictionary
      - the dictionary to find about one-third of the total percentage of time possible segmentation of time, in order to ensure segmentation speed, you need to choose a good
lookup dictionary method
      - Trie tree often used to speed up the dictionary to find word problems

Four, Trie tree

 

 Word segmentation map

Fifth, the probability of the language model

• Suppose you want to divide out the word corpus and vocabulary are present, the easiest way is to calculate the probability by words,
not by word count probability.


• From the statistical point of view thinking, the problem input word string is a C = C1, C2 ...... CN , outputs a
word sequence S = W1, w2 of ...... WM , where m <= n. For a specific string C, have a plurality of cutting
points S corresponding to the program, the word segmentation task is to find a program in the S, S, P such that | value (S C) of
the maximum.


• P (S | C) is the probability of generating segmentation S by the string C, which is the input string Parsing the most likely
sequence of words

 

example:

• For example: For input string C "Nanjing Yangtze River Bridge," following two segmentation may:
      - S1: Nanjing / Yangtze River / Bridge
      - S2: Nanjing / mayor / River Bridge
• Both Segmentation respectively called the S1 and S2. Calculating the conditional probability P (S1 | C) and P (S2 | C), and then according to
P S1 or S2 to determine selected values | | (C S2) of (C S1) and P.


• P (C) is the probability of occurrence of the string in the corpus. For example, there are 10 000 corpus sentences, of which one
is the "Nanjing Yangtze River Bridge," then P (C) = P ( "Nanjing Yangtze River Bridge") = parts per million.


• Since P (C∩S) = P (S | C) * P (C) = P (C | S) * P (S), so

 

 

• Bayesian formula:

 

 

 

• P (C) only for a fixed value of the normalized


• Also: probability recover from the word string to string of characters there is only one way, so P (C | S) = 1.


• So: Compare P (S1 | C) and P (S2 | C) size becomes relatively P (S1) and P (S2) size

 

 

 

 

 

 

 

• Since P (S1) = P (Nanjing, the Yangtze River, Bridge) = P (Nanjing) * P (Yangtze) * P (Bridge)  > P (S2) = P (Nanjing City,
Long Jiang Bridge), so select segmentation scheme S1

 

 

 

 Example 3:

• For ease of implementation, we assume that the probability between each word is context-free, then:

 

 

 

• where, P (w) is the probability that the word appears in the corpus. Because the function y = log (x), when x increases,
Y will be increased, is a monotonically increasing function. Α is proportional to the symbol. Because the word probability is less than 1, so take
the log is negative.


• The last count logP (w). Log is taken to prevent underflow, if a small number, e.g. .000000000000000000000000000001 may overflow down.

• If the value of the advance has been counted out, the result can be obtained directly by addition rather than multiplication addition faster method speeds.

 

 

 

 

 

 

Sixth, the dynamic programming method for solving ---

• string X, length is m, the number from the beginning;


• string Y, of length n, the number from the beginning;


• X i = <x 1, ......, xi> i.e. before the i-th character sequence X (1 <= i <= m) (X i is counted as "prefix string of X i ')


• Y i = <y 1, ......, yi> i.e. before the i-th character sequence Y (1 <= j <= n) (Y j is counted as "the string Y j prefix")


• LCS (X, Y) is a string of X and Y is the longest common subsequence, i.e. Z = <z 1, ......, zk>


• If xm = yn (the same as the last character), then:? The longest common last character sequence Z k X and Y n is necessarily as XM (= Yn)

• XM Zk = Yn =

Seven, LCS summary analysis

 

• is the dynamic programming problem!

 

 

Eight, two-dimensional array data structure ----

• a two-dimensional array C [m, n-]
• C [I, J] of the length of the longest common subsequence recording sequence X i and the Y j
- when i = 0 or j = 0, the X i is the emptiness and longest common subsequence Y j, so C [i, j] = 0

 

 

 example:

• X =<A, B, C, B, D, A, B>
• Y=<B, D, C, A, B, A>

 

 

 

 mr_lcs mapreduce

##map.py

# -*- coding: utf-8 -*-
#!/usr/bin/python

import sys

def cal_lcs_sim(first_str, second_str):
    len_vv = [[0] * 50] * 50

    first_str = unicode(first_str, "utf-8", errors='ignore')
    second_str = unicode(second_str, "utf-8", errors='ignore')

    len_1 = len(first_str.strip()) len_2 = len(second_str.strip()) #for a in first_str: #word = a.encode('utf-8') for i in range(1, len_1 + 1): for j in range(1, len_2 + 1): if first_str[i - 1] == second_str[j - 1]: len_vv[i][j] = 1 + len_vv[i - 1][j - 1] else: len_vv[i][j] = max(len_vv[i - 1][j], len_vv[i][j - 1]) return float(float(len_vv[len_1][len_2] * 2) / float(len_1 + len_2)) for line in sys.stdin: ss = line.strip().split('\t') if len(ss) != 2: continue first_str = ss[0].strip() second_str = ss[1].strip() sim_score = cal_lcs_sim(first_str, second_str) print '\t'.join([first_str, second_str, str(sim_score)])
#run.sh

HADOOP_CMD="/usr/local/src/hadoop-1.2.1/bin/hadoop"
STREAM_JAR_PATH="/usr/local/src/hadoop-1.2.1/contrib/streaming/hadoop-streaming-1.2.1.jar"

INPUT_FILE_PATH_1="/lcs_input.data"
OUTPUT_PATH="/lcs_output"

$HADOOP_CMD fs -rmr -skipTrash $OUTPUT_PATH # Step 1. $HADOOP_CMD jar $STREAM_JAR_PATH \ -input $INPUT_FILE_PATH_1 \ -output $OUTPUT_PATH \ -mapper "python map.py" \ -jobconf "mapred.reduce.tasks=0" \ -jobconf "mapred.job.name=mr_lcs" \ -file ./map.py

 

mr_tfidf mapreduce

 

 

##red.py
#!/usr/bin/python

import sys
import math

current_word = None
count_pool = []
sum = 0

docs_cnt = 508

for line in sys.stdin: ss = line.strip().split('\t') if len(ss) != 2: continue word, val = ss if current_word == None: current_word = word if current_word != word: for count in count_pool: sum += count idf_score = math.log(float(docs_cnt) / (float(sum) + 1)) print "%s\t%s" % (current_word, idf_score) current_word = word count_pool = [] sum = 0 count_pool.append(int(val)) for count in count_pool: sum += count idf_score = math.log(float(docs_cnt) / (float(sum) + 1)) print "%s\t%s" % (current_word, idf_score)

 

##run.sh

HADOOP_CMD="/usr/local/src/hadoop-1.2.1/bin/hadoop"
STREAM_JAR_PATH="/usr/local/src/hadoop-1.2.1/contrib/streaming/hadoop-streaming-1.2.1.jar"

INPUT_FILE_PATH_1="/tfidf_input.data"
OUTPUT_PATH="/tfidf_output"

$HADOOP_CMD fs -rmr -skipTrash $OUTPUT_PATH # Step 1. $HADOOP_CMD jar $STREAM_JAR_PATH \ -input $INPUT_FILE_PATH_1 \ -output $OUTPUT_PATH \ -mapper "python map.py" \ -reducer "python red.py" \ -file ./map.py \ -file ./red.py

 

Guess you like

Origin www.cnblogs.com/hackerer/p/11456681.html