Collective Programming Wisdom Reading Notes - Chapter 1

1. What is collective intelligence

      Concept: Collective Intelligence (CI), referred to as collective intelligence, is a shared or group intelligence. Before the advent of the Internet age, collective intelligence has been active in the fields of biology, sociology, computer science, and popular behavior . With the rise of Web 2.0 and the popularization of social software, collective intelligence has also been widely used in the fields of social networking services, crowdsourcing, sharing, commenting and recommendation . Typical cases include: Baidu Encyclopedia, Zhubajie.com, Task China, Threadless, InnoCentive, digg, iStockphoto, Mechanical Turk, etc. More and more traditional companies and organizations are also beginning to use various collective intelligence platforms or tools to solve complex problems with external intelligence .

2. What is machine learning

1. Introduction to Machine Learning

  • Machine Learning (ML) is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory , convex analysis , algorithm complexity theory and other disciplines. It specializes in how computers simulate or realize human learning behaviors to acquire new knowledge or skills, and to reorganize existing knowledge structures to continuously improve their performance.
  • It is the core of artificial intelligence and the fundamental way to make computers intelligent. It is applied in all fields of artificial intelligence. It mainly uses induction and synthesis instead of deduction.
  • The technical knowledge of programming is part of machine learning, and mathematics and statistics are key parts of machine learning.

2. Limitations of Machine Learning

  • The algorithm of machine learning is limited by the algorithm mode, that is, its learning ability and autonomous analysis ability are limited by the machine itself.
  • A specific manifestation is that many generalizations based on a small number of data are inaccurate (overgeneralization).
  • Many algorithms are different in theory and in practice.

3. Other Uses of Learning Algorithms

  • biotechnology
  • Financial Fraud Detection
  • machine vision
  • product marketization
  • Supply Chain Optimization
  • stock market analysis
  • National Security

4. Some algorithm principles

collaborative filtering

Concept: Search a large group of people, and group people with the same interests into one group. The algorithm will mine other data for this group of people, and combine them to make a recommendation table.

Data collection

     When the data is small, some built-in data objects of python can be used for processing, but when a large amount of data is processed, the database needs to be used.

Find similar user methods

  • Euclidean distance
  • Pearson correlation

 Euclidean distance


write picture description here

Introduction

  • One of the easiest ways to calculate similarity scores is to use Euclidean distance.
  • Definition: refers to the distance between two points in a multi-dimensional space. When it is a two-dimensional plane, we can imagine it well. The calculation of the distance between two points is the square of the subtraction of the abscissa and the square of the subtraction of the ordinate, and then open Square, multi-dimensional words, and so on

official

Precautions

  • Because the calculation is based on the absolute value of each dimension feature, Euclidean measurement needs to ensure that each dimension index is at the same scale level. For example, using Euclidean distance for two indicators with different units of height (cm) and weight (kg) may invalidate the results. .
  • Euclidean distance is an intuitive manifestation of data. It seems simple, but when dealing with some scoring data that is greatly influenced by subjective factors, the effect is not obvious; for example, U1 gives 2 points to Item1 and Item2 respectively, 4 points for evaluation; U2 gives 4 points, 8 points. From the scores, it can be roughly seen that two users praised Item2 and degraded Item1, which may be a personality problem. U1 scored more conservatively and scored lower, while U2 was more extensive and scored slightly higher. Logically, it is possible to draw a conclusion that the interests of the two users are highly similar. If the Euclidean distance is used at this time, the results obtained are not satisfactory. That is, when the evaluation of the evaluator deviates greatly from the average level, the Euclidean distance cannot well reveal the true similarity.

Code example:

'''编程思路
获得两者共同评分项
def sim_distance(数据文档,'person1','person2')
    s={}
    for item in UL[p1]:
        if item in UL[p2]:
            si[item] = 1
    return si
欧几里得距离算法
如果没有获得相同项,返回0
    if len(si) == 0:
               return 0            
    sum_of_squares = sum([pow(UL[p1][item] -UL[p2][item] , 2) for item in si])
    return 1/(1+math.sqrt(sum_of_squares)) 
'''

#!/user/bin/python
# -*- coding: cp936 -*-
from math import sqrt
BJ={'小明':{'唐人街探案':4.9,'湄公河行动':7.8,'红海行动':10},
    '小红':{'唐人街探案':4.9,'湄公河行动':7.8,'红海行动':10},
    '小将':{'唐人街探案':9.2,'湄公河行动':6.8,'红海行动':6,},
    'jace':{'唐人街探案':6.0,'湄公河行动':4.7,'红海行动':8},
    'jack':{'唐人街探案':4.9,'湄公河行动':7.8,'红海行动':6},
    'davi':{'唐人街探案':9.2,'湄公河行动':6.8,'红海行动':5,},
    }

#a=1/sqrt(1+pow(4.9-4.7,2)+pow(7.8-7.0,2))
def sim_distance(prefs,person1,person2):
    person1Items = prefs[person1]
    commonItemName = [itemName for itemName in person1Items if itemName in prefs[person2]]
    if len(commonItemName) == 0:return 0
    distance = sqrt(sum([pow(prefs[person1][item]-prefs[person2][item],2) for item in commonItemName]))
    return 1/(1+distance)



#输出结果
>>> import ojld_distance
>>> reload(ojld_distance)
<module 'ojld_distance' from 'E:/software/python2\ojld_distance.pyc'>
>>> sim_distance(BJ,'小将','小红')
0.14373291978667996


#!/user/bin/python
# -*- coding: cp936 -*-
from math import sqrt
BJ={'小明':{'唐人街探案':4.9,'湄公河行动':7.8,'红海行动':10},
    '小红':{'唐人街探案':4.9,'湄公河行动':7.8,'红海行动':10},
    '小将':{'唐人街探案':9.2,'湄公河行动':6.8,'红海行动':6,},
    'jace':{'唐人街探案':6.0,'湄公河行动':4.7,'红海行动':8},
    'jack':{'唐人街探案':4.9,'湄公河行动':7.8,'红海行动':6},
    'davi':{'唐人街探案':9.2,'湄公河行动':6.8,'红海行动':5,},
    }

#a=1/sqrt(1+pow(4.9-4.7,2)+pow(7.8-7.0,2))
def distance(p,person1,person2):
    s1={}
    for item in p[person1]:
        if item in p[person2]:
            s1[item]=1
    if len(s1)==0: return 0
    sum_distance=sum([pow(p[person1][item]-p[person2][item],2) for item in p[person1] if item in p[person2]])
    return 1/(1+sum_distance)

>>> import sum_distance1
>>> distance(BJ,'小明','小明')
1.0

Pearson correlation


write picture description here

Introduction

  • Usage: The Pearson correlation coefficient is a measure of how well two sets of data fit a straight line, and will produce better results when the data is not very standardized (for example, some users exaggerate reviews).
  • Best Fit Line: The line is as close as possible to all points. If the fit is best, the fit is a diagonal line.
  • Related formulawrite picture description here

code example

#!/user/bin/python
# -*- coding: cp936 -*-
critics = {
    'Lisa':{'Lady':2.5,'Snak':3.5,'Just':3.0,'Superman':3.5,'Dupree':2.5,'Night':3.0},
    'Gene':{'Lady':3.0,'Snak':3.5,'Just':1.5,'Superman':5.0,'Dupree':3.5,'Night':3.0},
    'Michael':{'Lady':2.5,'Snak':3.0,'Superman':3.5,'Night':4.0},
    'Claudia':{'Snak':3.5,'Just':3.0,'Superman':4.0,'Dupree':2.5,'Night':4.5},
    'Mick':{'Lady':3.0,'Snak':4.0,'Just':2.0,'Superman':3.0,'Dupree':2.0,'Night':3.0},
    'Jack':{'Lady':3.0,'Snak':4.0,'Just':3.0,'Superman':5.0,'Dupree':3.5,'Night':3.0},
    'Toby':{'Snak':4.5,'Superman':4.0,'Dupree':1.0}
}
from math import sqrt
def sim_pearson(prefs,p1,p2):
    si = {}
    for item in prefs[p1]:
        if item in prefs[p2]: si[item] = 1

    n = len(si)

    if n == 0:return 1

    sum1 = sum([prefs[p1][it] for it in si])
    sum2 = sum([prefs[p2][it] for it in si])

    sum1Sq = sum([pow(prefs[p1][it],2) for it in si])
    sum2Sq = sum([pow(prefs[p2][it],2) for it in si])

    pSum = sum([prefs[p1][it]*prefs[p2][it] for it in si])

    num = pSum - (sum1*sum2/n)
    den = sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq - pow(sum2,2)/n))
    if den == 0: return 0
    r = num/den
    return  r


print (sim_pearson(critics,'Lisa','Gene'))


#输出结果

0.396059017191

 

 

 

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325383806&siteId=291194637