The data type and dimensions

  Data (data) is the result of facts or observations, inductive logic of objective things, is used to represent the objective things unprocessed raw material ...... in a computer system, binary data information units 0,1 in the form of represent ( Baidu Encyclopedia)

  Read the last part, as to the first phrase, or forget better.

  Simply put, everything is the result of the data, note the result, not the process, the process is an action, a behavior-driven results.

  More simply, with whatever recording media is data, such as in a text book, a CD-ROM information, of course, programmers might first think of is data in the database.

  Suppose the customer made a micro letter:

  Usually call this "information", the information and data What is the difference?

  Suppose you resume on another phone chats with someone, what is this pipe? Called "historical data", so that the right data is a collection of information. Usually a type of data called a data set, such as image data set, the data set chats. In fact, no need to distinguish between these terms was so detailed, these concepts are usually straightforward, no mistake, even if a mistake it does not matter, you tube data set called the set of information does not affect the understanding.

  People are very good at classification, what should be sub-classified, recently more popular classification is garbage:

  23 design patterns 5,7,11 further divided into three categories:

  For dealing with data every day, and ultimately to sub-classification. Generally, data can be divided into structured data and unstructured data, for each dimension of the data structure can be further divided according to the type and dimensions.

Structured and unstructured

  Refers to structured data storage with the ranks, there is a strict division of dimension data, experimental data scientist, a relational database table records are structured data.

  And is structured corresponding unstructured data, such as log a system generated message, a picture, a video, some micro-channel chats ...... seen most of the data in the world is unstructured data.

  Clearly structured data easier to analyze and process, in fact, most of the statistical models and machine learning models can only be used to format the data, often in the face of unformatted data had to be converted into structured data .

  For a non-formatted data, information can be extracted first data size is, of course, a measure of the size may vary according to different data sets.

  Look at the US Mission to the evaluation of Suzhou Song Helou:

  The first reviews of the text more, most of the other comments are short, consistent with common sense, after all, most people are lazy.

  " With Mom and Dad eat it, thinking of their parents eat a la carte, on this mission, cost-effective than a single point, food is also very good, Mom and Dad liked, picky chef dad evaluation of food 8 points deduction point service, the lobby staff attitude is very good, although we do not have an appointment, but actively lobby for us to arrange the position, but the second floor foreman is also very good, fly in the ointment table waiter, the attitude is not very active, serving He did not introduce dish, eat a look ignorant, and finally with the exclusion even guessed. overall, recommended. # whitebait shield soup # # Pork # # Qing Liuhe shrimp # # Brasenia silver Yugeng # "

  Was this review 194 words (including punctuation), Song Helou total of 274 comments, comments are on average 21 words, the premise is not a large number of duplicate statements, almost can be identified on this product is a boutique comment.

  Next to comments segmentation analysis:

1  Mport PANDAS AS pd
 2  from jieba.analyse Import ChineseAnalyzer
 3  
4 Content = ' with Mom and Dad eat it, thinking of their parents eat a la carte, on this mission, cost-effective than a single point, food is also very good, mom and Dad liked, critical evaluation of the dish chef dad eight points deduction point service attitude, good lobby, staff attitude, although we do not have an appointment, but actively lobby for us to arrange the position, but the second floor foreman is also very good, fly in the ointment table waiter, the attitude is not very positive, there is no description serving dish, eat a look ignorant, and finally with the exclusion even guessed. Overall recommended. ' \
 5            ' # Protosalanx shield soup Pork # # # # # # clear LIUHE shrimp Brasenia silver Yugeng # ' 
. 6 length = len (Content)
 . 7  Print ( ' length = ' , length)
 . 8  
. 9 Segments = [ ]
 10 Analyzer = ChineseAnalyzer ()
 . 11 # Chinese word segmentation 
12 is  for Word in Analyzer (Content):
 13 is      segments.append ({ ' Word ' : word.text, ' COUNT ' :. 1 })
 14  
15 DF = pd.DataFrame (Segments)
 16  # word frequency statistics 
. 17 word_feq df.groupby = ( ' Word ' ) [ ' count ' ] .sum ()
 18 is  # of count in descending order, take the largest number of occurrences before 30 words 
. 19 word_feq_n = word_feq.sort_values (Ascending = False) [: 30 ]
 20 is  print(word_feq_n)

  May occur when using the 'ChineseAnalyzer': ImportError: can not import name 'ChineseAnalyzer', you can install whoosh: pop install whoosh

  Code to carry out this review word, then the highest statistics of the first 30 words of word frequency with the following results:

  The "good" and "positive", "cost-effective", "like" this kind of positive words appeared in a total of five times, "insufficient" appeared once with the customer to explain this meal is quite satisfactory. Comments a "we", "father", "mother", indicating that the customer is more than a meal. "Attitude", "food", "serving" have emerged, generally speaking, if the attitude and the food is poor, customers will not be assessed with "inadequate", most likely the lack of a "serving."

  Based on these analyzes, the data format can be obtained:

  For the latter three fields, simply represents 2 good, 1 denotes in general, 0 represents a difference.

  By a similar method, we can convert unstructured data into a data structure, whereby mining out more information.

Qualitative and class

  For structured data, the type of a column can be divided into the quantitative and qualitative data. If you can participate in this type of addition, subtraction operation, then the data is quantitative data or qualitative data.

  It looks very simple, such as a corporate employee information:

  Such is certainly the name of the text type of qualitative data. Age can be subtracted to obtain the age difference is meaningful, quantitative data; student number, gender, telephone, although also digital, but subtraction does not make sense, therefore qualitative data. For quantitative data, it can be calculated that the average dimension, maximum, minimum, and other information.

4 scale

  Still further qualitative and quantitative ratio according to the degree of mathematical operations involved in each column, a structured data grouped into one of four dimensions: nominal scale, scale sequencing, spacer scale, fixed ratio scale.

  Each dimension has a measure of data center, which is a value described in data trends, also called balance point data, is a commonly used measure of the average center.

Nominal scale

  Nominal scale primarily consisting of text and category data, such as name, order number, product categories, shipping address, etc., such data is usually a string format, can not participate in this type of addition, subtraction math.

  两种数学运算可能适合定类尺度——等式运算和包含运算,比如我们可以比较几个订单的发货地址是否相同,或者产品是否隶属于某个大类之下。

  有些数据虽然可以用数字表示,但仍然属于定类尺度,比如电话号码,对电话号进行加减乘除和除了等式之外的大小比较都是毫无意义的。

  很明显,定类尺度无法使用均值、中位数,但是可以通过统计的方式计算定类尺度数据的众数,因此定类尺度的测度中心是数据的众数。

  (关于中位数和众数,可参考 关于平均数

定序尺度

  定类尺度数据无法按照自然属性排序,而定序尺度数据可以支持大小比较运算,从而对数据进行排序。这里的排序,指对数据进行大小比较是有意义的前提下进行的排序,而不是指程序上的asc和desc。

  定序尺度不能进行乘除运算,这容易理解,但是很多数资料上说定序尺度不能进行加减法运算(减法和加法是一回事,a-b相当于a+(-b)),并把这一点作为判断定序尺度的依据,这就不容易理解了,需要换一种容易判定的方式。

  我们经常看到企业的人员的学历统计图:

  上图是某个互联网公司的人员学历,分为大专、本科、博士、硕士4个等级,可以编号为1、2、3、4。学历的排序是有意义的,但是学历相减呢?或许也是有意义的,3-1=2,4-2=2,两个2都表示学历的等级差,但这个等级差是否有用就值得商榷了,你能马上联想到什么地方需要这个差值吗?因此我们说,判断定序尺度的依据之一是:数据并不一定是不能相减,只是相减后的差值很少有(或根本没有)明确的用途。另一个依据是,定序尺度通常用中位数而不是均值作为测度中心。上图的中位数是2,表示本科占了大多数;而均值可能是2.1,它并没有一个明确的类别。因此HR在介绍时会说:“我们公司的平均学历是本科”,而不是说:“我们公司的平均学历比本科高那么一丢丢。

定距尺度

  定距尺度除了具备定序尺度的特征外,还可以进行有意义的加减法运算。

  上海近20年11月份的平均气温、某个企业员工的年龄,这些都是定距尺度,两个定距尺度的差是有意义的,并且很常用:去年11月的平均气温比今年高了2℃,老李比小王大10岁。显然,定距尺度数据可以使用均值作为测度中心。

  对于给定的数据集来说,我们往往想了解数据的波动性,此时需要用到标准差: 

  其中r是均值,N是数据总量。

  下表示2个射击运动员5轮射击后的数据:

  均值是都9.5,似乎而二者实力相当。但通过观察数据会发现,甲是发挥型选手,成绩波动较大,可以打出“超级环”,也会打出大失水准的“低级环”;相反,乙的发挥比较稳定,总是与平均成绩接近。

  每一次射击的成绩均会产生波动,用每一次射击的得分减去平均成绩表示本次波动,得到了下面的数据:

  现在可以计算出二人的总体波动了:

  可以看出,乙的波动远远小于甲的波动,说明乙的稳定性更高。

  关于标准差和数据波动的更多信息,可参考:方差、均方差和协方差

  能否使用标准差也可以作为定序尺度和定距尺度的参考判定依据之一。对于人类的智商来说,平均智商通常使用中位数,而且计算智商的波动是没有意义的,因此智商属于定序尺度。当然,智商也许会出现波动,比如看见美女智商下降70%,但这属于玄学问题了。

定比尺度

  定比尺度数据是最牛的一种,处理定距尺度的特性外,还可以进行乘除运算,同时还具有绝对或自然的起点,即存在可以作为比较的共同起点或基数。

  收入和存款是典型定比尺度,我们经常说某某的收入是自己的2倍。

  定比尺度和定距尺度也很微妙,关键还是看乘除法是否有明确的意义。比如考试分数,0分可以作为自然的起点,但是我们通常直说A比B高了30分,而不说A比B的分数高一倍,因此分数是用来确定两人之间的距离的,而不是比例。我家的面积是100平米,同学家是200平米,同学家比我家的面积大一倍,面积是定比尺度。

  


  作者:我是8位的

  出处:https://mp.weixin.qq.com/s/XJROL6iAFZ5XuFNq4WT86g

  本文以学习、研究和分享为主,如需转载,请联系本人,标明作者和出处,非商业用途! 

  扫描二维码关注作者公众号“我是8位的”

Guess you like

Origin www.cnblogs.com/bigmonkey/p/11820614.html
Recommended