Data analysis from scratch for actual combat | Basics (6)

This article is long (3k+ words) and has high practicability. It is recommended that old watches be collected first , then forwarded to Moments , and then read slowly when eating or relaxing, read them repeatedly, remember and practice repeatedly.

Zero, write in front

The previous four articles talked about data analysis virtual environment creation and pandas reading and writing CSV, TSV, JSON, Excel, XML format data, HTML page reading, database related operations, today we continue to explore pandas.


Data analysis from scratch

Data analysis from scratch for actual combat | Basics (1)

Data analysis from scratch for actual combat | Basics (2)

Data analysis from scratch | Basics (3)

Data analysis from scratch for actual combat | Basics (4)

Data analysis from scratch | Basics (5)



Reference books for this series of study notes: "Data Analysis Actual Combat" Thomas Drobas

1. Summary of Basic Knowledge

1. Introduction to data conversion tool OpenRefine
2. Data conversion tool OpenRefine installation
3. Data conversion tool OpenRefine basic use
4. Data conversion tool OpenRefine advanced use

2. Start using your brain

1. Introduction to data conversion tool OpenRefine

OpenRefine is a data conversion tool (IDTS), an open source software released by Metaweb in 2009. Google acquired Metaweb in 2010 and changed the name of the project from Freebase Gridworks to Google Refine. Later, Google opened its source code and changed its name to OpenRefine .

It can perform visual manipulation and processing on data. It is much like traditional excel software, but it works more like a database, because it does not deal with individual cells, but with columns and fields. This means that OpenRefine does not perform well for adding new content, but it is powerful for exploring, cleaning, and integrating data. It is mainly used to quickly filter data, clean data, sort weight, and analyze distribution and trends in the time dimension.

2. Data conversion tool OpenRefine installation

(1) Download: http://openrefine.org/download.html
OpenRefine home page, "A free, open source, powerful tool for working with messy data", a free , open source , powerful , deal with messy Data tools.

image.png

Here I downloaded OpenRefine 3.2 beta (3.2 beta version), because I think it's relatively new, so I downloaded it and tried it.
My computer is Windows, so I downloaded Windows kit. You download it according to your own development environment. We can see that this software is quite large with 95.9MB.


image.png

(2) After downloading, unzip the downloaded compressed package, and then click theopenrefine.exefile to start the service.


image.png


(3) In the second step, we can see that the service address ishttp://127.0.0.1:3333/. OpenRefine can be opened by accessing in the browser. If you are as bad as the old watch (the editor's own nickname), if you have a bad English, I suggest you use Google Chrome to open , Can automatically translate page content, the accuracy rate is still very high.

image.png


image.png


3. Basic use of data conversion tool OpenRefine

(1) After punching in OpenRefine according to the above steps, the first step is to import the file. The sample file given in the book is:, realEstate_trans_dirty.csvclick to select the file, after selecting the file, click to open it.

image.png


(2) After the data is imported successfully, click the next ( Next) and the data will be imported successfully. As shown in the figure below, we can see that OpenRefine supports multiple file format data reading, such as: CSV / TSV / delimiter based file , Line-based text files, fixed-width field text files, PC axis text files, JSON files, MARC files, JSON-LD files, RDF/N3 files, RDF/N-Triples files, Excel files, etc.
In addition, it should be noted that after the data is imported, it is treated as a text format, so before subsequent data analysis, the format of the data row must be converted to a numeric value. (As in the picture: columns, beds, baths, etc.)

image.png


In the previous step, we have imported the data, click on the upper right corner to Create Projectcreate the project, and then you can start the preliminary processing of the data.

image.png


(3) Data format conversion: Direct conversion (such as beds, baths columns)
Example : Convert the data in the beds column to a numeric type
a, click the small triangle
b on the left of the beds , click Edit cells
c, click Common transforms (ordinary conversion)
d. Select To number (indicating conversion to numeric type).
We can see that the above can also be converted to other formats, such as To data (date type), To text (text type), To nul1 (empty value) , To uppercase (uppercase), etc.

image.png


The conversion was successful. To prepare for subsequent data analysis, we sequentially convert the rows of baths, sq__ft, price, latitude, and longitude into numeric types according to the above method.

image.png


(4) Data format conversion: The data that needs to be processed and then converted (such as the sale_date column) is
in the
sale_datecolumn, and the data is similar: In Wed May 21 00:00:00 EDT 2008this way, we hope that such data will become more convenient to observe and become a suitable data type, which is obviously not It should be a character type, so we turn it into a date type, which requires a little bit of skill, not the above Common transformscan be achieved.
Example : Convert the data in the sale_date column to date type
a, click the small triangle
b on the left of sale_date , click Edit cells (edit column)
c, click Transform...

image.png


d. Select GREL (Google Optimized Expression Language) to convert the date.

image.png


 
 

# 原始数据
Wed May 21 00:00:00 EDT 2008
# 修改后数据
2008-05-21T00:00:00Z

# 使用GREL语句
substring(value, 410) + ',' + substring(value, 2429).toDate()
# 解释
'''
vaule表示数值(内容),即 Wed May 21 00:00:00 EDT 2008
substring表示分割字符串函数
第一个参数是要分割的字符串,即 Wed May 21 00:00:00 EDT 2008
第二个参数是分割起始符的下标,4 表示的是 M的下标
第三个参数是分割终止符的下标,10表示的是21后的空格字符的下标

剪切出字符串后,调用 toDate()把提取出来的数据转换成日期(date)类型。
'''


image.png

到这里,我们就粗略的对数据进行了第一步处理。

4、数据转换工具OpenRefine进阶使用

理解数据是建立成功模型的前提。     ----来自《数据分析实战》

(1)OpenRefine Facet之文本facet

首先,所谓facet,表面意思是面状、面片,在这里我们可以理解为过滤器,可以使你快速的选择某些行或直接探索数据。

文本facet可以让你快速地对数据集中文本列的分布有一个感觉,也就是了解文本数据在一些维度上的信息。
示例: 统计 
city_state_zip(表示意思是:城市邮政编码)中那个城市出现次数最多
a、点击 
city_state_zip左边的小倒的三角形
b、点击 
Facet- Text facet

image.png


我们仔细观察显示结果会发现,有很多其实是一个城市,只是所处州邮政编码不同导致统计的时候误认为是两个城市了,所以我们在统计数据前需要处理一下数据。

image.png


这次我们点击Facet后选择 Custom text facet(自定义文本过滤器)。

image.png


用一句GREL表达式处理数据,提取出city_state_zip中的城市名。

 
 

'''表达式解析'''
value.match("(.*?) CA.*?")[0]
'''
vaule表示数值(内容),即 SACRAMENTO CA 95823
match表示正则提取函数
参数是正则匹配模式字符串,表示意思是 取出" CA"之前的字符串,即城市名
'''



image.png



原数据是记录2008.5.15-2008.5.21之间商品的交易信息,通过这个结果我们可以明显看出,在这期间SACRAMENTO交易次数是最多的,其次是ELK GROVE,这比我们在Python里用代码处理数据计数好多了,当然,前提是你能比较熟练的使用OpenRefine。


image.png


(2)OpenRefine Facet之数字facet

示例: 查看价格( price)分布
a、点击 price左边的倒三角形
b、点击 Facet-> Numeric facet
我们发现原数据中有108行price是空白的,有值的数据量有1067个,价格区间在0-890000,大量数据靠左,我们进行进一步确定数据集中处,可以拖拽两边的滑动块,发现价格集中在60000 — 400000。


image.png


(3)OpenRefine Facet还有 时间线facet和散布图facet

时间线facet(Timeline facet):可以看到不同时间点的数据量情况。
散布图facet(
Scatterplot facet):可以分析数据集中数字型变量间的相互作用。

具体使用方法同上述的文本Facet和数字Facet,可以从不同角度去观察数据,让数据有更好的呈现状态。


(4)OpenRefine 数据排重

这里我们对stree列处理,因为同一套房子不会在一周内同时卖出两次,如果有相同的stree就表示是重复的数据。
a、点击 stree左边的倒三角形
b、点击 Edit cells-> Blank down
Blank down表示:使重复数据的位置值变成空值(用于去除重复数据);
Fill down表示:如果某数据位置为空值,则使用上一行的数据值填补该位置(用于填补空缺数据)。

image.png



image.png


(5) OpenRefine quickly removes blank and missing data

How to remove these blank rows distributed in the data?
We can create a blank value filter.
a. Click 
streethe small inverted triangle on the left
b. Click 
Facet->  Customized facets -> Facet by blank
so that you can filter out all rows with vacant stree values.

image.png


c. Click  Allthe small inverted triangle on the left
d, click 
Edit rows->  Remove all matching rows 
to delete all blank lines.

image.png


In addition, GEL syntax is more important in OpenRefine, and it is also a programming language. For the specific syntax, please check the Github address of GEL-Functions: https://github.com/OpenRefine/OpenRefine/wiki/GREL-Functions

Three, old watch talk

image.png

@老表的学台

This scene is only for posing, and the actual pose is the same as that of a general programmer


Complete Python basic knowledge points

Python knowledge | these skills you do not know? (1)

Python little knowledge | these skills you do not? (2)

Python knowledge | these skills you do not know? (3)

Python knowledge | these skills you don't know? (4)

I am an old watch, support me, please forward and share this article .

image.png


Guess you like

Origin blog.51cto.com/15069482/2578585