[Statistics notes three] to organize and display data

Data preprocessing

Data preprocessing is necessary before processing the data packet or classification made, including: the audit data, filtering, sorting, and so on.

Data Review: whether there is error-checking data. They include: completeness and correctness of the audit review.

Data filtering: to find certain types of data that meet specific criteria as required.

Data Sort: according to a certain order of the data is arranged in order to facilitate the researchers found some clear trends by browsing features or data, to find clues to solve the problem.

They are more familiar with Excel can help achieve the above functions. At work, Excel may be regarded as the most commonly used data pre-processing tool.

Industry standard data mining process (CRISP-DM) - the current standard data mining process model development. This process is noted that the first phase of the data mining process to understand the business, otherwise known as research to understand which companies and researchers first clarified the project objectives, and those objectives into the definition of data mining, and finally to develop a preliminary accomplish these goals strategy. Expand the data mining, the need for the pre-processed data, including data cleaning and data conversion in two forms.

Why do we need to clean up the data it?

Because the data collected from various sources, there may be a problem:

Obsolete or redundant data
Missing Values
Outliers
Other forms of data mining models are not suitable for the
Data and inconsistent policy or common sense
and many more

It is necessary to be subjected to data analysis and data mining pretreatment.

FIG visual data representation common

illustration

You can use Python, or directly using tools Anaconda easily create a variety of charts.

My favorite tools are: Anaconda. It already comes with the following libraries: Numpy, Scipy, Maplotlib, Pandas and Scikit-Learn.

Download Anaconda's: https://www.anaconda.com/distribution/

The following legend is made by Anaconda, of course, is to use the development language Python.

Python Data Mining Related Extensions

Python Data Mining Related Extensions
Extensions	Brief introduction
Numpy	Provide support for the array, and the corresponding handler efficient
Scipy	Providing support matrix, and a matrix of correlation value calculation module
Matplotlib	Powerful data visualization tools for gallery
Pandas	Powerful and flexible data analysis and exploration tools
state Models	Statistical modeling and econometrics, including descriptive statistics, statistical model estimation and inference
Scikit-Learn	Support for regression, classification, clustering of powerful machine learning library
Hard	Depth study library for the establishment of neural networks and deep learning model
Gensim	Used to make text topic model library, text mining may be used