[Statistics notes three] to organize and display data

[Statistics notes three] to organize and display data


Data preprocessing

Data preprocessing is necessary before processing the data packet or classification made, including: the audit data, filtering, sorting, and so on.

Data Review: whether there is error-checking data. They include: completeness and correctness of the audit review.

Data filtering: to find certain types of data that meet specific criteria as required.

Data Sort: according to a certain order of the data is arranged in order to facilitate the researchers found some clear trends by browsing features or data, to find clues to solve the problem.

They are more familiar with Excel can help achieve the above functions. At work, Excel may be regarded as the most commonly used data pre-processing tool.

Industry standard data mining process (CRISP-DM) - the current standard data mining process model development. This process is noted that the first phase of the data mining process to understand the business, otherwise known as research to understand which companies and researchers first clarified the project objectives, and those objectives into the definition of data mining, and finally to develop a preliminary accomplish these goals strategy. Expand the data mining, the need for the pre-processed data, including data cleaning and data conversion in two forms.

Why do we need to clean up the data it?

Because the data collected from various sources, there may be a problem:

  • Obsolete or redundant data
  • Missing Values
  • Outliers
  • Other forms of data mining models are not suitable for the
  • Data and inconsistent policy or common sense
  • and many more

It is necessary to be subjected to data analysis and data mining pretreatment.


FIG visual data representation common


illustration

You can use Python, or directly using tools Anaconda easily create a variety of charts.

My favorite tools are: Anaconda. It already comes with the following libraries: Numpy, Scipy, Maplotlib, Pandas and Scikit-Learn.

Download Anaconda's: https://www.anaconda.com/distribution/

The following legend is made by Anaconda, of course, is to use the development language Python.

 

Python Data Mining Related Extensions

Python Data Mining Related Extensions
Extensions

Brief introduction

Numpy Provide support for the array, and the corresponding handler efficient
Scipy Providing support matrix, and a matrix of correlation value calculation module
Matplotlib Powerful data visualization tools for gallery
Pandas Powerful and flexible data analysis and exploration tools
state Models Statistical modeling and econometrics, including descriptive statistics, statistical model estimation and inference
Scikit-Learn Support for regression, classification, clustering of powerful machine learning library
Hard Depth study library for the establishment of neural networks and deep learning model
Gensim Used to make text topic model library, text mining may be used

 


Data analysis features

Analysis of the data for later analysis by mass, followed by drawing a graph, the feature amount calculating some other means of data characteristic.


Rational use of charts

A well-designed graphical display of data is an effective tool.

A good graph should have the following basic features:

(1) Display Data

(2) so that people watching the pictures focus on the content of graphics, rather than making graphics program.

(3) avoid distortions

(4) Comparison between the emphasis data

(5) serve a clear goal

(6) has a statistical description of graphics and text

 


Attached: CRISP-DM resources

Methodology CRISP-DM (Data Mining standard cross-flow, Cross-Industry Standard Process for Data Mining) 

Links: https://wiki.mbalib.com/wiki/CRISP-DM

 

Published 619 original articles · won praise 185 · views 660 000 +

Guess you like

Origin blog.csdn.net/seagal890/article/details/105014336