python data analysis & mining, machine learning environment configuration

A. What is Data Analysis

1. The entire definition of the Internet:

       Data analysis is the large amount of data collected is analyzed using appropriate statistical analysis methods, extracting useful information and form conclusions to be studied in detail and summarized the process data.

2. Data analysis development and composition

       Mathematical analysis of the underlying data had been established in the early 20th century, but until the advent of computers makes it possible to practice, and make data analysis to promote. Data analysis is a mathematical and computer science product of the combination.
       Common analytical tool is Excel.
       The main activities of data analysis by identifying information needs, collect data, analyze data, evaluate and improve the effectiveness of the composition of the data analysis.

3. Features

       Multi-dimensional and descriptive

They are generally used with data visualization tools.



Two .python analysis of environmental data and various types of commonly used analysis package configuration

1. The processing of the data type

       Mainly structured data, including tabular data, multidimensional array (matrix), multiple database table structures.
       Is required, it can be converted to a more analysis data sets, structure modeling.

2. Why python

       In python, there are a lot of libraries has been perfect, but it is easy to integrate code for C, C ++ and FORTRAN and other languages, and with good algorithm for data manipulation.




In fact, python, there are many shortcomings, here to ignore them, Next we introduce some of the important data analysis library python.

Three .python data analysis environment installation

1.Ipython

(1 Introduction

       IPython is an interactive computing system. Is a more interactive python interpreter, it does not provide any calculations or data analysis tools, it is primarily to provide an environment, and much more than the default python shell easy to use, supports variable auto-completion, auto indent support the bash shell commands, it built a lot of features and useful functions.
Ipython can start with cmd
Here Insert Picture Description
but most of the words or the start ipython in anaconda.

       Its workflow is executed - exploration. Not only can use python, other languages for Juptyter realized the kernel, allowing many languages Jupyter in.
So what is Jupyter it?

(2) installation is as follows

Direct mounting PIP:
PIP IPython the install
Here Insert Picture Description
Here Insert Picture Description

2.Jupyter

(1 Introduction

       The full name Jupyter Notebook is an interactive notebook, run support more than 40 programming languages.
       Essence: is a Web application, easy to create and share literary program documentation, support real-time code, mathematical equations, visualization and markdown.
Applications include: data analysis, cleansing and transformation, numerical simulation, statistical modeling, machine learning, and so on.
       == In fact, in Jupyter Notebook, the code can generate real-time images, video, LaTeX and JavaScript. ==
Jupyter Notebooks has become one of the most commonly used tools data scientists.

(2) Installation

官网上有详细教程https://jupyter.org/install
用pip,或者使用Anaconda安装Jupyter
打开Jupyter:
Here Insert Picture Description
然后自动跳转到网页,就可以编辑了:
Here Insert Picture Description
可以先跳转到指定文件夹,然后再打开jupyter notebook:
Here Insert Picture Description
Here Insert Picture Description
会生成这些文件:
Here Insert Picture Description

3.Anaconda安装器

(1)简介

       Anaconda指的是一个开源的Python发行版本,其包含了conda、Python等180多个科学包及其依赖项。
       它其实就是一个开源的包、环境管理器,可以用于在同一个机器上安装不同版本的软件包及其依赖,即可以很方便的切换不同的版本(包括各个版本的python和各个版本的类库),并能够在不同的环境之间切换。
       Anaconda包括Conda、Python以及一大堆安装好的工具包,比如:numpy、pandas等。
       它是适用于企业级大数据分析的Python工具。其包含了720多个数据科学相关的开源包,在数据可视化、机器学习、深度学习等多方面都有涉及。不仅可以做数据分析,甚至可以用在大数据和人工智能领域。
==安装完anaconda,就相当于安装了Python、IPython、集成开发环境Spyder、一些包等等。==
可以理解为,一个python环境中需要有一个解释器, 和一个包集合。

(2)安装

进入官网https://www.anaconda.com/
Here Insert Picture Description
安装后会有下面的应用

  • Anaconda Navigtor :用于管理工具包和环境的图形用户界面,后续涉及的众多管理命令也可以在 Navigator 中手工实现。
  • Jupyter notebook :基于web的交互式计算环境,可以编辑易于人们阅读的文档,用于展示数据分析的过程。
  • qtconsole :一个可执行 IPython 的仿终端图形界面程序,相比 Python Shell 界面,qtconsole 可以直接显示代码生成的图形,实现多行代码输入执行,以及内置许多有用的功能和函数。
  • spyder :一个使用Python语言、跨平台的、科学运算集成开发环境。
    有时候有问题,可能是环境路径问题。
           在windows下,在计算机->右键选择属性->高级系统设置->环境变量->系统变量->path。在path中加入anaconda安装的目录就可以了。

安装后在cmd中输入conda --version,会看到版本:
Here Insert Picture Description
或者直接进入Anaconda Prompt终端:
Here Insert Picture Description
用conda list列出已经安装的所有库:
Here Insert Picture Description
Here Insert Picture Description
有关不同的环境的创建,在下一个教程会介绍。

4.Jupyter与集成开发环境与文本编辑器

一般在Juptyter中进行交互式操作,在集成开发环境(IDE)中进行大型数据处理,在文本编辑器中进行简单操作。



三.常用数据分析包

1.NumPy

NumPy是使用Python进行科学计算的基础包。 它包含:

  • 一个强大的N维数组对象
  • 复杂的(广播)功能
  • 用于集成C / C ++和Fortran代码的工具
  • 有用的线性代数,傅里叶变换和随机数功能

作用:这种工具可用来存储和处理大型矩阵,比Python自身的嵌套列表结构要高效的多。
numpy和稀疏矩阵运算包scipy配合使用更加方便。

2.pandas

pandas is a data analysis package python, is based on a NumPy tool, the tool is created to solve data analysis tasks.
pandas provides a number of functions and methods enable us to quickly and easily handle the data.
Data is structured as follows:

  • Series: one-dimensional array, with one-dimensional array Numpy are similar. Both the basic Python data structures List is also very similar. Series can now save different data types, string, boolean values, numbers, etc. can be stored in the Series.
  • Time- Series: Time-indexed Series.
  • DataFrame: two-dimensional tabular data structure. The many features and R data.frame similar. Series DataFrame may be understood as a container.
  • Panel: a three-dimensional array, as will be appreciated DataFrame container.

    3.matplotlib

    Matplotlib is a Python 2D graphics library, which generate publication-quality graphics in various levels of hardcopy formats and interactive cross-platform environment.
    It may generate the drawing, histogram, power spectrum, bar, error, scatter and the like.

    4.sciPy

    SciPy is a convenient, easy to use, designed for scientific and engineering Python toolkit. It includes statistics, optimization, integration, linear algebra module, Fourier transforms, signal and image processing, often solver differential equations, and so on.

    5.scikit-learn

    It is a machine learning toolkit, it will be introduced later.

    6.statsmodels

    Statsmodels is Python statistical modeling and econometric toolkit, including some descriptive statistics, statistical model estimation and inference.

       These libraries have the anaconda, the installation anaconda installer, the equivalent of these are installed go up.

Guess you like

Origin www.cnblogs.com/ITXiaoAng/p/11875787.html