Python - How self-Python

Python is a general-purpose programming language, is widely used in the fields of science data over the past decade. In fact, Python is second only to the scientific data in the field of R Second most popular programming language .

The main purpose of this article is to show how easy it is to use Python have scientific learning data. You might think that they must first become an advanced Python programmer before you can perform complex tasks typically associated with scientific data, however, is not the case. Python comes with many useful tools library, they can provide powerful support for you in the background. You do not even know what programs are running, you do not have to care about these. The only thing you really need to know is that you need to perform some specific tasks, while Python these tasks become quite simple.

So, let's start now.

Configuration data required for scientific Python environment

Whether the computer you're using Mac or Windows, I suggest you download a free allows you to easily access the Python release as many useful modules.

I tried a number of releases of Python, here, I recommend you use Anaconda Continuum Analytics provides . This release contains over 200 Python tool library. To understand the differences in Python packages, modules and libraries, please refer to this article .

In Anaconda, when you download, you need to select the version or download Python 2 Python 3 version. I strongly recommend that you use Python 2.17.12 version. By the end of 2016, the vast majority of non-computer science Python users are using this version of Python. It can complete the task brilliantly scientific data, easier to learn than Python 3, and like Python scripts and code fragments sites like GitHub in the millions, for your reference, life would be easier.

Anaconda also comes with Ipython programming environment, it is recommended that you use. After the installation of Anaconda, just navigate to Jupyter laptop and open the program, you can open IPython in a Web browser. Jupyter laptop program will start the application automatically in the Web browser.

 


294afd9ee97c7c51f0f3a2bec2b028123933a786

You can refer to this article to learn how to change the path in Ipython notebook.

Learning the basics

Before you understand Python library science data, you first need to learn some of the basics of Python. Python is an object-oriented programming language. In Python, the object may be assigned to a variable may be passed as a parameter to a function. The following objects are Python: numbers, strings, lists, tuples, sets, dictionaries, and a function category.

Python is an ordinary function and the function is basically the same mathematics - it receives input data, processes the data and outputs the result. The results output function depends entirely on how it is designed. On the other hand, Python classes is designed to output the prototype object to other objects.

If your goal is to write fast, reusable, easy to modify the Python code, then you must use the functions and classes. Use functions and classes help to ensure clean and efficient code.

Now, let's see what the scientific data available tool library Python has.

Scientific Computing: Numpy and Scipy

Numpy Python is a major tool for processing n-dimensional array object package, and Scipy provides a number of mathematical algorithms and implement complex functions can be used to extend the functionality Numpy library. Scipy library for Python adds some specialized function of science, scientific data to address some of the specific tasks.

In order to use Numpy (or any other Python libraries) in Python, you must first import the corresponding tool library.

 


6cc4a1539315a415495f2faf7d3843972c03ddd0

np.array (scores)  to a list converted into an array.

When you use the normal Python program - not using any external expansion (such as tools library) Python program - you can only use one-dimensional list is limited to store data. However, if you use the library to expand Numpy Python, you can directly use the n-dimensional arrays. (If you want to know, n-dimensional array is an array containing one or more dimensions.)

Most start learning Numpy, because Numpy essential when using Python for scientific computing. In-depth understanding of Numpy will help you effectively use Pandas and Scipy tools such libraries.

Data reprocessing: Pandas

Pandas are the most widely used tool for data used during reprocessing. It includes advanced data structures and data manipulation tools to make data analysis more quickly and easily designed. Using the R language for statistical computing users, and certainly not on the variable name DataFrame feel strange.

Pandas are one of the key factors Python grow into a powerful and efficient data analysis platform.

Next, I'll show you how to use Pandas deal with a small data set.

 


a8cf30a1383362a453db14c88a46cbf264ec5958

DataFrame spreadsheet is a structure that contains an ordered collection of columns. Each column can have different variable types. DataFrame contains both the row index, column index are also included.

 


d3c5c955bfb1bea1d17ee459990fa482cbd253ff

Visualization: Matplotlib + Seaborn + Bokeh

Matlplotlib Python is a module for data visualization. Matplotlib allows you to easily draw line graphs, pie charts, histograms, and other professional charts.

You can use Matplotlib custom chart every detail. When you use the Matplotlib in IPython, Matplotlib with interactive features zoom, pan, and so on. Matplotlib support different GUI backend on all operating systems, at the same time, it can also export the charts from several common image formats, such as PDF, SVG, JPG, PNG, BMP, GIF and so on.

 


14a65a4f1e4cf8e9f21157a34b2691d5c41d7dfb

Seaborn is a Matplotlib based data visualization tool library for creating attractive and informative charts in Python. The main features of Seaborn that can create complex from Pandas chart type data using only relatively simple commands. I use the following Seaborn drew this picture:

 


a6e0aabf56c941972066039b0eb92e6f7427c9c7

Machine Learning: Scikit-learn

Machine learning aims by providing some examples to the machine (software) (how to perform tasks or execute what can not complete the task) to church machine to perform the task.

There are many tools in the Python library of machine learning, however, Scikit-learn is the most popular one. Scikit-learn builds on Numpy, Scipy and Matplotlib library. Based Scikit-learn library, you can achieve almost all machine learning algorithms, such as regression, clustering, classification, and so on. So, if you plan to use Python learning machine learning, then I suggest you start learning Scikit-learn.

K-nearest neighbor algorithm can be used for classification or regression. The following code shows how to use KNN model of iris data set to predict.

 


affeb624ec5a3c8814a2332af48b9ff5c24edbf3

 


cb9f9293512a91c4da1c47c31ae190a3460c26be

Some other machine learning library are:

Statistics: Statsmodels and Scipy.stats

Statsmodels Python and Scipy.stats are two popular statistical learning module. Scipy.stats mainly used for the probability distribution. On the other hand, Statsmodels compared statistical model provides a framework similar to the formula of R. Including descriptive statistics, statistical tests, and the results included in the statistics function plotting extensions apply to different types of data, and each estimator.

The following code shows how to use Scipy.stats module calls a normal distribution.

 


07f098eb0cedcb28937177c13390ba93143eb067

 


7905479d4377cb66ad7e452c731433023ec87759

Normal distribution is input to a continuous distribution function or any value of the solid line. Normal distribution can be parameterized by two parameters: the mean μ and variance σ2 distribution.

Web crawling: Requests, Scrapy with the BeautifulSoup
Web crawler retrieves from the unstructured data network (typically HTML format), and to facilitate the process of converting a structured format of the data analysis.

Popular tool for library Web crawl are:

  • Scrapy
  • URl lib
  • Beautifulsoup
  • Requests

Crawling data from a Web site, you need to know some basic knowledge of HTML.

Here is an example of using a network crawling BeautifulSoup library of:

import urllib2
import bs4

 


015460c8042ff4e80e891f9e9311a6a3d95c9b05

Code beautiful = urllib2.urlopen (url) .read (); go bigdataexaminer.com and access to the entire site corresponding HTML text. Then, I will text is stored in the variable beautiful.

I use urllib2 to get the url for the http://www.bigdataexaminer.com/  site pages, you can also use the Requests do the same thing. Here there are articles that can help you understand the differences between urllib2 and Requests.

Scrapy and BeautifulSoup similar. In the back-end engineer Prasanna Venkadesh Quora explain the difference between these two tools on libraries:

_ "Scrapy is a Web crawler, or that the framework is a Web crawler, you provide a start crawling the root URL for the Scrapy, then you can specify some constraints, such as the number of URL to crawl and so on, this is a Web crawling for a complete frame or crawling.

The BeautifulSoup is a resolver library, it can also be an excellent page crawling tasks, and allows you to easily resolve some of the content on the page. However, BeautifulSoup will crawl the content you provide the URL of the page. It will not crawl other pages, unless you manually add a URL to a page in a certain cycle way.
In simple terms, you can use the Scrapy BeautifulSoup build something like that. But BeautifulSoup is a Python library, Scrapy is a complete framework. "_

in conclusion

Now that you know the basics of Python use some of these tools and libraries. It's time to use the knowledge you have learned to solve specific data analysis problems. You can first deal with structured data sets, then you can solve the complex problem of unstructured data analysis.

Published 48 original articles · won praise 121 · views 680 000 +

Guess you like

Origin blog.csdn.net/guoyunfei123/article/details/82353020