How to learn python data analysis?

The whole process of Python data analysis basics is as follows (suitable for beginners, job transfers, no programming basics, direct teaching, no additional links)

1. Learning

For the data analysis module, not all the content of python learning is required (the same is true for SQL), that is, you don’t need to learn as much as a programmer, and you also need to know that the part that promotes the application is statistical knowledge. You need to know about regression. to implement it in python

Most of the content requires additional Baidu and learning to complete, but it is basically free

The main learning frameworks include:

1. Basic programming language (input and output, loop, etc.)

2. Use of data analysis related libraries (numpy, pandas, matplotlib)

3. Application of statistical theory and analysis of actual cases

In the workflow, python is a tool. Its main function is to import [data into software] in the case of [existing data], [write logic to get analysis results], and finally [visualize into charts and conclusion output]

To be clear: Python is just a tool for logical succession. Learning it does not mean learning logic. It is equivalent to having a sword without sword skills, but sword skills and sword skills are indispensable.

2. Basics of programming language

The basics of programming language is a compulsory course for freshmen. Students who are not majors in science and engineering may not be exposed to it (especially accounting, business). It is important to learn [program thinking] and regard the computer as a rigorous [executor, tool, and worker]. Use logical thinking sentences and let him output your thoughts through the interactive interface.

1. Input and output

The most basic, when you output print in the interactive interface, the system will output what you input

If you say a word, he will say a word. Whatever you say, he will say.

>>>print(1) 

1    

>>> print("Hello World") 

HelloWorld      

>>> a = 1  

>>> b = 'runoob'  

>>> print(a,b) 

1runob

If you have a lot to say, such as a document with a lot of data stored in it, then you need to prepare excel, csv and txt files, and then use a function such as read_csv and read_table to import these documents

data = pd.read_csv('file path/result.csv',sep=',') 

>>>print(data) 

Your file is in python

2. Data format

For computers, all [inputs] must be classified. Numbers are not just numbers, but integers; Chinese characters and English are all strings; and one step closer, list types, map types, and tuple types.

Different types meet different needs, calculation methods, etc.

Summary of data types:

Numbers : Int() integers, float() decimals, bool() Boolean types, complex() complex numbers

String (str): including: text, symbols, letters, special characters, etc.

Single quotes , double quotes, triple single quotes, and triple double quotes are strings

Multi-line printing : newlines when outputting elements

Single line printing : output on the same line

List : formatted with square brackets, comma-separated elements, elements can be numbers, characters, or other data types

For example: [(1, 2), 'three', 'd', {'y': 0}]

Tuple : Semi-parentheses format, elements can be numbers, characters or other data types

For example: ([1, 2], 'three', 'd', {'y': 0})

Dictionary : The curly bracket format elements correspond to each other, query function, {key: value} key is required to be an immutable type, and value can be any data type (integer, string, list, tuple, dictionary), including iterable objects . Immutable types are: string, tuple.

For example: {'a': 1, 'b': 2}

Different types have different rules, just as you know that 1 can be + 2 but not + Weihe, only the same type can operate with each other

3. Define variables

At this stage, you can define some variables yourself (note the data type)

>>>name = 'Weihe'

>>>print(name)

>>>prtint('type',type(name))

>>>prtint('value',name)

Output result: Weihe

type <class 'str'>

value Weihe

Many variables need to be defined in the work, such as let a = 0, let a = [], let them be equal to an empty array, used to calculate, loop, store various numbers and results

4. Arithmetic operators

Here is the most basic logic. On it, you can already input variables, define variables, and output variables. The link is already ok. Now it is the above: you need to add logic to it.

For example, the business side puts forward a demand for you: I want today's order + yesterday's order

picture

5、

The word function has been learned in junior high school, input an x ​​is equal to a y. Including the first print, including the arithmetic operators we are talking about, there is a function behind the source code, and calling the function directly is the logic behind the call

For example, we learned the len() function earlier, through which we can directly obtain the length of a string. We might as well imagine, if there is no len() function, how to get the length of a string?

n=0 

for c in "http://www.nowcoder.com/link/pc_kol_bzwh":    

n = n + 1 

print(n)>>>33

The essence of a function is a piece of code that has a specific function and can be reused. This code has been written in advance and has a "nice" name for it. In the subsequent process of programming, if you need the same function, you can call this code directly by giving a good name.

The following demonstrates how to encapsulate our self-implemented len() function into a function:

#Custom len() function

def my_len(str):    

length = 0 

for c in str:       

length = length + 1 

return length

#Call the custom my_len() function

length = my_len("http://www.nowcoder.com/link/pc_kol_bzwh")

print(length)

#Call the my_len() function again

length = my_len("http://www.nowcoder.com/link/pc_kol_bzwh")

print(length)

Realize addition, subtraction, multiplication and division within the function, such as calculating the orders of the previous two days and the orders of the previous two years, you only need to call the function every day

6. Cycle

Loop is a relatively abstract point. You can imagine that the computer runs the logic continuously. This time, the logic can be incremented, decremented, and various operations to realize some logic such as [cumulative summation]

There are two types of loop statements in Python, while loop and for loop

picture

add = "http://www.nowcoder.com/link/pc_kol_bzwh"

#for loop, traversing the add string

for ch in add: 

print(ch,end="")

The result of the operation is:

http://www.nowcoder.com/link/pc_kol_bzwh

When I was learning by myself, it was actually more painful than c. This "ch" is very irregular. Generally, ijk is written in c, which is easy to understand. In fact, ch here is analogous to ijk in c, which means There is such a scanning robot that scans the string "add", records a number every time it is scanned, records it in the thing ch, and finally outputs it.

The loop can do many things, test different inputs (turn the input into a string, and then let the program traverse), the result of the function, and then output the result (such as the error value of fitting, to directly judge which variable is more suitable)

The above can be considered as the basic part of the programming language. There are still many actual foundations, and to consolidate the above knowledge points, there are the most basic algorithms to help practice (such as realizing cumulative summation, addition, etc.) These need to be completed by yourself ( For example, if function, etc., this is easy to understand)

Going down here, we start the practice of the data analysis package in the python library (in fact, it is mainly to call functions)

2. Data analysis python library practice

1. Commonly used libraries

A library can be seen as a collection of functions, like a book, and import library name is like ordering the computer to open this dictionary

There are mainly three commonly used libraries, and there can be ten libraries including data science

They are:

Pandas, Numpy (data clarity, analysis, exploration, array processing); Scikit-learn, TensorFlow, Keras (machine learning library), Gradio (machine learning deployment); SciPy, Statsmodels (statistical library); matplotlib, Seaborn (visualization)

Generally speaking, learning Pandas, Numpy, and Plotly is enough

2. Numpy library

Numpy highlights an array processing capability, you can regard the array as an excel table, storing data in each cell

Combined with the basic tutorial above, when a piece of data comes in, you should prepare boxes to put the data in. The functions involved in these actions include

(1) Array creation

picture

For example:

import numpy as npa = np.array([1,2,3,4])b = np.array([,'点赞','分享','求关注'])print(a)

Arrays can be operated (refer to the logic of linear algebra), and arrays can be added, subtracted, multiplied, and divided with arrays

Selecting numbers between arrays, basic indexing and slicing, transpose, and trigonometric operations are all available on Baidu. As long as you know how to use them, all mathematical operations on arrays can be realized

3. Pandas library

The advantages of the pandas library are: aligning various types of data sources, integrating time series functions, flexibly handling missing data, and merging relational operations that appear in other databases

To put it bluntly, it is more flexible than numpy, and sometimes only pandas can meet the needs

The learning idea is:

(1) Familiar with series and two data types

(2) Commonly used indexing methods and

picture

(3) Indexing, selection, calculation and filtering logical learning ideas such as numpy

(4) Introduce some summary statistics that can be done using dataframe, such as corr method, cov method

(5) Handle missing data, including functions such as dropna\fillna\isnull\notnull

The above two libraries should not memorize functions by rote. It is best to remember what they can do, such as processing arrays, operating arrays, cutting and indexing arrays, filling missing values, and sorting.

The main functions of the two libraries are basically reflected in the data preprocessing. From here, you must realize that you are getting closer and closer to the place where you need statistics. After the data is preprocessed, you can do the analysis (later )

3、matpoltlib

It is clear that 80% of the needs can be met by excel. It’s not impossible to use python. The advantage is that the degree of freedom is higher, the chart is closer, and the style of scientific research; the disadvantage is that it does not meet most work scenarios. Satisfied, you can't write the code yourself. Generally, the data is entered into ppt or excel and can be directly operated in it, which is more efficient.

picture

Without further ado, let's start

(1) Create an empty chart

Figure and subplot can create a chart object

fig = plt.figure()

ax1 = fig.add_subplot(2,2,1)

From here, you can issue drawing commands, that is, fill in the data in the horizontal and vertical coordinates

(2) Draw the image

form numpy. import randnplt.plot(randn(50).csmsum(),'k--')_= ax1.hist(randn(100),bins=20,color = 'k',alpha = 0.3)ax2.scatter(np.arange(30),np.arange(30)+3*randn(30))

picture

You need to know that all parameters of the image are controllable, including colors, markers, linetypes, axes, spacing, scales, legends, annotations, you name it.

So why is it more suitable for scientific research, because the paper may only need a few pictures; but at work, you may need several pictures a day, and you don't have time to write so fast. The content of the visualization can be understood.

(3) Understand the types of graphs

Line graph:

series(np.random.randn(10).cumsum(),index =np.arange(0,100,10))

Histogram:

data.polt(kind = 'bar',ax=axes[0],color='k',alpha = 0.7)

Histogram: hist(bins = 50)

Pie chart: plt.pie()

Special Topic: Data Aggregation and Grouping Operations

There is a group by function in SQL, and python can also do it. In fact, most of the basic data processing belongs to grouping operations. For example, the order of [every day] [all cities] is to group the date and city. The results of different groupings are the dismantling of indicators in dimensions.

In fact, python's ability to group data is stronger than that of sql, but it is weaker in data processing efficiency. It depends on how far you need to group to determine whether you need to export data specifically for processing with python

The most basic grouping function is expressed as:

Suppose the data group is two columns of indicators data1&data2, and the dimension is key1&ley2

group = df['data1'].groupby(dt['key']) 

The generated group is a groupby object, you can understand it as a prepared grouper, and then need to cooperate with various aggregation functions to output results,

For example:

group.mean()

can also be written together

mean =df['data1'].groupby([df['key1'],df['key2']]).mean()

Some commonly used aggregate functions are as follows:

picture

Special Topic: Time Series

Time series in python can be summarized as: time function processing + time format data, fully depicting the distribution of time series, adding various operations, the main modules are: datetime, calendar

For example, the current time: now = datetime.now

datetime stores time in milliseconds, delta = datatime(2022,12,14) - datatime(2022,12,1)

Time functions can also be converted to each other with strings,

For example: value = '2022-12-24'

datetime.strptime(value,'%Y-%m-%d')

You can also use parser.parse in the dateutil package to parse dates, for example:

parse('2022-12-24')

Output: datetime.datetime(2022,12,24,0,0)

format definition for datetime

picture

Variables in the time series format are actually a series of time, plus various indicators, such as 1.1, 1; 1.2, record 2

It can be defined as follows:

dates =

 [datetime(2022,12,1),datetime(2022,12,2),datetime(2022,12,3),datetime(2022,12,4),datetime(2022,12,5),datetime(2022,12,6),datetime(2022,12,7),datetime(2022,12,8)] 

ts = Series(np..randn(6),index = dates) 

The series variables of the two columns of time series are obtained

Python can do: generate a fixed date range (pd.date_range), offset a fixed date amount (nowday = datetime(2022,12,1)+3*day(), move date data, etc., all belong to the operation of data preprocessing .

Above, most of the work that can use python to process data has basically been shared. Many details can be made up by typing code and Baidu. Finally, in the actual data analysis work, complete the data processing and display part to carry out your various tasks. Analysis (such as dismantling indicators by different groupings, visualizing and finding falling fluctuations, etc.)

Guess you like

Origin blog.csdn.net/onebound_linda/article/details/131913497