Preprocessing, analysis, modeling for data analysis

1. Introduction to data analysis

1.1 Core idea

1. What is business data analysis?
It is the process of analyzing a large amount of collected data with appropriate analysis methods, extracting useful information and forming conclusions.

1.2 Course Orientation

Alt

1.3 Three types of data analysis

ppJYG4O.png

1.4 Data measurement scale

(1) Classification scale:

  • nominal measurement
  • A measure of a class or property of something
  • Computable: frequency, frequency

eg, gender, city, occupation

(2) ordinal measurement

  • A measure of the difference in rank or order between things Computable: frequency, frequency, order

eg, education, grade

(3) interval measurement

  • A measure of the distance between classes or orders of things, usually measured in natural or physical units Computable: frequency, frequency, ordering, addition and subtraction

eg, temperature

(4) scale measurement

  • Ability to measure the ratio between two measured values
  • Computable: frequency, frequency, sorting, addition and subtraction, multiplication and division eg age, weight
  • There is a fixed absolute "zero point", "O" means no

Summary:
1. Categorical data, ordinal data: categorical, discrete, qualitative data
2. Interval data, ratio data:Numerical , continuous, quantitative data

1.5 Use different statistical charts for different data attributes

Number of variables variable type optional graphics
Univariate Discrete Column chart, bar chart, pie chart, donut chart
Continuous Histogram, line graph, box plot
bivariate discrete + discrete stacked column chart
Discrete (distinct types) + continuous (numerical) Line chart (two groups), grouped box plot
continuous + continuous Scatterplot
Multivariate discrete + multiple continuous Scatter plot with multiple series
three in a row bubble chart
multiple consecutive Radar chart, line chart of multiple time series

Popular explanation:

  • Univariate discrete type: it is to count a certain attribute column according to the difference of the attribute value
  • Univariate continuous type: draw a histogram for counting the economic value of a certain interval; draw a line chart for the count of attributes in a certain period [Describe the change trend of income under the time series]
  • Bivariate Discrete + Discrete: Stacked Histogram [Is the movie category and the movie's restriction level related?]
  • Bivariate discrete + continuous: under the influence of different attribute values ​​of a certain attribute column x, how much economic value y is produced, draw a grouped box plot [calculate the impact of categories on box office revenue]; line graph [describe income and expenditure in time The relationship under the sequence]

The diagram is as follows :

ppJYb24.png

1.6 Application fields

Data analysis has already penetrated various industries and industries, mainly including: Internet, e-commerce, finance and insurance, online education, manufacturing, biomedicine, transportation and logistics, food delivery, energy, urban management, sports and entertainment and other industries.
ppJdhNt.png

2. Data source

External sources: data purchase, data scraping, free and open source data, etc.
Internal sources: sales data, financial data, social communication data, etc.

Source address:
China Internet Information Center
Analysys Analysis
National Data
National Bureau of Statistics
UCI
open source data platform website resource address

3. Data preprocessing

ppJaZZR.jpg

4. Continuously updated data preprocessing operations

1. The method of processing the NAN value in the data attribute column:
Description dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)of the parameters:
axis: The default is 0, that is, delete the row. 1 or columns is to delete the column
how: delete method. any removes rows/columns with at least one NaN; all removes rows/columns with all NaNs
thresh: Threshold. int, rows/columns to delete with at least n NaN values
subset: list. columns or index, only delete the specified columns/rows

2. Split the function according to special characters:

each = each[0].split(',')

3. Remove the space at the beginning of the data attribute column: each.strip()
4. Remove special characters in the attribute column: item.append(each.strip('["{","/","}"]'))
.5. Replace Chinese characters

# 也可以用于去除非法字符
df_1 = df_1.replace('--', np.nan)

6. .renameMethod, rename the column name or index name
7. Take out the specific data column of the data frame according to the attribute index column

df_1 = df.iloc[1:,[1,2,3,4,5,7]] 

8. Replace all corresponding values ​​of data frame 1 according to data frame 2:

 import pandas as pd

 df1 = pd.DataFrame({
    
    'col1': [1, 2, 3], 'col2': [4, 5, 6]})
 df2 = pd.DataFrame({
    
    'col1': [2, 3, 4], 'col2': [7, 8, 9]})

 merged_df = pd.merge(df1, df2, on='col1', how='inner')

 for index, row in merged_df.iterrows():
     df2.at[df2['col1'] == row['col1'], 'col2'] = row['col2']

 print(df2)

9. Use grouping to find the mean value and fill in the vacant value:

df_1['Budget'] = df_1.groupby('Genre')['Budget'].apply(lambda x: x.fillna(x.mean()))

10. Check for null values

print(pd.isnull(data["时间戳"]).value_counts())

5. Common data analysis models and methods

Common data analysis models:
comparative analysis, funnel analysis, retention analysis, A/B testing, user behavior path analysis, user grouping, user portrait analysis, etc.
Common data analysis methods:
descriptive statistics, hypothesis testing, reliability analysis, correlation analysis, Analysis of variance, regression analysis, cluster analysis, discriminant analysis, principal component analysis, factor analysis, time series analysis, etc.

data visualization

  1. Data visualization: Data visualization refers to the use of visual expressions to explore, understand and communicate data. ·Transform invisible or difficult-to-display data into perceivable graphics, symbols, colors, etc., to enhance data recognition efficiency and deliver effective information.
  2. The role of data visualization:
  • Information recording: record abstract things and information in the form of graphics, for example, the ancients in our country recorded the observed astrological information in the form of astrological charts to calculate the calendar
  • Support the reasoning and analysis of information: data visualization greatly reduces the complexity of data understanding, effectively improves the efficiency of information cognition, thus helping people to analyze and reason out effective information faster
  • Information Dissemination and Collaboration
  1. Visual analysis Visual analysis: Visual analysis is a dynamic, iterative process in which you can quickly build different views to explore the infinite path of "what" and the "why" behind it. Visual analytics can help you explore, find answers, and build stories in your data. It even goes beyond initial insights, so everyone who sees the visualization can ask questions and make unexpected discoveries. In a nutshell, visual analytics is a method of visually exploring data in real time.

Main tools used: python matplotlib, seabornlibrary

Commonly used data visualization libraries in Python: Matplotlib, Seaborn.

Common data visualization charts:

ppJ0Asg.png
————————————————————————————————————
Reference learning address:
https://blog.csdn.net/longxibendi/article/ details/82558801
https://www.cnblogs.com/caochucheng/p/10539282.html
https://www.cnblogs.com/HuZihu/p/11274171.html
https://www.cnblogs.com/bigmonkey/p /11820614.html
https://blog.csdn.net/weixin_43913968/article/details/84778833
https://www.zhihu.com/collection/275297497
http://www.woshipm.com/data-analysis/1035908. html
https://www.sensorsdata.cn/blog/20180512/
http://meia.me/act/1/schedule/112?lang=
http://www.360doc.com/content/20/0718/00 /144930_924966974.shtml
https://zhuanlan.zhihu.com/p/51658537
https://www.cnblogs.com/ljt1412451704/p/9937833.html
https://www.cnblogs.com/peter-lau/p/12419989.html
https://zhuanlan.zhihu.com/p/138671551
https://zhuanlan.zhihu.com/p/83403033
https://blog.csdn.net/qq_33457248/article/details/79596384?utm_medium=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.nonecase&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.nonecase
https://blog.csdn.net/YYIverson/article/details/100068865?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.nonecase&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.nonecase
https://blog.csdn.net/weixin_30487317/article/details/101566492?utm_medium=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.nonecase&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.nonecase

Guess you like

Origin blog.csdn.net/qq_54015136/article/details/129595080