Python study notes (eight)

First of all need to be clear:

Python data cleansing solutions is not big data! !

Big Data Spark typically used for cleaning or MR data! !

Data source big data to the main business data and Web logs !!!

(sqoop/Flume/NiFi/Kafka)

1565829152247

Receipt collection - >> data entry - data cleansing >> - >> Data Processing - Data Integration >> - >> Regulatory Data - Data Analysis >> - >> Data Services

  1. Data Acquisition understand the basic method of data sources and
  2. Understanding Data cleaning processes and methods substantially
  3. Grasp how to implement data cleaning Python
  4. Grasp how to use data validation Python implementation
  5. Learn metadata and understand its important role in big data environments
  6. Introduction to basic understanding of data storage, processing, integration, analysis, and service

data collection

Concerns the validity of the data collected to determine the data needs to be collected to determine the field of data collection to develop data collection methods to verify the validity of the data collected

Data Sources

Data Sources method of extraction aims
Business data (RDB) File Export Data Integration
Data Import Sqoop    
Web logs Flume / NiFi / Kafka used to live (focus)  
Partner data Data Integration / Service  
Social networking / public data Data crawling Data Integration
News bulletin board Email / conference data Special data extraction mode Data Integration
Things device data NiFi / special data extraction methods Data Integration
other Special data extraction mode  

Data Quality

  • Data quality is the most important stage of data collection

  • Common data quality issues

    • Duplicate data
    • Missing data
    • Data relevance missing
    • Illegal data
    • Fill in the fields error
    • Data format is incorrect
  • Data quality principles to determine accurate, complete, comprehensive, effective, consistent and unified format, not repeat

 

Data validation

  • Data Verification data valid verification data set

    • Data type checking
    • Checking the data format
  • Data validation premise

    • Understand business needs
    • Understanding Data composition, structure and relevance
  • Method check data

    • schema / meta-data / Regulation
    • Data validation tool -SAS
    • Written verification program
  • Data check several times embodiment

 

Data validation tool -voluptuous

python>> pip install voluptuous
用Schema校验数据有效性
使用fillna的多种方式填充NaN值
使用interpolate()插值器填充NaN值,根据日期或时间按值等差填充
使用dropna删除包含缺失值的记录

 

  • Outliers legitimate but away from most of the data values

  • Abnormality determination value calculated by the standard deviation standard deviation determines the abnormal value range T range, the absolute value of T greater than the value calculated by the frequency distribution of the abnormal data value exceeds the value of 90% of the value range of the distribution

  • Correction of outliers affect Winsorizing (Windsor Law)

    • T (such as 1.95) * StandardDeviation + Mean
    • Outlier correction value +1 or -1 as a boundary, the boundary is not reflected

9, Python integrated spark

  • Anaconda installation on linux, and configure the environment variables
  • Spark installed on linux, you will be able to configure the environment variables:SPARK_HOME 和 SPARK_CONF_DIR
  • Perform the following steps
ipython
from notebook.auth import passwd
passwd()
#键入密码
#获取sha1值,复制
#rw
#sha1:0cc7d44db1b9:1ce93f146c1e0faaebf73740ca9db8ba90c7adde

cd~
jupyter notebook --generate-config
vi ./.jupyter/jupyter_notebook_config.py
#添加输入以下内容
c.NotebookApp.allow_root = True
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.password = 'sha1:粘贴上一步复制的值'
c.NotebookApp.port = 7070

cd~
vi /etc/profile
#添加以下内容
export PYSPARK_PYTHON=$ANACONDA_HOME/bin/python3
export PYSPARK_DRIVER_PYTHON=$ANACONDA_HOME/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
ipython_opts="notebook -pylab inline"

source /etc/profile

cd~
vi .jupyter/jupyter_notebook_config.py
#添加以下内容
c.NotebookApp.notebook_dir='自己定义的工作目录'

 

 

 

Development spark using notebook

  • cmd: pyspark
  • Browser connected to the jupyter 7070
  • Notebook + spark into the environment

It can be used directly in pyspark spark, a syntax similar to scala, the following main differences between

1, written anonymous function: scala is written directly, py is a lambda expression

2, iterables (list, column) value of the symbol, Scala is () or [], may be the opposite Py

 

Use pyspark resolve complex field

from pyspark.sql.functions import *
from pyspark.sql.types import *
df = spark.read.option("header", "true").csv("file:///root/example/movies_metadata.csv")
# Define the schema for the movie category data field
genres = ArrayType(StructType([StructField("id", IntegerType(), False), StructField("name", StringType(), False)]))

# Organize the movie category with the original move id
df_MovieCategory = df.withColumn("movie_category", from_json(col("genres"), genres)) \
  .select(col("id"), col("movie_category")).select(col("id"), explode(col("movie_category"))) \
  .select(col("id"), col("col.name"))

 

Guess you like

Origin www.cnblogs.com/whoyoung/p/11424212.html