First of all need to be clear:
Python data cleansing solutions is not big data! !
Big Data Spark typically used for cleaning or MR data! !
Data source big data to the main business data and Web logs !!!
(sqoop/Flume/NiFi/Kafka)
Receipt collection - >> data entry - data cleansing >> - >> Data Processing - Data Integration >> - >> Regulatory Data - Data Analysis >> - >> Data Services
- Data Acquisition understand the basic method of data sources and
- Understanding Data cleaning processes and methods substantially
- Grasp how to implement data cleaning Python
- Grasp how to use data validation Python implementation
- Learn metadata and understand its important role in big data environments
- Introduction to basic understanding of data storage, processing, integration, analysis, and service
data collection
Concerns the validity of the data collected to determine the data needs to be collected to determine the field of data collection to develop data collection methods to verify the validity of the data collected
Data Sources
Data Sources | method of extraction | aims |
---|---|---|
Business data (RDB) | File Export | Data Integration |
Data Import Sqoop | ||
Web logs | Flume / NiFi / Kafka used to live (focus) | |
Partner data | Data Integration / Service | |
Social networking / public data | Data crawling | Data Integration |
News bulletin board Email / conference data | Special data extraction mode | Data Integration |
Things device data | NiFi / special data extraction methods | Data Integration |
other | Special data extraction mode |
Data Quality
-
Data quality is the most important stage of data collection
-
Common data quality issues
- Duplicate data
- Missing data
- Data relevance missing
- Illegal data
- Fill in the fields error
- Data format is incorrect
-
Data quality principles to determine accurate, complete, comprehensive, effective, consistent and unified format, not repeat
Data validation
-
Data Verification data valid verification data set
- Data type checking
- Checking the data format
-
Data validation premise
- Understand business needs
- Understanding Data composition, structure and relevance
-
Method check data
- schema / meta-data / Regulation
- Data validation tool -SAS
- Written verification program
-
Data check several times embodiment
Data validation tool -voluptuous
python>> pip install voluptuous
用Schema校验数据有效性
使用fillna的多种方式填充NaN值
使用interpolate()插值器填充NaN值,根据日期或时间按值等差填充
使用dropna删除包含缺失值的记录
-
Outliers legitimate but away from most of the data values
-
Abnormality determination value calculated by the standard deviation standard deviation determines the abnormal value range T range, the absolute value of T greater than the value calculated by the frequency distribution of the abnormal data value exceeds the value of 90% of the value range of the distribution
-
Correction of outliers affect Winsorizing (Windsor Law)
- T (such as 1.95) * StandardDeviation + Mean
- Outlier correction value +1 or -1 as a boundary, the boundary is not reflected
9, Python integrated spark
- Anaconda installation on linux, and configure the environment variables
- Spark installed on linux, you will be able to configure the environment variables:
SPARK_HOME 和 SPARK_CONF_DIR
- Perform the following steps
ipython
from notebook.auth import passwd
passwd()
#键入密码
#获取sha1值,复制
#rw
#sha1:0cc7d44db1b9:1ce93f146c1e0faaebf73740ca9db8ba90c7adde
cd~
jupyter notebook --generate-config
vi ./.jupyter/jupyter_notebook_config.py
#添加输入以下内容
c.NotebookApp.allow_root = True
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.password = 'sha1:粘贴上一步复制的值'
c.NotebookApp.port = 7070
cd~
vi /etc/profile
#添加以下内容
export PYSPARK_PYTHON=$ANACONDA_HOME/bin/python3
export PYSPARK_DRIVER_PYTHON=$ANACONDA_HOME/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
ipython_opts="notebook -pylab inline"
source /etc/profile
cd~
vi .jupyter/jupyter_notebook_config.py
#添加以下内容
c.NotebookApp.notebook_dir='自己定义的工作目录'
Development spark using notebook
- cmd: pyspark
- Browser connected to the jupyter 7070
- Notebook + spark into the environment
It can be used directly in pyspark spark, a syntax similar to scala, the following main differences between
1, written anonymous function: scala is written directly, py is a lambda expression
2, iterables (list, column) value of the symbol, Scala is () or [], may be the opposite Py
Use pyspark resolve complex field
from pyspark.sql.functions import *
from pyspark.sql.types import *
df = spark.read.option("header", "true").csv("file:///root/example/movies_metadata.csv")
# Define the schema for the movie category data field
genres = ArrayType(StructType([StructField("id", IntegerType(), False), StructField("name", StringType(), False)]))
# Organize the movie category with the original move id
df_MovieCategory = df.withColumn("movie_category", from_json(col("genres"), genres)) \
.select(col("id"), col("movie_category")).select(col("id"), explode(col("movie_category"))) \
.select(col("id"), col("col.name"))