Machine learning practice 10-combined application based on spark big data technology and machine learning

Hello everyone, I am Weixue AI. Today I will introduce to you Machine Learning Practice 10-based on the combination of spark big data technology and machine learning. Spark is a fast and general big data processing framework. It was developed by AMPLab at UC Berkeley. Spark provides a distributed computing platform that can process large-scale data sets in parallel in a cluster.
insert image description here

Contents
1. Introduction to big data technology
2. Characteristics of Spark
3. Why use Spark
4. The difference between Spark and Pandas
5. Use Python and Spark to develop big data applications
6. Spark-based machine learning training code

1. Introduction to big data technology

Big data technology refers to a series of technologies and tools developed to process and analyze large-scale data. With the development of the Internet, the Internet of Things and various sensor technologies, we can collect more and more data. These data are usually large in scale, complex and diverse, and characterized by high growth rates. Big data technology is dedicated to solving the problem of how to efficiently store, process and analyze these massive data.

Here are a few common big data technologies:

1. Distributed storage system: The storage of large-scale data requires the use of a distributed storage system to provide high capacity, high reliability and high scalability. For example, distributed database systems such as Hadoop Distributed File System (HDFS) and Apache Cassandra.

2. Distributed computing framework: Distributed computing is required in the process of big data processing to achieve efficient processing and analysis of data. Hadoop MapReduce is the earliest distributed computing framework, while Apache Spark is currently a popular fast and general big data processing framework.

3. Data management and governance tools: Large-scale data management and governance is a complex task. Data management tools help organize and manage data, including data collection, cleaning, transformation, and integration. Data governance tools focus on aspects such as data quality, security, and compliance.

4. Data warehouse and data lake: Data warehouse is a system for storing and managing structured data, which provides flexible query and analysis functions. A data lake is a collection of various types of data stored centrally, which can be processed and analyzed when needed.

5. Data mining and machine learning: Big data technology can be used in data mining and machine learning to help discover valuable information and patterns from large-scale data. Common tools and algorithms include Apache Hadoop, Apache Spark's machine learning library (MLlib), TensorFlow, etc.

6. Data visualization and reporting tools: Data visualization tools help transform data into visual charts and dashboards, making data easier to understand and analyze. The reporting tool can generate reports and displays of data analysis results.

The application of big data technology is very extensive, covering various industries and fields. For example, in the financial field, big data technology can be used for risk management and fraud detection; in the medical field, it can be used for medical image analysis and disease prediction; in the social media field, it can be used for user behavior analysis and personalized recommendations.

2. Characteristics of Spark

Fast performance: Spark uses in-memory computing technology to store data in the memory of the cluster, thereby accelerating the data processing process. In addition, Spark also utilizes the abstract concept of RDD (Resilient Distributed Dataset) to achieve efficient parallel computing through memory data sharing and data sharding.

1. Multiple data processing support: Spark supports multiple data processing models, including batch processing, interactive query, stream processing, and machine learning. You can use Spark's API (such as DataFrame and SQL) for data processing and analysis, and you can also combine other libraries (such as MLlib, GraphX) for machine learning and graph processing.

2. Ease of use: Spark provides an easy-to-use API, including interfaces for programming languages ​​such as Java, Scala, Python, and R. This allows developers to write Spark applications and scripts in their own familiar language.

3. Scalability: Spark can run on clusters of various sizes, from small laptops to large-scale clusters. It can be seamlessly integrated with other big data tools and frameworks (such as Hadoop, Hive, HBase, etc.), providing a flexible and scalable big data processing solution.
insert image description here

3. Why use Spark

Big data processing usually involves the storage, processing, and analysis of massive amounts of data. As a fast and general big data processing framework, Spark is a popular choice for several important reasons:

1. High performance: Spark uses memory computing technology to cache data in the memory of the cluster, thereby accelerating the data processing process. Compared with traditional disk read and write methods, in-memory computing can significantly improve data processing speed. In addition, Spark also uses the abstract concept of RDD to realize data sharing and parallel computing, further improving performance.

2. Support for multiple data processing models: Spark supports multiple data processing models such as batch processing, interactive query, stream processing, and machine learning. This means that various types of big data processing tasks can be performed under the same framework, eliminating the need to use different tools and systems, thus simplifying the complexity of development and deployment.

3. Ease of use and flexibility: Spark provides an easy-to-use API, including interfaces for programming languages ​​such as Java, Scala, Python, and R. This allows developers to write Spark applications and scripts in their own familiar language. At the same time, Spark seamlessly integrates with other big data tools and frameworks (such as Hadoop and Hive), and can flexibly build a big data processing process.

4. The difference between Spark and Pandas

What is the difference between using Spark to read CSV files and using pandas' pd.read_csv to read CSV files:

1. Distributed computing: Spark is a distributed computing framework that can handle large-scale data sets. It is able to process data in parallel, utilizing multiple nodes in the cluster for computation, thereby increasing processing speed. In contrast, pandas runs on a single machine and can be memory limited for large datasets.

2. Data processing capabilities: Spark provides a wealth of data processing functions, including data cleaning, conversion, and feature engineering. Through Spark's DataFrame API, you can use SQL-like syntax to perform various operations, such as filtering, aggregation, sorting, etc. Relatively speaking, pandas also provides similar functions, but Spark's data processing capabilities are more powerful and flexible.

3. Multi-language support: Spark supports multiple programming languages, including Scala, Java, Python, and R. This means you can use the programming language you are most familiar with for data processing and machine learning. Pandas is mainly written in Python and only supports the Python language.

4. Scalability: Spark can be seamlessly integrated with other big data tools and frameworks (such as Hadoop, Hive, HBase, etc.), which facilitates the construction of end-to-end big data processing and machine learning pipelines. In addition, Spark also provides rich machine learning libraries (such as MLlib) and graph processing libraries (such as GraphX) to facilitate complex machine learning tasks.

5. Data shard storage: Spark divides data into multiple shards and stores them on different nodes in the cluster. This data fragmentation storage method helps to improve the performance of parallel reading and processing of data. Whereas pandas loads the entire dataset into memory, which can lead to out-of-memory or performance degradation for large datasets.

5. Use Python and Spark to develop big data applications

In this article, we will explore how to develop a big data application using Python and Spark. We'll load a CSV file, execute a machine learning algorithm, and then make predictions on the data. For beginners, this is a great example to get started, and for experienced developers, new ideas and perspectives may also be discovered.

Environmental preparation

Before we begin, we need to install and configure the following tools and libraries:

Python3
Apache Spark
PySpark
Jupyter Notebook
In this tutorial, we will use Jupyter Notebook as our development environment because it is convenient for us to show and explain the code.

load CSV file

First, we need to load a CSV file. In this example, we'll use a simple dataset with some simulated user data.

6. Spark-based machine learning training code

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Python Spark CSV").getOrCreate()

# 利用inferSchema参数进行类型推断
df = spark.read.format('csv').option('header', 'true').option('inferSchema', 'true').load('GDM.csv')

df.show()

from pyspark.sql.functions import col

# 删除含有空值的行
df = df.dropna()

# 转换数据类型
df = df.withColumn("AGE", col("AGE").cast("integer"))

df.show()

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression

# 选择用于预测的特征列
assembler = VectorAssembler(
    inputCols=["GW", "APB", "APA1", "CRE", "CHOL", "UA","ALP", "TG", "GLU"],
    outputCol="features")

# 将数据集转换为适合机器学习的格式
output_data = assembler.transform(df)

# 构建逻辑回归模型
lr = LogisticRegression(featuresCol='features', labelCol='GDM')

# 划分数据集为训练集和测试集
train_data, test_data = output_data.randomSplit([0.7, 0.3])


# 训练模型
lr_model = lr.fit(train_data)

# 测试模型
predictions = lr_model.transform(test_data)

# 展示预测结果
#predictions.show()

from pyspark.ml.evaluation import BinaryClassificationEvaluator

# 构建评估器
evaluator = BinaryClassificationEvaluator()

predictions = predictions.withColumnRenamed("GDM", "label")
# 计算AUC
auc = evaluator.evaluate(predictions)

print(f"AUC: {
      
      auc}")

This article uses Python and Spark to create the basic steps of a big data application. This article covers the basic steps of creating big data applications: loading data, preprocessing data, training models, testing models, and evaluating models. Hope you can learn something useful from this example.

Guess you like

Origin blog.csdn.net/weixin_42878111/article/details/131805216