Using KNIME to build Spark Machine learning model 2: Titanic Survival Prediction

This paper uses KNIME-based Spark decision tree model algorithm to train the Titanic's training data set containing the characteristic attributes of passengers and crew to obtain a decision tree survival model, and use the test data set to test the model.

1. Download the training dataset and test dataset from the Kaggle website

2. Create a new Workflow in KNIME, named TitanicKNIMESpark

image.png

3. Read the training dataset

KNIME supports reading data from a Hadoop cluster. In order to simplify the process, this article directly reads the data set from the local.

Type CSV Reader in the search box of the Node Repository, find the CSV Reader node, and drag it into the canvas.

image.png

Double-click or right-click CSV Reader to configure the node and set the directory of the dataset.

image.png

Right-click the node, click Execute, then right-click the node and click File table to view the results

image.png


4. Use Missing Value node to handle missing values

Similar to the third step, find the Missing Value node and drag it into the canvas (the following operations in this article are similar and will not be repeated), and set the properties as needed. Here, the method of simply taking the average is used to deal with missing values. Establish a connection from the CSV Reader node to the Missing Value node.

image.png

Right-click the node, click Execute, then right-click the node and click Output Table to view the results

image.png


5. Add the Create Spark Context node and set the Spark Context

image.png

image.png


6. Add the Table to Spark node, convert the KNIME data table into Spark's DataFrame/RDD, configure the Table to Spark node and establish the connection between the Missing Value node and the Table to Spark node, and establish the connection between the Create Spark Context node and the Table to Spark node .

The default configuration is used here.


7. Add the Spark Normalizer node, convert the Survived property from a numeric type to a character type, configure the Spark Normalizer node and establish a connection between the Table to Spark node and the Spark Normalizer node.

image.png

Right-click the node, click Execute, then right-click the node and click Normalized Spark DataFrame/RDD to see the results.

image.png


8. Add the Spark Decision Tree Learner node, configure the decision tree algorithm parameters, and establish a connection between the Spark Normalizer node and the Spark Decision Tree Learner node.

image.png

Right-click the node, click Execute, then right-click the node and click Decision Tree Model to see the results.

image.png


9 Use the test dataset and Spark Predictor node to test the model.

Copy the CSV Reader, Missing Value and Table to Spark nodes and refer to steps 3, 4, and 6 for configuration to read the test data set and process and transform the data. Add a Spark Predictor node, configure the Spark Predictor node, and connect the newly added Table to Spark node and Spark Decision Tree Learner node to Spark Predictor.

CSV Reader configures the test dataset.

image.png

Spark Predictor node configuration Prediction column

image.png

Right-click the node, click Execute, then right-click the node and click Labeled Data to view the results.

image.png


10. Other nodes can be added for subsequent processing of the results. Here, only the Spark Column Filter node is added to filter out unnecessary columns.

Add a Spark Column Filter node and configure it.

image.png

Right-click the node, click Execute, then right-click the node and click Filtered Spark DataFrame/RDD to see the results.

image.png

The final workflow is shown in the following figure

image.png

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326176519&siteId=291194637