Sagemaker Basic Operation Guide

Introduction

Amazon SageMaker is a managed machine learning service provided by Amazon AWS, designed to simplify and accelerate the entire life cycle of machine learning development. It provides machine learning engineers and data scientists with a complete set of tools and capabilities for building, training, tuning, and deploying machine learning models. This article will use a simple example to introduce the use of Sagemaker and complete a simple deep learning task.

Create a Jupyter Notebook instance

The code in the official example calls some SageMaker-specific SDKs, so it needs to be executed under the JupyterNotebook instance of SageMaker. The creation operation is as follows:

EnterAmazon Console and select Amazon SageMaker

Then select the notebook in the left sidebar->Notebook instance

Click Create Notebook Instance.

The status of the newly created notebook instance is pending at the beginning. After waiting for a period of time, the status will change to InService.

Then choose to open Jupyter to enter the Jupyter Notebook instance.

Reference connection

Amazon related content on creating Jupyter notebook instance Video link

training and deployment

The following is Amazon’s official github address for sagemaker examples

https://github.com/aws/amazon-sagemaker-examples

Here we choose an example of sagemake predicting user churn, and perform training and deployment operations.

After downloading the example on git locally, the corresponding files are in the following directory

amazon-sagemaker-examples\introduction_to_applying_machine_learning\xgboost_customer_churn

Enter the JupyterNotebook instance, click Upload, and upload xgboost_customer_chun.ipynb to the instance.

Click on this ipynb file to enter the instance and follow the steps.

Key code analysis

All the code of the example is in the xgboost_customer_chun.ipynb file. The first half of the example is about analyzing and cleaning the data, so I will not explain it in detail. Start directly from the cleanup section

The data is divided into training set, validation set, and test set. And save the training set and validation set into train.csv, validation.csv files

train_data, validation_data, test_data = np.split(
    model_data.sample(frac=1, random_state=1729),
    [int(0.7 * len(model_data)), int(0.9 * len(model_data))],
)
train_data.to_csv("train.csv", header=False, index=False)
validation_data.to_csv("validation.csv", header=False, index=False)

Then upload these two data to the s3 server

boto3.Session().resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, "train/train.csv")
).upload_file("train.csv")
boto3.Session().resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, "validation/validation.csv")
).upload_file("validation.csv")

Then create the xgboost container

container = sagemaker.image_uris.retrieve("xgboost", sess.boto_region_name, "1.7-1")
display(container)

Then read the two previously uploaded csv files as input for training

s3_input_train = TrainingInput(
    s3_data="s3://{}/{}/train".format(bucket, prefix), content_type="csv"
)
s3_input_validation = TrainingInput(
    s3_data="s3://{}/{}/validation/".format(bucket, prefix), content_type="csv"
)

Then set the hyperparameters for training

sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type="ml.m4.xlarge",
    output_path="s3://{}/{}/output".format(bucket, prefix),
    sagemaker_session=sess,
)
xgb.set_hyperparameters(
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    verbosity=0,
    objective="binary:logistic",
    num_round=100,
)

xgb.fit({"train": s3_input_train, "validation": s3_input_validation})

After waiting for the training to be completed, you can deploy the predictor

xgb_predictor = xgb.deploy(
    initial_instance_count=1, instance_type="ml.m4.xlarge", serializer=CSVSerializer()
)

After the deployment is completed, you can call the predict interface to predict the test set that was initially separated.

def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ""
    for array in split_array:
        predictions = "".join([predictions, xgb_predictor.predict(array).decode("utf-8")])

    return predictions.split("\n")[:-1]


predictions = predict(test_data.to_numpy()[:, 1:])

Finally, the correctness of the predicted results is compared.

pd.crosstab(index=test_data.iloc[:, 0],columns=np.round(predictions),rownames=["actual"],colnames=["predictions"],)

The following are the predicted results. The green boxes are the correct prediction results, and the red boxes are the incorrect prediction results. The accuracy rate is 94.6%

Attached is the official tutorial, which shows the entire output after execution. (Some of the codes may report errors in the real environment. In fact, you still need to use the code on github)

https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn_outputs.html

Creation is not easy. If you think this article is helpful to you, you can move your little hands, like it, and ღ( ´･ᴗ･` ) to show your appreciation