Deep Learning for Medical Prognosis - Lesson 2 Lesson 4 Week 1-4 Assignments

Job name: C2_W4_lecture.ipynb

Job address:
github --> bharathikannann/AI-for-Medicine-Specialization-deeplearning.ai --> AI for Medical Prognosis --> Week 4

One-hot encode categorical variables

First, let's take a look at which features are categorical features?

import pandas as pd
df = pd.DataFrame({'ascites': [0,1,0,1],
                   'edema': [0.5,0,1,0.5],
                   'stage': [3,4,3,4],
                   'cholesterol': [200.5,180.2,190.5,210.3]
                  })
df


In this small sample dataset, "ascites", "edema", and "stage" are categorical variables

  • Ascites (ascites): the value is 0 or 1
  • Edema (edema): value 0, 0.5 or 1
  • Stage: is 3 or 4
    "cholesterol" is a continuous variable because it can be any decimal value greater than zero.

Which categorical variables should be one-hot encoded?

Which of the categorical variables should be one-hot encoded (turned into a dummy variable)?

  • ascites: already 0 or 1, so no need for one-hot encoding.
    • We could one-hot encode ascites, but that's not necessary when the only two possible values ​​are 0 or 1
    • When the value is 0 or 1, 1 means the disease is present and 0 means normal (no disease).
  • Edema: Edema is swelling of any part of the body. The "edema" feature of this dataset has 3 categories, so we're going to one-hot encode it so that there's a feature column for each of the three possible values.
    • 0: no edema
    • 0.5: Patient has edema but is not receiving diuretics (used to treat edema)
    • 1: The patient had edema and was treated with diuretics (so the condition may be more severe).
  • Phase: Has 3 and 4 values. We want to one-hot encode these because they are not 0 or 1 values.
    • The "stage" of cancer is 0, 1, 2, 3 or 4.
    • Stage 0 means no cancer.
    • Stage 1 is cancer confined to a small area of ​​the body, also known as "early cancer"
    • Stage 2 is cancer that has spread to nearby tissues
    • Stage 3 is cancer that has spread to nearby tissues, but is more advanced than stage 2
    • Stage 4 is cancer that has spread to distant places in the body, also known as "metastatic cancer."
    • To train the model, we can convert stage 3 to 0 and stage 4 to 1. This can confuse people who review our code and data. We'll do a one-hot on the "stage". - You will actually see that we end up with 0 for stage 3 and 1 for stage 4 (see next section).

Multicollinearity of one-hot features

Let's see what happens when we one-hot encode the "stage" feature?

df_stage = pd.get_dummies(data=df,
               columns=['stage']
              )
df_stage[['stage_3','stage_4']]


Did you notice any difference between the "stage_3" and "stage_4" features?

Considering that stage has only two possible values, stage_3 and stage_4,
if you know that patient 0 (row 0) has a value of 1 for stage_3, what can you say about the value of that patient's stage_4 feature?

  • When stage_3 is 1, stage_4 must be 0;
  • When stage_3 is 0, stage_4 must be 1.
    This means that one of the feature columns is actually redundant. We should remove one of these features to avoid multicollinearity (where one feature can predict another).
    You can use the following method
df_stage_drop_first = df_stage.drop(columns='stage_3')

You can also add new parameters directly in the place just nowdrop_first

df_stage = pd.get_dummies(data=df,
               columns=['stage'], drop_first=True,
              )
df_stage

pd.get_dummiesThe function has multiple parameters, some of which are commonly used are listed below:

  • data: DataFrame or Series object to be encoded;
  • columns: Specifies the column name to be encoded, which can be a single column name or a list of column names;
  • prefix: Add the specified prefix to the generated dummy variable column;
  • prefix_sep: string used to add separator between prefix and column name;
  • dummy_na: Specifies whether to include missing values ​​in the generated dummy variables, the default is False;
  • drop_first: Specify whether to delete the first column that generates dummy variables to avoid multicollinearity, the default is False;
  • dtype: The data type of the generated dummy variable.

These parameters can be set as needed to meet the needs of different data analysis and machine learning tasks.

Hazard function

The formula of the hazard function is:
λ ( t , x ) = λ 0 ( t ) e θ TX i \lambda(t, x) = \lambda_0(t)e^{\theta^T X_i}λ ( t ,x)=l0(t)eiTXi
So we have variables X i X_iXiThe coefficient θ \theta of the features inθ .
If you have a new patient, we can predict their hazard functionλ ( t , x ) \lambda(t,x)λ ( t ,x)


The article is continuously updated, and you can follow the WeChat public account [Medical Image Artificial Intelligence Practical Camp] to get the latest news, a public account that focuses on cutting-edge technologies in the field of medical image processing. Adhere to what has been practiced, and lead you to do projects, play games, and write papers. All original articles provide theoretical explanations, experimental codes, and experimental data. Only practice can grow faster, pay attention to us, learn and progress together~

I'm Tina, see you in the next blog~

Working during the day and writing at night, working hard

If you think the writing is good, at the end, please like, comment, and bookmark. Or one key triple
insert image description here

Guess you like

Origin blog.csdn.net/u014264373/article/details/130736937