Use Python+PyTorch to predict wildfires!

One of the main obstacles facing the United Nations in achieving its sustainable development goals is the fight against natural disasters, and the primary obstacle that causes huge damage is wild fires.

As a data scientist, I really hope to bring positive changes to society. After reading about the recent wild fires in Colorado, USA and all over Australia, I began to wonder if there is a way to use my skills to predict these disasters in advance in order to take some protective measures.

I found this dataset on Kaggle. It contains 1.88 million wild fires in the United States. I started to build a model using PyTorch Lightning. Read on to see how I used this data set to build a model that can predict the intensity of a wildfire.

Climate change and wild fires

The United Nations Sustainable Development Goals focus on achieving specific goals by 2030, which involve issues such as education, poverty, climate change and marine life. The United Nations Environment Programme is responsible for coordinating the environmental activities of the United Nations and assisting developing countries in formulating environmentally friendly policies and practices.

We have 10 years to achieve the successful goals and targets of the United Nations Sustainable Development Initiative.

A prosperous environment directly affects the 17 sustainable development goals set by the United Nations. Therefore, protecting and protecting the environment from natural and man-made disasters is one of the main agendas of the United Nations. According to various reports from the United Nations, wild fires have indeed affected climate change. As the global temperature rises, the possibility of wild fires is increasing, and they have a major impact on climate change.

In 2019, Australia suffered one of the worst fires in history, which burned approximately 18 million hectares of land (compared to half of California's land) due to the extremely high temperature season. Increasing temperatures have created a highly flammable environment in forests and grasslands. Similarly, fires occurred in the Amazon forest in 2019, resulting in 2.24 million acres of burned area. Recently, in 2020, a wild fire in California burned 4,359,517 acres of land.

Wild fires produce large amounts of carbon dioxide and greenhouse gases, which in turn raise the global temperature. Fire particles are carried over long distances, causing air pollution. These particles are also deposited on the snow, leading to higher absorption of sunlight. This phenomenon is called the climate feedback loop, it will gradually worsen the climate conditions

Use PyTorch Lightning to forecast wild fires

Firms in different industries have made a variety of efforts to use historical field fire data and examine their dependence on alternative data sources such as weather and tourism to establish models to predict the occurrence and intensity of fires. In this article, I used PyTorch Lightning to build a machine learning model. The prediction of fire intensity can counter the effects of fire by taking correct remedial measures in advance. [3]

Step 1: Connect to the data

After downloading the data from Kaggle, I ingested it into my python environment and connected it to the jupyter notebook interface.

conn = sqlite3.connect("FPA_FOD_20170508.sqlite")

After reading the data, I need to perform some exploratory analysis to understand the characteristics of the data and its distribution, as shown below.

Step 2: Some exploratory data analysis

Learn about the spread of wild fires in different states in the United States. Here we can see the scale of the five largest wild fires in the United States.

df = pd.read_sql_query("SELECT SUM(FIRE_SIZE) AS SUM_FIRE_SIZE, State FROM Fires GROUP BY State;", conn)
df = df.set_index("STATE")
df[:5]

Visualize field fire statistics. It also helps to visualize the data and understand the number of wild fires in each state. We can plot the data graphically and look at the states most affected.

The scale of wild fires in different states

df["SUM_FIRE_SIZE"].sort_values(ascending=False)[:15].plot(kind="bar")

We know from this that the number of AKs is more than double that of other severely affected states.

We can try to analyze how the number of wild fires has changed in the past few years.

The scale of wild fires in different years

df = pd.read_sql_query("SELECT SUM(FIRE_SIZE) AS SUM_FIRE_SIZE, FIRE_YEAR FROM Fires GROUP BY FIRE_YEAR;", conn)
df.set_index("FIRE_YEAR").plot.bar()

We have seen a significant increase in the intensity of wild fires in the past decade.

After understanding the content of the data, we simply filter the outliers of the fire size. Outliers can be processed or scaled to model all data. For the sake of simplicity, we only study the types of wild fires that occur the most.

Step 3: Data processing and training test split

The data set is divided according to the firepower.

We only consider situations where the scale of a wild fire is between 2,000 and 10,000 units.

analyze_df = analyze_df[analyze_df.FIRE_SIZE > 2000]
analyze_df = analyze_df[analyze_df.FIRE_SIZE < 10000]

Split the data set.

First, we perform data conversion on numerical and discrete variables. Data scaling is performed on numerical features, while label coding is performed on discrete features. Then, merge the two feature subsets.

X = analyze_df.drop(columns=["FIRE_SIZE"])
y = analyze_df["FIRE_SIZE"]

X_numerical = X.select_dtypes(include=["float", "int"])
fill_nan = lambda col: col.fillna(col.mean())
X_numerical = X_numerical.apply(fill_nan)
sc = StandardScaler()
num_cols = X_numerical.columns
X_numerical = pd.DataFrame(sc.fit_transform(X_numerical), columns=num_cols)

for col in X_categorical.columns:
    le = LabelEncoder()
    X_categorical[col] = le.fit_transform(X_categorical[col])

X_numerical.reset_index(drop=True, inplace=True)
X_categorical.reset_index(drop=True, inplace=True)

X = pd.concat([X_numerical, X_categorical], axis=1)

Training test set split

Now we divide the data into 80% train+val and 20% test set. Then we further divide train+val into 80% training and 20% validation data sets.

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.2, random_state=42)

Step 4: Use PyTorch Lightning to build the model (only a fragment is shown here, please follow the link below to get the complete code)

In this section, we start to use PyTorch Lightning to build a simple regression neural network model. [2]

I started with a simple architecture with 17 input features. The first hidden layer has 64 neurons, the second hidden layer has 32 neurons, and the last one is a regression node. The PyTorch Lightning code is divided into different parts: model, data loader, optimizer, and training verification test steps.

class Regression(pl.LightningModule):
    
### 模型 ### 

    
    def __init__(self):
        super(Regression, self).__init__()
        self.fc1 = nn.Linear(17, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 8)
        self.fc4= nn.Linear(8, 1)
        self.dropout=torch.nn.Dropout(0.2)
        self.training_losses = []

    
    def forward(self, x):
        x = torch.sigmoid(self.fc1(x))
        x = self.dropout(x)
        x = torch.sigmoid(self.fc2(x))
        x = self.dropout(x)
        x = torch.sigmoid(self.fc3(x))
        x = self.dropout(x)
        x = self.fc4(x)
        return x
    ### 培训### 
    def validation_epoch_end(self, outputs):

    # Question: What should the training steps look like? 
    # Define training steps 
    def training_step(self, batch, batch_idx): 
        x, y = batch 
        logits = self.forward(x) 
        train_loss = mse_loss(logits, y) 
        self.training_losses.append (train_loss) 
        # Add 
        return {'loss': train_loss} 

###证### 
    
    # Question: What should the verification step look like 
    # Define the verification step 
    def validation_step(self, batch, batch_idx): 
        x, y = batch 
        logits = self.forward(x) 
        loss = mse_loss(logits, y) 
        return {'val_loss': loss} 

    # define validation_epoch_end 
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        return {'avg_val_loss': avg_loss} 

###测试###      

    # Q: What should the test steps look like 
    # Define test steps 
    def test_step(self, batch, batch_idx): 
        x, y = batch 
        logits = self.forward( x) 
        loss = mse_loss(logits, y) 
        predictions_pred.append(logits) 
        predictions_actual.append(y) 
        return {'test_loss': loss,'logits': logits} 
    
    # define test_epoch_end 
    def test_epoch_end(self, outputs): 
        avg_loss = torch .stack([x['test_loss'] for x in outputs]).mean() 
    # Q: What optimizer will I use? 
        logs = ('test_loss': avg_loss}      
        return {'avg_test_loss': avg_loss,'log': logs,'progress_bar': logs} 
    
    ### The Optimizer### 

    # Define the optimizer function: here we use stochastic gradient descent 
    def configure_optimizers(self): 
        return optim. SGD(self.parameters(), lr=l_rate)

If you have data of different shapes, or you want to create a different model architecture, you only need to change the model parameters in the above function. You can change the optimizer used in the model in the configure_optimizers function above.

###
数据载器###      class WildfireDataLoader(pl.LightningDataModule):    
    # Question: How do you want to load data into the model? 
    # Define functions for data loading: training/validation/testing 
    def train_dataloader(self): 
        train_dataset = TensorDataset(torch.tensor(X_train.values).float(), torch.tensor(y_train.values).float()) 
        train_loader = DataLoader(dataset = train_dataset, batch_size = 512) 
        return train_loader 
        
    def val_dataloader(self): 
        validation_dataset = TensorDataset(torch.tensor(X_val.values).float(), torch.tensor(y_val.values).float()) 
        validation_loader = DataLoader(dataset = validation_dataset, batch_size = 512) 
        return validation_loader
     
    def test_dataloader(self):
        test_dataset = TensorDataset(torch.tensor(X_test.values).float(), torch.tensor(y_test.values).float())
        test_loader = DataLoader(dataset = test_dataset, batch_size = 512)
        return test_loader

Step 5: Run the model

Here, we use the data loading module to run the Lightning model and fit it according to the data.

data_loader_module = WildfireDataLoader()
model = Regression()
trainer = Trainer(max_epochs = 200,data_loader_module )  
trainer.fit(model, datamodule= data_loader_module)

Step 6: The final result of the model

We found that the mean square error loss of the model is 0.2048, which is slightly better than Logistic regression (code not included in the blog).

The fire data seems to have a large deviation, most of the data points have a fire scale of less than 10 units, and other data points are as high as 50,000 units or more. This skew of the data makes the regression problem difficult to solve. In addition, the data set contains very limited variables. By merging more characteristic variables and other data sources (such as weather data), the performance of the model can be improved.

Its purpose is to demonstrate the use of the Lightning framework to build advanced machine learning models, so it does not represent the best model performance, but to propose a data-driven method to predict wild fires.

Recently, many friends consulted about Python learning issues through private messages. To facilitate communication, click on the blue to join the discussion and answer resource base by yourself