One article to understand the working principle of genetic algorithm (with Python implementation)

From AnalyticsVidhya

Participation: Yan Qi, Huang Xiaotian

Recently, Analyticsvidhya published an article titled "Introduction to Genetic Algorithm & their application in data science", the author Shubham Jain appeared, gave a comprehensive and concise overview of the genetic algorithm in easy-to-understand language, and listed Its practical applications in various fields are presented, with emphasis on the data science applications of genetic algorithms. The Heart of the Machine has compiled this article, and the link to the original text is at the end of the article.

Introduction

A few days ago, I set out to solve a practical problem - the problem of selling in a hypermarket. After doing some feature engineering with a few simple models, I was ranked 219th on the leaderboard.

Although the results are good, I still want to do better. So, I started researching optimization methods that could improve the score. As a result, I found one, it's called Genetic Algorithm. After applying it to the supermarket sales problem, my score jumped to the top of the leaderboard.

That's right, I jumped directly from the 219th to the 15th with the genetic algorithm alone, amazing! I believe that after reading this article, you can also apply genetic algorithms freely, and you will find that when you apply it to the problem you are dealing with, the effect will be greatly improved.

content

1. The origin of genetic algorithm theory

2. Inspired by biology

3. Definition of Genetic Algorithm

4. The specific steps of genetic algorithm

  • initialization

  • fitness function

  • choose

  • cross

  • Mutations

5. Application of Genetic Algorithm

  • Feature selection

  • Implemented using the TPOT library

6. Practical application

7. Conclusion

1. The origin of genetic algorithm theory

Let's start with a famous quote from Charles Darwin:

It is often not the most powerful or the most intelligent that survive, but the most adaptable.

You may be thinking: what does this sentence have to do with genetic algorithms? In fact, the whole concept of genetic algorithm is based on this sentence.

Let's explain with a basic example:

Let's start by assuming a scenario, now that you are the king of a country, and to save your country from disaster, you implement a set of laws:

  • You elect all the good people and ask them to increase the population of the nation by giving birth.

  • This process continued for several generations.

  • You will find that you already have a whole group of good people.

This example is unlikely, but I use it to help you understand the concept. That is, we change the input value (eg: population) to get a better output value (eg: better country). Now, I assume that you have a general understanding of the concept, that the meaning of genetic algorithms should be related to biology. So let's take a quick look at some small concepts so that we can relate them to understanding.

2. Inspired by biology

I believe you still remember this sentence: "Cells are the cornerstone of all living things." It can be seen that in any cell of an organism, there is the same set of chromosomes. Chromosomes are aggregates of DNA.

Traditionally, these chromosomes can be represented by strings of numbers 0 and 1.

A chromosome is made up of genes, which are essentially the building blocks of DNA. Each gene on DNA encodes a unique trait, such as hair or eye color. I hope you will recall the biological concepts mentioned here before you continue reading. This part is over, now let's see what the so-called genetic algorithm actually refers to?

3. Definition of Genetic Algorithm

First let's go back to the example discussed earlier and summarize what we've done.

  1. First, we set the initial population size of the nation.

  2. Then, we define a function that we use to distinguish between good people and bad people.

  3. Again, we select good people and let them produce their own offspring.

  4. In the end, these descendants replaced some of the bad guys from the original citizens, and the process was repeated over and over again.

That's how a genetic algorithm actually works, that is, it basically tries its best to simulate the process of evolution to some extent.

So, to formally define a genetic algorithm, we can think of it as an optimization method that tries to find some input with which we can get the best output value or result. The way the genetic algorithm works is also derived from biology. The specific process is shown in the following figure:

So now let's understand the whole process step by step.

4. The specific steps of genetic algorithm

To make the explanation easier, let's first understand the famous combinatorial optimization problem "knapsack problem". If you still don't understand, here is a version of my explanation.

For example, you are going to go wild for a month, but you can only carry a backpack with a weight limit of 30 kg. Now you have different required items, each with their own "survival points" (as given in the table below). Therefore, your goal is to maximize your "survival points" with the limited backpack weight.

4.1 Initialization

Here we use a genetic algorithm to solve this knapsack problem. The first step is to define our population. The population contains individuals, and each individual has its own set of chromosomes.

We know that chromosomes can be expressed as strings of binary numbers. In this problem, 1 represents the presence of the gene at the next position, and 0 means loss. (Translator's Note: The author borrows chromosomes and genes to solve the previous knapsack problem, so the genes in a specific position represent the items in the knapsack problem table above, such as Sleeping Bag in the first position, then it is reflected in the chromosome at this time. The "gene" position of the chromosome is the first "gene" of that chromosome.)

Now, let's consider the 4 chromosomes in the figure as our initial population.

4.2 Fitness function

Next, let's calculate the fitness scores for the first two chromosomes. For A1 chromosome [100110], there are:

Similarly, for chromosome A2 [001110], we have:

For this question, we argue that when a chromosome contains more survival scores, it means that it is more adaptive.

Therefore, it can be seen from the figure that chromosome 1 is more adaptive than chromosome 2.

4.3 Selection

Now, we can start to select suitable chromosomes from the population, and let them "mating" with each other to produce their own offspring. This is the general idea of ​​the selection operation, but this will cause the chromosomes to be less different from each other after a few generations and lose diversity. Therefore, we generally perform the "Roulette Wheel Selection method".

Imagine a roulette wheel, now we divide it into m parts, where m is the number of chromosomes in our population. The area of ​​the region occupied by each chromosome on the roulette will be expressed proportional to the fitness score.

Based on the values ​​in the image above, we create the following "roulette".

Now, the wheel starts to spin, and we choose the area pointed to by the fixed point in the picture as the first parent. Then, for the second parent, we do the same. Sometimes we also mark two fixed pointers on the way, as shown below:

In this way, we can get both parents in one round. We call this method the "Stochastic Universal Selection method".

4.4 Crossover

In the previous step, we have selected parental chromosomes that will produce offspring. So in biological terms, the so-called "crossover" actually refers to reproduction. Now let's "crossover" chromosomes 1 and 4 (selected in the previous step), see the image below:

This is the most basic form of intersection, and we call it a "single-point intersection". Here we randomly select a crossover point, and then perform crossover swap between chromosomes before and after the crossover point, thus producing a new offspring.

If you set two intersection points, then this method is called "multi-point intersection", see the figure below:

4.5 Variation

If we now look at this from a biological point of view, then ask: Do offspring produced by the above process have the same traits as their parents? The answer is no. As the offspring grow, there are changes in their genes that make them different from their parents. We call this process "mutation", and it can be defined as random changes that occur on chromosomes, and it is because of mutation that diversity exists in a population.

The following figure shows a simple example of mutation:

After the mutation is completed, we get a new individual, and the evolution is completed. The whole process is as follows:

After a round of "genetic mutation", we use the fitness function to verify these new offspring. If the function determines that they are fit enough, they will be used to replace those chromosomes that are not fit enough from the population. Here is a question, what standard should we finally use to judge that the offspring has reached the best fitness level?

Generally speaking, there are the following termination conditions:

  1. After X iterations, not much has changed overall.

  2. We define the number of evolutions for the algorithm in advance.

  3. When our fitness function has reached a pre-defined value.

Okay, now I assume you have a basic understanding of the gist of Genetic Algorithms, so now let's use it in a data science scenario.

5. Application of Genetic Algorithm

5.1 Feature selection

Think about it whenever you enter a data science competition, what method do you use to pick out those features that are important for the prediction of your target variable? You often make a judgment about the importance of features in your model, then manually set a threshold and select features whose importance is higher than this threshold.

So, is there any way to handle this better? In fact, one of the most advanced algorithms for the task of feature selection is the Genetic Algorithm.

Our previous approach to the knapsack problem can be fully applied here. Now, let's start with establishing the "chromosome" population, where the chromosomes are still binary strings, "1" means the model includes the feature, "0 means the model excludes the feature".

One difference, though, is that our fitness function needs to change a bit. The fitness function here should be the standard for the accuracy of this competition. That is to say, if the predicted value of the chromosome is more accurate, then it can be said that its fitness is higher.

Now I'm assuming you have a little idea of ​​this method. I won't explain the solution to this problem right away, but let's first implement it with the TPOT library.

5.2 Implementation with TPOT library

This part is believed to be the goal that you ultimately want to achieve when you first read this article. That is: realize. So first let's take a quick look at the TPOT library (Tree-based Pipeline Optimisation Technique), which is based on the scikit-learn library. The figure below shows a basic transfer structure.

The gray area in the figure is automatically processed with the TPOT library. A genetic algorithm is needed to realize the automatic processing of this part.

We won't go into depth here, but apply it directly. To be able to use the TPOT library, you need to first install some python libraries on which TPOT is built. Below we quickly install them:

# installing DEAP, update_checker and tqdm

pip install deap update_checker tqdm

# installling TPOT

pip install tpot

Here, I use the Big Mart Sales (dataset address: https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/) data set, to prepare for implementation, let's download it quickly Training and testing files, the following is the python code:

# import basic libraries

import numpy as np 

import pandas as pd 

import matplotlib.pyplot as plt

%matplotlib inline 

from sklearn import preprocessing 

from sklearn.metrics import mean_squared_error ## preprocessing

### mean imputations

train[ 'Item_Weight'].fillna((train[ 'Item_Weight'].mean()), inplace= True)test[ 'Item_Weight'].fillna((test[ 'Item_Weight'].mean()), inplace= True)

### reducing fat content to only two categories

train[ 'Item_Fat_Content'] = train[ 'Item_Fat_Content'].replace([ 'low fat', 'LF'], [ 'Low Fat', 'Low Fat']) train[ 'Item_Fat_Content'] = train[ 'Item_Fat_Content'].replace([ 'reg'], [ 'Regular']) test[ 'Item_Fat_Content'] = test[ 'Item_Fat_Content'].replace([ 'low fat', 'LF'], [ 'Low Fat', 'Low Fat']) test[ 'Item_Fat_Content'] = test[ 'Item_Fat_Content'].replace([ 'reg'], [ 'Regular']) train[ 'Outlet_Establishment_Year'] = 2013- train[ 'Outlet_Establishment_Year'] test[ 'Outlet_Establishment_Year'] = 2013- test[ 'Outlet_Establishment_Year'] train[ 'Outlet_Size'].fillna( 'Small',inplace= True)test[ 'Outlet_Size'].fillna( 'Small',inplace= True)train[ 'Item_Visibility'] = np.sqrt(train[ 'Item_Visibility'])test[ 'Item_Visibility'] = np.sqrt(test[ 'Item_Visibility'])col = [ 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type', 'Item_Fat_Content']test[ 'Item_Outlet_Sales'] = 0combi = train.append(test) fori incol: combi[i] = number.fit_transform(combi[i].astype( 'str')) combi[i] = combi[i].astype( 'object')train = combi[:train.shape[ 0]]test = combi[train.shape[ 0]:]test.drop( 'Item_Outlet_Sales',axis= 1,inplace= True)Item_Fat_Content']test[ 'Item_Outlet_Sales'] = 0combi = train.append(test) fori incol: combi[i] = number.fit_transform(combi[i].astype( 'str')) combi[i] = combi[i].astype( 'object')train = combi[:train.shape[ 0]]test = combi[train.shape[ 0]:]test.drop( 'Item_Outlet_Sales',axis= 1,inplace= True)Item_Fat_Content']test[ 'Item_Outlet_Sales'] = 0combi = train.append(test) fori incol: combi[i] = number.fit_transform(combi[i].astype( 'str')) combi[i] = combi[i].astype( 'object')train = combi[:train.shape[ 0]]test = combi[train.shape[ 0]:]test.drop( 'Item_Outlet_Sales',axis= 1,inplace= True)

## removing id variables

tpot_train = train.drop([ 'Outlet_Identifier', 'Item_Type', 'Item_Identifier'],axis= 1)tpot_test = test.drop([ 'Outlet_Identifier', 'Item_Type', 'Item_Identifier'],axis= 1)target = tpot_train[ 'Item_Outlet_Sales']tpot_train.drop( 'Item_Outlet_Sales',axis= 1,inplace= True)

# finally building model using tpot library

from tpot import TPOTRegressor

X_train, X_test, y_train, y_test = train_test_split(tpot_train, target, train_size= 0.75, test_size= 0.25)

tpot = TPOTRegressor(generations= 5, population_size= 50, verbosity= 2)

tpot.fit(X_train, y_train)

print(tpot.score(X_test, y_test))

tpot.export( 'tpot_boston_pipeline.py')

Once the code runs, tpot_exported_pipeline.py will put the python code for path optimization. We can find that ExtraTreeRegressor can best solve this problem.

## predicting using tpot optimised pipeline

tpot_pred = tpot.predict(tpot_test)

sub1 = pd.DataFrame(data=tpot_pred)

#sub1.index = np.arange(0, len(test)+1)

sub1 = sub1.rename(columns = { '0': 'Item_Outlet_Sales'})

sub1[ 'Item_Identifier'] = test[ 'Item_Identifier']

sub1[ 'Outlet_Identifier'] = test[ 'Outlet_Identifier']

sub1.columns = [ 'Item_Outlet_Sales', 'Item_Identifier', 'Outlet_Identifier']

sub1 = sub1[[ 'Item_Identifier', 'Outlet_Identifier', 'Item_Outlet_Sales']]

sub1.to_csv( 'tpot.csv',index= False)

If you submit this csv, then you will find that the ones I promised at the beginning have not been fully implemented. So am I lying to you? of course not. Actually, the TPOT library has a simple rule. If you don't run TPOT for too long, then it won't figure out the most likely way of delivery for your problem.

So, you have to increase the algebra of evolution, grab a cup of coffee for a walk, and leave the rest to TPOT. Additionally, you can also use this library for classification problems. For further content, please refer to this document: http://rhiever.github.io/tpot/. In addition to competitions, we also have many application scenarios in life where genetic algorithms can be used.

6. Practical application

Genetic algorithms have many real-world applications. Here I have listed some interesting scenarios, but due to space constraints, I will not introduce them in detail.

6.1 Engineering Design

Engineering design relies heavily on computer modeling and simulation to make the design cycle process fast and economical. Genetic algorithms can be optimized here and give a good result.

related resources:

  • 论文:Engineering design using genetic algorithms

  • Address: http://lib.dr.iastate.edu/cgi/viewcontent.cgi?article=16942&context=rtd

6.2 Traffic and Shipping Routes (Travelling Salesman Problem)

This is a very well-known problem that has been used by many trading companies to make shipping more time-efficient and economical. Genetic algorithms are also used to solve this problem.

6.3 Robots

Genetic algorithms are widely used in robotics. In fact, genetic algorithms are currently being used to create self-learning robots that can act like humans, performing tasks such as cooking, laundry, and more.

related resources:

  • 论文:Genetic Algorithms for Auto-tuning Mobile Robot Motion Control

  • Address: https://pdfs.semanticscholar.org/7c8c/faa78795bcba8e72cd56f8b8e3b95c0df20c.pdf

7. Conclusion

Hopefully, through the introduction of this article, you now have enough understanding of the genetic algorithm, and will also use the TPOT library to implement it. But the knowledge of this article is also very limited if you don't practice it yourself.

Therefore, readers and friends, please try to implement it yourself, whether it is a data science competition or in life.

Original link: https://www.analyticsvidhya.com/blog/2017/07/introduction-to-genetic-algorithm/

This article is compiled for the heart of the machine, please contact this public account for authorization to reprint.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324690795&siteId=291194637