An unknown college student, called Caigou by people in the Jianghu
original author: Jacky Li
Email : [email protected]

Time of completion：2023.5.2
Last edited: 2023.5.2

Sign language is a form of communication primarily used by people who are hard of hearing or deaf. This gesture-based language allows people to express thoughts and ideas easily, overcoming barriers posed by hearing problems.

A major problem with this convenient form of communication is that the vast majority of people across the globe lack knowledge of the language. Just like other languages, learning sign language takes a lot of time and effort, which is frustrating and not being learned by more people.

However, an obvious solution to this problem already exists in the field of machine learning and image detection. Implementing predictive modeling techniques to automatically classify sign language signs can be used to create real-time captions for virtual meetings such as Zoom meetings.

This will greatly increase access to such services for the hearing impaired as it will be synchronized with voice-based subtitles, creating a two-way online communication system for the hearing impaired.

Large training datasets for many sign languages are available on Kaggle, a popular data science resource. The one used in this model is called "Sign Language MNIST", a public domain, freely available dataset containing pixel information for about 1000 images of each of the 24 ASL letters, excluding J and Z, Because they are gesture-based symbols.

Sign Language MNIST | KaggleDrop-In Replacement for MNIST for Hand Gesture Recognition Taskshttps://www.kaggle.com/datasets/datamunge/sign-language-mnist

The first step in preparing data for training is to convert and reshape all the pixel data in the dataset into images so that the algorithm can read them.

import matplotlib.pyplot as plt
import seaborn as sns
from keras.models import Sequential
from keras.layers import Dense, Conv2D , MaxPool2D , Flatten , Dropout , BatchNormalization
from keras.preprocessing.image import ImageDataGenerator
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix
import pandas as pd

train_df = pd.read_csv("sign_mnist_train.csv")
test_df = pd.read_csv("sign_mnist_test.csv")

y_train = train_df['label']
y_test = test_df['label']
del train_df['label']
del test_df['label']

from sklearn.preprocessing import LabelBinarizer
label_binarizer = LabelBinarizer()
y_train = label_binarizer.fit_transform(y_train)
y_test = label_binarizer.fit_transform(y_test)

x_train = train_df.values
x_test = test_df.values

x_train = x_train / 255
x_test = x_test / 255

x_train = x_train.reshape(-1,28,28,1)
x_test = x_test.reshape(-1,28,28,1)

The above code starts by reshaping all MNIST training image files so that the model understands the input files. Besides that, the LabelBinarizer variable takes the classes in the dataset and converts them to binary, a process that greatly speeds up the training of the model.

The next step is to create a data generator to randomly implement changes to the data, increase the number of training examples, and make the images more realistic by adding noise and transformations to different instances.

datagen = ImageDataGenerator(
        featurewise_center=False,
        samplewise_center=False, 
        featurewise_std_normalization=False,
        samplewise_std_normalization=False,
        zca_whitening=False,
        rotation_range=10,
        zoom_range = 0.1, 
        width_shift_range=0.1,
        height_shift_range=0.1,
        horizontal_flip=False,
        vertical_flip=False)

datagen.fit(x_train)

After processing the images, the CNN model had to be compiled to recognize all categories of information used in the data, i.e. 24 different image groups. Normalization of the data must also be added to the data to balance classes with fewer images.

model = Sequential()
model.add(Conv2D(75 , (3,3) , strides = 1 , padding = 'same' , activation = 'relu' , input_shape = (28,28,1)))
model.add(BatchNormalization())
model.add(MaxPool2D((2,2) , strides = 2 , padding = 'same'))
model.add(Conv2D(50 , (3,3) , strides = 1 , padding = 'same' , activation = 'relu'))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(MaxPool2D((2,2) , strides = 2 , padding = 'same'))
model.add(Conv2D(25 , (3,3) , strides = 1 , padding = 'same' , activation = 'relu'))
model.add(BatchNormalization())
model.add(MaxPool2D((2,2) , strides = 2 , padding = 'same'))
model.add(Flatten())
model.add(Dense(units = 512 , activation = 'relu'))
model.add(Dropout(0.3))
model.add(Dense(units = 24 , activation = 'softmax'))

Note that the algorithm is initialized by adding variables (such as Conv2D models) and condensed to 24 features. We also use batching techniques to allow the CNN to process data more efficiently.

Finally, define the loss function and metrics, and fit the model to the data

model.compile(optimizer = 'adam' , loss = 'categorical_crossentropy' , metrics = ['accuracy'])
model.summary()

history = model.fit(datagen.flow(x_train,y_train, batch_size = 128) ,epochs = 20 , validation_data = (x_test, y_test))

model.save('smnist.h5')

This code has a lot to unpack. Let's break it down into sections.

Line 1:

The model.compile function accepts many parameters, three of which are shown in the code. The optimizer and loss parameters work together with the epoch statement in the next line to effectively reduce the amount of error in the model by incrementally changing how it is computed on the data.

Besides that, the metric to optimize is the accuracy function, which ensures that the model has the maximum achievable accuracy after a set number of epochs.

Line 4:

The functions run here match the designed model to the data in the image data developed in the first bit of code. It also defines the number of epochs or iterations the model must have to improve the accuracy of image detection. The validation set is also called here to introduce the testing aspect to the model. The model uses this data to calculate accuracy.

Line 5:

Of all the statements in the code bit, the model.save function is probably the most important part of this code, as it saves hours of time when implementing the model.

The developed model accurately detects and classifies sign language symbols with about 95% training accuracy.

Now, using two popular real-time video processing libraries, Mediapipe and OpenCV, we can take webcam input and run our previously developed model on the live video stream.

First, we need to import the packages required by the program.

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
import tensorflow as tf
import cv2
import mediapipe as mp
from keras.models import load_model
import numpy as np
import time

The OS command to run at start just prevents the Tensorflow library used by Mediapipe from emitting unnecessary warnings. This makes future output provided by the program clearer and easier to understand.

Before we start the main while loop of the code, we need to first define some variables, such as the saved model and information on the OpenCV camera.

model = load_model('smnist.h5')

mphands = mp.solutions.hands
hands = mphands.Hands()
mp_drawing = mp.solutions.drawing_utils

cap = cv2.VideoCapture(0)
_, frame = cap.read()
h, w, c = frame.shape

analysisframe = ''
letterpred = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y']

Each variable set here falls into one of four categories. The classes at the beginning are directly related to the model we trained in the first part of this paper.

The second and third sections of code define the variables needed to run and start Mediapipe and OpenCV. The final category is primarily used to analyze frames as they are detected and to create dictionaries for cross-referencing of data provided by the image model.

The next part of the program is the main while True loop where most of the program runs.

while True:
    _, frame = cap.read()

    k = cv2.waitKey(1)
    if k%256 == 27:
        # ESC pressed
        print("Escape hit, closing...")
        break

    framergb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    result = hands.process(framergb)
    hand_landmarks = result.multi_hand_landmarks
    if hand_landmarks:
        for handLMs in hand_landmarks:
            x_max = 0
            y_max = 0
            x_min = w
            y_min = h
            for lm in handLMs.landmark:
                x, y = int(lm.x * w), int(lm.y * h)
                if x > x_max:
                    x_max = x
                if x < x_min:
                    x_min = x
                if y > y_max:
                    y_max = y
                if y < y_min:
                    y_min = y
            y_min -= 20
            y_max += 20
            x_min -= 20
            x_max += 20
            cv2.rectangle(frame, (x_min, y_min), (x_max, y_max), (0, 255, 0), 2)
            mp_drawing.draw_landmarks(frame, handLMs, mphands.HAND_CONNECTIONS)
    cv2.imshow("Frame", frame)

cap.release()
cv2.destroyAllWindows()

This part of the program takes input from your camera and uses our imported image processing library to display the device's input to the computer. This part of the code is focused on getting general information from the camera and simply displaying it in a new window. However, using the Mediapipe library, we can detect the main landmarks of the hand, such as fingers and palm, and create a bounding box around the hand.

The concept of a bounding box is a key component of all forms of image classification and analysis. This box allows the model to focus directly on the part of the image that is required for functionality. Without this, the algorithm finds the pattern in the wrong place and can lead to wrong results.

For example, during training, missing bounding boxes may cause the model to associate features of images such as clocks or chairs with labels. This could cause the program to notice the clock in the image and decide what sign language characters to display based solely on the fact that the clock is present.

almost finished! The penultimate part of the program is to capture a single frame as prompted and crop it to the dimensions of the bounding box.

while True:
    _, frame = cap.read()
    
    k = cv2.waitKey(1)
    if k%256 == 27:
        # ESC pressed
        print("Escape hit, closing...")
        break
    elif k%256 == 32:
        # SPACE pressed
        # SPACE pressed
        analysisframe = frame
        showframe = analysisframe
        cv2.imshow("Frame", showframe)
        framergbanalysis = cv2.cvtColor(analysisframe, cv2.COLOR_BGR2RGB)
        resultanalysis = hands.process(framergbanalysis)
        hand_landmarksanalysis = resultanalysis.multi_hand_landmarks
        if hand_landmarksanalysis:
            for handLMsanalysis in hand_landmarksanalysis:
                x_max = 0
                y_max = 0
                x_min = w
                y_min = h
                for lmanalysis in handLMsanalysis.landmark:
                    x, y = int(lmanalysis.x * w), int(lmanalysis.y * h)
                    if x > x_max:
                        x_max = x
                    if x < x_min:
                        x_min = x
                    if y > y_max:
                        y_max = y
                    if y < y_min:
                        y_min = y
                y_min -= 20
                y_max += 20
                x_min -= 20
                x_max += 20 

        analysisframe = cv2.cvtColor(analysisframe, cv2.COLOR_BGR2GRAY)
        analysisframe = analysisframe[y_min:y_max, x_min:x_max]
        analysisframe = cv2.resize(analysisframe,(28,28))


        nlist = []
        rows,cols = analysisframe.shape
        for i in range(rows):
            for j in range(cols):
                k = analysisframe[i,j]
                nlist.append(k)
        
        datan = pd.DataFrame(nlist).T
        colname = []
        for val in range(784):
            colname.append(val)
        datan.columns = colname

        pixeldata = datan.values
        pixeldata = pixeldata / 255
        pixeldata = pixeldata.reshape(-1,28,28,1)

This code looks very similar to the last part of the program. This is mainly because the process involved in generating bounding boxes is the same in both parts.

However, in this analysis part of the code, we use the image reshape function in OpenCV to resize the image to the dimensions of the bounding box instead of creating a visual object around it.

In addition, we also use NumPy and OpenCV to modify the images to have the same characteristics as the images the model was trained on.

We also use pandas to create a dataframe using the pixel data from the saved image, so we can normalize the data in the same way we created the model.

Finally, we need to run the trained model on the processed images and process the information output.

prediction = model.predict(pixeldata)
predarray = np.array(prediction[0])
letter_prediction_dict = {letterpred[i]: predarray[i] for i in range(len(letterpred))}
predarrayordered = sorted(predarray, reverse=True)
high1 = predarrayordered[0]
high2 = predarrayordered[1]
high3 = predarrayordered[2]
for key,value in letter_prediction_dict.items():
    if value==high1:
        print("Predicted Character 1: ", key)
        print('Confidence 1: ', 100*value)
    elif value==high2:
        print("Predicted Character 2: ", key)
        print('Confidence 2: ', 100*value)
    elif value==high3:
        print("Predicted Character 3: ", key)
        print('Confidence 3: ', 100*value)
time.sleep(5)

There is a lot of information in this part of the code. We will dissect this part of code one by one.

The first two lines plot the predicted probability that the hand image is of any different class in Keras. The data is presented as 2 tensors, where the first tensor contains probability information. Tensors are essentially collections of feature vectors, much like arrays. The tensors produced by this model are one-dimensional, allowing it to be used with the linear algebra library NumPy to parse information into a more Pythonic form.

From here, we use the previously created list of classes under the variable letterpred to create a dictionary that matches the value of the tensor with the key. This allows us to match the probability of each character with its corresponding class.

After this step, we use a list comprehension to sort the values from highest to lowest. This way, we can take the first few items in the list and assign them the 3 closest characters to the sign language image shown.

Finally, we iterate through all the key:value pairs in the dictionary using a for loop to match the highest value with its corresponding key and output the probability for each character.

As shown, the model accurately predicts the character shown from the camera. In addition to predicting features, the program also shows the confidence of the classification of the CNN Keras model.

The developed model can be implemented in various ways, the main usage is a captioning device for video calls such as Facetime. To create such an application, the model must run frame by frame, predicting the symbols displayed.

The program allows simple and easy communication from sign language to English by using the Keras image analysis model.

The author has something to say

If you need the code, please chat with the blogger privately, and the blogger will see back.
If you feel that what the blogger said is useful to you, please click to support it, and will continue to update such issues...

Gesture Recognition Technology Based on Computer Vision

The author has something to say

Guess you like