[CCF BDCI 2023] Speaker recognition in multi-modal multi-party dialogue scenarios Baseline 0.71 Slover part

Overview

Today's technology is changing with each passing day, and the development of Artificial Intelligence is rapidly changing the way we live and work. Especially in fields such as Natural Linguistic Processing and Computer Vision.

Traditional multimodal dialogue research mainly focuses on the interaction between a single user and the system, while ignoring the complexity of multi-user scenarios. Visual information (Visual Info) is often marginalized and only used as Vega information rather than as part of the dialogue. Core part. In practical applications, the algorithm needs to "observe" and interact with multiple users, who may not be the current speaker.

[CCF BDCI 2023] Speaker recognition in multi-modal multi-party dialogue scenarios. The core idea is to identify speakers from each round of dialogue through the content of multiple rounds of continuous dialogue and the corresponding frames of each round, as well as the corresponding face bbox and name label. Give a spokesperson.

The code of Baseline is divided into three files, namely convex_optimization.py, dialog_roberta-constrasive.py, finetune_cnn-multiturn.py. Below I will explain Let me explain the Solver part in detail.

The role of Solver in multimodal speaker recognition

The role of the solver (Solver) in multi-modal speaker recognition is to integrate and analyze data from different modalities, such as visual information in food and textual information in scripts. The Slover's task is to extract from these complex data Useful information and ultimately identify the speaker in each conversation.

Solver

The Importance of Solver in Multimodal Speaker Recognition

Three main points of the solver:

  1. Integrate multi-modal data: Solver can process data from different sources, in this case the output of CNN and NLP, providing different perspectives for recognition
  2. Optimize the decision-making process: By applying mathematical and statistical methods, Solver can find the optimal solution among many potential speakers (Potential)
  3. Improve accuracy: Compared with single modal analysis (CNN or NLP or Audio), Solver can comprehensively consider a variety of information to improve the accuracy of speaker recognition (can be understood as similar to the Bagging algorithm)

How Solver works

Solver usually uses digital optimization technology, such as Quadratic Programming used this time, to solve the speaker identification problem. First, the output of the two models is converted into parameters in the mathematical problem, and then an optimization algorithm is used to find these The optimal combination of parameters to determine the speaker.

quadratic planning

First of all, I would like to state that I am a novice and have graduated from primary school. My mathematics level is still at the stage of2x + 1 = 5. The following content is based on online content + my superficial understanding. I hope there are any mistakes. Please correct me.

quadratic planning

Quadratic Programming. It is a special type of mathematical optimization problem involving a quadratic objective function and a series of linear constraints.

The basic form of quadratic planning

The quadratic programming problem can be formalized as the following mathematical model:

  • Objective function: f ( x ) = 1 2 x T Q x + c T x f(x) = \frac{1}{2} x^T Q x + c^ T x f(x)=21xTQx+cTx
  • Promise condition: A x ≤ b Ax \leq b Axb

inside, x x x is the variable vector that needs to be optimized, Q Q Q is a stacked matrix, c c c is one direction amount, A A A This is one square, b b b is a vector. The goal is to find x x The value of x makes the objective function f ( x ) f(x) f(x)Maximize or minimize, All linear constraints are satisfied at the same time.

Characteristics of Quadratic Planning

Characteristics of quadratic planning:

  1. Quadratic nature of the objective function: The core feature of quadratic programming is that its objective function is a quadratic function of the variables. This means that the objective function can have a unique maximum or minimum value, depending on the properties of the matrix Q.
  2. Linear constraints: Although the objective function is quadratic, all constraints are linear. These constraints define a feasible region within which the optimization must
  3. Convexity: If matrix Q Q Q is positive semi-definite, where the quadratic programming problem is convex. Convex quadratic programming ensures the existence of the global optimal solution

Application of quadratic programming in multimodal speech (my understanding)

In the context of Multimodal Speaker Identification (MMSI), Quadratic Programming is used to integrate information from different models (CNN & Roberta) and find the optimal speaker assignment.

The objective function here usually consists of two parts:

  1. Based on visual information, CNN output
  2. Based on text information, Roberta outputs

By optimizing this combined objective function, Solver can assign the most likely speaker (Speaker) while satisfying certain constraints.

Detailed code explanation

The main steps:

  1. Data transformation: Used to transform the model output of CNN and Roberta / Deberta into the format required by the optimization problem
  2. Setting of objective functions and constraints
  3. Solving the optimization problem: Solving with the Gurobi solver
  4. Optimization process: Make sure only one speaker is selected
  5. Get results

Data transformation

def convert_cnn_preds_to_matrix(frame_names_list, cnn_scores, label='pred'):
    matrix_list, mappings_list = [], []
    for frame_names in frame_names_list:
        scores = [cnn_scores.get(frame_name, {}) for frame_name in frame_names]
        speakers = set(sum([list(i.keys()) for i in scores], list()))
        speaker_id_mappings = {speaker: i for i, speaker in enumerate(speakers)}
        matrix = np.zeros((len(frame_names), len(speakers)))
        for i, score_dict in enumerate(scores):
            for speaker, score in score_dict.items():
                matrix[i][speaker_id_mappings[speaker]] = score[label]
        matrix_list.append(matrix)
        mappings_list.append(speaker_id_mappings)
    return matrix_list, mappings_list
  • Input: frame_names_list (frame names) and cnn_scores (CNN prediction scores)
  • Mapping: assign an index to the speaker appearing in each frame, mapped tospeaker_id_mappings
  • Construct a matrix: For each dialogue, initialize a matrix of all 0s, with rows equal to the number of frames and columns equal to the number of speakers. Traverse the score of each frame and put the score corresponding to the speaker into the corresponding position of the matrix according to the mapping
  • Output:
    • Matrix list: the matrix after each dialogue is correct
    • Mapping list: mapping of each dialogue speaker to index

Objective function and constraint setting

def solve(cnn_scores, roberta_scores):
    '''
    cnn_scores: matrix of shape [n, m]
    roberta_scores: matrix of shape [n, n]
    where m = num_speakers, n = num_sents
    '''

    n, m = cnn_scores.shape
    x = cp.Variable(np.prod(cnn_scores.shape), boolean=True)
    constraints = []
    for i in range(n):
        constraints.append(cp.sum(x[i*m: i*m+m]) == 1)

    cnn_objective = cnn_scores.reshape(-1).T @ x.T
    new_roberta_scores = np.zeros((n*m, n*m))
    for i in range(n):
        for j in range(n):
            for k in range(m):
                new_roberta_scores[i*m+k, j*m+k] += roberta_scores[i, j]

    roberta_objecive = cp.quad_form(x, new_roberta_scores) * (1/2)
    objective = cnn_objective + roberta_objecive

    problem = cp.Problem(cp.Maximize(objective), constraints)
    problem.solve(solver='GUROBI')
    return x.value.reshape(n, m), problem.status
  • Variable definition: Define the decision variablex, which is an array of Boolean variables, and the remaining length is equal to the total number of elements of the cnn_scores matrix. This variable represents Choose each speaker's decision per sentence
  • CNN objective function: cnn_objective is obtained by flattening the cnn_scores matrix and multiplying the decision variables x of. This part represents the reward for selecting a specific speaker
  • Roberta objective function:
    • new_roberta_scores: n ∗ m × n ∗ m n*m \times n*m nm×nm, this matrix is ​​an extension of roberta_scores and is used to express the similarity between different sentences
    • roberta_objective is obtained by applying the quadratic form to the decision variable x, representing the reward for choosing the same speaker
  • Total objective function: The final objective functionobjective is the sum of cnn_objective and roberta_objextive. This function represents Under the premise of satisfying the constraints, we hope to maximize the total reward

Setting of constraints

constraints = []
for i in range(n):
    constraints.append(cp.sum(x[i*m: i*m+m]) == 1)

Constraint definition: For each sentence (totaln), add a constraint to ensure that only one speaker can be selected for each sentence. The specific implementation is:

  1. Initialize constraints Li Biao: Create an empty listconstraints, used to store all constraints
  2. Loop to add constraints: For each sentencefor i in range(n), perform the following operations:
    • x[i*m: i*m+m]: It is a slicing operation on the decision variable x, selecting the part related to the ith sentence. Since x is a one-dimensional array, this slice actually selects the i. That is, only one speaker can be selected for each sentence
  3. Add to constraint list: Add constraints for each sentence toconstraintslist

CVXPY and Gurobi solution

CVXPY

CVXPY is a library in Python used to construct and solve convex optimization problems. In the solver for multimodal speaker identification, CVXPY is used to define the quadratic programming problem. Including constructing the objective function (RNN output + Roberta output), and define constraints (a speaker).

CVXPY

Gurobi

Gurobi is a powerful mathematical optimization solver that is widely used in industry and academia. Gurobi can efficiently solve various types of optimization problems. Including linear programming, integer programming, quadratic programming, etc. In our Solver , Gurobi is used as the backend solver of CVXPY. Once a single optimization problem is defined in CVXPY, Gurobi is used to actually solve the problem and find the optimal solution.

Gurobi

CVXPY and Gurobi solve optimization problems

Define decision variables:

x = cp.Variable(np.prod(cnn_scores.shape), boolean=True)

Construct objective function and constraints:

  • As mentioned above

Create an optimization problem:

problem = cp.Problem(cp.Maximize(objective), constraints)

Solve the optimization problem:

problem.solve(solver='GUROBI')

Process the solution results:

return x.value.reshape(n, m), problem.status

Citation

Beasley, J. E. (1998). Heuristic algorithms for the unconstrained binary quadratic programming problem. ResearchGate. Retrieved from https://www.researchgate.net/publication/2661228

Vanderbei, R. J. (1999). LOQO: An interior point code for quadratic programming. Optimization Methods and Software, 11(1-4), 451-484. https://doi.org/10.1080/10556789908805759

Axehill, D. (2008). Integer quadratic programming for control and communication. DIVA. Retrieved from https://www.diva-portal.org/smash/record.jsf?pid=diva2:17358

Guess you like

Origin blog.csdn.net/weixin_46274168/article/details/134962025