[CCF BDCI 2023] Speaker recognition in multi-modal multi-party dialogue scenarios Baseline 0.71 Slover part
Overview
Today's technology is changing with each passing day, and the development of Artificial Intelligence is rapidly changing the way we live and work. Especially in fields such as Natural Linguistic Processing and Computer Vision.
Traditional multimodal dialogue research mainly focuses on the interaction between a single user and the system, while ignoring the complexity of multi-user scenarios. Visual information (Visual Info) is often marginalized and only used as Vega information rather than as part of the dialogue. Core part. In practical applications, the algorithm needs to "observe" and interact with multiple users, who may not be the current speaker.
[CCF BDCI 2023] Speaker recognition in multi-modal multi-party dialogue scenarios. The core idea is to identify speakers from each round of dialogue through the content of multiple rounds of continuous dialogue and the corresponding frames of each round, as well as the corresponding face bbox and name label. Give a spokesperson.
The code of Baseline is divided into three files, namely convex_optimization.py
, dialog_roberta-constrasive.py
, finetune_cnn-multiturn.py
. Below I will explain Let me explain the Solver part in detail.
The role of Solver in multimodal speaker recognition
The role of the solver (Solver) in multi-modal speaker recognition is to integrate and analyze data from different modalities, such as visual information in food and textual information in scripts. The Slover's task is to extract from these complex data Useful information and ultimately identify the speaker in each conversation.
The Importance of Solver in Multimodal Speaker Recognition
Three main points of the solver:
- Integrate multi-modal data: Solver can process data from different sources, in this case the output of CNN and NLP, providing different perspectives for recognition
- Optimize the decision-making process: By applying mathematical and statistical methods, Solver can find the optimal solution among many potential speakers (Potential)
- Improve accuracy: Compared with single modal analysis (CNN or NLP or Audio), Solver can comprehensively consider a variety of information to improve the accuracy of speaker recognition (can be understood as similar to the Bagging algorithm)
How Solver works
Solver usually uses digital optimization technology, such as Quadratic Programming used this time, to solve the speaker identification problem. First, the output of the two models is converted into parameters in the mathematical problem, and then an optimization algorithm is used to find these The optimal combination of parameters to determine the speaker.
quadratic planning
First of all, I would like to state that I am a novice and have graduated from primary school. My mathematics level is still at the stage of2x + 1 = 5
. The following content is based on online content + my superficial understanding. I hope there are any mistakes. Please correct me.
Quadratic Programming. It is a special type of mathematical optimization problem involving a quadratic objective function and a series of linear constraints.
The basic form of quadratic planning
The quadratic programming problem can be formalized as the following mathematical model:
- Objective function: f ( x ) = 1 2 x T Q x + c T x f(x) = \frac{1}{2} x^T Q x + c^ T x f(x)=21xTQx+cTx
- Promise condition: A x ≤ b Ax \leq b Ax≤b
inside, x x x is the variable vector that needs to be optimized, Q Q Q is a stacked matrix, c c c is one direction amount, A A A This is one square, b b b is a vector. The goal is to find x x The value of x makes the objective function f ( x ) f(x) f(x)Maximize or minimize, All linear constraints are satisfied at the same time.
Characteristics of Quadratic Planning
Characteristics of quadratic planning:
- Quadratic nature of the objective function: The core feature of quadratic programming is that its objective function is a quadratic function of the variables. This means that the objective function can have a unique maximum or minimum value, depending on the properties of the matrix Q.
- Linear constraints: Although the objective function is quadratic, all constraints are linear. These constraints define a feasible region within which the optimization must
- Convexity: If matrix Q Q Q is positive semi-definite, where the quadratic programming problem is convex. Convex quadratic programming ensures the existence of the global optimal solution
Application of quadratic programming in multimodal speech (my understanding)
In the context of Multimodal Speaker Identification (MMSI), Quadratic Programming is used to integrate information from different models (CNN & Roberta) and find the optimal speaker assignment.
The objective function here usually consists of two parts:
- Based on visual information, CNN output
- Based on text information, Roberta outputs
By optimizing this combined objective function, Solver can assign the most likely speaker (Speaker) while satisfying certain constraints.
Detailed code explanation
The main steps:
- Data transformation: Used to transform the model output of CNN and Roberta / Deberta into the format required by the optimization problem
- Setting of objective functions and constraints
- Solving the optimization problem: Solving with the Gurobi solver
- Optimization process: Make sure only one speaker is selected
- Get results
Data transformation
def convert_cnn_preds_to_matrix(frame_names_list, cnn_scores, label='pred'):
matrix_list, mappings_list = [], []
for frame_names in frame_names_list:
scores = [cnn_scores.get(frame_name, {}) for frame_name in frame_names]
speakers = set(sum([list(i.keys()) for i in scores], list()))
speaker_id_mappings = {speaker: i for i, speaker in enumerate(speakers)}
matrix = np.zeros((len(frame_names), len(speakers)))
for i, score_dict in enumerate(scores):
for speaker, score in score_dict.items():
matrix[i][speaker_id_mappings[speaker]] = score[label]
matrix_list.append(matrix)
mappings_list.append(speaker_id_mappings)
return matrix_list, mappings_list
- Input: frame_names_list (frame names) and cnn_scores (CNN prediction scores)
- Mapping: assign an index to the speaker appearing in each frame, mapped to
speaker_id_mappings
- Construct a matrix: For each dialogue, initialize a matrix of all 0s, with rows equal to the number of frames and columns equal to the number of speakers. Traverse the score of each frame and put the score corresponding to the speaker into the corresponding position of the matrix according to the mapping
- Output:
- Matrix list: the matrix after each dialogue is correct
- Mapping list: mapping of each dialogue speaker to index
Objective function and constraint setting
def solve(cnn_scores, roberta_scores):
'''
cnn_scores: matrix of shape [n, m]
roberta_scores: matrix of shape [n, n]
where m = num_speakers, n = num_sents
'''
n, m = cnn_scores.shape
x = cp.Variable(np.prod(cnn_scores.shape), boolean=True)
constraints = []
for i in range(n):
constraints.append(cp.sum(x[i*m: i*m+m]) == 1)
cnn_objective = cnn_scores.reshape(-1).T @ x.T
new_roberta_scores = np.zeros((n*m, n*m))
for i in range(n):
for j in range(n):
for k in range(m):
new_roberta_scores[i*m+k, j*m+k] += roberta_scores[i, j]
roberta_objecive = cp.quad_form(x, new_roberta_scores) * (1/2)
objective = cnn_objective + roberta_objecive
problem = cp.Problem(cp.Maximize(objective), constraints)
problem.solve(solver='GUROBI')
return x.value.reshape(n, m), problem.status
- Variable definition: Define the decision variable
x
, which is an array of Boolean variables, and the remaining length is equal to the total number of elements of thecnn_scores
matrix. This variable represents Choose each speaker's decision per sentence - CNN objective function:
cnn_objective
is obtained by flattening thecnn_scores
matrix and multiplying the decision variablesx
of. This part represents the reward for selecting a specific speaker - Roberta objective function:
new_roberta_scores
: n ∗ m × n ∗ m n*m \times n*m n∗m×n∗m, this matrix is an extension ofroberta_scores
and is used to express the similarity between different sentencesroberta_objective
is obtained by applying the quadratic form to the decision variablex
, representing the reward for choosing the same speaker
- Total objective function: The final objective function
objective
is the sum ofcnn_objective
androberta_objextive
. This function represents Under the premise of satisfying the constraints, we hope to maximize the total reward
Setting of constraints
constraints = []
for i in range(n):
constraints.append(cp.sum(x[i*m: i*m+m]) == 1)
Constraint definition: For each sentence (totaln
), add a constraint to ensure that only one speaker can be selected for each sentence. The specific implementation is:
- Initialize constraints Li Biao: Create an empty list
constraints
, used to store all constraints - Loop to add constraints: For each sentence
for i in range(n)
, perform the following operations:x[i*m: i*m+m]
: It is a slicing operation on the decision variablex
, selecting the part related to thei
th sentence. Sincex
is a one-dimensional array, this slice actually selects thei
. That is, only one speaker can be selected for each sentence
- Add to constraint list: Add constraints for each sentence to
constraints
list
CVXPY and Gurobi solution
CVXPY
CVXPY is a library in Python used to construct and solve convex optimization problems. In the solver for multimodal speaker identification, CVXPY is used to define the quadratic programming problem. Including constructing the objective function (RNN output + Roberta output), and define constraints (a speaker).
Gurobi
Gurobi is a powerful mathematical optimization solver that is widely used in industry and academia. Gurobi can efficiently solve various types of optimization problems. Including linear programming, integer programming, quadratic programming, etc. In our Solver , Gurobi is used as the backend solver of CVXPY. Once a single optimization problem is defined in CVXPY, Gurobi is used to actually solve the problem and find the optimal solution.
CVXPY and Gurobi solve optimization problems
Define decision variables:
x = cp.Variable(np.prod(cnn_scores.shape), boolean=True)
Construct objective function and constraints:
- As mentioned above
Create an optimization problem:
problem = cp.Problem(cp.Maximize(objective), constraints)
Solve the optimization problem:
problem.solve(solver='GUROBI')
Process the solution results:
return x.value.reshape(n, m), problem.status
Citation
Beasley, J. E. (1998). Heuristic algorithms for the unconstrained binary quadratic programming problem. ResearchGate. Retrieved from https://www.researchgate.net/publication/2661228
Vanderbei, R. J. (1999). LOQO: An interior point code for quadratic programming. Optimization Methods and Software, 11(1-4), 451-484. https://doi.org/10.1080/10556789908805759
Axehill, D. (2008). Integer quadratic programming for control and communication. DIVA. Retrieved from https://www.diva-portal.org/smash/record.jsf?pid=diva2:17358