Paper Reading - FedACK: Federated Adversarial Contrastive Knowledge Distillation for Cross-Lingual

论文:FedACK: Federated Adversarial Contrastive Knowledge Distillation for Cross-Lingual and Cross-Model Social Bot Detection

Paper link: https://arxiv.org/pdf/2303.07113.pdf

代码链接:GitHub - 846468230/FedACK: The code for paper "Cross Platforms Linguals and Models Social Bot Detection via Federated Adversarial Contrastive Knowledge Distillation." on Web Conference 2023 (WWW).

Summary

        Social bot detection is critical to the resilience and security of online social platforms. State-of-the-art detection models are isolated and largely ignore various data features from multiple cross-lingual platforms . Meanwhile, the heterogeneity of data distribution and model structure makes it very complicated to design an effective cross-platform and cross-model detection framework.

        In this paper, we propose a new federated adversarial contrastive knowledge extraction framework, FedACK, for social bot detection.

        We design a GAN-based federated knowledge distillation mechanism for efficiently transferring knowledge of data distributions between clients. In particular, a global generator is used to extract knowledge of the global data distribution and refine it into each client's local model.

        We leverage local discriminators to enable custom model design and local generators for data augmentation on hard-to-discriminate samples.

        Local training is performed as multi-stage adversarial and contrastive learning to achieve a consistent feature space across clients and constrain the optimization direction of local models, reducing the discrepancy between local and global models. Experiments show that FedACK outperforms state-of-the-art methods in terms of accuracy, communication efficiency, and feature space consistency.

1 Introduction

        Social robots imitate human behavior on social networks such as Twitter, Facebook, Instagram [43]. Millions of bots, often controlled by automated programs or platform APIs [1], attempt to disguise themselves as infiltrating real users in pursuit of malicious goals, such as actively participating in election interference [11, 17], misinformation spreading [8], and Privacy Attack [37]. Bots are also involved in spreading extremist ideologies [3, 18], posing a threat to online communities. Compromised user experience on social media platforms and triggering adverse social influence necessitates effective bot detection.

        There is a new and understudied problem in bot detection — bot societies are often exposed on multiple social platforms and appear as collaborative groups. Existing bot detection solutions largely rely on user attribute features extracted from metadata [9, 41], or from text data such as tweets [15] before adopting graph-based techniques to explore neighborhood information. , 39]) features extracted in [14, 42, 46]. While such models can reveal camouflage behavior, they are isolated and affected by the amount, shape, and quality of platform-specific data. To this end, federated learning (FL) has become a major driver for model training across heterogeneous platforms without exposing local private datasets. Some studies [32, 44, 45, 49] enhance FL in a data-free manner via generative adversarial networks (GANs) and knowledge distillation (KD) to preserve privacy from intrusion. However, they have the following limitations:

        i) Restrictions on isomorphic model architectures. Since the FL model assumes a homogeneous model architecture on a per-client basis—but this no longer applies—participants are strictly required to adhere to the same model architecture managed by a central server. Therefore, it is imperative to enable each individual platform to tailor heterogeneous models according to unique data characteristics ;

        ii) Inconsistent feature learning space. State-of-the-art joint KD methods are mainly based on image samples and assume a consistent feature space. However, the difference between global and local data distributions often leads to non-negligible model drift and inconsistent feature learning space, which in turn leads to performance loss. It is highly desirable to align feature spaces across different clients to improve global model performance .

        iii) Sensitivity to Content Language. To date, anomaly detection methods based on text data are sensitive to the language on which the models are based. Existing solutions for cross-lingual content detection in online social networks either substantially increase the computational cost [10, 13, 50] or require labor-intensive feature engineering to identify features that are invariant across languages ​​[7, 12, 36]. It can be said that , how to incorporate various customized models with heterogeneous data in different languages ​​into a collaborative model to achieve a consistent feature learning space is still underexplored.

        In this paper, we propose FedACK, a novel robot detection framework through joint adversarial learning, contrastive learning, and knowledge distillation. FedACK envisions personalization of local models in a consistent feature space across languages ​​(see Figure 1).

(combining multiple social platforms with heterogeneous languages, context spaces and model architectures)

         We propose a new joint GAN-based knowledge distillation architecture - a global generator is used to extract knowledge of the global data distribution, and the knowledge is extracted into each client's local model.

        We carefully design two discriminators—globally shared and local—to enable custom model design, and use a local generator for data augmentation on hard-to-determined samples .

        Specifically, the local training of each client is regarded as a multi-stage adversarial learning process to efficiently transfer data distribution knowledge to each client and learn a consistent feature space and decision boundary.

        We further utilize contrastive learning to constrain the optimization direction of the local model and reduce the difference between the local model and the global model.

        To replicate the non-IID data distribution across multiple platforms, we used two real-world Twitter datasets, partitioned by Dirichlet distribution.

        Experiments show that FedACK outperforms state-of-the-art methods in terms of accuracy and achieves competitive communication efficiency and a consistent feature space.

contribute

        To our knowledge, FedACK is the first social bot detection solution based on federated knowledge distillation, which envisions cross-lingual and cross-model bot detection.

        Contrastive and adversarial learning mechanisms for achieving a consistent feature space for better knowledge transfer and representation when dealing with non-IID data and data scarcity across clients.

        FedACK outperforms other FL-based methods, improving accuracy by 15.19% in high-heterogeneity scenarios, and achieving up to 4.5 times faster convergence than the second fastest method.

2 Preprocessing

2.1 Background

Federated Learning (FL).

        FL is a distributed learning paradigm that allows clients to perform local training before aggregation without sharing clients' private data [4, 22, 27, 28, 30].

        Although promising, the performance of FL may be poor, especially when the training data is not independent and identically distributed (Non-IID) [25, 47] on the local device, which may bias the model towards local optima [20 ]

        Most existing works mainly fall into two categories.

        The first is to introduce additional data or use data augmentation to address model drift caused by non-IID data. FedGAN [32] trains GANs to solve non-IID data problems in an efficient communication manner, but biases are inevitable. FedGen [49] and FedDTG [45] utilize generators to model global data distributions to improve performance.

        The second category mainly focuses on local regularization. FedProx [25] adds an optimization term in local training, and SCAFFOLD [20] uses a control variant to correct client-side drift in local updates while guaranteeing faster convergence. FedDyn [2] and MOON [24] constrain the direction of local model updates by comparing the similarity between model representations to align local and global optimization objectives. However, these methods either directly model aggregation to obtain a global model [35] that leads to non-negligible performance degradation, or ignore the effect of data heterogeneity, which may lead to loss of knowledge of local data distribution during model aggregation.

Joint Knowledge Distillation (KD).        

        KD is first introduced, using compact models to approximate the features learned by larger models [5]. Knowledge is formally known as softening logic, and in typical KD the student model absorbs and imitates the knowledge of the teacher model [19]. KD is inherently beneficial to FL because it requires less or no data in order for the model to elucidate the data distribution. Feddistill [33] jointly refined the logic of user data obtained through model forward propagation and formed a global knowledge distillation to reduce the global model drift problem. FedDF [26] proposes ensemble distillation for model fusion and trains the global model by averaging the logic of the local model. FedGen [49] incorporates the average logit of each local model as a teacher in KD to train a global generator. FedFTG [44] uses the logit of each local model as a teacher to train a global generator and distills knowledge to fine-tune the global model by using the fake data generated by the global generator. However, none of them focus on achieving a consistent feature space, which will lead to ineffective knowledge propagation. So far, FL and KD have been largely ignored in social bot detection, which was investigated in an isolated fashion [8]. FedACK can fill this gap by augmenting adversarial learning with shared and exclusive discriminators to support specified cross-model bot detection.

Cross-lingual content detection in social networks.

        Posting false or misleading content on social networks in different languages ​​via social bots has become the norm rather than the exception. [7, 12] explored the possibility of cross-lingual content detection by seeking features that are invariant across languages . There is also a large body of research [10, 13, 29, 31, 50] on cross-lingual text embeddings and model representations for detecting hate speech, fake news, or unusual events. These works often require enormous efforts to find cross-lingual invariants in the data and are thus computationally inefficient. Although Infoxlm [6] can be applied to our cross-language module in FedAck, it may involve additional overhead for only a few mainstream languages ​​in social platforms. FedAck achieves text embedding by mapping translingual texts into the same context space.

2.2 Scope of the problem

        We consider a joint social bot detection setup, which consists of a central server and K holding private datasets {D1, . . . , DK }. These private datasets contain benign accounts and bots of different generations. Presumably there are different model architectures or parameters for different clients. FedACK focuses on metadata and textual data rather than multimodal data. Rather than collecting raw client data , this server addresses heterogeneous data distribution across clients and aggregates model parameters for a shared network. The goal is to minimize the overall error among all clients:

where L is the loss function used to evaluate the predictive model on the data samples of k clients.

3 Methodology

        As shown in Figure 2, FedACK consists of a cross-lingual mapping, a backbone model, and a federated adversarial contrast KD.

Guess you like

Origin blog.csdn.net/qq_40671063/article/details/130658984