Paper Reading-Simplistic Collection and Labeling Practices Limit the Utility of Benchmark Datasets for Twitter

Paper link: https://dl.acm.org/doi/pdf/10.1145/3543507.3583214

Summary

        Accurate bot detection is necessary for the safety and integrity of online platforms. It is also crucial for studying the influence of bots in elections, the spread of misinformation and financial market manipulation.

        Platforms deploy infrastructure to flag or remove automated accounts , but their tools and data are not publicly available. Therefore, the public must rely on third-party bot detection.

        These tools employ machine learning and often achieve near-perfect classification performance on existing datasets, suggesting that robot detection is accurate, reliable, and suitable for downstream applications.

        We present evidence that this is not the case and show that the high performance is due to limitations in dataset collection and labeling rather than tool complexity.

        Specifically, we show that simple decision rules - shallow decision trees trained on a small number of features - achieve near-state-of-the-art performance on most available datasets (the robot detection dataset). Even when datasets are combined, they do not generalize well to out-of-sample datasets.

        Our results show that predictions are highly dependent on the collection and labeling procedures of each dataset rather than fundamental differences between robots and humans.

        These results have important implications for the transparency of sampling and labeling procedures and potential bias in studies using existing robotic detection tools for preprocessing.

introduction

        With the rise of online social media as an important means of connecting with others and sharing information, the impact of bots or automated accounts has become an important topic of social concern. Some bots are benign, serving up interesting content or directly enhancing a site's accessibility (subtitles for videos that don't have them on the platform), but there are also many engaged in influence operations, spreading misinformation and harassment: fake followers raise some the popularity of users; spammers advertising political candidates or products on the site; malicious automated accounts that undermine the credibility of elections, or exacerbate polarization. Bots have been reported to have influenced the 2016 US presidential election [4,36], the Brexit referendum [3,36], the spread of misinformation about covid-19 [25] and financial markets [11,52]. The ability (or inability) to accurately label these accounts could have very real implications for elections, public health, and public trust in institutions.

        Platforms delete large numbers of accounts they deem inauthentic, but they keep these deletion systems secret and may be incentivized to misrepresent the impact or prevalence of bots. In fact, bot detection was at the heart of Elon Musk’s negotiations to acquire Twitter: Twitter claims that less than 5% of its monetizable users are bots[66], while Musk claims the number is much higher and more [51]. Because in-house bot detection techniques are often not publicly available, researchers, journalists, and the general public rely on tools developed by researchers to distinguish bots from real human users, and to understand bots' impact on social phenomena.

        Developing bot detection tools on Twitter and other online social media platforms is an active area of ​​research. Over the past decade, to enable third-party bot detection, large user data sets have been collected. High (sometimes near-perfect) performance is achieved on these datasets using expressive machine learning techniques such as ensembles Random Forests and Deep Neural Networks, along with hundreds of features such as profile metadata, engagement patterns, network features, tweet content, and sentiment.

        Crucially, researchers often use bot detection as a preprocessing step to study social phenomena, to separate human users from bots, and to study phenomena associated with either or both humans and bots. This includes subject areas such as the spread of misinformation or disinformation [6,40,53,61–63,67], elections [2,4,24,41,54,64], and echo chambers [7], and is published in Major sites for scientific research, including Science [67], Nature [53] and PNAS [64]. For example, Broniatowski et al. [6] observed that bots erode trust in vaccinations, González-Bailón et al. [35] concluded that bots share a disproportionate amount of content during political protests, and Vosoughi et al. [67] Concluded that humans and bots spread fake news differently. The robustness and validity of these results depend on accurate and reliable robot detection.

        Third-party bot detection tools are also readily available and widely used by the public: the latest version of Botometer [60] is reported to receive hundreds of thousands of queries per day to its public API [74], and BotSentinel [5] provides a A browser extension and a way to conveniently block accounts classified as bots.

        Is robot detection a solved problem? On the face of it, robot detection research appears to be a success story for machine learning: researchers have collected various datasets for a well-defined classification task, and random forests and neural networks and other expressive machine learning models achieve near-perfect performance on the data. Furthermore, these methods are widely adopted in both academic literature and public applications. Bot detection tools are often trained on combinations of datasets, and researchers argue that existing methods can be easily Adapt to the shortcomings of existing classifiers or the evolution of more human-like robots.

        Even so, there are signs that bot detection tools are far from perfect. They may be inconsistent with each other [47], proven to be unreliable over time [56], and rely on dubious labels [26, 27]. Here, our attempt to reconcile and systematically explain apparently successful Twitter bot detections appears to have significant limitations.

        Evaluating third-party bot-detection datasets and tools is inherently challenging: the "ground truth" that the public doesn't know or have access to, and the only window we have into bots Twitter does, is through the datasets themselves. However, this does not mean that evaluation cannot be performed. By carefully analyzing these datasets and the relationships between them, we can still better understand what these datasets are telling us.

        Take the dataset published by Cresci et al. [10] (cresci-2017) as an example, which is the most widely used in academic literature. The dataset consists of a pool of real human users, a collection of fake followers, and several types of "spam bots": collections of different accounts in the domain. The state-of-the-art model is a deep neural network using text data, which achieves essentially perfect performance on this dataset [43]. However, a closer look reveals something surprising: we can achieve near-state-of-the-art performance using a classifier that only asks yes/no questions on the data. In fact, there are at least two distinct yes/no questions that pretty much separate humans from robots. These classifiers are shown on the left and in the middle of the decision tree, as shown in Figure 1.

         : 0.98 accuracy for two shallow decision trees (left, center) from creci-2017 and 0.91 accuracy for one shallow decision tree (right) from caverlee2011.

        As we discuss later, we argue that the tree on the left is a product of convenience sampling by Cresci et al. [16], which involves social perception of natural disasters using Twitter. On the right side of Fig. 1, we show another high performance classifier for another popular dataset: caverlee-2011 published in [44]. Likewise, a small number of yes/no questions distinguished humans from bots with high accuracy. These examples are not unique. As we will show, almost every other benchmark dataset we analyze admits high performance with very simple classifiers.

        How should we reconcile these results with our intuition that bot detection is a hard problem? On the one hand, bot detection may be simpler than expected, and simple decision rules will suffice. On the other hand, perhaps the dataset itself fails to capture the true complexity of bot detection. If this is the case, then simple decision rules will perform significantly worse when deployed, although they perform well in the sample. We provide evidence in support of the latter hypothesis through an extensive Twitter bot detection dataset.

        our contribution. In this work, we scrutinize widely used Twitter bot detection datasets and explore their limitations. First, we show that simple decision rules perform nearly as well as state-of-the-art models on benchmark datasets. Therefore, each dataset only provides predictive signals of limited complexity. Because our simple decision rule allows us to transparently examine the reasons for the classifier's high performance, we find that predictive signals in the dataset likely reflect a specific collection and labeling process ; collecting accounts and assigning each account a human or bot label.

        Next, we examine the combination of datasets. Many bot detection tools combine datasets (see [17, 37, 75]) and, implicitly or explicitly, do so to cover the distribution of bots appearing on Twitter. Building on previous work [18,60], we show that expressive machine learning models trained on one dataset perform poorly when tested on other datasets, and that models trained on all but one dataset The model does not perform well at test time. Information provided by a dataset cannot be generalized to other datasets, suggesting a different distribution of datasets according to the dataset distribution, suggesting different sampling (i.e., collection and labeling) procedures.

        Finally, we consider whether imposing a structural assumption on the data that each dataset contains bots from one of the few types (e.g., spam bots or fake followers) can yield greater generalization as in the method of Sayyadiharikandeh et al. [60] and Dimitriadis et al. [17] showed. We found that simple decision rules can accurately distinguish each type of robot from humans. Therefore, each robot sample of one type has inherently lower information complexity. We also show that, in accounts of specific bot types, simple decision rules can identify which dataset a given bot came from. Thus, datasets for a given robot type are drawn from very different distributions, again suggesting different data collection processes. Taken together, these results suggest that each individual dataset contains little information and that predictive signals in each dataset do not contribute to predictions in other datasets, even in datasets representative of specific types of robots. Therefore, existing datasets are unlikely to provide representative or comprehensive samples of robots, and classifiers trained on these data are unlikely to perform well when deployed.

        In addition to bot detection, our approach— examining simple decision rules on datasets and measuring performance across datasets —may help detect simple data sampling and labeling processes in a range of machine learning applications: if the dataset admits Highly accurate simple decision rules , the dataset itself has low information complexity. Furthermore, if an expressive machine learning model trained on some datasets does not generalize to others, the underlying system does not appear to be simple, and the dataset is unlikely to provide insight into the entire problem domain.

        We also believe that these findings have immediate implications for future bot detection research in Twitter and beyond: creators of bot detection datasets should transparently report and justify sampling and labeling procedures; researchers developing bot detection techniques should be simple to train and analyze. accurate, interpretable models as well as more expressive models; researchers who use robot detection as a preprocessing step should consider how it affects the results.

background

        Robot detection technology. To improve the classification ability, the researchers used a series of cutting-edge machine learning techniques to detect different types of data. One approach is to apply random forests [32,72] and ensembles of random forests that combine the predictions of a classifier trained on a subset of the data. Another popular approach is to leverage text data to apply large pre-trained language models [38] or models trained by researchers themselves [28, 39, 43, 46, 48]. A third approach uses network data to train graph neural networks [1, 20, 23], or tries to detect botnets from abnormal network structures [70]. Finally, a fourth approach seeks insights from other disciplines by using behavioral [30,34] or biologically inspired techniques [13–15,58]. In addition to new predictive models, considerable effort has been spent on deriving or exploring contour, text, or network features that might inform robot detection [39, 49]. All papers cited above rely on the benchmark datasets analyzed in our work.

        Limitations of bot detection tools. Several papers explored the limitations of robotic detection techniques, but few provided evidence to explain these limitations. To the best of our knowledge, our work is the first to trace the limitations of bot detection to simple sampling and labeling strategies. Martini et al. [47] compared three public tools for bot detection and found significant differences in predictions between different tools. Related to this, Rauchfeisch and Kaiser [56] found that a single tool could produce different results over time due to changes in account activity, and Torusdaul et al. [65] created a tool that can reliably evade existing bot detection frameworks. robot. Elmas et al. [19] found that qualitative observations from previous work, such as that bot accounts were often created recently or were flagged by high levels of activity, did not apply to the data collected for their paper, and concluded that the prevalence A classifier may not be able to generalize. Gallwitz and Kreil [26, 27] manually identified individual accounts mislabeled as “bots” in popular datasets, noting the high prevalence of false positives, and arguing that there may be errors in labels that are often considered ground truth.

data and methods

        In this section, we discuss the datasets we analyzed and the criteria for including each dataset in the analysis. Most benchmark datasets in the literature are aggregations of data collected across various contexts, and the benchmark datasets we study are listed in Table 1.

data collection        

        To gather a list of benchmark datasets, we searched Google Scholar for peer-reviewed papers related to bot detection, as well as the references of the papers we found. We found a total of 58 papers using at least one of the datasets we included in our analysis, 22 of which had at least 50 citations on Google Scholar at the time of writing (several had at least 500 citations), and 26 of them were in Published after 2020. In our analysis, we included only datasets used in multiple peer-reviewed robot detection papers that reported the accuracy and F1 scores we found in our searches, although nearly all datasets were used in two or more papers. . Several datasets were accessed through the Botometer Bot repository
. For the remaining datasets, we contacted the authors of related papers to request access to the original data (twibot2020 and yang-2013), or found publicly accessible data online ( in the case of caverlee-2011 and pan-2019).

        We also received augmented data from the authors for gilani-2017 used in the original work [30–32], although a reduced feature set is available on the Bot repository. For gilani-2017 and caverlee-2011, the original data provided by the authors [32, 44] contain at least 35% more users than those contained in the Bot repository; we use larger original data sets in our results. For the astroturf and varol-2017 datasets published on the Bot Repository, the data only appears as a list of user identifiers. Due to the long time that has passed since its generation, we did not rehydrate this data or use it in our analysis.

        feature. All datasets contain profile characteristics, typically screen name, number of tweets, number of followers, number of followers, number of favorites, language, location, time zone, number of Twitter lists containing the user. Additionally, some datasets also contain a corpus of tweets for each user in the dataset. Online relationships and related following/follower behavior are occasionally recorded.

        Annotate method. Determining the "ground truth" labels for robot detections is a challenging task. In most datasets, humans (whether paper authors or hired crowdworkers) manually assign a “robot” or “human” label to each account. Previous work found that human annotators have high agreement with each other [32], and inconsistent accounts are sometimes excluded from the dataset [22]. Others assign them using heuristics or relying on external sources such as celebrity accounts [celebrity2019] or accounts posting links to tweets from public blacklists [yang2013]. The quality of hand-labeled and heuristically labeled datasets depends largely on the implicit assumption that humans are very good at classification tasks, and neither the datasets themselves nor the wider literature provide strong evidence that this is the case. In contrast, recent evidence suggests that human annotators are systematically biased toward considering inconsistent accounts as bots [69, 71]. Likewise, there are some accounts for which neither bot nor human labels are suitable, such as semi-automated accounts or accounts representing institutional entities such as companies or universities [8]. However, since other work assumes that the labels in the data are real, and since no better annotation method is available, we make the same assumption.

Dataset description

        The datasets we consider fall into two categories: component datasets, consisting of single category (human or robot) accounts, and composite datasets, consisting of combinations of component datasets. Each of the 28 datasets is briefly described below. Datasets were manually labeled by authors of related papers unless otherwise stated.

        Social-spambots-1 [10] was a spam account used to promote specific candidates during the 2014 mayoral elections in Rome. Social-spambots-2 [10] are spammers who promote the Talnts application using the hashtag #TALNTS. Social-spambots-3 [10] contains accounts that send spam links to products on Amazon, including genuine product links and malicious URLs. classic-spambots-yang [72] is an account of known malicious links in spam collected by crawling the Twitter network. true-accountsyang [72] are accounts that have not posted malicious links on Twitter, taken from the same crawling process as traditional spambot-yang. classic-spambots-2 [10] includes accounts that share malicious URLs and accounts that are repeatedly flagged for sharing such content. Legacy-spambots-3 [10] and legacy-spambots-4 [10] are accounts that spam job listings. pronbots-2019 [73] are Twitter bots that infrequently post links to pornographic sites. elezioni-2015[12] is an Italian-language account manually tagged with the hashtag #elezioni2013. political-bots-2019 [73] Collected and identified by Josh Russell (@josh_emerson) as an automated account run by an individual designed to amplify right-wing influence in the US 2018 midterm elections [75] Includes accounts using relevant hashtags, For example #2018midterms during the 2018 US election. trueaccounts-cresci [10] purports to be a random sample of human Twitter users whose authenticity is confirmed by their responses to natural language questions. These are all accounts tweeting "earthquake" mentioned in Section 1 and discussed in Section 4. twibot-2020 [22] is collected by crawling the Twitter network using well-known users as seeds. These accounts were manually flagged by hired crowdworkers.

        rtbust-2019 [49] contains manually labeled accounts subsampled from all accounts that retweeted Italian tweets during the data collection period. fake-followers-2015 [10] and vendorpurchased-2019 [73] are fake follower accounts purchased from different Twitter online marketplaces. Caverlee-2011 [44] was collected via honeypot Twitter accounts, and the researchers used an automated process of human-computer interaction to flag bots and human accounts. Celebrity-2019 [73] is a manual collection of verified celebrity accounts. the-fake-project-2015[12] consists of accounts following @TheFakeProject and successfully completing a captcha. botwiki-2019 [75] is a list of self-identified benign Twitter bots, such as automated accounts posting generative art or tweeting world holidays. Feedback-2019 [73] is a collection of approximately 500 accounts that Botometer users have flagged as mislabeled by the tool.

        Several datasets we study are combinations of the above components. cresci-2015 [12] includes the-fake-project-2015, elezioni-2015 and fake-followers-2015. cresci-2017 [10] consists of fake-followers-2015, true-accounts-cresci, three social spam bot datasets, and four traditional spam bot datasets. yang-2013 [72] has bots from traditional spam bots-yang and humans from real accounts-yang. pan2019 [55] includes all components of cresci-2015, cresci2017, varol-2017, plus caverlee-2011 and an additional collection of manually annotated robots and humans not found in any of them. This dataset also includes tweet data that was not present in the original component.

method

        Simple decision rules. While sophisticated machine learning models are capable of learning complex relationships between input data patterns and their labels, their flexibility often comes at the cost of transparency and interpretability.

        We chose to instantiate "simple decision rules" as shallow decision trees because their transparency allowed us to easily check why each data point was assigned a label. Similar analysis is much more difficult or not feasible for the complex and opaque models mainly used in robot detection. Researchers have used now standard interpretable machine learning tools such as LIME [57] and SHAP [45] to build robot detection models [12,42,75]. However, none of these proves that the underlying dataset allows for simple, high-performance classifiers relying on a small number of features, as we do. Other simple machine learning models, such as linear regression, -mean, or nearest-neighbor classifiers, may be able to provide similar interpretability to shallow decision trees, but the choice of specific method was not critical for our analysis.

        We use scikit-learn's binary decision tree implementation,2 by selecting the feature-threshold pairs (represented by nodes) that best classify the data into two groups, and then recursively train a decision tree on the numerical data to separate each group. In our case, after a fixed recursion depth (corresponding to the tree depth), the classifier outputs a label corresponding to the majority of examples in the group; these are the leaves of the tree. We only consider trees with a depth of four or less to ensure that trees can be easily inspected and to avoid overuse. See Figure 1 for several examples of shallow decision trees trained on benchmark datasets.

        Performance. The most commonly reported metrics in the literature are accuracy and F1-score. Accuracy is defined as the proportion of correctly labeled examples. Accuracy can be misleading when the dataset is imbalanced between classes, since naive models can achieve high accuracy by always predicting the majority class. The F1 score in binary classification is the harmonic mean of model precision and recall. In our context, a low F1 score indicates that the classifier either failed to detect a high proportion of robots, or mislabeled most of the humans. The F1 score does not include the number of true negatives, i.e. humans correctly labeled as human, which can be misleading in cases where bots outnumber humans.

        Although the two metrics are complementary, both depend on the proportion of humans and bots in the data. For these reasons, it is difficult to compare the accuracy and F1-score results of models and datasets with different ratios of robots and humans. To provide additional clarity and comparability, we report the classifier's balanced accuracy (bal.acc.), or the arithmetic mean of the true positive and true negative rates. Balanced accuracy is a less useful metric when one knows a priori the relative proportions of robots and humans in the environment in which the classifier is deployed.

Guess you like

Origin blog.csdn.net/qq_40671063/article/details/132026462