What value can the smart text review of Sumei Technology bring to the social industry | Sumei Artificial Intelligence Research Institute

The mobile Internet has broken the communication barriers of traditional portals, while social software has built a barrier-free communication bridge between information.

According to the "2019 Social Industry Research Report", there are currently more than 6000 social software on the market, including content socialization, tool socialization, and scene socialization.

Nowadays, social software is not just ordinary dating software, it is essentially a medium for transmitting information, and it is extremely inclusive, complex, and extensive. Thousands of social software are scattered in all corners of the online world, and the changes it brings are enough to drive changes in information exchange throughout the Internet era.

It allows and encourages users from different regions of the world to register and log in. Every user can freely speak and create as much as they want. With information sharing as the core, it can conduct personal communication, comment forwarding, online live broadcast, expansion of friends, knowledge creation, etc., so it It is an important landing model of digital information dissemination. But this double-edged sword also has many headaches in safety and risk control.

The social industry challenges escalation

With the continuous advancement of the digitization of the industrial Internet, the fault tolerance of social software is getting lower and lower.

6000+ social software with information dissemination as the carrier is distributed in a tree diagram. Mainly social sharing of knowledge content (Zhihu, CSDN, Weibo, Douyin, Kuaishou, etc.), instant messaging social chat (WeChat, Tantan, Momo, etc.), social communication in vertical scenes of various industries (Maimai, Mafengwo, etc.) ) Three types are the main ones.

Social software classification

In the face of these multi-scene and multi-channel social forms, it is obvious that there must be text content where there is communication, and there are some common risk control problems on some social platforms. For example, frequent occurrences of illegal content such as violent, political, vulgar, abusive and other illegal content, as well as illegal information such as the release of illegal advertising diversion, which not only violates the harmonious order of network security, but also creates a bad perception and experience for users, resulting in normal User loss.

To analyze the root cause, except for a small part of the user’s own behavior violations, most of the reasons are due to the fact that some criminals regard social software as their own "gold pool". And the modus operandi is endless: game bonus, slaughter plate, malicious marketing scalping wool, scalpers reselling air tickets, train tickets, concert tickets...

Under the guidance of national regulatory authorities, social platforms have also adopted a series of penalties.

In August 2020, Weibo administrators closed 109 illegal diversion accounts; Douyu closed 525 illegal live broadcast rooms and 571 banned accounts; Wuhan City guided a live broadcast platform to close 525 illegal live broadcast rooms in accordance with laws and regulations. 571 violating user accounts were banned, and 136 title parties were cleared.

As of September 2020, the national network information system and the telecommunications department have punished 6,907 illegal websites, and relevant website platforms have shut down more than 860,000 illegal and illegal groups according to law... Therefore, the country's content supervision requirements for social software are becoming more and more stringent. .

The constant frequency of illegal content on various social platforms and the various modus operandi of black production gangs have escalated the challenge of content review and brought great survival pressure to social software.

The battle against black production is becoming more and more fierce. Aiming at how to solve such problems, Sumei Artificial Intelligence Research Institute has conducted in-depth research and development on intelligent text recognition technology based on the industry background, and used its self-developed Tianjing intelligent content filtering engine to meet the challenges.

Social software content precise filter: Sumei Smart Text Review

The Research Institute of Artificial Intelligence of Shumei found that the text review of social software mainly focuses on six aspects: live video barrage, forum filling and posting, product review messages, avatar nickname signature, spam advertisement group posting, and game channel chat.

For different application scenarios, there are extremely high requirements for smart text semantic recognition accuracy, wide recognition range, and multilingual recognition. In this regard, the intelligent text filtering of SUMEI Technology supports the identification of risks related to political violations, vulgar filth, and ad diversion through the establishment of a comprehensive user portrait system and characteristic intelligent semantic analysis functions, combined with multi-scene and multi-dimensional judgments.

Smart text review technology framework diagram

For different social scenarios, Sumei Smart Text Filter uses semantic analysis technology, a variety of text recognition models and strategies, and text processing technology, including a list service based on sensitive lexicons. NLP model based on deep learning, behavior analysis of user portraits, real-time distributed rule engine, statistical engine, etc., learn and train massive text data, can accurately identify semantics and make risk judgments.

Identification of political violations

Synchronize the regulatory requirements of relevant departments such as the Cybersecurity and Cyberspace Administration of China in real time, and continuously update the sensitive vocabulary of hundreds of thousands, through flexible list matching
(whitelist, blacklist, ignore list, variant list, etc.) and intelligent NLP model , Accurately and effectively identify the risk of political violations in the text.

Including leader names, sensitive events, banned books, banned films, cult superstitions, government agencies, reactionary divisions, contraband, violent terror, heroes and martyrs, hot events, etc., and support the personalized setting of sensitive words in business scenarios, and the recognition of variants (homonyms, Similar words, pinyin, insertion confusion, insinuation, etc.) and a variety of flexible matching methods.

Vulgar Violation Identification

By accumulating a large amount of industry corpus, training vulgar and abusive models based on NLP technology, combined with vulgar sensitive vocabulary, accurately identifying non-compliant vulgar and filthy content in the text. And divide the content into multiple levels to flexibly adapt to the personalized review standards of different applications, scenarios, and roles.

The intelligent NLP model is combined with pornographic sensitive words, intercepting from multiple angles and all directions, and supports custom sensitive word lists. And use intelligent semantic recognition technology to produce corresponding discrimination results for the same word in different contexts.

Advertising diversion recognition

Mainly aimed at the large number of spam and fraud advertisements released by the black industry group in the advertising diversion group in the social software, the use of intelligent text variant recognition capabilities can accurately identify fraudulent ads and diversion ads, support the advertising law compliance inspection, and reduce the risk of violations , Tens of thousands of mainstream contact methods (WeChat, QQ, mobile phone number, website, official account, Baidu search, Weibo, advertising law compliance, etc.) variant feature library.

Smart text review risk trend DEMO

The Chinese culture is broad and profound, and the meaning of the same word in different contexts is very different. The accuracy of traditional sensitive word matching technology is difficult to meet the requirements of accurate and efficient review. The accuracy of SUMEI's intelligent text filtering and recognition is as high as 99%, which can quickly process the text, greatly reduce the manslaughter rate, reduce the cost of manual review, and effectively eliminate online risks.

In terms of technical indicators, the average response time of Sumei Smart Text Filtering API is less than 50ms, the maximum response time is 500ms, the timeout rate is less than 0.1%, the throughput is greater than 100QPS, and it can be expanded according to the level of demand. It can also support UTF8 multilingual text character encoding, and the text content is limited to no more than 1MB and 20,000 words.

Sumei's core technical advantages: text classification NLP model

Sumei Smart Text Filter uses word2vec word vector, fasttext text classification and other technologies to train NLP models based on massive text corpus.

Word2Vec is a model for learning semantic knowledge in an unsupervised way from a large amount of text corpus, and it is widely used in natural language processing (NLP). It uses word vectors to characterize the semantic information of words by learning text, that is, through an embedding space to make semantically similar words close in the space.

Embedding is actually a mapping, mapping the word from the original space to the new multidimensional space, that is, embedding the original word space into a new space.

Among them, in the Word2Vec model, there are mainly two models, Skip-Gram and CBOW. From an intuitive understanding, Skip-Gram is a given input word to predict the context. And CBOW is a given context to predict the input word.

Insert picture description here
Insert picture description here

The fastText database can help establish quantitative solutions for text expression and classification. fastText combines the most successful concepts in natural language processing and machine learning. These include using bag-of-words and n-gram bags to represent sentences, as well as using subword information, and sharing information between categories through hidden representations.

In addition, Sumei Artificial Intelligence Research Institute uses a softmax level (taking advantage of the uneven distribution of categories) to speed up the calculation process. These different concepts are used for two different tasks: effective text classification and learning word vector representation. In the field of text processing, deep neural networks have become popular recently, but their training and testing process is very slow, which also limits their application on large data sets. FastText can directly solve this problem.

fastText focuses on text classification. This allows it to be trained quickly on particularly large data sets. Using a standard multi-core CPU, we got the result of training more than 1 billion vocabulary models in 10 minutes. In addition, fastText can divide 500,000 sentences into more than 300,000 categories in five minutes.

Sumei Artificial Intelligence Research Institute has been deeply engaged in the training and development of intelligent text recognition NLP model for a long time, constantly fighting against black production fraud groups, and cooperating with AI from multiple dimensions of content, behavior, and portraits to accurately and effectively identify illegal content, forming a one-stop shop Intelligent wind control engine. As a professional AI risk control solution provider, SUMEI Technology will continue to escort the online business of thousands of social industry customers worldwide.

Guess you like

Origin blog.csdn.net/SHUMEITECH/article/details/108731940