Why do I recommend that businesses buy third-party content security services outright?

Introduction: There is such a question on V2EX: "Do V friends have any good solutions for filtering sensitive words in comments?" A netizen named "TimePPT" did it from the perspective of "magnitude, business needs, strategies" and so on. Detailed reply, the content of the answer is not only professional, but also very high quality.

Under the premise of obtaining the authorization of "TimePPT", NetEase Yunyidun edited the relevant content again, hoping that through it, everyone will take less detours in content security.

The following is the text:

Dependence-of-businesses-on-technology.jpg

I've worked on anti-spam (content safety) products for a while because of my work relationship.

Before explaining it in detail, everyone must have a clear understanding: Yellow anti-filtering is actually a work of technology and continuous investment in operations.

First look at magnitude

If the magnitude of the content is not large, anything can be done. Searching online or through relationships can find a relatively new vocabulary of tens of thousands or hundreds of thousands of sensitive words Loading into memory, starting a server to directly judge and filter, although simple and rude, but effective - of course, the rate of false positives and false negatives Certainly not low. 

However, once this method encounters a variant or a large scale, it will not work, and the rate of false positives and false negatives will increase. If you manually add rules, you will often go "crazy" in the end. 

If the magnitude is large to a certain extent, long-term anti-strategies must be considered, and Bayesian filtering, regression clustering, and machine learning all have to be piled up. 

Next, look at the business needs:

The business side is only involved in the content of comments, or there are large paragraphs of article content. How high is the requirement for anti-real time? What is the tolerance for false positives and false negatives? These directly affect the product technology strategy. 

There is also whether there is rich media content, such as comments with pictures and videos, then it is not only a problem of filtering keywords, but also image recognition. In addition, at the business level, do you need to leave room, for example, because of KPIs - a certain edge ball is allowed to exist, the so-called water is clear, there are no fish... The operators probably don't want you to kill them all!

Let's talk strategy

The strategy of UGC content in the big picture is nothing more than first review and then release, or first release and then review. The two product strategies are different, and they need to be adjusted according to the requirements of the superior department. Therefore, there should be room for product design. In addition, because no machine algorithm can achieve extremely high accuracy and coverage, there must be false positives and false negatives. 

At present, the vast majority of products of a large number of grades rely on machine primary screening + manual secondary screening at the level of yellowing, especially for pictures and videos. It is much more difficult to rely on machines than text yellowing. 

In addition, a report button is added to the product strategy to allow users to assist in completing the front-end self-audit of Huang Anti. 

The above are just some of the experiences. 

Finally: why I don't recommend building your own content security system

First, I think Huang Fan is generally related to censorship, and the first-hand information about some sensitive words is actually only available to companies that are closer to higher-level departments or companies with a large volume (such as BAT, the four major portals, search engines), so the maintenance of the vocabulary list is actually lag and a posteriori. Many companies are aware of the problem of stepping on the line. As a result, they are called by the relevant regulatory authorities to "meeting", and the service is seriously offline or even shut down. stop.

Second, even if you invest a lot of manpower and material resources to build it yourself, it must be considered whether the amount of junk information you can collect can meet the model training effect.

Third, Huang's important role at the operational level is to prevent spam from interfering with normal operations. However, most of the operational requirements for this area are vague for many reasons. For example, the KPI orientation I mentioned above... So this area is also You have to leave room for it, otherwise you will be a mess, and it will be uncomfortable if you don't do your best.

Huang counter work is a relatively serious and complex work, which is why I recommend that ordinary companies directly purchase stable third-party pornographic services. The cost of continuous investment is actually very high, and this part of work is sometimes not valued within the company. , the effort is not thankful. If there is no problem, there is no credit, and if there is a problem, I will trouble you (for example, if you kill too many KPI indicators, you will step on the red line due to omissions...).

Author introduction: TimePPT , 8 years of Internet product manager, is now active in the field of AI.

Source: https://www.v2ex.com/t/378618


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325812276&siteId=291194637