The road of data analysis and learning - (10) special analysis: how to mine the hidden information in 4w articles

illustrate

        With the continuous development and growth of the Internet in modern society, more and more thematic websites and forums are also taking advantage of the continuous and rapid development of Dongfeng. The vast number of Internet users are living in this era of "information explosion", how to select high-quality content that interests them has become the most concerned purpose of most Internet users. This is also the case. For website operations, the core goal of maintaining high output, high quality, high users, and high activity, so as to bring traffic and revenue to the website, is a long-term continuous optimization and iteration of website data. The process of analyzing operations. The website "Everyone is a Product Manager" has grown rapidly under this background, and now it has developed into a comprehensive learning platform for Internet products and other fields. This article selects all the special articles published on this website for analysis, mainly in the following directions:

  • Analyze the overall content operation direction and user activity level of the website, such as article classification preference, author activity level, and user reading/commenting article trends;
  • Analyze the operation theme through the content of all articles, extract keywords to understand the focus of website operation; analyze the development trend of the Internet industry in major and medium cities according to the city mentioned in the article;
  • Quantify the user's liking of articles through dimensions such as user reading, likes, comments, etc., and divide articles into several types of user preferences to guide article publishing and website content operation.

        This article draws on the article of the columnist Scottish Fold-eared Meow , and sorts out his own ideas to complete this analysis report independently.

data processing

data collection

        Use python crawler technology to obtain all published article information of "Everyone is a Product Manager", the time period is 2012.05.17-2018.04.05, a total of more than 41,000 articles, including article id, article title, text link, text Content, release date, author name, author role, article classification, reading volume, likes, collections, comments and other information.

                    

data preprocessing

        The acquired data is messy, and some fields do not need to be used temporarily, so do the following:

  • Information such as article id, body link, author role, body content, etc. are of little significance to the overall analysis process, so they are filtered out;
  • The crawled article category field is tagged with <a></a>, and regular expressions need to be used to match the actual category;
  • The reading volume and other indicators of tens of thousands and millions of data are represented by characters similar to 1.2w and 2.2m, which need to be converted into data.

        Processed data:

                        

Overall Operational Analysis

Overall analysis

        Categorize all articles, and count the total number of articles, and make the following number statistics chart. And the pie chart shows the proportion of articles belonging to different categories in the total number of articles.

                    

                        

        The figure above shows the total number and proportion of articles in categories such as industry trends, product design, and product operations. It can be seen from this that the four categories of articles on industry trends, product design, product operation, and product managers are the mainstream of website article operation, accounting for nearly 75% of the 14 categories, which is also in line with the website’s product learning-based operation. ideas. In terms of quantity, the number of articles in these four categories exceeds 5,000, and there is a big gap between them and the fifth-ranked interactive experience (number of 2,751). Even the number of first-ranked industry news articles has reached 11,770, indicating that More high-quality content is more likely to be produced from these types of articles. According to the "28 Principles", it is true that fewer fields contribute more content, which also reflects that the operation direction of the website pays more attention to industry information and product guidance, and pays less attention to entrepreneurial guidance, data analysis, and AI.

trend analysis

        By observing the historical trend of the number of published articles, the overall operation of the website can be detected; the number of articles read and likes/comments can reflect the activity level of users and is also an important indicator of the development of the website.

                        

        First of all, pay attention to two abnormal nodes - one first and one last. The second quarter of 2012 is the time to start publishing articles. During this period, the number of articles published is huge. The possible reasons for the analysis are: the website started to start, in order to attract traffic to the website. The purpose of the article to its own platform is to attract the attention of users and let users know that there is such a platform. As for the second quarter of 2018, it is easy to explain, because this period has not ended so far. Roughly estimated, the number of articles in this period will be flat.

        In general, the number of articles on the entire platform keeps growing, especially in the three-year period from 2012 to 2014, which was in a stage of steady growth and was operating well.

                            

        Except for the two abnormal nodes in the above analysis, the number of user readings and the number of likes/comments showed a high growth in the initial stage, reached the maximum value in the second quarter of 2015, and then stabilized, indicating that the website has gained a group of loyal users, which has been accompanied by grow the website. According to the trend, in the process of website development, some non-target users should be eliminated, and finally the loyal core users should be left to support the operation and development of the platform.

Author Analysis

        By listing the authors of the articles separately for analysis, we can understand the contribution of these authors to the development of the website. Through the user's attention to the author's articles, it can also reflect the traffic ranking the authors bring to the platform.

                    

        The above picture shows the total number of published articles by authors. I only selected authors who have published more than 300 articles. As the webmaster of the website, Lao Cao’s number of articles is far beyond comparison with other authors. The number of articles of nearly 5k is more than the total number of the 2nd-5th authors. And everyone is a product manager, the author, as the official operating account of the website, also contributed a large proportion of articles. Therefore, in order to attract more users, Lao Cao and everyone are product managers, the two "own people" have spent a lot of energy, which is worthy of praise!

                             

        From the point of view of data operation (blue) and data analysis (orange), the whole is consistent with the above analysis. More authors' articles tend to be related to products, and they are also the main authors: such as Lao Cao, Nairo, Mi Ke and so on, providing most of the content to the platform. It is worth mentioning that the articles on data analysis are mainly provided by 36 Big Data and Qin Lu. Qin Lu is a big cow in the field of data analysis and is also a big V of Zhihu and Tianshan.

                        

        This graph shows the ranking of authors with the highest number of likes and favorites. First of all, it reflects that the main authors are more likely to gain the favor of users, which has a lot to do with the high-quality content they publish, which shows the professionalism of these authors in these fields. From the perspective of likes and collections, the number of likes is less than that of collections, which means that users do not easily use the "skill" of "like", and they are more willing to like an article; The ratio of likes to collections by authors such as Cao is much larger than others, which also reflects the higher degree of recognition of their articles. Finally, let’s talk about an author, Scottish Fold Ear Meow. The reference articles in this article are from him. It can be seen that although the number of his articles is small, he can get high praise and collection, which is very powerful.

Comparative analysis

        By analyzing the likes and comments of classified articles, you can understand the degree of attention of user groups to certain types of articles, and then optimize the content according to the interests of users.

                                

        The picture above shows the number of likes and comments received by each category article. The darker the color, the more likes, and the bigger the circle, the more comments. It is easy to see that articles on product managers, product operations, product design, and industry dynamics are more concerned by users, and there is a certain correlation between the number of likes and the number of comments, that is, as the number of likes increases, so does the number of comments. Increase, indicating that high-quality articles can reflect their value on the platform.

                            

        I selected three types of articles that I am interested in: data analysis, AI, and blockchain to study their average readings. Generally speaking, users are more concerned at the beginning, and then gradually tend to decrease. Let’s analyze the possible reasons:

  • The platform mainly involves the Internet product system, and articles such as AI are not as professional and readable as other platforms;
  • It is precisely because of the user characteristics of the platform that users in non-primary areas are not very sticky and are easily lost.

        I suggest that in order to expand the platform, you can choose a non-main field for research and develop high-quality content to attract user groups. As for the average reading volume of blockchain articles in November 2017, there was a sudden high point, which should be the result of the continuous appreciation of Bitcoin during that period, and it was during that time that Bitcoin continued to rise, breaking $8000 continuously. Breaking $10,000, various news platforms competed to report it, arousing strong attention from Internet users.

Cycle Analysis

        By selecting the number of articles published over a period of time and comparing and analyzing the time of publication, we can see that the publication time of articles has a clear periodicity.

                           

        The figure shows the relationship between the number of articles published in the first quarter of 2018 and the time. The light color in the bar chart represents Saturday and Sunday, and the time with a smaller number in the middle corresponds to the Spring Festival period. So it's easy to see that the vast majority of articles are published on weekdays.

Operational Content Analysis

To be perfected.

Dig deep into the value of data

potential value of data

        A total of more than 4000 articles were analyzed this time. Each article has the corresponding number of readings and comments, which can be quantified and valuable data. The number of readings of an article can indicate the size of the audience, and the number of likes and collections can reflect the popularity of the article, as well as the quality of the article. And through these articles to analyze the authors of the articles, you can know that those authors can produce high-quality content, so as to guide the operation ideas of the website.

        This analysis digs deep into these numerical data, and applies data mining methods to actual scenarios, so that articles can be individually divided, and based on this, we can learn from each other's strengths and weaknesses, focusing on discovering high-value articles and eliminating low-value articles.

analysis of idea

        I use a clustering algorithm to aggregate more than 40 articles into several categories, and label each category according to the performance of users in each category in terms of reading volume and other data to indicate the user's preference for the article. Then analyze the distribution of article categories with different degrees of preference, and explore which authors are more able to produce high-quality content that is loved by users.

        The valuable and quantifiable data corresponding to each article are: the number of readings, the number of likes, the number of collections, and the number of comments, so in the process of clustering, these four characteristics can be selected, also called the four dimensions. However, during the implementation process, I found that four features were selected. Since most of the data are concentrated, the clustering effect is always unsatisfactory. The above analysis process mentioned that there is a certain correlation between the amount of likes and the amount of comments, so we chose one of the two, and finally selected three features for clustering. After many parameter adjustments and results observation, I finally aggregated all the articles into 4 categories.

Article classification

        The user preference classification categories of these 4 articles are:

  • Like very much: high reading, high likes, high collections
  • General likes: low readings, medium likes, medium collections
  • Dislikes: low readings, low likes, high collections
  • Dislikes: Moderate readings, low likes, low collections

        First of all, I observed all the data, graded the data according to the performance of the data, and set the data as high, medium and low frequency in a certain range, so there are medium likes and medium collections.

Value Analysis

        Therefore, each article has a label of the user's preference. By analyzing the user's acceptance of the article, the operation results of the website can be analyzed.

                        

            

        Generally speaking, since the website was put into operation, most of the article users did not pay much attention to and liked it, and at the same time, 1/4 of the articles were very liked by the users. Combined with the historical trend, it is clear at a glance. When the website first started operation, almost all articles did not like this category. The reason for the analysis is: the website was established to publish articles, and there were not many users accumulated, so maybe some "high-quality" historical articles It was ignored, and since the third quarter of 2015, the results have slowly been shown. After this time node, the trend of articles liked by users grows very fast, and the proportion is getting higher and higher, and it continues. Therefore, I think the operation results of the entire website are very significant, and there is still more room for development in the future.

                

        This figure shows the acceptance of articles in various fields by users, light blue is disliked, and dark blue is liked. It can be seen that product managers, product operations, product design and other fields have the largest number; analysis and evaluation, prototyping, data analysis, and user research have a larger proportion of user-like articles, indicating that these non-main fields can produce high-quality content . The platform can keep in touch with the authors of these articles and keep about more articles, as mentioned before, and then choose a field to expand the platform.

               

        The figure above shows the distribution of article fields that users like very much and the author of the article (more than 40 articles). As can be seen from the figure on the left, the overall proportion of users who like marketing and promotion articles is close to 60%, indicating that the quality of articles in this field is particularly high. Similarly, users in the fields of analysis and evaluation, prototyping, data analysis, and user research are very fond of more than 60%. 40%, which also reflects the high value of related content. Then you can think about it like this: Marketing promotion, data analysis, and user research are not the focus areas of this website. With the development of more new scientific and technological knowledge, in the future operation process, you can tend to focus on these two areas. More articles published in the field and with better content can attract more users to the platform, thus forming a virtuous circle - high-quality content attracts more stable and loyal high-quality users.

        The key to achieving this goal is the author of the article, because articles are written by people, so finding and retaining authors who can bring high traffic is crucial. The information in the picture tells us that the authors of NetEase UEDC, Operation Helicopter, Qin Lu, etc. have published many articles that are liked by users and have high value to the website, and it is their articles that have brought some popularity and traffic to the website. , so in the next operation of the website, we must find ways to interact with these authors more, try to retain them and encourage them to produce more good content, which requires the operators of the website to do some online and offline work. Encourage activity.

        Next, analyze the distribution of articles that users do not pay much attention to and do not like. More than 90% of users dislike content such as big coffee videos, lecture salons, and Renren columns, which can be considered from two aspects:

  • The content of these fields is relatively inferior, and it is really not helpful to users;
  • These fields are very niche, and there are only a few groups concerned.

           

        According to the comparative analysis of the previous fields, I am more inclined to think that the second point is the main reason. In this case, I think it is unnecessary to spend too much resources on these classification fields that are not easy to produce results. However, for product managers, prototyping, and marketing promotion sites, the overall percentage of them disliked is about 50% (marketing promotion performance is better). , and strive to make the content better. Product design, product operation and other fields are also the main areas of the website, but the performance is average, the overall dislike rate is about 70%, and certain means need to be used to reduce the proportion.

        For articles that users do not like, I have selected authors who have published more than 150 articles. Except for Lao Cao (the previous article analyzed a large number of articles in the early stage of the website), which may be special, other authors such as DT, Jiexi, Ouyang Junjie, Lijiang, etc., published articles. The number is very large - more than 300 articles, and the recognition degree of users is not very high. It may be that the content of the article does not say anything useful, or it may be because their field is small, and the reading group is not high or not interested, then whether it is possible to suggest this part of the author to reduce the frequency of publishing articles, more thinking in the The style of the article and the quality of the article are optimized and improved. If the columnist’s articles have been tepid, it is easy to cause the author to have no sense of accomplishment, and eventually leave for other platforms, which may take away a group of resident users, which will ultimately affect the platform’s traffic and platform operations.

        Therefore, by presenting the potential value of data in the form of data and charts, it is easier for website operators to know where their platform's advantages and disadvantages are, so as to enhance their strengths and avoid weaknesses and expand their operational ideas.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324935312&siteId=291194637