Must-see, how to verify data accuracy and coverage after Taobao data collection

Friends who have data collection needs should have learned about data companies large and small on the Internet. The process of understanding is nothing more than data accuracy, data coverage, price, timeliness, and whether it can be customized as required. Price and timeliness are both intuitive and negotiable, but for data, accuracy and coverage are the most important, and they are also the most difficult to verify. Often, data companies can only express their collection capabilities at this time. Bad verification doesn't mean it can't be verified. Let's share the professional data verification experience of antuodata and teach you how to verify the collected data.

Verification 1: Data coverage. I personally feel that if the coverage rate is not up to the requirements, especially if the URLs with high sales / evaluation are seriously missing, then the industry reports based on data analysis will be inaccurate. Therefore, the verification of coverage is the primary verification. Tmall e-commerce data as an example

Step 1. Check the product URL by category , randomly select a product A from the home appliance data on hand, and check the number of URLs , then search for product A with keywords on the Taobao platform , click on the "Tmall" platform, and sort by sales volume, comprehensive sorting, Randomly select 10-20 links by price sorting and other methods to see if these links are in the table;

Step 2. Spot check the product brand URL , search for several home appliances on the web page, click on the top brand, and then randomly select 10-20 links to check whether these links exist in the table;

Step 3. Randomly check the URL of the category and product model , search for several hot-selling models of home appliances on the webpage, randomly select 10-20 links, and check whether these links exist in the table;

Step 4. Spot check of product brands by category, randomly search for several home appliances, especially large appliances, on the webpage. After searching, check whether the top 10 brands on the page are included in the table.

JD.com can also compare the total number of URLs displayed for a certain product based on the page search and the total number of URLs in hand to see if there is a big difference.

After the above multi-dimensional spot checks, you can get a rough idea of ​​the coverage of the data in your hands.

Verification 2: Data accuracy. The accuracy mentioned here does not include coverage, but simply the comparison of the information on the page and the information at hand. You can probably start from the following places to verify. Commodity price dimension: selling price; price after full coupon reduction. Check whether the selling price is consistent with the webpage, and whether the calculation of the price after the coupon is fully reduced is accurate; product information dimension: whether the data collected from the model, brand, style, color, promotional activities, etc. is consistent with the page; sales volume, evaluation dimension: collected Whether the data sales volume and evaluation volume are consistent with the page; store information dimensions: whether the store name, Wangwang name, store ID , store level, etc. are consistent with the page. In short, it is whether all the field data collected are consistent with the page. This is just needed for collection. If the information is inaccurate, the data will be meaningless.

The data verification process is a repetitive and extremely boring process, which requires attention and patience. I hope the above experience can help you to check the data quality.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326508667&siteId=291194637