In recent years, the "second-hand economy" has become increasingly hot, and the second-hand car market is also expanding rapidly.
For the same model, second-hand cars are much more affordable than new cars. For example, the Mercedes-Benz GLC class in the picture below, second-hand cars can be 50,000-200,000 cheaper than new cars . Therefore, more and more people take second-hand cars into consideration when buying vehicles.
But as we all know, the water in the second-hand market is relatively deep, and it is easy to pay the "IQ tax" if you are not careful. Therefore, it is essential to have a certain understanding of the market before buying a second-hand car.
Today I brought you a practical project of a second-hand car website, using Python to analyze the second-hand car market .
One, clear needs
1. Crawling information about Mercedes-Benz GLC-class sedan from a used car website (title, purchase year, mileage, price)
2. Analyze the information on the insured rate of second-hand cars by using years and mileage
Second, crawl data
Before we start crawling the data, we first determine the tool to be used, that is, the library. At present, there are several ways to write crawlers in Python:
After selecting the tools according to your needs, you can start crawling data.
First, the crawler will download the data of the webpage according to our instructions, and then use the xpath expression to extract the content we need from the webpage data. That is, the title, year, mileage, price and other information of each used car. (Remember to write a cycle based on the number of used car information on the page!)
Three, data cleaning
What is data cleaning? Data cleaning is a process of re-examining and verifying data. The purpose is to remove duplicate information, correct existing errors, and provide data consistency.
Just like our example, there are spaces in the crawled title and "|" in the subtitle. We need to divide the different data and delete the words "year" in the year and "10,000 kilometers" after the mileage. Only pure data computers can calculate.
Finally, use the Pandas library to output as a csv file.
Is this kind of data much more pleasing to the eye?
Four, data visualization
After obtaining the standardized data in csv format, we can analyze the data in an intuitive way and discover the trends and characteristics of the data.
As shown in the figure, the dot matrix chart on the left can clearly see that the earlier the purchase year of the car, the price will gather in the lower range; while on the right, we can see that the mileage and the price are negatively correlated.
Five, summary process