Is it difficult to get started with reptiles? It is not difficult to share a few cases for you to think about_Is crawler technology difficult to learn?

foreword

image

I won’t talk about how popular reptiles are now. Let me first talk about what this technology can do, mainly in the following three aspects:

1. Crawl data, conduct market research and business analysis Crawl high-quality topic content from websites such as Zhihu and Douban; grab buying and selling information from real estate websites, analyze housing price trends, and analyze housing prices in different regions; crawl job information from recruitment websites , Analyze the talent demand and salary level of various industries.

2. As raw data for machine learning and data mining For example, if you want to make a recommendation system, then you can crawl more dimensional data to make a better model.

3. Crawl high-quality resources: **Pictures, texts, videos Crawl exquisite pictures in the game, obtain picture resources and comment text data. It is actually very easy to master the correct method and be able to crawl the data of mainstream websites in a short period of time. But it is recommended that you have a specific goal from the beginning. Driven by the goal, your learning will be more accurate and efficient.

Here is a smooth, zero-based quick start learning path for you:

  1. Understand how reptiles are implemented

  2. Realize simple information crawling

  3. Anti-crawler measures for special websites

  4. Scrapy and advanced distribution

01 Understand how crawlers are implemented

Most crawlers follow the process of "sending requests—obtaining pages—parsing pages—extracting and storing content", which actually simulates the process of using a browser to obtain web page information. Simply put, after we send a request to the server, we will get the returned page. After parsing the page, we can extract the part of the information we want and store it in the specified document or database. In this part, you can simply understand the HTTP protocol and the basic knowledge of web pages, such as POST\GET, HTML, CSS, and JS, and you can simply understand it without systematic learning.

02 Realize simple information crawling

There are many crawler-related packages in Python: urllib, requests, bs4, scrapy, pyspider, etc. It is recommended that you start with requests+Xpath. Requests is responsible for connecting to websites and returning web pages, and XPath is used to parse web pages for easy data extraction. If you have used BeautifulSoup, you will find that XPath saves a lot of trouble, and the work of checking element code layer by layer is all omitted. After mastering it, you will find that the basic routines of crawlers are similar, and general static websites are not a problem at all. Public information on websites such as Zhihu and Douban can be crawled. Of course, if you need to crawl asynchronously loaded websites, you can learn browser capture to analyze real requests or learn Selenium to realize automatic crawling. In this way, dynamic websites such as Zhihu, Mtime, and TripAdvisor are basically no problem. You also need to understand the basics of Python, such as: file read and write operations: used to read parameters and save crawled content list (list), dict (dictionary): used to serialize crawled data condition judgment (if/else ): Solve the judgment in the crawler whether to execute loops and iterations (for ... while): used to cycle crawler steps

03Anti -crawling mechanism for special websites

In the process of crawling, I will also experience some despair, such as being blocked by the website, such as various strange verification codes, userAgent access restrictions, various dynamic loading, etc.

When encountering these anti-crawler methods, of course, some advanced skills are needed to deal with them, such as access frequency control, use of proxy IP pool, packet capture, OCR processing of verification codes, etc.

For example, we often find that the url of some websites does not change after turning the page, which is usually asynchronous loading. We use developer tools to analyze web page loading information, and we can usually get unexpected gains.

Often websites will prefer the former between efficient development and anti-crawlers, which also provides space for crawlers. Mastering these anti-crawler skills will not be difficult for most websites.

04Scrapy and advanced distribution

The use of requests+xpath and packet capture method can indeed solve the crawling of many website information, but it will be difficult to crawl if the amount of information is relatively large or it needs to be crawled by modules. Later, it was applied to the powerful Scrapy framework, which not only can easily construct Request, but also has a powerful Selector that can easily parse Response. However, the most surprising thing is its super high performance, which can engineer and modularize crawlers . After learning Scrapy, I tried to build a simple crawler framework by myself. When doing large-scale data crawling, I can think about large-scale crawling problems in a structured and engineering way. This allows me to think about problems from the perspective of crawler engineering. Later, I gradually came into contact with distributed crawlers. This thing sounds bluffing, but in fact, it uses the principle of multi-threading to allow multiple crawlers to work at the same time, which can achieve higher efficiency.

In fact, after learning this, you can basically say that you are an old reptile driver. It is difficult for laymen, but it is not that complicated. Because crawler technology does not require you to be proficient in a language systematically, nor does it require advanced database technology. The efficient posture is to learn these scattered knowledge points from actual projects, and you can guarantee that every time you learn The parts that are most needed.

Of course, the only trouble is that in specific problems, how to find the part of the learning resources that are specifically needed, how to screen and identify, is a big problem faced by many beginners. But don't worry, we have prepared a very systematic reptile course. In addition to providing you with a clear learning path, we have selected the most practical learning resources and a huge library of mainstream reptile cases. After a short period of study, you will be able to master the skill of crawling well and obtain the data you want.

1. Introduction to Python

The following content is the basic knowledge necessary for all application directions of Python. If you want to do crawlers, data analysis or artificial intelligence, you must learn them first. Anything tall is built on primitive foundations. With a solid foundation, the road ahead will be more stable.All materials are free at the end of the article!!!

Include:

Computer Basics

insert image description here

python basics

insert image description here

Python introductory video 600 episodes:

Watching the zero-based learning video is the fastest and most effective way to learn. Following the teacher's ideas in the video, it is still very easy to get started from the basics to the in-depth.

2. Python crawler

As a popular direction, reptiles are a good choice whether it is a part-time job or as an auxiliary skill to improve work efficiency.

Relevant content can be collected through crawler technology, analyzed and deleted to get the information we really need.

This information collection, analysis and integration work can be applied in a wide range of fields. Whether it is life services, travel, financial investment, product market demand of various manufacturing industries, etc., crawler technology can be used to obtain more accurate and effective information. use.

insert image description here

Python crawler video material

insert image description here

3. Data analysis

According to the report "Digital Transformation of China's Economy: Talents and Employment" released by the School of Economics and Management of Tsinghua University, the gap in data analysis talents is expected to reach 2.3 million in 2025.

With such a big talent gap, data analysis is like a vast blue ocean! A starting salary of 10K is really commonplace.

insert image description here

4. Database and ETL data warehouse

Enterprises need to regularly transfer cold data from the business database and store it in a warehouse dedicated to storing historical data. Each department can provide unified data services based on its own business characteristics. This warehouse is a data warehouse.

The traditional data warehouse integration processing architecture is ETL, using the capabilities of the ETL platform, E = extract data from the source database, L = clean the data (data that does not conform to the rules), transform (different dimension and different granularity of the table according to business needs) calculation of different business rules), T = load the processed tables to the data warehouse incrementally, in full, and at different times.

insert image description here

5. Machine Learning

Machine learning is to learn part of the computer data, and then predict and judge other data.

At its core, machine learning is "using algorithms to parse data, learn from it, and then make decisions or predictions about new data." That is to say, a computer uses the obtained data to obtain a certain model, and then uses this model to make predictions. This process is somewhat similar to the human learning process. For example, people can predict new problems after obtaining certain experience.

insert image description here

Machine Learning Materials:

insert image description here

6. Advanced Python

From basic grammatical content, to a lot of in-depth advanced knowledge points, to understand programming language design, after learning here, you basically understand all the knowledge points from python entry to advanced.

insert image description here

At this point, you can basically meet the employment requirements of the company. If you still don’t know where to find interview materials and resume templates, I have also compiled a copy for you. It can really be said to be a systematic learning route for nanny and .

insert image description here
But learning programming is not achieved overnight, but requires long-term persistence and training. In organizing this learning route, I hope to make progress together with everyone, and I can review some technical points myself. Whether you are a novice in programming or an experienced programmer who needs to be advanced, I believe that everyone can gain something from it.

It can be achieved overnight, but requires long-term persistence and training. In organizing this learning route, I hope to make progress together with everyone, and I can review some technical points myself. Whether you are a novice in programming or an experienced programmer who needs to be advanced, I believe that everyone can gain something from it.

Data collection

This full version of the full set of Python learning materials has been uploaded to the official CSDN. If you need it, you can click the CSDN official certification WeChat card below to get it for free ↓↓↓ [Guaranteed 100% free]

insert image description here

Good article recommended

Understand the prospect of python: https://blog.csdn.net/SpringJavaMyBatis/article/details/127194835

Learn about python's part-time sideline: https://blog.csdn.net/SpringJavaMyBatis/article/details/127196603

Guess you like

Origin blog.csdn.net/weixin_49895216/article/details/132576227