Python crawler combat: brush the traffic of a blog site (transfer)

Introduction :

Python concurrent articles are still in the liver, which is relatively boring. I wrote a small crawler script to play. I remembered that I saw on a blog site before, a scumbag article published by one person, less than 2W traffic in half an hour, and several consecutive articles. Yes, then there were no comments or likes. Compared with Hongyang's blog, which was seen by many people, an article was posted for 3 months and only got 1w3 traffic. Thinking about it, I knew that it was brushed by reptiles. Feeling so embarrassed, I couldn't help but report it to their customer service. However, I didn't get any reply.

 

 

Occasionally recently clicked on the ranking page and saw something like this? ? ?

 

 

No one cares about those who sell coupons...Fuck, this can be tolerated, is this worthy of those big guys and Mengxin who worked so hard to write a technical article?

 

 

Who can't write a script for brushing the traffic, and this blog site does not need to log in to count as one traffic, write an infinite loop, and then change the ip to visit the blog. The specific process:

  • 1. Grab all the article URLs of your blog and save them;
  • 2. Prepare a bunch of proxy IPs (many thorns are not enough, you can buy 5 yuan 2W high-secret proxy IPs online);
  • 3.while True: change ip and randomly access one of your own articles;

Yes, it's that simple, without any anti-crawlers or punishment, the person before has swaggered to get more than 5W visits per article. So I chose to move to the Nuggets...

1. Write your own common module

When writing a simple crawler, there are some very commonly used code segments, such as initiating a request, and then getting a Response; downloading pictures; ), and then you can call it when you need it. When you use it, import your module and you can use it. For example, a simple module written by Piggy:

 

 

part of the code

 

 

These can be customized by yourself, which makes it much more convenient for you to write crawlers.

In addition, the little pig will crawl some small things to practice his hands when he has nothing to do. Beginners can also try it. The relevant scripts are all thrown on my Gayhub, and they can be picked up on demand:

github.com/coder-pig/R…

 

 

2. Write a script to brush traffic

Step 1 : Get all the links of the blog

Open: blog.csdn.net/coder_pig?v... Swipe to the bottom, you can find this thing, all you have to do is get the url of each page, then process the page to get all the article links and save them, and then click Open the second page and find that the url becomes: blog.csdn.net/zpj77987844... So we just need to get how many pages there are in total, and then splicing the URL by ourselves:

 

 

Browser f12 to open Elements, Ctrl+ f to search for the last page, and it is directly located

 

 

Then search this papelist globally , and find that it is also unique. Next, it is easy to handle, process the href, and get the final page number:

 

 

The next step is to look at the page structure of each list page, get all the article urls, and write them into a file:

Just choose the second page we just opened, also open: Elements, just search for it: Xiaozhu talks about Android screen adaptation , you can locate:

 

 

Then click on the node, flip up, and it is not difficult to find:

 

 

Similarly, search for article_list, which is also unique, get this div, then get the span with class='link_title' , and then get the a tag inside:

 

 

Execute this method, you can see that csdn_articles_file.txt is generated in the directory and click to see the urls of all our articles:

 

 

The first step is complete~

Step 2 : Visit the webpage

This is very simple, change the IP, and then initiate a request for the article. You don't even need to execute the read() method. In addition, you can also add a count here. When the return code is 200, it means a successful visit, and the count + 1. The code is very simple:

 

 

Step 3 : Execute the code

It is very simple here, first determine whether the article list file exists, there is no traversal, then load all the data in the file into the list, then while infinite loop, and then random randomly take out an access!

 

 

After running, the count starts to indicate success. Open your own blog page and refresh it after a while to see if the traffic has increased:

 

 

3. Drop the script on the server to run

It's impossible for your computer to be on 24 hours a day, right? It costs electricity. For example, I usually turn off my computer after get off work. If you want your script to run 24 hours a day, you can throw it on the server and buy it for a hundred dollars. This is an ordinary game, and I am interested in Baidu Aliyun, Tencent Cloud virtual host and the like.

Usually linked to our remote host terminal through the ssh command

ssh root@host ip , then enter the host password to link

 

 

Then you can throw your own script files to the server through some ftp tools, and then execute python3 xxx.py in the ssh terminal.

 

 

But there is a problem, if you press ctrl+c or close the ssh terminal your script will stop! So you need to execute your Python script as a daemon, you can use the nobup command. Type a command like this:

nohup python3 -u xxx.py > xxx.out 2>&1 &

please explain:

  • Nohup and the last & wrap are to let the command execute in the background. For example, you can write nohup python3 xxx.py & directly.
  • > xxx.out means to output the output information to the xxx.out log file
  • 2>&1 turns the information into standard output, and also enters the error information into the log file. 0 represents stdin, 1 represents stdout, and 2 represents stderr

After this execution, a pid (process id) will be returned:

 

 

Then you can trace the log output via the tail command:

tail -f xxx.out

 

 

If you feel that you are almost running and want to stop the program, just execute the following command to kill the process, such as kill -9 19267

kill -9 pid

It's okay if you forget the pid, you can find it with the following command:

ps -ef | grep python3

 

 

Then kill it off. In addition, the script above that was executed for 8:28 minutes was the script I ran before going to bed last night, and tail looked at a wave of log files:

 

 

Angrily brushing 31W traffic, let's not talk about this...

4. Python3 ssl module not found

Throwing the script on the server, the ssl module cannot be installed when python3 is executed, it is really inexplicable, pip3 install ssl, the fatality is an error, and then I searched the information on the Internet, first execute the following two commands to decorate things:

apt-get install openssl
apt-get libssl-dev

After installing it, it still doesn't work, and then I find that I have to change the code in the python3 folder, then re-make, cd to the following path, and edit the Setup file with vim:

cd ../../usr/lib/python/Python-3.6.4/Modules 
vim Setup

Change the corresponding part to the following, then esc, type: wq to save.

 

 

Then execute the following command at a time (if the last fame and fortune prompts that the permission is not enough, you can add -H before make)

cd ..
sudo ./configure
sudo make
sudo make install

After make, type python3 on the command line, enter the python3 ide, and import ssl if no error is reported, it means the installation is successful!

 

 

summary

In this section, you learned to write your own module, write a script for brushing traffic, and how to throw your own script on the server to run as a background program.

 

 

In fact, what is the use of brushing so many visits? The original intention of blogging is to share and record our own learning process. I don’t know when to start, we are eager to pursue the so-called reading volume, the number of likes, the number of comments, and then various Title party, chicken soup... Nowadays, most people are rushing for things that have quick results and immediate results, but shy away from things that need to be settled for a long time. This may be impetuous.

Download the source code of this section :

github.com/coder-pig/R…


Reprinted by: coder-pig
Link: https://juejin.im/post/5a6bfb5b6fb9a01c9332e7f6
Source: Nuggets

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325026786&siteId=291194637