Implementation of crawler proxy service based on Squid

proxy

How the proxy server works

How the proxy server works:

1. Client A sends a request to the proxy server to access the Internet.
2. After accepting the request, the proxy server first matches the access rules in the access control list. If the rules are met, the cache is searched for the required resource information.
3. If the request information of client A exists in the cache, then return this information to client A; if there is no proxy server, it will replace the client and request the specified information from the Internet.
4. The host on the Internet sends the requested information to the proxy server, and the proxy server will store the information in the cache.
5. The proxy server transmits the return information of the host on the Internet to the client A.
6. When client B also requests the same information.
7. The proxy server will also accept the request and match the rules in the access control list.
8. If the rules are met, the proxy server will pass the information in the cache directly to the client B.

Agent classification

  • Forward agent (controlling intranet access to the Internet)

  • Reverse proxy (controlling external network access to internal network)

  • Transparent proxy (forward proxy without encryption)

Forward agent

Proxy internal host Internet access, shared Internet access, caching, control of Internet users' Internet access behavior and other functions (clients need to set the proxy server IP and proxy port)

正向代理分析图:
		外网
		 |
		modem
		 |
		路由器(dhcp,snat共享上网,上网行为控制,限速等)
		 |
		 |
	 squid正向代理(共享上网,静态页面缓存加速,内网用户四七层上网行为控制,限速等)
		 |
		 |	
	|----------------------|
 上网用户一		    上网用户二
            公网
			 |
			 |	
			br0	172.16.13.250
			squid 服务器 
			virbr1	192.168.100.1		   
			 |
			 |
			 |	
			内网用户VM1          	   	
			eth0(virbr1)			
			192.168.100.128

Reverse proxy

Access internal servers from external network, contrary to the positive direction, mainly used for cache acceleration or CDN of website architecture

            client
			  |
			  |
			反向代理 (缓存加速,七层切分,负载均衡,会话保持等)
			  |
			  |	
			  web

Transparent proxy

The function of the forward proxy is completely consistent (the client does not need to set the proxy server IP and proxy port, which is transparent to the user)

References

https://www.cnblogs.com/yanjieli/p/7507456.html

Squid

concept

Squid is a cache proxy server software, which is widely used in the load balancing architecture of websites. Common cache servers include varnish and ATS.

The forward proxy server can meet the requirement that only one server on the intranet can access the Internet, and the need to provide Internet access to all the machines on the intranet can also be used for crawler proxy access. In practice I would Squid proxy server as a crawler, to achieve a multi- IPfunction switch.

installation

yum install -y squid

Configuration instructions

Configure authentication

yum install httpd

# 然后执行如下命令进行生成 用户名和密码,这里的示例为生成一个账号:hello
# 执行该命令之后,根据提示输入设置密码
htpasswd -c /etc/squid/passwd hello

Configuration file

(/etd/squid/squid.conf)

acl all src 0.0.0.0/0.0.0.0     #允许所有IP访问
acl manager proto http        #manager url协议为http
acl localhost src 127.0.0.1/255.255.255.255 #允午本机IP
acl to_localhost dst 127.0.0.1         #允午目的地址为本机IP
acl CONNECT method CONNECT     #请求方法以CONNECT

#http_access allow all         #允许所有人使用该代理.

#http_reply_access allow all         #允许所有客户端使用该代理

acl Safe_ports port 80     # 允许安全更新的端口为80
acl Safe_ports port 443    #允许安全更新的端口为443
acl localnet src 10.195.249.225   #
acl localnet src 10.195.236.141   #


http_access allow localnet      #
http_access deny !Safe_ports      #

acl OverConnLimit maxconn 16    #限制每个IP最大允许16个连接,防止攻击

http_access deny OverConnLimit

 
icp_access deny all             #禁止从邻居服务器缓冲内发送和接收ICP请求.
miss_access allow all         #允许直接更新请求
ident_lookup_access deny all                 #禁止lookup检查DNS
http_port 8080 transparent                 #指定Squid监听浏览器客户请求的端口号。

hierarchy_stoplist cgi-bin ?         #用来强制某些特定的对象不被缓存,主要是处于安全的目的。

acl QUERY urlpath_regex cgi-bin \?

cache deny QUERY

cache_mem 1 GB     #这是一个优化选项,增加该内存值有利于缓存。应该注意的是:

​           \#一般来说如果系统有内存,设置该值为(n/)3M。现在是3G 所以这里1G

fqdncache_size 1024    #FQDN 高速缓存大小

maximum_object_size_in_memory 2 MB     #允许最大的文件载入内存


memory_replacement_policy heap LFUDA  #动态使用最小的,移出内存cache

cache_replacement_policy heap LFUDA     #动态使用最小的,移出硬盘cache


cache_dir ufs /home/cache 5000 32 512 #高速缓存目录 ufs 类型 使用的缓冲值最大允午1000MB空间,

\#32个一级目录,512个二级目录


max_open_disk_fds 0                 #允许最大打开文件数量,0 无限制

minimum_object_size 1 KB             #允午最小文件请求体大小

maximum_object_size 20 MB         #允午最大文件请求体大小

cache_swap_low 90              #最小允许使用swap 90%

cache_swap_high 95              #最多允许使用swap 95%

 
ipcache_size 2048                # IP 地址高速缓存大小 2M
ipcache_low 90                #最小允许ipcache使用swap 90%
ipcache_high 95                 #最大允许ipcache使用swap 90%


access_log /var/log/squid/access.log squid     #定义日志存放记录
cache_log /var/log/squid/cache.log squid
cache_store_log none             #禁止store日志


emulate_httpd_log on     #将使Squid仿照Web服务器的格式创建访问记录。如果希望使用

​                \#Web访问记录分析程序,就需要设置这个参数。


refresh_pattern . 0 20% 4320 override-expire override-lastmod reload-into-ims ignore-reload  #更新cache规则


acl buggy_server url_regex ^http://.... http://      #只允许http的请求broken_posts allow buggy_server

acl apache rep_header Server ^Apache         #允许apache的编码

broken_vary_encoding allow apache


request_entities off                     #禁止非http的标分准请求,防止攻击
header_access header allow all             #允许所有的http报头
relaxed_header_parser on                 #不严格分析http报头.
client_lifetime 120 minute                 #最大客户连接时间 120分钟
cache_mgr [email protected]             #指定当缓冲出现问题时向缓冲管理者发送告警信息的地址信息。
cache_effective_user squid             #这里以用户squid的身份Squid服务器
cache_effective_group squid

icp_port 0            #指定Squid从邻居服务器缓冲内发送和接收ICP请求的端口号。
​           \#这里设置为0是因为这里配置Squid为内部Web服务器的加速器,
​           \#所以不需要使用邻居服务器的缓冲。0是禁用

\# cache_peer 设置允许更新缓存的主机,因是本机所以127.0.0.1

cache_peer 127.0.0.1 parent 80 0 no-query default multicast-responder no-netdb-exchange
cache_peer_domain 127.0.0.1                 
hostname_aliases 127.0.0.1

error_directory /usr/share/squid/errors/Simplify_Chinese     #定义错误路径

always_direct allow all         # cache丢失或不存在是允许所有请求直接转发到原始服务器
ignore_unknown_nameservers on     #开反DNS查询,当域名地址不相同时候,禁止访问
coredump_dir  /var/log/squid         #定义dump的目录
max_filedesc 2048        #最大打开的文件描述

half_closed_clients off     #使Squid在当read不再返回数据时立即关闭客户端的连接。

​                \#有时read不再返回数据是由于某些客户关闭TCP的发送数据
​                \#而仍然保持接收数据。而Squid分辨不出TCP半关闭和完全关闭。

When Squid is a crawler proxy, we only need to be a Squid proxy, and then forward polling to other agents, how to use Squid as a proxy and

Automatic forwarding polling?

Add this line of code:

cache_peer 120.xx.xx.32 parent 80 0 no-query weighted-round-robin weight=2 connect-fail-limit=2 allow-miss max-conn=5 name=proxy-90

Note that when they 120.xx.xx.32are the same but the ports are different, you must set a different name, otherwise you will get an error. In cache_peer 120.xx.xx.32 specified twicethis case, you must set a different name.

Meaning of configuration keywords

The syntax is such cache_peer Web server address server type http port icp port [optional] , the options include:

  • proxy-only: indicates that the data obtained from the peer is not cached locally. By default, Squid is to cache this part of data;
  • weight = n: used when you have multiple peers. If more than one peer has the data you requested, Squid determines the weight value by calculating the ICP response time of each peer, and then Among them, the peer with the largest weight issues an ICP request. That is, the larger the weight value, the higher its priority. Of course, you can also manually specify its weight value;
  • no-query: Do not send ICP request to this peer. If the peer is not available, you can use this option;
  • Default: a bit like the default route in the routing table, this peer will be used as a last resort. When you only have one parent proxy server and it does not support the ICP protocol, you can use the default and no-query options so that all requests are sent to the parent proxy server;
  • login = user: password: This option can be used for authentication when your parent proxy server requires user authentication.
    After the update is complete, save and restart Squid, you will find that Squid is already available.

Access control

squid的acl(access control list)访问控制(下面列举一些常见的控制)

acl denyip src  192.168.100.128/32 	--拒绝内网的192.168.100.128/32上网
http_access deny denyip

acl denyip src 192.168.100.128-192.168.100.132/255.255.255.255
http_access deny denyip

acl vip  arp  00:0C:29:79:0C:1A 
http_access allow  vip 

acl  baddsturl2  dst   220.11.22.33  --不能访问这个外网IP的网站
http_access deny baddsturl2

acl  baddsturl  dstdomain -i  www.163.com  --不能访问www.163.com和WWW.163.COM;-i参数定义大小写都匹配;  但是可以访问war.163.com或sports.163.com
http_access deny baddsturl

acl  baddsturl  dstdom_regex -i  163	--这是把163以下的所有域名都禁止  ,但直接使用IP仍然是可以访问的
http_access deny   baddsturl

acl  baddsturl  dstdom_regex "/etc/squid/baddsturl"  --如果网址太多,可以写成一个文件,然后在这个文件里一行一个网站写上你要禁止的
http_access deny baddsturl

acl baddsturl3  url_regex  -i  baidu   --拒绝访问url里有baidu这个关键字的网站
http_access deny baddsturl3

acl badfile  urlpath_regex -i \.mp3$ \.rmvb$ \.exe$ \.zip$ \.mp4$ \.avi$  \.rar$
http_access deny badfile	--禁止下载带有定义后缀名的文件

acl badipclient2  src 192.168.100.0/255.255.255.0
acl worktime time  MTWHF 9:00-17:00
http_access deny badipclient2 worktime  --拒绝192.168.100.0网段工作时间不能上网

acl badipclient3  src 192.168.100.128
acl conn5  maxconn  5
http_access deny badipclient3 conn5	--最大连接数为5

https://www.cnblogs.com/wangxiaoqiangs/p/5796597.html

initialization

修改完配置文件之后保存,然后输入以下命令进行初始化 squid

squid -z

problem

TCP_MISS/503

Found the following content in the log

1587003941.248      0 172.25.0.1 TCP_MISS/503 4362 GET http://gtj.hangzhou.gov.cn/col/col1363087/index.html - HIER_NONE/- text/html
1587003942.505      0 172.25.0.1 TCP_MISS/503 4362 GET http://gtj.hangzhou.gov.cn/col/col1363087/index.html - HIER_NONE/- text/html
1587003943.779    301 172.25.0.1 TCP_MISS/200 388 GET http://httpbin.org/ip - HIER_DIRECT/34.230.193.231 application/json
1587003943.899      0 172.25.0.1 TCP_MISS/503 4357 GET http://gtj.hangzhou.gov.cn/col/col1363087/index.html - HIER_NONE/- text/html
1587003945.333      0 172.25.0.1 TCP_MISS/503 4362 GET http://gtj.hangzhou.gov.cn/col/col1363087/index.html - HIER_NONE/- text/html

View a keyword TCP_MISS / 503

Google, find this article: https://forums.freebsd.org/threads/34184/

solve:

It turns out that IPv6 is not supported. Follow the prompts inside and configure a dns_v4_first on in /etc/squid/squid.conf

It's time to try again!

If it still does not work, modify the system configuration directly

Modify / etc / sysconfig / network:
set NETWORKING_IPV6 = no

(Preferably reboot once)

References

http://cn.linux.vbird.org/linux_server/0420squid.php#server_default

Proxy pool

https://github.com/AaronJny/open_proxy_pool

Profile updater

https://github.com/xNathan/squid_proxy_pool

The documentation of the above items

https://xnathan.com/2017/03/01/squid-anony-proxy/

https://xnathan.com/2017/02/28/squid-proxy/

https://xnathan.com/2017/03/02/squid-proxy-pool/

Squid Official Manual

img

​ http://zyan.cc/book/squid/index.html

Reference example

https://rookiefly.cn/detail/192

Published 134 original articles · Liked 119 · Visit 310,000+

Guess you like

Origin blog.csdn.net/jobbofhe/article/details/105561452