1. Http流量韧性
1.1 混沌工程和故障注入
复杂的分布式服务系统中,故障发生的随机性和不可预测性都大大增加
随着服务化,微服务和持续集成的逐渐普及,快速迭代的门槛越来越低,但是对复杂系统稳定性的考虑却在成倍的增长.
- 分布式系统天上包含大量的交互,依赖点,故障点层出不穷
- 人力无法改变此种局面,更需要做的是致力于这些异常被触发之前尽可能多地识别出会导致此类异常的系统脆弱环节或组件,进而有针对性地对其加固,以避免故障发生,打造出更具弹性的系统.这正是混沌工程诞生的原因之一
混沌工程是一种通过实证探究的方法来理解系统行为的方法,也是一套通过在系统基础设施上进行试验,主动找出系统中的脆弱环节的方法学
- 混沌工程是分布式系统上进行试验的学科,旨在提升系统容错性,建立系统地域生产环境中发生不可预测的问题的信心.
- 混沌工程的意义在于,能让复杂系统中根深蒂固的混乱和补稳定性浮出表面,让工程师可以更全面地理解这些系统性固有现象,从而在分布式系统中实现更好的工程设计,不断提高系统弹性.
1.2 故障注入输入样例
- CPU高负载
- 磁盘高负载: 频繁读写磁盘
- 磁盘空间不足
- 优雅的下线应用: 使用应用的stop脚本平滑的停止应用
- 通过kill进程直接停止应用,可能造成数据不一致
- 网络恶化: 随机改变一些包数据,使数据内容不正确
- 网络延迟: 将包延迟一个特定范围的时间
- 网络丢包: 构造一个tcp不会完全失败的丢包率
- 网络黑洞: 忽略来自某个ip的包
- 外部服务不可达: 将外部服务的域名指向本地环回地址或将外部服务的端口的OUTPUT数据包丢弃
1.3 HTTP故障注入过滤器
故障注入在Envoy中实现上类似于重定向,重写和重试.他们通过修改HTTP请求或应答的内容完成
- 它由专用的注入过滤器(fault.injection)实现,用于测试微服务对不同形式的故障韧性(envoy.filters.http.fault)
- 通过用户指定的错误代码注入延迟(delay)和请求终止(abort),从而模拟出分阶段不同的故障情形.
- 故障范围仅限于通过网络进行通信的应用程序可观察到的范围,不支持模拟本地主机上的CPU和磁盘故障
- 注入延迟
{
"fixed_delay": "{...}", # 持续时长,将请求转发至上游主机之前添加固定延迟
"header_delay": "{...}", # 基于HTTP标头的指定控制故障延迟
"percentage": "{...}" # 将注入延迟的操作/连接/请求(operations/connections/requests)的百分比.将错误注入到多大比例的请求操作上
}
- 注入"请求终止"
{
"http_status": "..." , # 用于中止HTTP请求的状态码;http_status,grpc_status,header_status三者必选1
"grpc_status": "...", # 用于中止grpc请求的状态码
"header_abort": "{...}", # 用于HTTP标头控制的中止
"percentage": "{...}" # 使用的错误代码中止的 请求/操作/连接的百分比
}
- 响应报文的速率限制
{
"fixed_limit": "{'limit_kbps': ... }", # 固定速度,单位KiB/s
"header_limit": "{...}", # 限制为HTTP首部的指定速率
"percentage": "{...}" # 将注入的速率限制为 操作/连接/请求(operations/connections/requests)的百分比;
}
2. 故障注入实验
2.1 docker-compose
四个Service:
- envoy:Front Proxy,地址为172.31.62.10
- 3个后端服务
- service_blue:对应于Envoy中的blue_abort集群,带有abort故障注入配置
- service_red:对应于Envoy中的red_delay集群,带有delay故障注入配置
- service_green:对应于Envoy中的green集群
version: '3.3'
services:
envoy:
image: envoyproxy/envoy-alpine:v1.21.5
environment:
- ENVOY_UID=0
- ENVOY_GID=0
volumes:
- ./front-envoy.yaml:/etc/envoy/envoy.yaml
networks:
envoymesh:
ipv4_address: 172.31.62.10
aliases:
- front-proxy
expose:
# Expose ports 80 (for general traffic) and 9901 (for the admin server)
- "80"
- "9901"
service_blue:
image: ikubernetes/servicemesh-app:latest
volumes:
- ./service-envoy-fault-injection-abort.yaml:/etc/envoy/envoy.yaml
networks:
envoymesh:
aliases:
- service_blue
- colored
environment:
- SERVICE_NAME=blue
expose:
- "80"
service_green:
image: ikubernetes/servicemesh-app:latest
networks:
envoymesh:
aliases:
- service_green
- colored
environment:
- SERVICE_NAME=green
expose:
- "80"
service_red:
image: ikubernetes/servicemesh-app:latest
volumes:
- ./service-envoy-fault-injection-delay.yaml:/etc/envoy/envoy.yaml
networks:
envoymesh:
aliases:
- service_red
- colored
environment:
- SERVICE_NAME=red
expose:
- "80"
networks:
envoymesh:
driver: bridge
ipam:
config:
- subnet: 172.31.62.0/24
2.2 envoy.yaml
定义了3段路由
admin:
profile_path: /tmp/envoy.prof
access_log_path: /tmp/admin_access.log
address:
socket_address:
address: 0.0.0.0
port_value: 9901
layered_runtime:
layers:
- name: admin
admin_layer: {
}
static_resources:
listeners:
- name: listener_0
address:
socket_address: {
address: 0.0.0.0, port_value: 80 }
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
codec_type: AUTO
route_config:
name: local_route
virtual_hosts:
- name: backend
domains:
- "*"
routes:
- match:
prefix: "/service/blue"
route:
cluster: blue_abort
- match:
prefix: "/service/red"
route:
cluster: red_delay
- match:
prefix: "/service/green"
route:
cluster: green
- match:
prefix: "/service/colors"
route:
cluster: mycluster
http_filters:
- name: envoy.filters.http.router
clusters:
- name: red_delay
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: red_delay
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: service_red
port_value: 80
- name: blue_abort
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: blue_abort
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: service_blue
port_value: 80
- name: green
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: green
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: service_green
port_value: 80
- name: mycluster
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: mycluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: colored
port_value: 80
2.3 abort配置
10%的概率被503
http_filters:
- name: envoy.filters.http.fault
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.fault.v3.HTTPFault
max_active_faults: 100
abort:
http_status: 503
percentage:
numerator: 10
denominator: HUNDRED
2.4 delay配置’
10%的概率被注入延迟10s
http_filters:
- name: envoy.filters.http.fault
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.fault.v3.HTTPFault
max_active_faults: 100
delay:
fixed_delay: 10s
percentage:
numerator: 10
denominator: HUNDRED
2.5 测试
2.5.1 测试delay
撞到10%的概率时,会被delay10s
# curl -w "@curl_format.txt" -o /dev/null -s "http://172.31.62.10/service/red"
time_namelookup: 0.000020
time_connect: 0.000162
time_appconnect: 0.000000
time_pretransfer: 0.000196
time_redirect: 0.000000
time_starttransfer: 0.002587
----------
time_total: 0.002623
# curl -w "@curl_format.txt" -o /dev/null -s "http://172.31.62.10/service/red"
time_namelookup: 0.000022
time_connect: 0.000179
time_appconnect: 0.000000
time_pretransfer: 0.000213
time_redirect: 0.000000
time_starttransfer: 10.009525
----------
time_total: 10.009569
2.5.2 测试abort
10%的概率被503
# curl -o /dev/null -w '%{http_code}\n' -s "http://172.31.62.10/service/blue"
200
# curl -o /dev/null -w '%{http_code}\n' -s "http://172.31.62.10/service/blue"
503
# curl -o /dev/null -w '%{http_code}\n' -s "http://172.31.62.10/service/blue"
200
# 实际服务器端都是200响应
service_blue_1 | 127.0.0.1 - - [08/Oct/2022 06:59:34] "GET /service/blue HTTP/1.1" 200 -
service_blue_1 | 127.0.0.1 - - [08/Oct/2022 06:59:36] "GET /service/blue HTTP/1.1" 200 -
service_blue_1 | 127.0.0.1 - - [08/Oct/2022 06:59:37] "GET /service/blue HTTP/1.1" 200 -
2.5.3 测试green
由于green没有配置中断,所以都是正常
# curl -o /dev/null -w '%{http_code}\n' -s "http://172.31.62.10/service/green"
200
# curl -o /dev/null -w '%{http_code}\n' -s "http://172.31.62.10/service/green"
200
# curl -o /dev/null -w '%{http_code}\n' -s "http://172.31.62.10/service/green"
200
3. 局部故障处理机制
- retry: 分布式环境中对远程资源和服务的调用可能会由于瞬态故障(短暂时间内科自行恢复的故障)而失败.一般情况下,重试机制可解决此类问题.
- 常见的瞬态故障有网络连接速度慢,超时,资源过量使用或暂时不可用等
- 瞬态处理策略有三个:
- 重试
- 延迟后重试
- 取消
- 瞬态处理策略有三个:
- 常见的瞬态故障有网络连接速度慢,超时,资源过量使用或暂时不可用等
- timeout: 存在因意外事件导致故障,并且可能需要较长时间才能得以恢复
- 此类故障的严重性范围涵盖从部分连接中断到服务完全失败
- 连续重试和长时间的等待对该类场景都没有太大意义
- 应用程序应迅速接受该操作已失败并主动地应对该失败
- 可以将调用服务的操作配置为实时"超时",若该服务在超时时长内未能响应,则以失败消息响应.
- 此类故障的严重性范围涵盖从部分连接中断到服务完全失败
- circuit breaker: 若服务非常繁忙,则系统某一部分的故障可能会导致级联故障
- 对此,简单的超时策略可能导致对同一操作的许多并发请求被阻止,直到超时时长耗尽为止
- 对这些被阻止的请求可能包含关键的系统资源
- 这类资源的耗尽可能导致需要使用相同资源的系统其他可能不相关的部分出现故障
- 于是,此时最好立即使操作失败,并且仅在可能成功的情况下才尝试调动服务
3.1 请求重试的注意事项
- 重试策略需要匹配应用程序的业务需求和故障的性质,对于某些非关键操作,最好是快速失败而不是重试几次并影响应用程序的吞吐量.
- 若大量重试后请求任然失败,最好防止进一步的请求进入同一服务,并立即报告失败
- 还需要考虑操作幂等与否
- 请求可能会由于多种原因而失败,它们可能分别会引发不同的异常,重试策略应根据异常的类型调整两次重试之间的时间间隔
- 确保所有重试代码已针对各种故障情况进行了全面测试,以检查它们是否不会严重影响应用程序的性能或可靠性,是否对服务和资源造成过多负担,是否产生竞争状况或瓶颈.
3.2 HTTP请求重试(route.RetryPolicy)
retry_policy: {
...}
"retry_on": "..." # 重试发生的条件,其功能同x-envoy-retry-on和x-envoy-retry-grpc-on标头相同;
"num_retries": "{...}" # 重试次数,默认值为1,其功能同x-envoy-max-retries标头相同,但采用二者中配置的最大值;
"per_try_timeout": "{...}" # 每次重试时同上游端点建立连接的超时时长;
"retry_priority": "{...}" # 配置重试优先级策略,用于在各优先级之间分配负载;
“retry_host_predicate”: [] # 重试时使用的主机断言(predicate)列表,各断言用于拒绝主机;在选择重试主机时将参考该列表中的各断言,若存在任何谓词拒绝了该主机,则需要重新尝试选择其它主机
"retry_options_predicates": [] #
"host_selection_retry_max_attempts": "..." # 允许尝试重新选择主机的最大次数,默认为1;
"retriable_status_codes": [] # 除了retry_on指定的条件之外,用于触发重试操作的http状态码列表;
"retry_back_off": "{...}" # 配置用于控制回退算法的参数,默认基本间隔为25ms,给定基本间隔B和重试次数N,重试的退避范围为[0,(2^N−1)B),最大间隔默认为基本间隔(250ms)的10倍;
“rate_limited_retry_back_off”: “{
...}“ # 定义控制重试回退策略的参数;
"retriable_headers": [] # 触发重试的HTTP响应标头列表,上游端点的响应报文与列表中的任何标头匹配时将触发重试;
"retriable_request_headers": [] # 必须在用于重试的请求报文中使用的HTTP标头列表;
3.3 HTTP请求重试条件(route.RetryPolicy)
- 重试条件(同 x-envoy -retry-on 标头)
- 5xx:上游主机返回 5xx 响应码,或者根本未予响应(断开 /重置 /读取超时)
- gateway-error:网关错误,类似于 5xx 策略,但仅为 502 、503 或504 的应用进行重试
- connection-failure:在 TCP 级别与上游服务建立连接失败时进行重试
- retriable-4xx:上游服务器返回可重复的 4xx 响应码时进行重试
- refused-stream:上游服器务使用 REFUSED-STREAM 错误码重置时进行试
- retriable-status-codes:上游服务器的响应码与重试策略或 x-envoy-retriable-status-codes 标头值中定义的响应码匹配时进行重试
- reset:上游主机完全不响应时( disconnect/reset/read 超时), Envoy 将进行重试
- retriable-headers:如果上游服务器响应报文匹配重试策略或 x-envoy-retriable-header-names 标头中包含的任何,则 Envoy 将尝试重试
- envoy-ratelimited:标头中存在 x-envoy-ratelimited 时进行重试
- 重试条件 2(同 x-envoy-retry-grpc-on 标头)
- cancelled:gRPC 应答标头中的状态码是"cancelled" 时进行重试
- deadline-exceeded :gRPC 应答标头中的状态码是 “deadline-exceeded” 时进行重试
- internal:gRPC 应答标头中的状态码是"internal"时进行重试
- resource-exhausted:gRPC 应答标头中的状态码是 “resource-exhausted” 时进行重试
- unavailable:gRPC 应答标头中的状态码是 "unavilable"时进行重试
- 默认情况下, Envoy 不会进行任何类型的重试操作,除非明确定义
4. 请求重试
4.1 docker-compose
四个Service:
- envoy:Front Proxy,地址为172.31.65.10
- 3个后端服务
- service_blue:对应于Envoy中的blue_abort集群,带有abort故障注入配置,地址为172.31.65.5;
- service_red:对应于Envoy中的red_delay集群,带有delay故障注入配置,地址为172.31.65.7;
- service_green:对应于Envoy中的green集群,地址为172.31.65.6;
version: '3.3'
services:
envoy:
image: envoyproxy/envoy-alpine:v1.21.5
environment:
- ENVOY_UID=0
- ENVOY_GID=0
volumes:
- ./front-envoy.yaml:/etc/envoy/envoy.yaml
networks:
envoymesh:
ipv4_address: 172.31.65.10
aliases:
- front-proxy
expose:
# Expose ports 80 (for general traffic) and 9901 (for the admin server)
- "80"
- "9901"
service_blue:
image: ikubernetes/servicemesh-app:latest
volumes:
- ./service-envoy-fault-injection-abort.yaml:/etc/envoy/envoy.yaml
networks:
envoymesh:
ipv4_address: 172.31.65.5
aliases:
- service_blue
- colored
environment:
- SERVICE_NAME=blue
expose:
- "80"
service_green:
image: ikubernetes/servicemesh-app:latest
networks:
envoymesh:
ipv4_address: 172.31.65.6
aliases:
- service_green
- colored
environment:
- SERVICE_NAME=green
expose:
- "80"
service_red:
image: ikubernetes/servicemesh-app:latest
volumes:
- ./service-envoy-fault-injection-delay.yaml:/etc/envoy/envoy.yaml
networks:
envoymesh:
ipv4_address: 172.31.65.7
aliases:
- service_red
- colored
environment:
- SERVICE_NAME=red
expose:
- "80"
networks:
envoymesh:
driver: bridge
ipam:
config:
- subnet: 172.31.65.0/24
4.2 envoy.yaml
/service/blue 5xx 尝试重试3次
/service/red 1秒超时重试
admin:
profile_path: /tmp/envoy.prof
access_log_path: /tmp/admin_access.log
address:
socket_address:
address: 0.0.0.0
port_value: 9901
layered_runtime:
layers:
- name: admin
admin_layer: {
}
static_resources:
listeners:
- name: listener_0
address:
socket_address: {
address: 0.0.0.0, port_value: 80 }
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
codec_type: AUTO
route_config:
name: local_route
virtual_hosts:
- name: backend
domains:
- "*"
routes:
- match:
prefix: "/service/blue"
route:
cluster: blue_abort
retry_policy:
retry_on: "5xx"
num_retries: 3
- match:
prefix: "/service/red"
route:
cluster: red_delay
timeout: 1s
- match:
prefix: "/service/green"
route:
cluster: green
- match:
prefix: "/service/colors"
route:
cluster: mycluster
retry_policy:
retry_on: "5xx"
num_retries: 3
timeout: 1s
http_filters:
- name: envoy.filters.http.router
clusters:
- name: red_delay
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: red_delay
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: service_red
port_value: 80
- name: blue_abort
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: blue_abort
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: service_blue
port_value: 80
- name: green
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: green
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: service_green
port_value: 80
- name: mycluster
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: mycluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: colored
port_value: 80
4.3 abort配置
50%的概率503
- name: envoy.filters.http.fault
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.fault.v3.HTTPFault
max_active_faults: 100
abort:
http_status: 503
percentage:
numerator: 50
denominator: HUNDRED
4.4 delay配置
50%的概率延迟10s
http_filters:
- name: envoy.filters.http.fault
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.fault.v3.HTTPFault
max_active_faults: 100
delay:
fixed_delay: 10s
percentage:
numerator: 50
denominator: HUNDRED
- name: envoy.filters.http.router
4.5 测试
由于超时重试,避免了长期阻塞
# curl -w"@curl_format.txt" -o /dev/null -s "http://172.31.65.10/service/red"
time_namelookup: 0.000019
time_connect: 0.000198
time_appconnect: 0.000000
time_pretransfer: 0.000231
time_redirect: 0.000000
time_starttransfer: 0.002555
----------
time_total: 0.002587
# curl -w"@curl_format.txt" -o /dev/null -s "http://172.31.65.10/service/red"
time_namelookup: 0.000022
time_connect: 0.000278
time_appconnect: 0.000000
time_pretransfer: 0.000658
time_redirect: 0.000000
time_starttransfer: 1.001931
----------
time_total: 1.001986
由于有重试,将原来50%的503返回大量降低
由于3次的重试,可以看到出现503的概率从50%降低到12.5%左右.
# ./send-requests.sh 172.31.65.10/service/blue 100
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
503
200
发往service/color请求会被调度到3个集群,有可能会出现中断
# ./send-requests.sh 172.31.65.10/service/colors 100
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
504 # 超时
200
200