Travel Web link monitoring certain exploration and practice two preliminary tests with the first article of mine detection

  Original articles, please indicate the source. Because I am limited, errors and omissions in the text are inevitable, hope a lot of criticism. Excessive content of this article, we talked divided into four, this is The second.

  By comparison with the previous analysis, we chose the SkyWalking (hereinafter abbreviated SW) do link monitoring. Considering the SW is open source systems, from past experience, we reached the open source and system level, usually have mine, do not pay attention, then sneak a mine explosion, all associated systems may be affected, serious time will make you regret using this system. And if you are using a distributed tracing system, then when Baolei will be more serious, because it is not independent deployment, container is to be integrated into each system instance, when something goes wrong it will be directly implicated in the container other service. Therefore, carefully verify the principle, formally launched before use, we briefly deployed in a test environment SW, initial attempts to access the system, it will help the data body mass, cluster configuration and estimated human work, Ray also can test the waters, identify problems in advance actual verify for ourselves and our follow-up work to facilitate risk of the control. The entire deployment process we use MacOS Mojave operating system demonstrations, and Linux may be subtle differences.

Let us talk about background
  • Machine: Because of resource constraints, strict examination and approval, only got 1 virtual machine.
    • Hardware [VM]: 4C / 8GB / 50GB [HDD] / 100Mbps [LAN]
    • 软件[64-Bit]:CentOS[7.4]/JDK[1.8.0_91]/Docker[18.09.7]/Docker Compose[1.18.0]
  • CD tool: the company self-development system, full-screen operation, here called him tard it. Provides a container management, Docker mirroring configuration, the environment variable configuration, packaging, and other functions on offline.
  • SkyWalking: Use Apache official Docker mirroring, version 6.3.0. Tips, we recommend using version 6.3.0 or later, because there is a low version of the mine, then back to talk in detail.

  Mentioned the need for simple, rapid deployment, our first thought is to use the Docker, and SW is the official Docker mirror, you can rest assured that use it, eliminating the need to make a mirror of our own time. This time we mainly use Docker Docker Compose and deployment SW, including service orchestration, container and release management. Readers need to have a basic understanding of Docker command, if not used, a wave of strong push Docker here. Docker is really a good thing ah, super easy to use, is that once inseparable from programs used, but also help expand our thinking, we strongly suggest that you learn about.

Service deployment SkyWalking
  1. We only have one machine, only 8 G memory, the initial deployment of stand-alone version can only guess, but we still want to cluster deployed SW. Because most realistic simulation of cluster line environment, are also likely to expose more than a stand-alone issue, but one machine is not can not deploy a cluster, so that we would no longer struggle, the first statistical clusters and stand-alone version requires deployment how many instances to see if there is the possibility of not running cluster.

    The minimum cluster configuration skywalking-ui skywalking-oap-server ZooKeeper Elasticsearch total
    The number of instances 2 2 3 2 9
    The minimum stand-alone configuration skywalking-ui skywalking-oap-server ZooKeeper Elasticsearch total
    The number of instances 1 1 0 1 3
  2. Combined with the number of instances and each instance of work about the amount of memory required, you can see the difference and stand-alone cluster is still very large. Even taking into account the Cluster Edition starts correctly, performance is not very good at reading and writing back pressure, the probability of overflow is also a great memory, so the stand-alone version is more suitable for this situation is. However, there is a turning point, if we want to carefully lower, the components of which only Elasticsearch (hereinafter referred to as ES) memory requirements are relatively high, most memory consumption, and in normal use, we have verified the reader ES cluster performance and stability, and this fact can be ignored ES testing, verification major issues SW cluster. So that we can use stand-alone SW cluster + ES way to mix and match, to reduce memory pressure of virtual machines to realize cluster deployment. Through the back of the test, we found that this hybrid approach is feasible, but we are also well positioned to meet this demand. The final deployment configuration used is as follows:

    Cluster + Standalone hybrid configuration skywalking-ui skywalking-oap-server ZooKeeper Elasticsearch total
    The number of instances 2 2 3 1 8
  3. From the process of deploying zero in practice there are many, here the network virtual machine omitted configuration, firewall configuration, processing operating system level FD restrictions and other operations, JDK omitted, install the operating Docker, Docker Compose software, please yourself installation we direct the crucial part.

    • Well in advance to download the image file (this step can be omitted)

      # 我们使用6.3.0版本的SW,下载过程如图1所示。
      macos$ docker pull apache/skywalking-ui:6.3.0
      macos$ docker pull apache/skywalking-oap-server:6.3.0

      FIG 1 uses mirroring download docker
      FIG 1 uses mirroring download docker

      # 思考下ZooKeeper使用什么版本?
      # skywalking-oap-server:6.3.0的镜像文件中有一个oap-lib目录,存放了所有的依赖。
      # 我们发现它使用的ZooKeeper客户端版本号是3.4.10,如图2所示。
      # 所以ZooKeeper服务最好也使用3.4.10版本
      macos$ docker pull zookeeper:3.4.10

      FIG 2 skywalking-oap-server: 6.3.0 version dependent ZooKeeper
      FIG 2 skywalking-oap-server: 6.3.0 version dependent ZooKeeper

      # 6.3.0版本的SW只能兼容6.3.2或者更高版本的ES,如图3所示。
      # 因为运维同学使用的是6.4.2版本,我们最好和运维保持一致。
      macos$ docker pull elasticsearch:6.4.2

      Figure 3 SkyWalking official website of the ES version of the requested information
      Figure 3 SkyWalking official website of the ES version of the requested information

      # 下载完成后,应该可以看到如图4所示的四个镜像文件。
      macos$ docker images

      See FIG. 4 local mirroring
      See FIG. 4 local mirroring

    • Use Docker Compose orchestration services
      • docker-compose.yml we use profiles

        version: "3"
        services:
          elasticsearch:
            image: elasticsearch:6.4.2
            container_name: elasticsearch
            restart: always
            ports:
              - 9200:9200
              - 9300:9300
            ulimits:
              memlock:
                soft: -1
                hard: -1
            environment:
              - discovery.type=single-node
              - bootstrap.memory_lock=true
              - "ES_JAVA_OPTS=-Xms1536m -Xmx1536m -Xmn512m -XX:MetaspaceSize=256m -XX:MaxMetaspaceSize=256m"
              - TZ=Asia/Shanghai
          oap1:
            image: apache/skywalking-oap-server:6.3.0
            container_name: oap1
            depends_on:
              - elasticsearch
              - zk1
              - zk2
              - zk3
            links:
              - elasticsearch
              - zk1
              - zk2
              - zk3
            restart: always
            ports:
              - 11801:11800
              - 12801:12800
            environment:
              SW_CLUSTER: zookeeper
              SW_CLUSTER_ZK_HOST_PORT: zk1:2181,zk2:2181,zk3:2181
              SW_STORAGE: elasticsearch
              SW_STORAGE_ES_CLUSTER_NODES: elasticsearch:9200
              SW_STORAGE_ES_INDEX_SHARDS_NUMBER: 1
              SW_STORAGE_ES_INDEX_REPLICAS_NUMBER: 0
              TZ: Asia/Shanghai
          oap2:
            image: apache/skywalking-oap-server:6.3.0
            container_name: oap2
            depends_on:
              - elasticsearch
              - zk1
              - zk2
              - zk3
            links:
              - elasticsearch
              - zk1
              - zk2
              - zk3
            restart: always
            ports:
              - 11802:11800
              - 12802:12800
            environment:
              SW_CLUSTER: zookeeper
              SW_CLUSTER_ZK_HOST_PORT: zk1:2181,zk2:2181,zk3:2181
              SW_STORAGE: elasticsearch
              SW_STORAGE_ES_CLUSTER_NODES: elasticsearch:9200
              SW_STORAGE_ES_INDEX_SHARDS_NUMBER: 1
              SW_STORAGE_ES_INDEX_REPLICAS_NUMBER: 0
              TZ: Asia/Shanghai
          ui1:
            image: apache/skywalking-ui:6.3.0
            container_name: ui1
            depends_on:
              - oap1
              - oap2
            links:
              - oap1
              - oap2
            restart: always
            ports:
              - 8081:8080
            environment:
              SW_OAP_ADDRESS: oap1:12800,oap2:12800
              TZ: Asia/Shanghai
          ui2:
            image: apache/skywalking-ui:6.3.0
            container_name: ui2
            depends_on:
              - oap1
              - oap2
            links:
              - oap1
              - oap2
            restart: always
            ports:
              - 8082:8080
            environment:
              SW_OAP_ADDRESS: oap1:12800,oap2:12800
              TZ: Asia/Shanghai
          zk1:
            image: zookeeper:3.4.10
            restart: always
            container_name: zk1
            ports:
              - 2181:2181
            environment:
              ZOO_MY_ID: 1
              ZOO_SERVERS: server.1=zk1:2888:3888 server.2=zk2:2888:3888 server.3=zk3:2888:3888
          zk2:
            image: zookeeper:3.4.10
            restart: always
            container_name: zk2
            ports:
              - 2182:2181
            environment:
              ZOO_MY_ID: 2
              ZOO_SERVERS: server.1=zk1:2888:3888 server.2=zk2:2888:3888 server.3=zk3:2888:3888
          zk3:
            image: zookeeper:3.4.10
            restart: always
            container_name: zk3
            ports:
              - 2183:2181
            environment:
              ZOO_MY_ID: 3
              ZOO_SERVERS: server.1=zk1:2888:3888 server.2=zk2:2888:3888 server.3=zk3:2888:3888
        

      • Create a container, start the service

        # 指定配置文件docker-compose.yml的路径,在后台启动所有服务。
        # 如图5所示,会显示容器的创建结果。
        macos$ docker-compose -f ./docker-compose.yml up -d

        5 start all services
        5 start all services

        # 查看容器是否启动成功,如图6所示。
        macos$ docker ps

        See FIG. 6 of the container status
        See FIG. 6 of the container status

    • Here SW deployed on the cluster. Some students may find that often in the process of restarting the cluster started in oap1 and oap2 two containers will be, do not panic, not what you did wrong, this is a normal phenomenon, wait a minute stabilizes.

Deployment SkyWalking Agent
  1. Recommended to SkyWalking official website to download the installation package of the Agent, speed is good. Please download the corresponding installation package according to the respective operating system, as shown in FIG.
    FIG 7 SW installation package, a second election
    FIG 7 SW installation package, a second election

  2. We downloaded the installation package is a complete All-In-One installation package, unpacked we will see in the Agent installation package, OAP Server, Web UI three components. Since we only need Agent, so only the extraction agent folders on it, as shown in Fig.
    FIG 8 agent extracted from the installation package
    FIG 8 agent extracted from the installation package

  3. Got the Agent, the following needs to be done is to deploy the Agent and our systems to the same container to go, everyday we use some CI tools, combined tools tard CD system will be deployed on private clouds. This time, let's use or deployment process tard Agent presentation, rather than using Docker native.
    • 说到这里可能会有一部分同学觉得不太理解,尤其是很少使用Docker,对部署流程了解较少的同学,他们可能会疑惑:”读者原本是想参考文章中的代码、命令来尝试部署测试,tard是你们公司的内部系统,我们没法使用,如果用更加通用的Docker会不会更好?“。

    • 解释这个问题前先简单说下背景。和很多的公司一样,我们也是有一套标准的打包发布流程的,tard有一套自己的使用规范,关于这一点曾经与运维的同学简单了解过。出于安全的考虑,tard会要求所有的系统只能使用公司统一的基础镜像,或者以基础镜像为底做出来的其他镜像,tard也会限制很多的docker命令不可以使用。同时tard自身还有一些功能待完善,比如在创建docker镜像的时候,基础镜像只支持解压War包,暂不支持其他的压缩格式,所以tard只支持添加War包;容器启动时唯一可以执行的命令是运行Tomcat等等。(PS:tard为部分功能是留了口子的,如果项目组有什么特殊需求可以走运维部门的技术支持流程,由运维同学再细看是否可以处理。有些限制是可以解决的,有些暂时还是不可以,具体的细节没有深入了解。)

    • 所以从发布流程来说,我们是一定要使用tard来发布的,使用tard演示可以为公司的同事提供一些参考,也可以还原我们当时遇到的一些问题和解决思路,分享这些我觉得是很有意义的。同时在tard现有的限制下部署Agent还是有些难度的,可以说这也是一个雷,不过这些问题的存在也会促使我们思考,这也是让我觉得好玩的地方,可能是个人的恶趣味吧。而且tard和Docker的相似度很高,只要有一点Docker的基础就没什么问题,所以选择了tard来演示。这里我们最主要来理解都需要做什么,具体使用什么实现都可以的。
  4. 要实现前文说的整个接入过程无任何代码侵入,只需要简单修改镜像的配置就可以完成接入的这个目标,我们需要将Agent做到基础镜像里面,这样只需要修改发布系统依赖的基础镜像就可以使用Agent了。常规做法一般是将Agent文件夹压缩成tar包或者zip包,上传,然后在镜像内解压就可以了。可是tard只支持添加War包,怎么办?这是我们遇到的一个问题,请大家思考一下再向下阅读。

  5. tard只支持添加War包怎么办?很简单,那就打成War包喽。很多人会觉得有点别扭,有点反常规,尤其是有技术洁癖的同学。我觉得别想太多,能解决问题就是好办法,规则已经制定了,无法打破我们就想想怎么利用规则解决问题吧。既然是研发熟悉的War包,那我们走常规的研发流程就可以了。
    • 考虑到大概率上只会做一次基础镜像,我们就尽量简单点,不需要走申请创建新代码仓库的审批流程,最好可以利用已有的Web系统,复用原有的打包配置。这样我们从现有系统的git代码仓库拉出来了一个分支,命名为sw-agent。

    • 保留Maven中使用maven-assembly-plugin插件打包的配置,删除其他无关代码。将SW安装包中的agent文件夹添加到项目中,commit、push到仓库,如图9所示。
      FIG 9 sw agent folder to the branch
      图9 将sw的agent文件夹添加到分支中

    • 新建分支打包配置,指定要编译、打包的分支与War包生成路径,后面再结合tard的界面化操作,可以实现一键完成编译、打包、上传操作。
      Packing configuration 10 of FIG branch, War package configuration
      图10 打包的分支配置、War包配置
  6. 我们能够拿到sw-agent.war,就可以开始做基础镜像了。

    # Dockerfile
    FROM 公司统一的基础镜像:JDK8+Tomcat8版本
    ADD sw-agent.war /opt
    WORKDIR /opt
    # 解压后删除War包多余的文件,最终的结果就和使用tar包、zip包一样了
    RUN jar xf sw-agent.war && rm -rf sw-agent.war META-INF
    
    #做好的镜像我们暂时命名为skywalking-agent-6.3:1.0
  7. 修改发布系统的基础镜像,然后再将agent加入到Tomcat的运行时参数里面。

    # 修改基础镜像,使用skywalking-agent-6.3:1.0
    FROM skywalking-agent-6.3:1.0
    
    省略......
    # 修改Tomcat的环境配置脚本setenv.sh
    省略......
    
    # 加入这一段脚本代码
    AGENT_BOOTSTRAP="/opt/skywalking/agent/skywalking-agent.jar"
    if [ -f $AGENT_BOOTSTRAP ]; then
        # SW_AGENT_NAME:我们要追踪的系统的名称。
        # SW_AGENT_COLLECTOR_BACKEND_SERVICE:SW OAP服务地址和端口号。
        CATALINA_OPTS="$CATALINA_OPTS -DSW_AGENT_NAME=${SW_AGENT_NAME} -DSW_AGENT_COLLECTOR_BACKEND_SERVICES=${SW_AGENT_COLLECTOR_BACKEND_SERVICES} -javaagent:$AGENT_BOOTSTRAP"
    fi
  8. 现在我们可以让系统跑起来,验证下接入是否成功了。
    • 在系统启动之前,我们需要先为上面留的两个运行时变量赋值。我个人更喜欢以环境变量的方式配置系统,简单、改动更加方便,具体的配置如下所示。

      # 指定被追踪的系统的名称,不可重复。
      SW_AGENT_NAME=ATS-WZ
      
      # Agent需要将生产的数据上报给我们部署的服务实例oap1,oap2。
      # 使用过程中在这里发现了一个雷,标记下。
       SW_AGENT_COLLECTOR_BACKEND_SERVICES=127.0.0.1:11801,127.0.0.1:11802
    • 实例启动成功以后,打开SW的首页[ http://127.0.0.1:8081, http://127.0.0.1:8082 ],UI文本会默认使用英文显示,可以通过右下角的中英文切换功能切换语言。如果能在当前服务中看到我们配置的系统名称ATS-WZ,就表示接入成功了,如图11所示,至此SkyWalking Agent我们就部署完毕了。
      FIG 11 SW may be tracked to identify the system
      图11 SW可以识别被追踪的系统了

简单说下探到的雷吧

   整个的接入、使用与测试过程中我们发现了几个雷,包括上面说的在低版本的SW上可能会出现一些问题的雷。我们都一一标记、收集整理了,打算在后面总结的时候我们再集中一一拆解,本节只是试验与探雷,至于排雷后面再聊。不过有一些雷可能会导致接入出现问题,我们可以先简单说一下,以免影响大家使用。

  1. 问题表现:SW首页没有数据,包括当前服务、端点、实例都没有数据,可以说接入失败了。
    • 大概率是OAP服务挂了,可以检查一下。

    • 如果是6.3以下的版本,在OAP服务重启以后可能还是没有数据的。因为低版本的SW要求在OAP重启以后,所有的agent端都要重启,重新接入,否则是没有数据的。这是一个比较坑的问题,建议SW官网能够加上这个使用说明,这个问题带来的后果就是如果OAP服务挂了,重启,那么要通知所有的接入方重启自己的系统。一个是很麻烦,一个是可能会影响我们的服务。还好的是在6.3版本升级了相关功能,解决了这个问题,所以强烈建议使用或者升级到6.3以上版本。
  2. 问题表现:SW首页没有数据,并且在agent日志中可以看到如图12所示的报错。
    FIG 12 Agent error log
    图12 Agent日志报错

    • 很大的可能是我们配置的环境变量SW_AGENT_COLLECTOR_BACKEND_SERVICES有问题。一些同学会想的很全面,他们考虑到了负载均衡的问题,使用的是OAP集群的域名,考虑的很好很充分,只不过SW不支持。我们通过查看SW的源码可以发现,SW是不支持域名的,只支持 ip1:port1,ip2:port2 这种格式,如图13所示。
      FIG 13 OAP configuration format may be inferred from the source address of the service
      图13 从源码中可以推断出OAP服务地址的配置格式

    • 至于OAP集群的负载均衡问题,我们不需要自建类似Nginx+OAP这种的反向代理集群。查看源码可以发现Agent采用的是客户端软负载均衡,默认使用随机算法选择需要连接的OAP服务实例。
  3. 问题表现:SW页面上有很多的区域都没有数据,或者说响应很慢。打开Chrome浏览器的开发者工具可以发现,SW页面发出的很多接口请求都超时失败了,超时时间大约为10秒。查看ui1和ui2的日志可看到如图14所示的错误。

    FIG 14 UI error log
    图14 UI日志中的错误

    • 这个问题一般是因为连接、请求超时,主要问题在于OAP服务或者ES性能不足,造成连接超时、响应超时甚至无响应。从根本上解决这个问题的办法,一般是调整优化OAP服务或者ES的性能。

    • 我们也可以使用一个临时方案,UI的时间不够,那我们就延长下UI的时间。可以修改UI的超时配置,默认为10秒,测试和生产环境都可以适当的延长这个值,我们将ReadTimeout延长到了30秒。同时增加ConnectTimeout配置,也设置为30秒,如图15所示。
      15 to increase the timeouts
      图15 延长超时时间

  4. Performance issues: tracking page, we can see that all calls are tracked link interface. We can go a step further, one open interface call link details, we hope to get this interface to the Senate and the parameters. But unfortunately, most of the time we can not get this interface to the Senate and the Senate, as shown in Figure 16.
    FIG 16 get inside the interface before the interface reference parameters and return
    FIG 16 get inside the interface before the interface reference parameters and return

    • Because the default SW implementation in many cases are not recorded, these data need to customize development, specifically, how do we talk to you later.
to sum up

  Throughout the installation, use, testing, we encountered some problems, but fortunately a review of official documents and reading SW source, we find that the problem is controllable, can be solved. This also allows us to use the familiar taste of the benefits of technology, especially in this time-critical situations, so that we can do to project a more controlled, more confident. After you have verified the applicability, read and write performance and stability SW, we will certainly do for SW development and scale to meet our business needs, content to let us talk to you slowly.

Reference material

Tard using the document
SkyWalking official documents

Guess you like

Origin www.cnblogs.com/whslowly/p/11511141.html
Recommended