background

SaaS services will face data security, compliance and other issues in the future. The company's business needs to accumulate a set of privatization deployment capabilities to help the business improve its industry competitiveness. In order to improve the platform system capabilities, we need to accumulate a set of data systems to help analyze the effect of operations and improve operational capabilities. However, in the actual development process, if a big data system is directly deployed, it will be a relatively large server overhead for users. To this end, we choose a compromise solution to improve our data analysis capabilities.

elasticsearch vs clickhouse

ClickHouse is a high-performance columnar distributed database management system. We tested ClickHouse and found that it has the following advantages:

ClickHouse has a large write throughput. The log write volume of a single server ranges from 50MB to 200MB/s, and the number of records written per second exceeds 60W, which is more than 5 times that of ES. The more common write Rejected in ES leads to data loss, write delay and other problems, which are not easy to occur in ClickHouse.

The query speed is fast. It is officially claimed that the data is in the pagecache, and the query rate of a single server is about 2-30GB/s; without the pagecache, the query speed depends on the read rate of the disk and the compression rate of the data. After testing, the query speed of ClickHouse is more than 5-30 times faster than that of ES.

ClickHouse cost less than ES server. On the one hand, the data compression ratio of ClickHouse is higher than that of ES, and the disk space occupied by the same data is only 1/3 to 1/30 of that of ES, which saves disk space and effectively reduces disk IO, which is also more efficient for ClickHouse query. One of the reasons; on the other hand, ClickHouse occupies less memory and consumes less CPU resources than ES. We estimate that processing logs with ClickHouse can cut server costs in half.

Support Features\Open Source Projects	ElasticSearch	ClickHouse
Inquire	java	c++
storage type	document storage	columnar database
Distributed support	Both sharding and replicas are supported	Both sharding and replicas are supported
Extensibility	high	Low
write speed	slow	quick
CPU/Memory usage	high	Low
Storage occupation (54G log data import)	High 94G (174%)	Low 23G (42.6%)
Exact match query speed	generally	quick
Fuzzy matching query speed	quick	slow
authority management	support	support
Query Difficulty	Low	high
visualization support	high	Low
Use Cases	a lot of	Ctrip
maintenance difficulty	Low	high

cost analysis

Note: In the absence of any discount, based on aliyun analysis

cost item	standard	cost	illustrate	Total cost
zookeeper cluster	2 core 4g shared computing n4 50G SSD cloud disk	222/month	3 high availability	666/month
kafka cluster	4 cores 8g shared standard s650G SSD cloud disk 300G data disk	590/month	3 high availability	1770/month
filebeat deployment			Co-location-related applications will generate a certain amount of memory and disk overhead, which will have a certain impact on the availability of applications.
clickhouse	16-core 32g shared computing n450G SSD cloud disk 1000G data disk	2652/month	2 high availability	5304/month
Total cost				7740/month

Environment deployment

zookeeper cluster deployment


yum install java-1.8.0-openjdk-devel.x86_64
/etc/profile 配置环境变量
更新系统时间
yum install  ntpdate
ntpdate asia.pool.ntp.org

mkdir zookeeper
mkdir ./zookeeper/data
mkdir ./zookeeper/logs

wget  --no-check-certificate https://mirrors.tuna.tsinghua.edu.cn/apache/zookeeper/zookeeper-3.7.1/apache-zookeeper-3.7.1-bin.tar.gz
tar -zvxf apache-zookeeper-3.7.1-bin.tar.gz -C /usr/zookeeper

export ZOOKEEPER_HOME=/usr/zookeeper/apache-zookeeper-3.7.1-bin
export PATH=$ZOOKEEPER_HOME/bin:$PATH

进入ZooKeeper配置目录
cd $ZOOKEEPER_HOME/conf

新建配置文件
vi zoo.cfg

tickTime=2000
initLimit=10
syncLimit=5
dataDir=/usr/zookeeper/data
dataLogDir=/usr/zookeeper/logs
clientPort=2181
server.1=zk1:2888:3888
server.2=zk2:2888:3888
server.3=zk3:2888:3888

在每台服务器上执行，给zookeeper创建myid
echo "1" > /usr/zookeeper/data/myid
echo "2" > /usr/zookeeper/data/myid
echo "3" > /usr/zookeeper/data/myid

进入ZooKeeper bin目录
cd $ZOOKEEPER_HOME/bin
sh zkServer.sh start

Kafka cluster deployment

mkdir -p /usr/kafka
chmod 777 -R /usr/kafka
wget  --no-check-certificate https://mirrors.tuna.tsinghua.edu.cn/apache/kafka/3.2.0/kafka_2.12-3.2.0.tgz
tar -zvxf kafka_2.12-3.2.0.tgz -C /usr/kafka


不同的broker Id 设置不一样，比如 1,2,3
broker.id=1
listeners=PLAINTEXT://ip:9092
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dir=/usr/kafka/logs
num.partitions=5
num.recovery.threads.per.data.dir=3
offsets.topic.replication.factor=2
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=3
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=zk1:2181,zk2:2181,zk3:2181
zookeeper.connection.timeout.ms=30000
group.initial.rebalance.delay.ms=0

后台常驻进程启动kafka
nohup /usr/kafka/kafka_2.12-3.2.0/bin/kafka-server-start.sh /usr/kafka/kafka_2.12-3.2.0/config/server.properties   >/usr/kafka/logs/kafka.log >&1 &

/usr/kafka/kafka_2.12-3.2.0/bin/kafka-server-stop.sh

$KAFKA_HOME/bin/kafka-topics.sh --list --bootstrap-server  ip:9092

$KAFKA_HOME/bin/kafka-console-consumer.sh --bootstrap-server ip:9092 --topic test --from-beginning

$KAFKA_HOME/bin/kafka-topics.sh  --create --bootstrap-server  ip:9092  --replication-factor 2 --partitions 3 --topic xxx_data

FileBeat deployment

sudo rpm --import https://packages.elastic.co/GPG-KEY-elasticsearch

Create a file with a .repo extension (for example, elastic.repo) in your /etc/yum.repos.d/ directory and add the following lines:
在/etc/yum.repos.d/ 目录下创建elastic.repo

[elastic-8.x]
name=Elastic repository for 8.x packages
baseurl=https://artifacts.elastic.co/packages/8.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1
type=rpm-md

yum install filebeat
systemctl enable filebeat
chkconfig --add filebeat

FileBeat configuration file description, pit 1 (need to set keys_under_root: true). If you do not set the message fields of kafka as follows:

文件目录： /etc/filebeat/filebeat.yml

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /root/logs/xxx/inner/*.log
  json:  
如果不设置该索性，所有的数据都存储在message里面，这样设置以后数据会平铺。
       keys_under_root: true 
output.kafka:
  hosts: ["kafka1:9092", "kafka2:9092", "kafka3:9092"]
  topic: 'xxx_data_clickhouse'
  partition.round_robin:
            reachable_only: false
            required_acks: 1
            compression: gzip
processors: 
剔除filebeat 无效的字段数据
    - drop_fields:  
        fields: ["input", "agent", "ecs", "log", "metadata", "timestamp"]
        ignore_missing: false
        
nohup ./filebeat -e -c /etc/filebeat/filebeat.yml > /user/filebeat/filebeat.log & 
输出到filebeat.log文件中，方便排查

clickhouse deployment

检查当前CPU是否支持SSE 4.2，如果不支持，需要通过源代码编译构建
grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"
返回 "SSE 4.2 supported" 表示支持，返回 "SSE 4.2 not supported" 表示不支持

创建数据保存目录，将它创建到大容量磁盘挂载的路径
mkdir -p /data/clickhouse
修改/etc/hosts文件，添加clickhouse节点
举例：
10.190.85.92 bigdata-clickhouse-01
10.190.85.93 bigdata-clickhouse-02

服务器性能参数设置：
cpu频率调节，将CPU频率固定工作在其支持的最高运行频率上，而不动态调节，性能最好
echo 'performance' | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

内存调节，不要禁用 overcommit
echo 0 | tee /proc/sys/vm/overcommit_memory

始终禁用透明大页(transparent huge pages)。 它会干扰内存分配器，从而导致显着的性能下降
echo 'never' | tee /sys/kernel/mm/transparent_hugepage/enabled

首先，需要添加官方存储库：
yum install yum-utils
rpm --import <https://repo.clickhouse.tech/CLICKHOUSE-KEY.GPG>
yum-config-manager --add-repo <https://repo.clickhouse.tech/rpm/stable/x86_64>

查看clickhouse可安装的版本：
yum list | grep clickhouse
运行安装命令：
yum -y install clickhouse-server clickhouse-client

修改/etc/clickhouse-server/config.xml配置文件，修改日志级别为information，默认是trace
<level>information</level>
执行日志所在目录：

正常日志
/var/log/clickhouse-server/clickhouse-server.log
异常错误日志
/var/log/clickhouse-server/clickhouse-server.err.log

查看安装的clickhouse版本：
clickhouse-server --version
clickhouse-client --password

sudo clickhouse stop
sudo clickhouse tart
sudo clickhouse start

Summarize

The whole deployment process has stepped on a lot of pits, especially the parameter settings of filebeat yml. The configuration of clickhouse shows that I will update another article to synchronize the pits that I stepped on in the process. I haven't updated my blog for a long time, and I often see the question of what to do after the blog turns 35. To be honest, I myself have not thought about what to do in the future. The core is continuous learning & output. Continue to build your own moat, whether it is technical experts, business experts, architecture, management, etc. Personally, I suggest that if you can continue to write code, you should fight on the front line, and management is completely tied to the company. Unless you are a well-known big factory, this is another look. If the company I work for lacks a large industry influence, I feel that I can fight on the front line and choose a new job in the future. More consideration is the industry influence, business sense, and technical architecture capabilities. Now I am 35 and face every day calmly.

FlieBeat & Kafka & Clickhouse replace ELK series one