Tencent Cloud Finops Crane Developer Training Camp - How Cloud Native Can Help Enterprises Achieve Cost Optimization

introduction:

With the popularization of docker technology, more and more enterprises have joined the development process of cloud computing. The cloud-native industry is developing rapidly, and the proportion of investment in cloud-native construction is obvious. Faced with large-scale cluster investment, deployment, maintenance and other problems are gradually arising. More and more enterprises continue to put forward higher requirements for cloud native. At the same time, the efficiency of cloud native technology to simplify operation and maintenance has begun to improve significantly, but the investment in testing after deepening the application has begun to rise.

Today, cloud-native technologies and concepts are being extended and enriched, and more and more enterprises are building technical architectures and green sustainable development models that adapt to rapid business development based on cloud-native technologies. In this context, Tencent Cloud launched the first open source platform in China based on cloud-native technology to reduce costs and increase efficiency - FinOps Crane.

FinOps defines a series of cloud financial management rules and best practices, emphasizing cost decision-making through data-driven methods, so that organizations can obtain maximum benefits; Tencent Cloud builds Crane, a cloud resource analysis and cost optimization platform based on FinOps, which supports multi-cloud costs Observation, optimization of expected calculations, waste kanban and resource optimization capabilities help cloud customers under the distributed cloud architecture reduce costs.

In order to promote cloud-native users to achieve real extreme cost reduction on the basis of ensuring business stability, Tencent launched the first domestic cost-optimized open source project Crane (Cloud Resource Analytics and Economics) based on cloud-native technology.

The current main contributors to the Crane project include industry experts from well-known companies such as Tencent, Xiaohongshu, Google, eBay, Microsoft, and Tesla.


1. Crane:

Tencent Cloud's cloud-native cost optimization platform FinOps Crane won the "Cloud Computing Center Science and Technology Award Excellence Award" and becameThe first cloud-native tool in China to win this national award

Crane is the first cost optimization open source project based on cloud native technology in China. It follows the FinOps standard and aims to provide cloud native users with a one-stop solution for cloud cost optimization.

  1. FinOps and Crane relationship:

insert image description here
FinOps defines a series of methodologies such as cloud financial management rules and best practices. Tencent has open-sourced a cost optimization project, Crane. Tencent's cloud-native cost reduction and efficiency enhancement best practices are based on the FinOps framework.

2. Crane capability model:

insert image description here

What business pain points can it help us solve?

  • prediction is king
    • Scalable Predictive Algorithms
    • Evaluate predictive algorithm accuracy against large amounts of offline metric data
    • Resource Utilization Report
    • Avoid a runaway cloud computing revenue
  • optimization-oriented
    • Recommended billing method
    • Identify waste of resources
    • Optimize Cloud Application and Cost Efficiency
    • Drive financial accountability and visibility
  • root of stability
    • Anomaly identification
    • Identify waste of resources
    • Empowering cross-organizational trust and collaboration
    • Accelerating business value realization and business revitalization
  1. What are the features of Crane:
    insert image description here
  • Cost visualization and optimization assessment
3. 提供一组 Exporter 计算集群云资源的计费和账单数据并存储到你的监控系统,比如 Prometheus。
4. 多维度的成本洞察,优化评估。通过 Cloud Provider 支持多云计费。
  • Recommended framework
1. 提供了一个可扩展的推荐框架以支持多种云资源的分析
2. 内置了多种推荐器:资源推荐,副本推荐,HPA 推荐,闲置资源推荐。
  • Forecast-Based Horizontal Elasticizer
1. EffectiveHorizontalPodAutoscaler 支持了预测驱动的弹性。
2. 它基于社区 HPA 做底层的弹性控制,支持更丰富的弹性触发策略(预测,观测,周期),让弹性更加高效,并保障了服务的质量。
  • load-aware scheduler
1. 动态调度器根据实际的节点利用率构建了一个简单但高效的模型,并过滤掉那些负载高的节点来平衡集群。
  • Topology-aware scheduler
1. Crane Scheduler与Crane Agent配合工作,支持更为精细化的资源拓扑感知调度和多种绑核策略。
2. 可解决复杂场景下“吵闹的邻居问题",使得资源得到更合理高效的利用。
  • Mixing based on QOS
1. QOS相关能力保证了运行在 Kubernetes 上的 Pod 的稳定性。
2. 具有多维指标条件下的干扰检测和主动回避能力,支持精确操作和自定义指标接入。
3. 具有预测算法增强的弹性资源超卖能力,复用和限制集群内的空闲资源。
4. 具备增强的旁路cpuset管理能力,在绑核的同时提升资源利用效率。
  1. What are the advantages of Crane:

Based on the two-level scheduling capability, Crane realizes the mixed operation of high-priority delay-sensitive services and low-priority high-throughput services on the same node:

insert image description here

1. The first-level scheduling capability ensures the efficient scheduling of applications and realizes the real "how much is used".

  • Profile construction based on application historical load information

  • Realize intelligent scheduling based on application portrait and node portrait

  • Realize elastic prediction based on DSP algorithm, AI algorithm, etc.

2. Second-level scheduling capability, while ensuring service quality, it can achieve a substantial increase in resource utilization.

  • Node portrait and idle resource recovery. The agent running on each node collects the node load, predicts the future load trend based on DSP and other prediction algorithms, and recycles idle resources into node expansion resources for low-quality business use.

  • Resource isolation and service quality assurance. Define resource isolation rules to ensure that the stability of high-quality business is not affected when resource competition occurs in mixed-department business; the open-source solution is based on CPU Quota to complete the resource suppression of low-quality business; the closed-source solution is based on Tencent TLinux Ruyi kernel to complete high-quality business Absolute resource preemption.

  • Jamming detection and low priority active avoidance. After the node agent detects that the resource isolation strategy takes effect, whether there is still interference on the node, if there is interference, low-quality services will be expelled to ensure that high-quality services are not affected.

  1. Crane's record:

Currently, Crane has beenTencent, Xiaohongshu, Netease, Aspire, Kujiale, Mingyuan Cloud, Shushu Technologyand other companies are deployed in the production system, and its main contributors come from well-known companies such as Tencent, Xiaohongshu, Google, eBay, Microsoft, and Tesla.

  • Tencent's internal self-developed business has been implemented on a large scale, deployed to hundreds of Kubernetes clusters, and managed and controlled millions of CPU cores. Within one month of full launch, the total number of cores in the market has been reduced by 25%.

  • Taking the cluster optimization of Tencent's internal department as an example, by using FinOps Crane, the resource utilization rate of this department has been increased by 3 times while ensuring business stability; after another self-developed business of Tencent implemented FinOps, the total CPU The savings of a scale of 400,000 cores is equivalent to a monthly cost savings of over 10 million yuan.

  • Using only basic capabilities such as Crane's Request recommendation, the utilization rate of equipment resources has increased from less than 10% to 16.6% at present, thus reducing the overall cost by 30%.


2. Hands-on experiment:

1. Install the basic software required for the project:

# 安装docker
curl -fsSL https://get.docker.com | bash -s docker --mirror Aliyun

# 安装kubectl
# 下载最新发行版:
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
# 下载 kubectl 校验和文件:
   curl -LO "https://dl.k8s.io/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256"
# 基于校验和文件,验证 kubectl 的可执行文件:
echo "$(cat kubectl.sha256)  kubectl" | sha256sum –check
# 安装 kubectl
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
# 执行测试,以保障你安装的版本是最新的:
kubectl version --client

# 安装helm
curl https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg > /dev/null
sudo apt-get install apt-transport-https --yes
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt-get update
sudo apt-get install helm

# 安装kind
# For AMD64 / x86_64
[ $(uname -m) = x86_64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.19.0/kind-linux-amd64
# For ARM64
[ $(uname -m) = aarch64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.19.0/kind-linux-arm64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind
  • kind is used to quickly create and test k8s tools
  • helm is a package management tool for k8s

2. Install the local Kind cluster and Crane components:

curl -sf https://raw.githubusercontent.com/gocrane/crane/main/hack/local-env-setup.sh | sh -

3. Make sure all Pods are up and running:

export KUBECONFIG=${
    
    HOME}/.kube/config_crane
kubectl get pod -n crane-system

NAME                                              READY   STATUS    RESTARTS       AGE
craned-6dcc321sd-vnfsf                            2/2     Running   0              4m41s
fadvisor-5b6321dsd6-xpxzq                         1/1     Running   0              4m37s
grafana-3213ds54-6l24j                            1/1     Running   0              4m46s
metric-adapter-2314ds-swhfv                       1/1     Running   0              4m41s
prometheus-kube-state-metrics-432312d-p8l7c       1/1     Running   0              4m46s
prometheus-server-fdsad23223-4qqlv                2/2     Running   0              4m46s

Tip: It takes a certain amount of time to start the Pod. After a few minutes, enter the command to check whether the cluster status is Running

4. Visit the Crane Dashboard:

kubectl -n crane-system port-forward service/craned 9090:9090

# 后续的终端操作请在新窗口操作,每一个新窗口操作前请把配置环境变量加上(不然会出现8080端口被拒绝的提示)
export KUBECONFIG=${
    
    HOME}/.kube/config_crane

5. Access Crane Dashboard:

insert image description here- Total cost of the month: The total cost of the cluster in the past month. Starting from the time of Crane installation, the cluster cost is accumulated by the hour

  • Estimated monthly cost: Estimate the cost of the next month based on the latest hourly cost. hourly cost * 24 * 30
  • Estimated total CPU cost: Estimate the CPU cost for the next month based on the CPU cost of the last hour. Hourly CPU cost * 24 * 30
  • Estimated total memory cost: Estimate the memory cost for the next month based on the memory cost of the last hour. Memory cost per hour * 24 * 30
    insert image description here
    insert image description here5. Use intelligent elastic EffectiveHPA:

Kubernetes HPA supports rich elastic expansion capabilities. Kubernetes platform developers deploy services to implement custom metric services, and Kubernetes users configure multiple built-in resource indicators or custom metric indicators to achieve custom horizontal elasticity.

EffectiveHorizontalPodAutoscaler (EHPA for short) is an elastic scaling product provided by Crane. It is based on the community HPA as the underlying elastic control, supports richer elastic trigger strategies (prediction, observation, cycle), makes elasticity more efficient, and guarantees service quality.

  • Expansion in advance to ensure service quality: Predict future traffic peaks and expand capacity in advance through algorithms to avoid avalanches and service stability failures caused by untimely expansion.
  • Reduce invalid shrinkage: By predicting the future, unnecessary shrinkage can be reduced, resource usage of workloads can be stabilized, and misjudgments can be eliminated.
  • Support Cron configuration: Support Cron-based elastic configuration to deal with abnormal traffic peaks such as big promotions.
  • Compatible with the community: Using community HPA as the execution layer of elastic control, the capability is fully compatible with the community.

5.1 Install Metrics Server:

Install Metrics Server with the following command:

kubectl apply -f installation/components.yaml
kubectl get pod -n kube-system

5.2 Create a test application:

Start a Deployment with the following command to run a container with the hpa-example image, and then expose it as a service (Service)

kubectl apply -f installation/php-apache.yaml

kubectl apply -f installation/nginx-deployment.yaml

5.3 Create EffectiveHPA:

kubectl apply -f installation/effective-hpa.yaml

Run the following command to view the current status of EffectiveHPA:

kubectl get ehpa

The output is similar to:

NAME         STRATEGY   MINPODS   MAXPODS   SPECIFICPODS   REPLICAS   AGE
php-apache   Auto       1         10                       0          3m39s

5.4 Increase load:

so that load generation continues and you can continue with the remaining steps

# 如果你是新创建请配置环境变量 
export KUBECONFIG=${
    
    HOME}/.kube/config_crane
kubectl run -i --tty load-generator --rm --image=busybox:1.28 --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://php-apache; done"
kubectl get hpa ehpa-php-apache --watch

As the number of requests increases, the CPU utilization rate will continue to increase. You can see that EffectiveHPA will automatically expand the instance.

Note: Forecast data requires more than two days of monitoring data to appear.

6. How to calculate the cost:
The cost calculation function is realized by the component Fadvisor, which will be installed together when Crane is installed, and provides cost display and cost analysis functions together:

  • Server: collect cluster metric data and calculate cost
  • Exporter: expose the cost Metric

insert image description here
principle

Fadvisor cost models provide a way to estimate and analyze resource prices per container, pod or other resource in Kubernetes.

Please note that the cost model is only an estimated cost, not a substitute for cloud orders, because the actual billing data depends on more reasons, such as various billing logics. Here is the theory of computation:

  • The simplest cost model is to estimate resource prices for all nodes or pods at the same price. For example, when calculating costs, you can assume that all containers have the same price per unit of CPU and RAM, 2/hour core, 0.3/hour Gib
  • Advanced cost models estimate resource prices through cost allocation. The basis of this theory is that the price of each cloud machine instance of different instance types and billing types is different, but the price ratio of CPU and RAM is relatively fixed, and the resource cost can be calculated through this price ratio.

The specific calculation formula under the cost allocation model is as follows:

  • Overall cluster cost: sum of cvm costs
  • Relatively fixed CPU/mem price ratio
  • cvm cost = CPU cost * CPU amount + mem cost * mem amount
  • CPU application cost: overall cost * (the ratio of CPU to cvm cost) to get the overall CPU cost, and then calculate the CPU application cost according to the ratio of the applied CPU overview to the overall CPU
  • CPU application cost under namespace: CPU application cost is aggregated by namespace

7. Clean up the experimental environment:
After the hands-on experiment is completed, you can clean up and delete the local cluster:

kind delete cluster --name=crane

8. Related screenshots:

insert image description hereinsert image description hereinsert image description here

insert image description here


3. Summary:

During the whole experiment process, Crane's official staff and CSDN staff are still more careful guidance, here is a special thanksThug, When I encounter problems, patiently answer and assist me. Especially not so familiar in a Mac environment.

Finally, Crane is an open source cloud cost management tool that helps businesses better manage costs when using cloud computing resources. Crane can help enterprises realize transparent management of resource costs in the cloud computing environment, so as to better control costs and improve efficiency. Crane can associate the usage of cloud computing resources with costs through API, and provide real-time cost analysis and forecasting functions.

Guess you like

Origin blog.csdn.net/2301_77888392/article/details/130500757