kube-scheduler provides resources for configuration files, as a configuration file for kube-scheduler, and the file is specified by --config= at startup. The KubeSchedulerConfiguration currently used in each kubernetes version is:
Versions before 1.21 use v1beta1;
Version 1.22 uses v1beta2, but retains v1beta1;
Versions 1.23, 1.24, and 1.25 use v1beta3, but keep v1beta2 and delete v1beta1;
As shown below, it is a simple example of kubeSchedulerConfiguration, where kubeconfig has the same function as the startup parameter --kubeconfig, and kubeSchedulerConfiguration is similar to configuration files of other components, such as kubeletConfiguration is a configuration file started as a service:
–kubeconfig and --config cannot be specified at the same time. If --config is specified, other parameters will naturally fail.
② kubeSchedulerConfiguration 使用
Through the configuration file, users can customize multiple schedulers and configure the extension points of each stage, and the plug-in provides scheduling behavior in the entire scheduling context through these extension points.
The configuration shown below is an example for configuring the extension point (if name="*", all plugins corresponding to the extension point will be disabled/enabled):
Since kubernetes provides multiple schedulers, it naturally supports multiple configuration files for configuration files. Profile is also in the form of a list. You only need to specify multiple configuration lists. The following is an example of multiple configuration files. If there are multiple extension points , and multiple extension points can also be configured for each scheduler:
kube-scheduler provides many plug-ins as scheduling methods by default, and these plug-ins will be enabled if they are not configured by default, such as:
ImageLocality: Scheduling will be more biased towards Nodes with container images, extension point: score;
TaintToleration: realize the function of taint and tolerance, extension points: filter, preScore, score;
NodeName: implement the simplest scheduling method NodeName in the scheduling strategy, extension point: filter;
NodePorts: Scheduling will check whether the Node port is occupied, extension points: preFilter, filter;
NodeAffinity: Provide node affinity related functions, extension points: filter, score;
PodTopologySpread: realize the function of Pod topology domain, extension points: preFilter, filter, preScore, score;
NodeResourcesFit: This plugin will check whether the node has all the resources requested by the Pod, using one of the following three strategies: LeastAllocated (default) MostAllocated and RequestedToCapacityRatio, extension points: preFilter, filter, score;
VolumeBinding: Check whether the node has or can bind the requested volume, extension points: preFilter, filter, reserve, preBind, score;
VolumeRestrictions: Check whether the volume installed in the node meets the volume provider-specific restrictions, extension point: filter;
VolumeZone: Checks whether requested volumes meet any zone requirements they may have, extension point: filter;
InterPodAffinity: realize the function of affinity and anti-affinity between Pods, extension points: preFilter, filter, preScore, score;
PrioritySort: Provides sorting based on default priority, extension point: queueSort.
2. How to extend kube-scheduler?
When you think about writing a scheduler for the first time, you usually think that extending kube-scheduler is a very difficult thing. In fact, the official kubernetes has already thought of these things. For this reason, kubernetes introduced the concept of framework in version 1.15. The framework aims to In making the scheduler more extensible.
The framework uses it as plugins by redefining each extension point, and supports users to register out of tree extensions so that they can be registered in kube-scheduler.
① Define entry
The scheduler allows customization, but it only needs to refer to the corresponding NewSchedulerCommand and implement the logic of plugins:
The NewSchedulerCommand allows injection of out of tree plugins, that is, injection of external custom plugins. In this case, there is no need to modify the source code to define a scheduler, but to complete a custom scheduler only by implementing it yourself:
// WithPlugin 用于注入out of tree plugins 因此scheduler代码中没有其引用。
func WithPlugin(name string, factory runtime.PluginFactory)Option{
returnfunc(registry runtime.Registry) error {
return registry.Register(name, factory)}}
② Plug-in implementation
The implementation of the plug-in only needs to implement the corresponding extension point interface. The built-in plug-in NodeAffinity can be found by observing its structure. The implementation of the plug-in is to implement the corresponding extension point abstract interface:
Define the plug-in structure: framework.FrameworkHandle is used for calling between Kubernetes API and scheduler. It can be seen from the structure that it includes lister, informer, etc. This parameter must also be implemented:
func(pl *NodeAffinity)Score(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string)(int64,*framework.Status){
nodeInfo, err := pl.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)if err != nil {
return0, framework.NewStatus(framework.Error, fmt.Sprintf("getting node %q from Snapshot: %v", nodeName, err))}
node := nodeInfo.Node()if node == nil {
return0, framework.NewStatus(framework.Error, fmt.Sprintf("getting node %q from Snapshot: %v", nodeName, err))}
affinity := pod.Spec.Affinity
var count int64
// A nil element of PreferredDuringSchedulingIgnoredDuringExecution matches no objects.// An element of PreferredDuringSchedulingIgnoredDuringExecution that refers to an// empty PreferredSchedulingTerm matches all objects.if affinity != nil && affinity.NodeAffinity!= nil && affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution!= nil {
// Match PreferredDuringSchedulingIgnoredDuringExecution term by term.for i := range affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution{
preferredSchedulingTerm :=&affinity.NodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution[i]if preferredSchedulingTerm.Weight==0{
continue}// TODO: Avoid computing it for all nodes if this becomes a performance problem.
nodeSelector, err := v1helper.NodeSelectorRequirementsAsSelector(preferredSchedulingTerm.Preference.MatchExpressions)if err != nil {
return0, framework.NewStatus(framework.Error, err.Error())}if nodeSelector.Matches(labels.Set(node.Labels)){
count +=int64(preferredSchedulingTerm.Weight)}}}return count, nil
}
Finally, provide a method for registering this extension by implementing a New function, which can be injected into the scheduler as out of tree plugins in main.go:
// New initializes a new plugin and returns it.
func New(_ runtime.Object, h framework.FrameworkHandle)(framework.Plugin, error){
return&NodeAffinity{
handle: h}, nil
}
3. Scheduling based on network traffic
Through the above understanding of how to extend the scheduler plug-in, an example of traffic-based scheduling will be completed below. Usually, the network traffic used by a Node in the network for a period of time is also a very common situation in the production environment.
For example, among multiple hosts with a balanced configuration, host A runs as a service order script, and host B runs as a normal service. Because ordering requires downloading a large amount of data, but the hardware resources are rarely occupied, at this time, if a Pod is If it is scheduled to this node, the business of both parties may be affected (the front-end agent thinks that the node has a small number of connections and will be scheduled in large numbers, and the order script will reduce the efficiency due to the occupation of network bandwidth).
① Environment configuration
A kubernetes cluster must have at least two nodes.
The provided kubernetes clusters all need to install prometheus node_exporter, which can be inside or outside the cluster, and the one outside the cluster is used here.
The example is roughly divided into the following steps:
Define the plug-in API, and the plug-in is named NetworkTraffic;
Define the extension point, the Score extension point is used here, and the scoring algorithm is defined;
Define the way to obtain the score (get the corresponding data from the prometheus indicator);
Define the parameter input to the custom scheduler;
Deploy the project to the cluster (in-cluster deployment and out-of-cluster deployment);
Example of result verification.
The example will follow the built-in plug-in nodeaffinity to complete the code writing. The reason for choosing this plug-in is that this plug-in is relatively simple and has basically the same purpose as the need. In fact, other plug-ins have the same effect.
② Error handling
When initializing the project, go mod tidy and other operations, you will encounter a lot of the following errors:
This problem was mentioned in kubernetes issue #79384. After a cursory glance, it did not explain why this problem occurred. At the bottom, a boss provided a script. When the above problem cannot be solved, run the script directly and it will be normal:
#!/bin/sh
set -euo pipefail
VERSION=${
1#"v"}if[-z "$VERSION"]; then
echo "Must specify version!"
exit 1
fi
MODS=($(
curl -sS https://raw.githubusercontent.com/kubernetes/kubernetes/v${
VERSION}/go.mod|
sed -n 's|.*k8s.io/\(.*\)=>./staging/src/k8s.io/.*|k8s.io/\1|p'
))forMODin"${MODS[@]}";doV=$(
go moddownload-json "${MOD}@kubernetes-${VERSION}"|
sed -n 's|.*"Version":"\(.*\)".*|\1|p'
)
go modedit"-replace=${MOD}=${MOD}@${V}"
done
go get "k8s.io/kubernetes@v${VERSION}"
③ Define plug-in API
Through the above content description, we understand that defining a plug-in only needs to implement the corresponding extension point abstract interface, then the project directory pkg/networtraffic/networktraffice.go can be initialized.
Define the plugin name and variables:
constName="NetworkTraffic"
var _ = framework.ScorePlugin(&NetworkTraffic{
})
Next, the results need to be normalized. It can be seen from the source code that the Score extension point needs to implement more than just this single method:
// Run NormalizeScore method for each ScorePlugin in parallel.
parallelize.Until(ctx,len(f.scorePlugins),func(index int){
pl := f.scorePlugins[index]
nodeScoreList := pluginToNodeScores[pl.Name()]if pl.ScoreExtensions()== nil {
return}
status := f.runScoreExtension(ctx, pl, state, pod, nodeScoreList)if!status.IsSuccess(){
err := fmt.Errorf("normalize score plugin %q failed with error %v", pl.Name(), status.Message())
errCh.SendErrorWithCancel(err, cancel)return}})
From the above code, you can understand that to implement Score, you must implement ScoreExtensions, and return directly if it is not implemented. According to the example in nodeaffinity, it is found that this method only returns the extension point object itself, and the specific normalization is the actual scoring operation in NormalizeScore.
// NormalizeScore invoked after scoring all nodes.func(pl *NodeAffinity)NormalizeScore(ctx context.Context, state *framework.CycleState, pod *v1.Pod, scores framework.NodeScoreList)*framework.Status{
return pluginhelper.DefaultNormalizeScore(framework.MaxNodeScore,false, scores)}// ScoreExtensions of the Score plugin.func(pl *NodeAffinity)ScoreExtensions() framework.ScoreExtensions{
return pl
}
In the scheduling framework, the method of actually performing the operation is also NormalizeScore():
func(f *frameworkImpl)runScoreExtension(ctx context.Context, pl framework.ScorePlugin, state *framework.CycleState, pod *v1.Pod, nodeScoreList framework.NodeScoreList)*framework.Status{
if!state.ShouldRecordPluginMetrics(){
return pl.ScoreExtensions().NormalizeScore(ctx, state, pod, nodeScoreList)}
startTime := time.Now()
status := pl.ScoreExtensions().NormalizeScore(ctx, state, pod, nodeScoreList)
f.metricsRecorder.observePluginDurationAsync(scoreExtensionNormalize, pl.Name(), status, metrics.SinceInSeconds(startTime))return status
}
In NormalizeScore, a specific algorithm for selecting nodes needs to be implemented. The implemented algorithm formula will be the highest score, the highest current bandwidth, and the highest bandwidth. This ensures that the machine with a larger bandwidth occupation has a lower score. For example, if the highest bandwidth is 200,000, and the current Node bandwidth is 140,000, then the Node score is:
// 如果返回framework.ScoreExtensions 就需要实现framework.ScoreExtensionsfunc(n *NetworkTraffic)ScoreExtensions() framework.ScoreExtensions{
return n
}// NormalizeScore与ScoreExtensions是固定格式func(n *NetworkTraffic)NormalizeScore(ctx context.Context, state *framework.CycleState, pod *corev1.Pod, scores framework.NodeScoreList)*framework.Status{
var higherScore int64
for _, node := range scores {
if higherScore < node.Score{
higherScore = node.Score}}// 计算公式为,满分 - (当前带宽 / 最高最高带宽 * 100)// 公式的计算结果为,带宽占用越大的机器,分数越低for i, node := range scores {
scores[i].Score= framework.MaxNodeScore-(node.Score*100/ higherScore)
klog.Infof("[NetworkTraffic] Nodes final score: %v", scores)}
klog.Infof("[NetworkTraffic] Nodes final score: %v", scores)return nil
}
In kubernetes, the maximum number of nodes supports 5000. Doesn't it mean that looping takes up a lot of performance when obtaining the maximum score? In fact, there is no need to worry. The scheduler provides a parameter percentageOfNodesToScore, which determines the number of deployment cycles.
⑤ Configure the plug-in name
In order for the plugin to be used when registering, it also needs to be configured with a name:
// Name returns name of the plugin. It is used in logs, etc.func(n *NetworkTraffic)Name() string {
returnName}
⑥ Define the parameters to be passed in
There is also a prometheusHandle in the extension of the network plug-in, which is the action of operating the prometheus-server to get the indicators. First, you need to define a PrometheusHandle structure:
With the structure, you need to query actions and indicators. For indicators, node_network_receive_bytes_total is used here as the calculation method for obtaining Node network traffic. Since the environment is deployed outside the cluster, there is no node host name, and it is obtained through promQL. The entire statement is as follows:
Because the address of prometheus, the name of the network card and the size of the obtained data need to be specified, the entire structure is as follows. In addition, the parameter structure must follow the name in the Args format:
typeNetworkTrafficArgsstruct{
IP string `json:"ip"`
DeviceName string `json:"deviceName"`
TimeRange int `json:"timeRange"`
}
In order to make this type of data a structure that can be parsed by KubeSchedulerConfiguration, one more step is required, which is to expand the corresponding resource type when extending APIServer. Here, kubernetes provides two methods to extend the resource type of KubeSchedulerConfiguration:
One is that the framework.DecodeInto function provided in the old version can do this operation:
Another way is to implement the corresponding deep copy method, such as in NodeLabel:
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object// NodeLabelArgs holds arguments used to configure the NodeLabel plugin.typeNodeLabelArgsstruct{
metav1.TypeMeta// PresentLabels should be present for the node to be considered a fit for hosting the podPresentLabels[]string
// AbsentLabels should be absent for the node to be considered a fit for hosting the podAbsentLabels[]string
// Nodes that have labels in the list will get a higher score.PresentLabelsPreference[]string
// Nodes that don't have labels in the list will get a higher score.AbsentLabelsPreference[]string
}
Finally, register it in register, the whole behavior is similar to extending APIServer:
// addKnownTypes registers known types to the given scheme
func addKnownTypes(scheme *runtime.Scheme) error {
scheme.AddKnownTypes(SchemeGroupVersion,&KubeSchedulerConfiguration{
},&Policy{
},&InterPodAffinityArgs{
},&NodeLabelArgs{
},&NodeResourcesFitArgs{
},&PodTopologySpreadArgs{
},&RequestedToCapacityRatioArgs{
},&ServiceAffinityArgs{
},&VolumeBindingArgs{
},&NodeResourcesLeastAllocatedArgs{
},&NodeResourcesMostAllocatedArgs{
},)
scheme.AddKnownTypes(schema.GroupVersion{
Group:"",Version: runtime.APIVersionInternal},&Policy{
})return nil
}
For generating deep copy functions and other files, you can use the script kubernetes/hack/update-codegen.sh in the kubernetes code base. For convenience, the framework.DecodeInto method is used here.
⑧ Project deployment
Prepare the profile of the scheduler, you can see that the custom parameters can be recognized as the resource type of KubeSchedulerConfiguration:
After startup, for the convenience of verification, the original kube-scheduler service is closed, because the original kube-scheduler has been used as the master in HA, so the custom scheduler will not be used to cause pod pending.
⑨ Verification result
Prepare a Pod that needs to be deployed, specifying the name of the scheduler to use:
The experimental environment here is a kubernetes cluster with two nodes, master and node01, because the master has more services than node01. In this case, no matter what, the scheduling result will always be scheduled to node01: