Open source sysgrok — AI assistant for analyzing, understanding and optimizing systems

By Sean Heelan

In this post, I introduce sysgrok, a research prototype where we are investigating how large language models (LLMs), such as OpenAI's GPT model, can be applied to problems in the areas of performance optimization, root cause analysis, and systems engineering. You can find it on GitHub

What does Sysgrok do?

sysgrok can do the following:

  • Use the profiler to identify the most expensive functions and processes, explain what each function and process provides, and suggest optimizations
  • Get a description of the host and the problems encountered by the host, automatically debug the problem and suggest remedial actions and further actions
  • Take source code that has been annotated by a profiler, explain hot paths, and suggest ways to improve code performance

Sysgrok's capabilities target three broad categories of solutions:

  1. Serves as an analytics engine for performance, reliability, and other system-related data. In this mode, the LLM (Large Language Model) receives output from some other tool used by the engineer, such as a Linux command-line tool, a profiler, or an observability platform. The goal of Sysgrok is to explain, summarize and form hypotheses about the state of a system using LLM. It may then also suggest optimizations or remedial actions.
  2. As a focused automation solution for specific performance and reliability related tasks. In performance engineering and SRE work, some tasks recur. For these, we can build targeted automated assistants that can be directly used by engineers or sysgrok itself to solve other problems. For example, in performance engineering, answer the question: "Is there a faster version of this library with equivalent functionality (is there a faster version of this library with equivalent functionality)?". sysgrok supports this directly.
  3. As an automated root cause analysis tool for performance and reliability issues. The first two categories of solutions are a combination of data analysis, interpretation, search, and summarization. Crucially, they are applied in a centralized fashion to data collected by engineers themselves. In sysgrok, we are also investigating a third approach to problem solving using LLM, where LLM is combined with other tools to automate root cause analysis and resolution for a given problem. In this approach, the LLM is given a description of the problem (e.g., "The web server is experiencing high latency") and told which features are available (e.g., "ssh to host", " execute arbitrary Linux command line tools")). The LLM is then asked to act, using its available capabilities, to classify the problem. These operations are performed by sysgrok, and LLM is asked to analyze the results, classify the problem, propose remedial actions, and recommend next steps.

Sysgrok is still in its early stages, but we're releasing it because it's already usable for a variety of tasks - we hope it will be a handy basis for others to perform similar experiments. If you have any ideas, feel free to send us a PR or file an issue on GitHub !

Analyzing Performance Issues Using LLM

LLMs, such as OpenAI's GPT model, have exploded in popularity over the past few months, providing natural language interfaces and core engines for a variety of products, from customer assistant chatbots to data manipulation assistants to coding assistants. An interesting aspect of this trend is that essentially all of these applications use off-the-shelf general models that have not been specifically trained or fine-tuned for the task at hand. Instead, they have been trained across a wide range of the Internet and are therefore suitable for a wide variety of tasks.

So, can we leverage these models to help with performance analysis, debugging, and optimization? There are various methods for investigating performance issues, triaging root causes, and suggesting optimization options. But essentially, any performance analysis effort will involve looking at the output of various tools (such as Linux command-line tools or observable platforms) and interpreting that output to form hypotheses about the state of the system. Materials for GPT model training include software engineering, debugging, infrastructure analysis, operating system internals, Kubernetes, Linux commands and their usage, and performance analysis methods. Therefore, these models can be used to summarize, interpret and hypothesize data and problems that performance engineers encounter every day, which can speed up the analysis of engineers.

We can go a step further and go beyond purely using LLM for data analysis and question answering in the engineer's own investigative process. As we'll show later in this article, the LLM itself can be used to drive the process in some cases, with the LLM deciding which commands to run or which data sources to look at to debug problems.

demo

For the full set of features supported by sysgrok, check out the GitHub repository. Overall, it supports three approaches to problem solving:

Approach 1: As an analytics engine for performance, reliability, and other system-related data

In this mode, LLM receives output from another tool used by the engineer, such as a Linux command-line tool, profiler, or observability platform. The goal of sysgrok is to explain, summarize and propose remedies.

For example, the topn subcommand takes the most expensive functions reported by the analyzer, interprets the output, and then suggests ways to optimize the system.

The video also demonstrates the chat functionality provided by sysgrok. When the --chat parameter is provided, sysgrok will enter a chat session after every response from LLM.

This feature also applies generally to the output of Linux command-line tools. For example, in " 60 Seconds of Linux Performance Analysis ," Brendan Gregg outlines 10 commands an SRE should run when first connecting to a host experiencing performance or stability issues. The analyzecmd subcommand takes as input the host to connect to and the command to execute, and then analyzes and summarizes the output of the command for the user. We can use this to automate the process Gregg describes, and provide the user with a snippet of all the data produced by the 10 commands, saving them the trouble of having to look at the output of the commands one by one.

 

Approach 2: As a centralized automated solution for specific performance and reliability related tasks

In performance engineering and SRE work, some tasks recur. For these, we can build targeted automated assistants that can be directly used by engineers or sysgrok itself to solve other problems.

For example, the findfaster subcommand takes as input the name of a library or program and uses LLM to find a faster, equivalent replacement. This is a very common task in performance engineering.

 

Another example of this approach in Sysgrok is the explainfunction subcommand. This subcommand takes the name of a library and a function in that library. It explains the purpose of the library and its common use cases, and then explains the function. Finally, it suggests possible optimizations if the library and functions consume a lot of CPU resources.

 

Approach 3: As an automated root cause analysis tool for performance and reliability issues

The usefulness of the LLM is not limited to focusing on answering questions, summarizing, and similar tasks. Nor is it limited to single use, i.e. asking a single, isolated question. The sysgrok debughost subcommand demonstrates how to use the LLM as the "brains" in the agent for the goal of automatic problem solving. In this mode, LLM is embedded in a process that uses LLM to decide how to debug a particular problem and enables it to connect to the host, execute commands, and access other data sources.

 

The debughost command is probably the most experimental part of sysgrok right now. It shows a step on the road to automated agents for profiling, but a lot of R&D is still needed to get there.

in conclusion

In this post, I introduce sysgrok, a new open source AI assistant for analyzing, understanding, and optimizing systems. We also discussed three broad categories of methods implemented by sysgrok:

  1. Analysis engine for performance, reliability, and other system-related data: see the topn, stacktrace, analyzecmd, and code subcommands.
  2. Centralized automated solutions for specific performance- and reliability-related tasks: see the explainprocess, explainfunction, and findfaster subcommands.
  3. Automatic root cause analysis for performance and reliability issues: see the debughost subcommand.

You can find the sysgrok project on GitHub . Feel free to create PRs and issues, or if you want to discuss LLM projects or applications in general, you can contact me directly at [email protected] .

Guess you like

Origin blog.csdn.net/UbuntuTouch/article/details/131487546