Footprint Analytics

The emergence of GPT has attracted global attention to large language models. All walks of life are trying to use this "black technology" to improve work efficiency and accelerate industry development. Future3 Campus and Footprint Analytics jointly conducted an in-depth study of the infinite possibilities of the combination of AI and Web3, and jointly released a research report titled "Analysis of the Current Situation, Competitive Landscape, and Future Opportunities of the Integration of AI and Web3 Data Industry." The research report is divided into two parts. This article is the first part, co-edited by Footprint Analytics researchers Lesley and Shelly. The next article is co-edited by Future3 Campus researchers Sherry and Humphrey.

Summary:

  • The development of LLM technology has made people pay more attention to the combination of AI and Web3, and new application paradigms are gradually unfolding. In this article, we will focus on how to use AI to improve the experience and productivity of Web3 data.

  • Due to the early stage of the industry and the characteristics of blockchain technology, the Web3 data industry faces many challenges, including data sources, update frequency, anonymity attributes, etc., making the use of AI to solve these problems a new focus.

  • Compared with traditional artificial intelligence, LLM's advantages such as scalability, adaptability, efficiency improvement, task decomposition, accessibility, and ease of use provide imagination space for improving the experience and production efficiency of blockchain data.

  • LLM requires a large amount of high-quality data for training, and the blockchain field has rich vertical knowledge and open data, which can provide learning materials for LLM.

  • LLM can also help produce and enhance the value of blockchain data, such as data cleaning, annotation, generating structured data, etc.

  • LLM is not a panacea and needs to be applied to specific business needs. It is necessary to take advantage of the high efficiency of LLM while also paying attention to the accuracy of the results.

1. Development and combination of AI and Web3

1.1. Development History of AI

The history of artificial intelligence (AI) can be traced back to the 1950s. Since 1956, people have begun to pay attention to the field of artificial intelligence and gradually developed early expert systems to help solve problems in professional fields. Since then, the rise of machine learning has expanded the application fields of AI, and AI has begun to be more widely used in all walks of life. Up to now, the explosion of deep learning and generative artificial intelligence has brought people infinite possibilities. Every step of it is full of continuous challenges and innovations in pursuit of higher intelligence levels and wider application fields.

Picture 1.jpg

Figure 1: AI development history

On November 30, 2022, ChatGPT was launched, demonstrating for the first time the possibility of low-threshold, high-efficiency interaction between AI and humans. ChatGPT has triggered a broader discussion of artificial intelligence, redefined the way to interact with AI, making it more efficient, intuitive and humane, and also promoted people's attention to more generative artificial intelligence, Anthropic (Amazon) , DeepMind (Google), Llama and other models also subsequently entered people's field of vision. At the same time, practitioners in various industries have begun to actively explore how AI will promote the development of their fields, or seek to stand out in the industry by combining it with AI technology, further accelerating the penetration of AI in various fields.

1.2. The integration of AI and Web3

Web3's vision begins with reforming the financial system, aims to achieve more user power, and is expected to lead the transformation of modern economies and cultures. Blockchain technology provides a solid technical foundation to achieve this goal. It not only redesigns value transmission and incentive mechanisms, but also provides support for resource allocation and decentralization of power.

Picture 2.jpg

Figure 2: Web3 development history

As early as 2020, the investment company in the blockchain field Fourth Revolution Capital (4RC) pointed out that blockchain technology will Combined with AI, it can achieve subversion of existing industries through decentralization of global industries such as finance, medical care, e-commerce, and entertainment.

At present, the combination of AI and Web3 mainly focuses on two major directions:

● Use AI to improve productivity and user experience.

● Combined with the technical characteristics of blockchain transparency, security, decentralized storage, traceability, and verifiability, and the decentralized production relationship of Web3, it can solve pain points that cannot be solved by traditional technologies or encourage community participation to improve production efficiency.

The combination of AI and Web3 in the market has the following exploration directions:

Picture 3.jpg

Figure 3: Panorama of the combination of AI and Web3

● Data: Blockchain technology can be applied to model data storage, providing encrypted data sets, protecting data privacy, recording the source and usage of model data, and verifying the authenticity of the data. By accessing and analyzing data stored on the blockchain, AI can extract valuable information and use it for model training and optimization. At the same time, AI can also be used as a data production tool to improve the production efficiency of Web3 data.

● Algorithm: The algorithm in Web3 can provide a more secure, trustworthy and autonomously controlled computing environment for AI, and provide encryption protection for the AI ​​system. Security fences are embedded in the model parameters to prevent the system from being abused or maliciously operated. AI can interact with algorithms in Web3, such as leveraging smart contracts to perform tasks, validate data, and execute decisions. At the same time, AI algorithms can also provide more intelligent and efficient decisions and services for Web3.

● Computing power: Web3’s distributed computing resources can provide high-performance computing power for AI. AI can leverage distributed computing resources in Web3 for model training, data analysis, and prediction. By distributing computing tasks to multiple nodes on the network, AI can speed up calculations and process larger amounts of data.

In this article, we will focus on exploring how to use AI technology to improve the productivity and user experience of Web3 data.

2. Current status of Web3 data

2.1. Web2 & Web3 data industry comparison

As the core component of AI, "data", Web3 is very different from the familiar Web2. The difference mainly lies in the application architecture of Web2 and Web3, which results in different data characteristics.

2.1.1. Web2 & Web3 application architecture comparison

Picture 4.jpg

Figure 4: Web2 & Web3 application architecture

In the Web2 architecture, a single entity (usually a company) usually controls the web page or APP. The company has absolute control over the content they build. They can decide who can access the content and logic on their server, and the users. What rights you have can also determine how long the content will exist online. Many cases have shown that Internet companies have the right to change the rules on their platforms and even suspend services to users, without users being able to retain the value created.

The Web3 architecture relies on the concept of Universal State Layer to place part or all of the content and logic on the public blockchain. These contents and logic are publicly recorded on the blockchain and can be accessed by everyone. Users can directly control the content and logic on the chain. In Web2, users need an account or API key to interact with content on the blockchain. Users can directly control their corresponding on-chain content and logic. Unlike Web2, Web3 users do not need authorized accounts or API keys to interact with content on the blockchain (except for certain administrative operations).

2.1.2. Comparison of data characteristics between Web2 and Web3

Picture 5.jpg

Figure 5: Comparison of data characteristics between Web2 and Web3

Web2 data is typically closed and highly restricted, with complex permission controls, high maturity, multiple data formats, strict compliance with industry standards, and complex business logic abstractions. These data are large in scale, but have relatively low interoperability, are usually stored on central servers, do not pay attention to privacy protection, and are mostly non-anonymous.

In contrast, Web3 data is more open and has wider access rights, although it is less mature, dominated by unstructured data, standardization is rare, and business logic abstraction is relatively simplified. The data size of Web3 is smaller than that of Web2, but it has high interoperability (such as EVM compatibility) and can store data in a decentralized or centralized manner. It also emphasizes user privacy, and users usually interact on the chain anonymously.

2.2. Current status and prospects of the Web3 data industry, and challenges encountered

In the Web2 era, data is as precious as oil "reserves", and accessing and obtaining large-scale data has always been a great challenge. In Web3, the openness and sharing of data suddenly made everyone feel that "oil is everywhere", making it easier for AI models to obtain more training data, which is crucial to improving model performance and intelligence. However, there are still many problems to be solved in data processing of Web3, the “new oil”, mainly including the following:

● Data source: The data “standards” on the chain are complicated and scattered, and data processing consumes a lot of labor costs.

When processing on-chain data, a time-consuming and labor-intensive indexing process needs to be repeatedly performed, requiring developers and data analysts to spend a lot of time and resources to adapt to data differences between different chains and different projects. The on-chain data industry lacks unified production and processing standards. In addition to being recorded on the blockchain ledger, events, logs, and traces are basically defined and produced (or generated) by the project itself, which leads to non-professional traders It is difficult to discern and find the most accurate and trustworthy data, adding to their difficulties in making on-chain transactions and investment decisions. For example, decentralized exchanges Uniswap and Pancakeswap may have differences in data processing methods and data calibers, and procedures such as inspection and unification of calibers in the process further increase the complexity of data processing.

● Data update: The data on the chain is large in volume and updated frequently, making it difficult to process it into structured data in a timely manner.

The blockchain changes all the time, and data updates are measured in seconds or even milliseconds. The frequent generation and update of data makes it difficult to maintain high-quality data processing and timely updates. Therefore, automated processing processes are very important, which is also a major challenge to the cost and efficiency of data processing. The Web3 data industry is still in its infancy. With the continuous emergence of new contracts and iterative updates, the lack of standards and diverse formats of data further increases the complexity of data processing.

● Data analysis: The anonymous attribute of data on the chain makes it difficult to distinguish the identity of the data

On-chain data often does not contain enough information to clearly identify each address, making it difficult to link the data with economic, social or legal developments off-chain. However, the trends of data on the chain are closely related to the real world. Understanding the correlation between activities on the chain and specific individuals or entities in the real world is very important for specific scenarios such as data analysis.

With the discussion of productivity changes triggered by large language model (LLM) technology, whether AI can be used to solve these challenges has also become one of the focuses in the Web3 field.

3. The chemical reaction caused by the collision of AI and Web3 data

3.1. Comparison of features between traditional AI and LLM

In terms of model training, traditional AI models are usually small in scale, with the number of parameters ranging from tens of thousands to millions, but in order to ensure the accuracy of the output results, a large amount of manually labeled data is required. Part of the reason why LLM is so powerful is that it uses massive corpus to fit tens of billions and hundreds of billions of parameters, which greatly improves its ability to understand natural language, but it also means that more data is needed to Training is very expensive.

In terms of scope of capabilities and operating methods, traditional AI is more suitable for tasks in specific fields and can provide relatively accurate and professional answers. In contrast, LLM is more suitable for general tasks, but is prone to hallucination problems, which means that in some cases its answers may not be precise or professional enough, or even completely wrong. Therefore, if objective, trustworthy, and traceable results are needed, multiple checks, multiple trainings, or the introduction of additional error correction mechanisms and frameworks may be necessary.

Picture 6.jpg

Figure 6: Comparison of features between traditional AI and large model language models (LLM)

3.1.1. The practice of traditional AI in the field of Web3 data

Traditional AI has already shown its importance in the blockchain data industry, bringing more innovation and efficiency to this field. For example, the 0xScope team used AI technology to build a cluster analysis algorithm based on graph computing to help accurately identify related addresses between users through weight distribution of different rules. The application of this deep learning algorithm improves the accuracy of address clustering, providing a more precise tool for data analysis. Nansen uses AI for NFT price prediction, providing insights on NFT market trends through data analysis and natural language processing technology. On the other hand, Trusta Labs uses machine learning methods based on asset graph mining and user behavior sequence analysis to enhance the reliability and stability of its Witch detection solution and help maintain the security of the blockchain network ecosystem. Trusta Labs, on the other hand, uses graph mining and user behavior analysis methods to enhance the reliability and stability of its Sybil detection solution and help maintain the security of the blockchain network. Goplus leverages traditional artificial intelligence in its operations to improve the security and efficiency of decentralized applications (dApps). They collect and analyze security information from dApps, providing rapid risk alerts to help reduce risk exposure on these platforms. This includes detecting risks in the dApp master contract by assessing factors such as open source status and potential malicious behavior, as well as collecting detailed audit information including audit company credentials, audit time, and links to audit reports. Footprint Analytics uses AI to generate code that produces structured data, analyzes NFT transactions, Wash trading transactions, and robot account screening and troubleshooting.

However, traditional AI has limited information and focuses on using predetermined algorithms and rules to perform preset tasks, while LLM learns from large-scale natural language data and can understand and generate natural language, which makes it more suitable for processing complex and huge tasks. amount of text data.

Recently, as LLM has made significant progress, people have also conducted some new thinking and exploration on the combination of AI and Web3 data.

3.1.2. Advantages of LLM

LLM has the following advantages over traditional artificial intelligence:

● Scalability: LLM supports large-scale data processing

LLM excels in scalability and can handle large amounts of data and user interactions efficiently. This makes it ideal for tasks that require large-scale information processing, such as text analysis or large-scale data cleaning. Its high degree of data processing capabilities provides powerful analysis and application potential for the blockchain data industry.

● Adaptability: LLM can learn to adapt to the needs of multiple fields

LLM is highly adaptable and can be fine-tuned for specific tasks or embedded in industry or private databases, allowing it to quickly learn and adapt to the nuances of different domains. This feature makes LLM an ideal choice for solving multi-domain and multi-purpose problems, providing broader support for the diversity of blockchain applications.

● Improve efficiency: LLM automates tasks to improve efficiency

The high efficiency of LLM brings significant convenience to the blockchain data industry. It automates tasks that would otherwise require significant amounts of manual time and resources, thereby increasing productivity and reducing costs. LLM can generate large amounts of text, analyze massive data sets, or perform a variety of repetitive tasks in seconds, reducing waiting and processing time and making blockchain data processing more efficient.

● Task decomposition: You can generate specific plans for certain tasks and divide large tasks into small steps.

LLM Agent has the unique ability to generate specific plans for certain jobs, breaking down complex tasks into small manageable steps. This feature is very beneficial for processing large-scale blockchain data and performing complex data analysis tasks. By breaking large jobs into small tasks, LLM can better manage the data processing process and output high-quality analysis.

This capability is critical for AI systems that perform complex tasks, such as robotic automation, project management, and natural language understanding and generation, enabling them to translate high-level mission goals into detailed courses of action, improving the efficiency and accuracy of task execution.

● Accessibility and ease of use: LLM provides user-friendly interactions in natural language

The accessibility of LLM enables more users to easily interact with data and systems, making these interactions more user-friendly. Through natural language, LLM makes data and systems easier to access and interact with, without requiring users to learn complex technical terms or specific commands such as SQL, R, Python, etc. for data acquisition and analysis. This feature broadens the audience for blockchain applications, allowing more people to access and use Web3 applications and services, regardless of whether they are tech-savvy or not, thereby promoting the development and popularity of the blockchain data industry.

3.2. Integration of LLM and Web3 data

Picture 7.jpg

Figure 7: Integration of blockchain data and LLM

The training of large language models requires relying on large-scale data to build models by learning patterns in the data. The interaction and behavioral patterns contained in blockchain data are the fuel for LLM learning. The amount and quality of data also directly affect the learning effect of the LLM model.

Data is not just a consumable for LLM, LLM helps produce data and can even provide feedback. For example, LLM can assist data analysts in contributing to data preprocessing, such as data cleaning and annotation, or generating structured data to remove noise from the data and highlight effective information.

3.3. Common technical solutions to enhance LLM

The emergence of ChatGPT not only shows us the universal ability of LLM to solve complex problems, but also triggers a global exploration of superimposing external capabilities on universal capabilities. This includes the enhancement of general capabilities (including context length, complex reasoning, mathematics, code, multimodality, etc.) as well as the expansion of external capabilities (processing unstructured data, using more complex tools, interacting with the physical world, etc.). How to graft proprietary knowledge in the crypto field and personal personalized private data to the general capabilities of large models is the core technical issue for the commercialization of large models in the crypto vertical field.

Currently, most applications focus on retrieval-augmented generation (RAG), such as hint engineering and embedding technology, and most of the existing agent tools focus on improving the efficiency and accuracy of RAG work. The main reference architectures of application stacks based on LLM technology on the market are as follows:

● Prompt Engineering

Picture 8.jpg

Figure 8: Prompt Engineering

Currently, most practitioners use a foundational solution when building applications, namely Prompt Engineering. This method is the most convenient and quick way to change the input of the model by designing specific Prompts to meet the needs of specific applications. However, basic Prompt Engineering has some limitations, such as untimely database updates, cumbersome content, support for input context length (In-Context Length) and limitations of multiple rounds of question and answer.

Therefore, the industry is also studying more advanced improvement solutions, including embedding and fine-tuning.

● Embedding

Embedding is a data representation method widely used in the field of artificial intelligence, which can efficiently capture the semantic information of objects. By mapping object attributes into vector form, embedding technology can quickly find the most likely correct answer by analyzing the correlation between vectors. Embeddings can be built on top of LLMs to take advantage of the model’s rich linguistic knowledge learned on a wide range of corpora. Information about specific tasks or fields is introduced into the pre-trained large model through embedding technology, making the model more specialized and more adaptable to specific tasks, while retaining the versatility of the basic model.

In layman's terms, embedding is similar to giving a reference book to a comprehensively trained college student and asking him to complete the task with a reference book that has knowledge related to a specific task. He can consult the reference book at any time and then solve specific problems. The problem.

● Fine-tuning

Picture 9.jpg

Figure 9: Fine Tuning

Fine-tuning is different from embedding by updating the parameters of a pre-trained language model to adapt it to a specific task. This approach allows models to exhibit better performance on specific tasks while remaining general. The core idea of ​​fine-tuning is to adjust model parameters to capture specific patterns and relationships relevant to the target task. However, the upper limit of the model's general capabilities for fine-tuning is still limited by the base model itself.

In layman's terms, fine-tuning is similar to giving professional knowledge courses to college students who have undergone comprehensive training, allowing them to master professional course knowledge in addition to comprehensive abilities and be able to solve problems in the professional sector on their own.

● Retrain LLM

Current LLMs, while powerful, may not meet all needs. Retraining an LLM is a highly customized solution by introducing new data sets and adjusting model weights to make it more suitable for a specific task, need, or domain. However, this method requires a lot of computing resources and data, and managing and maintaining the retrained model is also one of the challenges.

● Agent model

Picture 10.jpg

Figure 10: Agent model

The Agent model is a method for building intelligent agents that uses an LLM as the core controller. The system also includes several key components to provide more comprehensive intelligence.

● Planning: Divide large tasks into small tasks so they are easier to complete

● Memory, reflection: improve future plans by reflecting on past actions

● Tools, tool usage: Agents can call external tools to obtain more information, such as calling search engines, calculators, etc.

The artificial intelligence agent model has strong language understanding and generation capabilities, and can solve general problems, perform task decomposition and self-reflection. This gives it broad potential in a variety of applications. However, the agent model also has some limitations, such as being limited by context length, prone to errors in long-term planning and task splitting, and unstable reliability of output content. These limitations require long-term continuous research and innovation to further expand the application of agent models in different fields.

The various techniques above are not mutually exclusive and can be used together in the process of training and enhancing the same model. Developers can fully exploit the potential of existing large language models and try different methods to meet increasingly complex application requirements. This combined use not only helps improve model performance but also helps drive rapid innovation and advancement in Web3 technology.

However, we believe that although existing LLMs have played an important role in the rapid development of Web3, before fully trying these existing models (such as OpenAI, Llama 2, and other open source LLMs), we can start from the shallower to the deeper, Start with RAG strategies such as prompt engineering and embedding, and carefully consider fine-tuning and retraining the base model.

3.4. How LLM accelerates various processes of blockchain data production

3.4.1. General processing flow of blockchain data

Today, builders in the blockchain field are gradually realizing the value of data products. This value covers multiple areas such as product operations monitoring, predictive models, recommendation systems, and data-driven applications. Although this awareness is gradually increasing, data processing is often overlooked as an indispensable key step from data acquisition to data application.

Picture 11.jpg

Figure 11: Blockchain data processing process

● Convert the original unstructured data of the blockchain, such as events or logs, etc., into structured data

Every transaction or event on the blockchain generates events or logs, and these data are usually unstructured. This step is the first entry point to obtain data, but the data still needs to be further processed to extract useful information and obtain structured raw data. This includes organizing the data, handling exceptions, and converting it into a common format.

● Convert structured raw data into abstract tables with business meaning

After obtaining the structured raw data, you need to further abstract the business and map the data to business entities and indicators, such as transaction volume, user volume and other business indicators, to transform the raw data into meaningful data for business and decision-making.

● Calculate and extract business indicators from abstract tables

After having abstract business data, further calculations can be performed on the abstract business data to obtain various important derived indicators. For example, core indicators such as monthly growth rate of total transaction volume and user retention rate. These indicators can be implemented with the help of tools such as SQL and Python, and are more likely to help monitor business health and understand user behavior and trends to support decision-making and strategic planning.

3.4.2. Optimization after adding LLM to the blockchain data generation process

LLM can solve multiple problems in blockchain data processing, including but not limited to the following:

Process unstructured data:

● Extract structured information from transaction logs and events: LLM can analyze the transaction logs and events of the blockchain, extract key information, such as transaction amount, transaction party address, timestamp, etc., and convert unstructured data into Data with business meaning, making it easier to analyze and understand.

● Clean data and identify abnormal data: LLM can automatically identify and clean inconsistent or abnormal data to help ensure data accuracy and consistency, thereby improving data quality.

Perform business abstraction:

● Mapping original on-chain data to business entities: LLM can map original blockchain data to business entities, such as mapping blockchain addresses to actual users or assets, making business processing more intuitive and effective.

● Process unstructured on-chain content and label it: LLM can analyze unstructured data, such as Twitter sentiment analysis results, and mark it as positive, negative or neutral sentiment, thereby helping users better understand sentiment on social media tendency.

Natural language interpretation of data:

● Calculate core indicators: Based on business abstraction, LLM can calculate core business indicators, such as user transaction volume, asset value, market share, etc., to help users better understand the key performance of their business.

● Query data: LLM can understand user intentions and generate SQL queries through AIGC, allowing users to make query requests in natural language without having to write complex SQL query statements. This increases the accessibility of database queries.

● Indicator selection, sorting and correlation analysis: LLM can help users select, sort and analyze different multiple indicators to better understand the relationships and correlations between them, thereby supporting deeper data analysis and decision-making.

● Generate natural language descriptions of business abstractions: LLM can generate natural language summaries or explanations based on factual data to help users better understand business abstractions and data indicators, improve interpretability, and make decisions more rational.

3.5. Current use cases

According to LLM's own technology and product experience advantages, it can be applied to different on-chain data scenarios. Technically, these scenarios can be divided into four categories from easy to difficult:

● Data conversion: Perform operations such as data enhancement and reconstruction, such as text summarization, classification, and information extraction. This type of application is faster to develop, but is more suitable for general scenarios and is not suitable for simple batch processing of large amounts of data.

● Natural language interface: Connect LLM to knowledge bases or tools to automate question-and-answer or basic tool use. This can be used to build professional chatbots, but its actual value is affected by other factors such as the quality of the knowledge base it is connected to.

● Workflow automation: Use LLM to standardize and automate business processes. This can be applied to more complex blockchain data processing processes, such as deconstructing smart contract operation processes, risk identification, etc.

●Assistant robots and assistant auxiliary systems: The auxiliary system is an enhanced system based on natural language interfaces that integrates more data sources and functions, greatly improving user work efficiency.

Picture 12.jpg

Figure 12: LLM application scenario

 3.6. Limitations of LLM

3.6.1. Industry status: mature applications, problems being overcome and unsolved challenges

In the field of Web3 data, although some important progress has been made, there are still some challenges.

Relatively mature applications:

● Use LLM for information processing: AI technologies such as LLM have been successfully used to generate text summaries, summaries, explanations, etc., helping users extract key information from long articles and professional reports, and improving the readability and understandability of data.

● Use AI to solve development problems: LLM has been used to solve problems in the development process, such as replacing StackOverflow or search engines to provide developers with question answers and programming support.

Problems to be solved and being explored:

● Use LLM to generate code: The industry is working hard to apply LLM technology to the conversion of natural language into SQL query language to improve the automation and understandability of database queries. However, there will be many difficulties in the process. For example, in some situations, the generated code requires extremely high accuracy, and the syntax must be 100% correct to ensure that the program can run without bugs and obtain correct results. Difficulties also include ensuring the success rate and accuracy of answering questions, as well as a deep understanding of the business.

● Data annotation issues: Data annotation is crucial for the training of machine learning and deep learning models, but in the field of Web3 data, especially when dealing with anonymous blockchain data, the complexity of annotating data is high.

● Accuracy and hallucination issues: The occurrence of hallucinations in AI models may be affected by multiple factors, including biased or insufficient training data, overfitting, limited context understanding, lack of domain knowledge, adversarial attacks and model architecture . Researchers and developers need to continuously improve the training and calibration methods of models to improve the credibility and accuracy of generated text.

● Utilizing data for business analysis and article output: Using data for business analysis and article generation is still a challenging problem. The complexity of the problem, the need for carefully designed prompts, as well as high-quality data, data volume, and methods to reduce hallucination problems are all issues to be solved.

● Automatically index smart contract data according to business areas for data abstraction: Automatically indexing smart contract data in different business areas for data abstraction is still an unsolved problem. This requires comprehensive consideration of the characteristics of different business fields, as well as the diversity and complexity of data.

● Processing time series data, table document data and other more complex modalities: Multi-modal models such as DALL·E 2 are very good at generating common modalities such as images and speech from text. In the blockchain and financial fields, some time series data need to be treated specially, which cannot be solved simply by vectorizing the text. Combining time series data and text, cross-modal joint training, etc. are important research directions to achieve intelligent data analysis and application.

3.6.2. Why LLM alone cannot perfectly solve the problems of the blockchain data industry

As a language model, LLM is more suitable for handling scenarios that require higher fluency, but in pursuit of accuracy, further adjustments to the model may be required. The following framework can provide some reference when applying LLM to the blockchain data industry.

Picture 13.jpg

Figure 13: Fluency, accuracy and use case risks of LLM output in the blockchain data industry

When evaluating the suitability of LLM in different applications, it is crucial to focus on fluency and accuracy. Fluency refers to whether the model's output is natural and smooth, while accuracy indicates whether the model's answers are accurate. These two dimensions have different requirements in different application scenarios.

For tasks with high fluency requirements, such as natural language generation, creative writing, etc., LLM is usually adequate because its strong performance in natural language processing enables it to generate fluent text.

Blockchain data faces many problems such as data analysis, data processing, and data application. LLM has superior language understanding and reasoning capabilities, making it an ideal tool for interacting with, organizing, and summarizing blockchain data. However, LLM cannot solve all problems in the blockchain data field.

In terms of data processing, LLM is more suitable for rapid iteration and exploratory processing of on-chain data, and constantly tries new processing methods. However, LLM still has some limitations for tasks such as detailed reconciliation in production environments. A typical problem is that the token is not long enough to handle long context content. Time-consuming prompts, unstable answers to questions that affect downstream tasks, leading to unstable success rates, and low efficiency in executing large batches of tasks.

Secondly, hallucination problems are likely to arise in the processing of content by LLM. The probability of hallucinations in ChatGPT is estimated to be about 15% to 20%, and due to the opaque nature of its processing, many errors are difficult to detect. Therefore, the establishment of a framework and the incorporation of expert knowledge become crucial. In addition, there are still many challenges when LLM combines on-chain data:

● There are many types and quantities of data entities on the chain. In what form they are fed to LLM and can be effectively used in specific commercial scenarios. Similar to other vertical industries, more research and exploration are needed.

● On-chain data includes structured and unstructured data. Most current data solutions in the industry are based on the understanding of business data. In the process of parsing on-chain data, ETL is used to filter, clean, supplement and restore business logic, and further organize unstructured data into structured data, which can provide more efficient analysis for various business scenarios in the future. For example, structured DEX trades, NFT marketplace transactions, wallet address portfolio, etc., have the aforementioned characteristics of high quality, high value, accuracy and authenticity, and can provide efficient supplements to general LLM.

4. Misunderstood LLM

4.1. LLM can handle unstructured data directly, so structured data will no longer be needed?

LLM is usually pre-trained based on massive text data and is naturally suitable for processing all kinds of unstructured text data. However, various industries already have a large amount of structured data, especially parsed data in the Web3 field. How to effectively use these data to enhance LLM is a hot research topic in the industry.

For LLM, structured data still has the following advantages:

● Massive: A large amount of data is stored in databases and other standard formats behind various applications, especially private data. There are still a lot of LLMs in every company and industry that don’t have in-wall data for pre-training.

● Existing: This data does not need to be re-produced, and the investment cost is extremely low. The only problem is how to use it.

● High quality and high value: Expert knowledge accumulated over a long period of time in the field is usually stored in structured data and used in industry, academia and research. The quality of structured data is key to data availability, including data completeness, consistency, accuracy, uniqueness and factuality.

● High efficiency: Structured data is stored in tables, databases, or other standardized formats, and the schema is predefined and consistent across the entire data set. This means that the format, type and relationship of data are predictable and controllable, making data analysis and query easier and more reliable. Moreover, the industry already has mature ETL and various data processing and management tools, which are more efficient and convenient to use. LLM can use this data through API.

● Accuracy and factuality: LLM’s text data, based on token probability, is currently unable to stably output exact answers. The problem of hallucinations has always been the core fundamental problem that LLM needs to solve. For many industries and scenarios, security and reliability issues will arise, such as medical care, finance, etc. Structured data is a direction that can assist and correct these problems of LLM.

● Reflect the relationship graph and specific business logic: different types of structured data can be input into LLM in specific organizational forms (relational database, graph database, etc.) to solve different types of domain problems. Structured data uses standardized query languages ​​such as SQL, making complex queries and analysis of data more efficient and accurate. Knowledge Graph can better express the relationship between entities and make associated queries easier.

● Low cost of use: LLM does not need to retrain the entire base model from the bottom every time. It can be combined with LLM empowerment methods such as Agents and LLM API to access LLM faster and at a lower cost.

There are still some imaginative views on the market that believe that LLM is extremely capable of processing text information and unstructured information. It can be achieved by simply importing raw data, including unstructured data, into LLM. Purpose. The idea is similar to asking general-purpose LLMs to solve math problems; without a specifically built model of math ability, most LLMs are likely to make mistakes on simple elementary school addition and subtraction problems. On the contrary, establishing a Crypto LLM vertical model similar to a mathematical capability model and an image generation model is a more practical solution for LLM in the Crypto field.

4.2. LLM can infer content from text information such as news and tweets. People no longer need on-chain data analysis to draw conclusions?

Although LLM can obtain information from texts such as news and social media, insights obtained directly from on-chain data are still indispensable for the following main reasons:

● On-chain data is original first-hand information, while information in news and social media may be one-sided or misleading. Directly analyzing on-chain data can reduce information bias. Although using LLM for text analysis carries the risk of interpretation bias, directly analyzing on-chain data can reduce misinterpretations.

● On-chain data contains comprehensive historical interaction and transaction records, and analysis can discover long-term trends and patterns. Data on the chain can also show a complete picture of the entire ecosystem, such as capital flow, relationships between parties, etc. These big-picture insights provide a deeper understanding of the situation. News and social media information, on the other hand, are often more fragmented and short-term.

● The data on the chain is open. Anyone can verify the analysis results and avoid information asymmetry. The news and social media may not always reveal the truth. Text information and on-chain data can be mutually verified. Combining the two can form a more three-dimensional and accurate judgment.

On-chain data analysis remains indispensable. LLM has an auxiliary role in obtaining information from text, but it cannot replace direct analysis of on-chain data. Make full use of the advantages of both to achieve the best results.

4.3. Is it easy to build blockchain data solutions based on LLM using LangChain, LlamaIndex or other AI tools?

Tools such as LangChain and LlamaIndex facilitate building custom simple LLM applications, making it possible to quickly build them. However, successfully applying these tools in real production environments involves more challenges. Building an LLM application that runs efficiently and maintains high quality is a complex task that requires a deep understanding of how blockchain technology and AI tools work, and integrating them effectively. This is an important but challenging task for the blockchain data industry.

In this process, we must realize the characteristics of blockchain data, which require extremely high accuracy and repeatable verification. Once data is processed and analyzed through LLM, users have high expectations for its accuracy and trustworthiness. There is a potential contradiction between this and the fuzzy tolerance of LLM. Therefore, when building a blockchain data solution, these two needs must be carefully weighed to meet user expectations.

Although there are already some basic tools on the current market, this field is still evolving rapidly and iterating continuously. Analogous to the development process of the Web2 world, from the initial PHP programming language to more mature and scalable solutions such as Java, Ruby, Python, JavaScript and Node.js, etc., to emerging technologies such as Go and Rust, they have experienced continuous development. evolution. AI tools are also constantly changing. Emerging GPT frameworks such as AutoGPT, Microsoft AutoGen, and the recent GPTs and Agents of ChatGPT 4.0 Turbo launched by OpenAI themselves only show part of the possibilities in the future. This shows that both the blockchain data industry and AI technology still have a lot of room for development and require continuous efforts and innovation.

Currently, there are two pitfalls that require special attention when applying LLM:

● Expectations are too high: Many people think that LLM can solve all problems, but in fact LLM has obvious limitations. It requires a large amount of computing resources, is expensive to train, and the training process may be unstable. Have realistic expectations about the capabilities of the LLM, understanding that it excels in some scenarios, such as natural language processing and text generation, but may fail in other areas.

● Ignoring business needs: Another trap is to forcefully apply LLM technology without fully considering business needs. Before applying LLM, it is important to identify specific business needs. It is necessary to evaluate whether LLM is the best technology choice and to do a good job in risk assessment and control. It is emphasized that the effective application of LLM requires careful consideration based on the actual situation to avoid misuse.

Although LLM has great potential in many fields, developers and researchers need to be cautious when applying LLM and adopt an open attitude of exploration to find more suitable application scenarios and maximize its advantages.

This article is jointly published by Footprint Analytics, Future3 Campus, and HashKey Capital.

Footprint Analytics is a blockchain data solutions provider. With the help of cutting-edge artificial intelligence technology, we provide the first code-free data analysis platform and unified data API in the Crypto field, allowing users to quickly retrieve NFT, GameFi and wallet address fund flow tracking data of more than 30 public chain ecosystems.

Footprint官网:https://www.footprint.network

Twitter:https://twitter.com/Footprint_Data

WeChat public account: Footprint blockchain analysis

Join the community: add assistant WeChat group footprint_analytics

Future3 Campus is a Web3.0 innovation incubation platform jointly launched by Wanxiang Blockchain Laboratory and HashKey Capital. It focuses on the three major tracks of Web3.0 Massive Adoption, DePIN, and AI, with Shanghai, Guangdong-Hong Kong-Macao Greater Bay Area, and Singapore as the The main incubation base radiates the global Web3.0 ecology. At the same time, Future3 Campus will launch an initial seed fund of US$50 million for Web3.0 project incubation to truly serve innovation and entrepreneurship in the Web3.0 field.

HashKey Capital is an asset management institution focused on investing in blockchain technology and digital assets. Its current asset management scale exceeds US$1 billion. As one of the largest and most influential blockchain investment institutions in Asia, and also the earliest institutional investor in Ethereum, HashKey Capital exerts a goose effect, linking Web2 and Web3, and connecting with entrepreneurs, investors, communities and regulatory agencies. Join hands to build a sustainable blockchain ecosystem. The company is located in Hong Kong, Singapore, Japan, the United States and other places. It has taken the lead in deploying more than 500 global invested companies across Layer 1, protocols, Crypto Finance, Web3 infrastructure, applications, NFT, Metaverse and other tracks, and is representative. Invested projects include Cosmos, Coinlist, Aztec, Blockdaemon, dYdX, imToken, Animoca Brands, Falcon X, Space and time, Mask Network, Polkadot, Moonbeam and Galxe (formerly Project Galaxy), etc.

Guess you like

Origin blog.csdn.net/m0_60517769/article/details/134804064