Build your own PandasAI using LlamaIndex

Recommendation: Use NSDT scene editor to quickly build 3D application scenes

Pandas AI is a Python library that enhances the popular data analysis library Pandas with the power of generative AI. With a simple prompt, Pandas AI lets you perform complex data cleaning, analysis, and visualization that previously required many lines of code.

In addition to processing numbers, Pandas AI also understands natural language. You can ask questions about your data in plain English, and it will provide summaries and insights in everyday language, saving you from deciphering complex graphs and tables.

In the example below, we provide a Pandas dataframe and ask the generated AI to create a bar chart. The results are impressive.

pandas_ai.run(df, prompt='Plot the bar chart of type of media for each year release, using different colors.')

Build your own PandasAI using LlamaIndex

Note: Code examples are from the Pandas AI: Your Guide to Generative AI-Powered Data Analysis tutorial.

In this post, we will use LlamaIndex to create similar tools that can understand Pandas data frames and produce complex results as shown above.

LlamaIndex supports natural language querying of data via chat and agents. It allows large language models to interpret private data at scale without retraining on new data. It integrates large language models with various data sources and tools. LlamaIndex is a data frame that makes it easy to create chats with PDF applications in just a few lines of code.

Establish

You can use this command to install Python libraries.pip

pip install llama-index

By default, LlamaIndex uses OpenAI models for text generation as well as retrieval and embedding. In order to run the code easily, we have to set up. We can register and get an API key for free on the new API Tokens page.gpt-3.5-turbotext-embedding-ada-002OPENAI_API_KEY

import os
os.environ["OPENAI_API_KEY"] = "sk-xxxxxx"

They also support integration of Anthropic, Hugging Face, PaLM and more models. You can learn all about a module by reading its documentation.

panda query engine

Let's get into the main topic of creating your own PandasAI. After installing the library and setting up the API key, we will create a simple city dataframe with city name and population as columns.

import pandas as pd
from llama_index.query_engine.pandas_query_engine import PandasQueryEngine
df = pd.DataFrame(
    {"city": ["New York", "Islamabad", "Mumbai"], "population": [8804190, 1009832, 12478447]}
)

Using, we will create a query engine to load and index the dataframe.PandasQueryEngine

After that, we will write a query and display the response.

query_engine = PandasQueryEngine(df=df)

response = query_engine.query(
    "What is the city with the lowest population?",
)

As we can see, it develops Python code to display the least populated cities in a dataframe.

> Pandas Instructions:
```
eval("df.loc[df['population'].idxmin()]['city']")
```
eval("df.loc[df['population'].idxmin()]['city']")
> Pandas Output: Islamabad

And, if you print the reply, you get "Islamabad". It's simple, but impressive. You don't have to come up with your own logic or experiment around the code. Just type in the question and you'll get the answer.

print(response)
Islamabad

You can also use response metadata to print the code behind the results.

print(response.metadata["pandas_instruction_str"])
eval("df.loc[df['population'].idxmin()]['city']")

Global Youku Statistical Analysis

In the second example, we will load the 2023 global YouTube statistics dataset from Kaggle and perform some fundamental analysis. This is a step up from a simple example.

We will use it to load the dataset into the query engine. We will then write a prompt to display only columns with missing values ​​and the number of missing values.read_csv

df_yt = pd.read_csv("Global YouTube Statistics.csv")
query_engine = PandasQueryEngine(df=df_yt, verbose=True)

response = query_engine.query(
    "List the columns with missing values and the number of missing values. Only show missing values columns.",
)
> Pandas Instructions:
```
df.isnull().sum()[df.isnull().sum() > 0]
```
df.isnull().sum()[df.isnull().sum() > 0]
> Pandas Output: category                                    46
Country                                    122
Abbreviation                               122
channel_type                                30
video_views_rank                             1
country_rank                               116
channel_type_rank                           33
video_views_for_the_last_30_days            56
subscribers_for_last_30_days               337
created_year                                 5
created_month                                5
created_date                                 5
Gross tertiary education enrollment (%)    123
Population                                 123
Unemployment rate                          123
Urban_population                           123
Latitude                                   123
Longitude                                  123
dtype: int64

Now, we'll get straight to asking questions about popular channel types. In my opinion, the LlamdaIndex query engine is very accurate and does not create any illusions yet.

response = query_engine.query(
    "Which channel type have the most views.",
)
> Pandas Instructions:
```
eval("df.groupby('channel_type')['video views'].sum().idxmax()")
```
eval("df.groupby('channel_type')['video views'].sum().idxmax()")
> Pandas Output: Entertainment
Entertainment

Finally, we'll ask it to visualize barchat, and the results are stunning.

response = query_engine.query(
    "Visualize barchat of top ten youtube channels based on subscribers and add the title.",
)
> Pandas Instructions:
```
eval("df.nlargest(10, 'subscribers')[['Youtuber', 'subscribers']].plot(kind='bar', x='Youtuber', y='subscribers', title='Top Ten YouTube Channels Based on Subscribers')")
```
eval("df.nlargest(10, 'subscribers')[['Youtuber', 'subscribers']].plot(kind='bar', x='Youtuber', y='subscribers', title='Top Ten YouTube Channels Based on Subscribers')")
> Pandas Output: AxesSubplot(0.125,0.11;0.775x0.77)

Build your own PandasAI using LlamaIndex

With simple hints and query engines, we can automate data analysis and perform complex tasks. The Lama Index has more. I highly recommend you read the official documentation and try building something amazing.

in conclusion

In summary, LlamaIndex is an exciting new tool that allows developers to create their own PandasAI - leveraging the power of large language models for intuitive data analysis and conversation. By using LlamaIndex to index and embed datasets, you can enable advanced natural language capabilities on private data without compromising security or retraining models.

This is just the beginning, with LlamaIndex you can build documentation, chatbots, automated AI, knowledge graphs, AI SQL query engines, full-stack web applications for Q&A, and build privately generated AI applications.

Original link: Build your own PandasAI using LlamaIndex (mvrlink.com)

Guess you like

Origin blog.csdn.net/ygtu2018/article/details/132807967