From Data Engineering to Prompt Engineering

Make a fortune with your little hand, give it a thumbs up!

alt

Data engineering forms a large part of the data science process. In CRISP-DM, this stage of the process is called "data preparation". It includes tasks such as data ingestion, data transformation, and data quality assurance. In chapter [1] of this paper , we solve typical data engineering tasks using ChatGPT and Python. In doing so, we explore the connection between data engineering and the new discipline of prompting engineering.

Introduction

In May 2022, Stephen Wolfram and Lex Fridman published an insightful article titled "Is Programming Dead?" They discussed whether developers will still use high-level languages ​​in the future. According to Wolfram, many programming tasks can be automated with large language models (LLMs). At the time of writing, the most prominent example of such a model is ChatGPT. Since its launch at the end of 2022, it has produced amazing results. Specifying actions to be performed by LLM is called "hint engineering". If Wolfram is right, at least part of software development will move from writing code to writing hints.

When it comes to data science, data preparation can be a time-consuming and tedious task. So why not try to automate it with LLM? In the following sections, we will use ChatGPT and Python to solve different data engineering problems. Instead of writing the Python code ourselves, we use the hint project to generate it. Our experiments were conducted on May 19, 2023, based on the latest free ChatGPT version (GPT-3.5) at that time.

Create data frame

Let's start with a simple problem and create a Pandas dataframe from a sample dataset. Table 1 contains, for example, country indicators provided by the World Bank.

image-20230524153840794
image-20230524153840794

To create a dataframe from the example above, we started a new conversation with ChatGPT and posted the following prompt:

Create a Pandas table with the following data:
Continent, Country, Year, Category, Variable, Value
Asia, China, 2022, Economy, GDP, 18.1
Asia, China, 2022, Demographics, Population, 1412
North America, USA, 2022, Economy, GDP, 25.5
North America, USA, 2022, Demographics, Population, 333
Europe, Germany, 2022, Economy, GDP, 3.9
Europe, Germany, 2022, Demographics, Population, 84

In response, the model generated the following Python code:

import pandas as pd

data = {
    'Continent': ['Asia''Asia''North America''North America''Europe''Europe'],
    'Country': ['China''China''USA''USA''Germany''Germany'],
    'Year': [202220222022202220222022],
    'Category': ['Economy''Demographics''Economy''Demographics''Economy''Demographics'],
    'Variable': ['GDP''Population''GDP''Population''GDP''Population'],
    'Value': [18.1141225.53333.984]
}

df = pd.DataFrame(data)
print(df)

When we run the script, we see that the output matches our specifications:

alt

尽管我们在实验中使用了示例,但下一节中生成的代码原则上也可以处理更大的数据集。

旋转数据框

对于数据科学,我们在第一个提示中创建的表结构并不是最优的。相反,所谓的“平板”会更方便。在表 1 中,指标“GDP”和“人口”分别用名为“Variable”和“Value”的两列表示。相比之下,平面表格将为每个指标包含单独的列,如表 2 所示。

alt

将表 1 的格式转换为表 2 的格式称为“透视”。让我们尝试使用 ChatGPT 来解决这个任务。为此,我们在对话中发布了第二个提示:

Transform the table into the following format:
Continent, Country, Year, GDP, Population
Asia, China, 2022, 18.1, 1412
North America, USA, 2022, 25.5, 333
Europe, Germany, 2022, 3.9, 84

上述消息基于所需数据格式的示例。或者,我们也可以纯口头表达我们的要求。与经典编程一样,在提示工程中有实现目标的替代解决方案。在不久的将来,寻找最佳方式将是一项激动人心的任务。根据我们的提示,ChatGPT对Python代码进行了如下调整:

import pandas as pd

data = {
    'Continent': ['Asia''Asia''North America''North America''Europe''Europe'],
    'Country': ['China''China''USA''USA''Germany''Germany'],
    'Year': [202220222022202220222022],
    'Category': ['Economy''Demographics''Economy''Demographics''Economy''Demographics'],
    'Variable': ['GDP''Population''GDP''Population''GDP''Population'],
    'Value': [18.1141225.53333.984]
}

df = pd.DataFrame(data)

# Pivot the table
pivot_df = df.pivot_table(index=['Continent''Country''Year'], columns='Category', values='Value').reset_index()

# Rename the columns
pivot_df.columns.name = None
pivot_df.columns = [column if column != 'Variable' else None for column in pivot_df.columns]

print(pivot_df)

然而,运行脚本的第二个版本显示输出不是我们所要求的: alt

该表包含名为“人口统计”和“经济”的两列,而不是预期的“GDP”和“人口”列。为了解决这个问题,我们发布了下一个提示:

The columns are not correct, you should have used 
the variables GDP and population

作为响应,模型纠正了错误并生成了另一个版本的 Python 代码:

import pandas as pd

data = {
    'Continent': ['Asia''Asia''North America''North America''Europe''Europe'],
    'Country': ['China''China''USA''USA''Germany''Germany'],
    'Year': [202220222022202220222022],
    'Category': ['Economy''Demographics''Economy''Demographics''Economy''Demographics'],
    'Variable': ['GDP''Population''GDP''Population''GDP''Population'],
    'Value': [18.1141225.53333.984]
}

df = pd.DataFrame(data)

# Pivot the table
pivot_df = df.pivot_table(index=['Continent''Country''Year'], columns='Variable', values='Value').reset_index()

# Rename the columns
pivot_df.columns.name = None
pivot_df.columns = ['Continent''Country''Year''GDP''Population']

print(pivot_df)

运行脚本表明这次旋转已正确实现:

alt

富集数据框

另一个典型的数据工程任务是用附加信息丰富数据集。为此,我们要求 ChatGPT 通过发布以下提示将 ISO 国家代码添加到我们的数据中:

For each country, in the table above, add its ISO code

结果,模型调整了响应并返回了以下代码片段:

import pandas as pd

data = {
    'Continent': ['Asia''Asia''North America''North America''Europe''Europe'],
    'Country': ['China''China''USA''USA''Germany''Germany'],
    'Year': [202220222022202220222022],
    'Category': ['Economy''Demographics''Economy''Demographics''Economy''Demographics'],
    'Variable': ['GDP''Population''GDP''Population''GDP''Population'],
    'Value': [18.1141225.53333.984]
}

iso_codes = {
    'China''CN',
    'USA''US',
    'Germany''DE'
}

df = pd.DataFrame(data)

# Add ISO code column
df['ISO Code'] = df['Country'].map(iso_codes)

# Pivot the table
pivot_df = df.pivot_table(index=['Continent''Country''ISO Code''Year'], columns='Variable', values='Value').reset_index()

# Rename the columns
pivot_df.columns.name = None
pivot_df.columns = ['Continent''Country''ISO Code''Year''GDP''Population']

print(pivot_df)

运行脚本显示一个带有 ISO 代码的新列已添加到数据框中。

alt

派生列

接下来让我们通过将一个国家的 GDP 除以其人口规模来得出一个新的列“人均 GDP”。为此,我们发送了以下提示:

Derive a new column "GDP per capita" by dividing the GDP 
of a country by its population size

对此,模型修改了代码如下:

import pandas as pd

data = {
    'Continent': ['Asia''Asia''North America''North America''Europe''Europe'],
    'Country': ['China''China''USA''USA''Germany''Germany'],
    'Year': [202220222022202220222022],
    'Category': ['Economy''Demographics''Economy''Demographics''Economy''Demographics'],
    'Variable': ['GDP''Population''GDP''Population''GDP''Population'],
    'Value': [18.1141225.53333.984]
}

iso_codes = {
    'China''CN',
    'USA''US',
    'Germany''DE'
}

df = pd.DataFrame(data)

# Add ISO code column
df['ISO Code'] = df['Country'].map(iso_codes)

# Pivot the table
pivot_df = df.pivot_table(index=['Continent''Country''ISO Code''Year'], columns='Variable', values='Value').reset_index()

# Calculate GDP per capita
pivot_df['GDP per capita'] = pivot_df['GDP'] / pivot_df['Population']

# Rename the columns
pivot_df.columns.name = None
pivot_df.columns = ['Continent''Country''ISO Code''Year''GDP''Population''GDP per capita']

print(pivot_df)

输出表明脚本产生了预期的结果:

alt

分析数据

一旦数据转换完成,实际的分析就可以开始了。最后让我们计算一下各国GDP与人口的相关系数。为此,我们贴出最后一个提示:

Use the table above to calculate the correlation coefficient between 
the GDP and the population of the countries

结果,模型返回了以下代码片段:

import pandas as pd

data = {
    'Continent': ['Asia''Asia''North America''North America''Europe''Europe'],
    'Country': ['China''China''USA''USA''Germany''Germany'],
    'Year': [202220222022202220222022],
    'Category': ['Economy''Demographics''Economy''Demographics''Economy''Demographics'],
    'Variable': ['GDP''Population''GDP''Population''GDP''Population'],
    'Value': [18.1141225.53333.984]
}

df = pd.DataFrame(data)

# Pivot the table
pivot_df = df.pivot_table(index=['Continent''Country''Year'], columns='Variable', values='Value').reset_index()

# Calculate correlation coefficient
correlation_coefficient = pivot_df['GDP'].corr(pivot_df['Population'])

print("Correlation Coefficient: {:.3f}".format(correlation_coefficient))

运行脚本时,我们发现两个变量之间存在很强的正相关性,这并不奇怪:

alt

总结

Admittedly, our experiments are based on a simple dataset. However, the results are remarkable. We performed several data engineering tasks without writing a single line of code. Not only does ChatGPT execute our prompts correctly most of the time. But even when the model makes mistakes, it is able to reflect and fix them. As with software development, the resulting code must be tested. Also, it may require refactoring and optimization. Using pylint in the age of AI is still a good idea. In conclusion, however, we have to agree with Wolfram: in the future, a significant part of data engineering will shift from coding to hint engineering. This new approach will not replace data engineers, but will increase their effectiveness.

Reference

[1]

Source: "https://towardsdatascience.com/from-data-engineering-to-prompt-engineering-5debd1c636e0"

This article is published by mdnice multi-platform

Guess you like

Origin blog.csdn.net/swindler_ice/article/details/131197891