python format address information

background

Recently, I am tossing a fun library capato realize the formatted output of addresses. The tutorial I saw was this:

location_str = ["徐汇区虹漕路461号58号楼5楼", "泉州市洛江区万安塘西工业区"]
import cpca
df = cpca.transform(location_str)
df

Before officially running the code, I was wondering why the address I entered couldn’t be random (because in combination with many application scenarios, I think the solutions to the problems are the same), and by the way, I can help the official to test it, okay? use. So I started to toss around and found a library that simulates address generation Faker.

Before the official use, I also saw such a post, without using any pip package implementation. Article address

This article almost starts with the most basic method to generate corresponding random information. Personally, I suggest that as a beginner, you can make such an attempt, and you can continuously improve your grasp of the grammatical features of python. However, as an engineer with a certain degree of familiarity with python, our first choice is the package components pip. One is to use it as soon as it is used, saving time and effort; the other is: you can use the appropriate time to study the source code of the other party to improve your engineering thinking and technology.

You can refer to this blog for Faker’s use of API. I will change to another video to explain the use of Faker.

install fake

pip install faker

Randomly generate 10 addresses

from faker import Faker

# 创建Faker对象
fake = Faker('zh_CN')

# 生成10个随机地址
random_addresses = []
for _ in range(10):
    address = fake.address()
    # 生成的地址带区域编号,去除
    random_addresses.append(address.split(' ')[0])

for address in random_addresses:
    print(address)

The generated address is as follows:

img

It can be clearly felt that this is much more efficient and practicable than manually writing code to achieve random information generation.

cpca address resolution

Install the cpca package

pip install cpca

Test, in order to make the effect more obvious, I wrote a piece of data I made

random_addresses.append('湖北省武汉市香港路111号')
    
import cpca

df = cpca.transform(random_addresses)
print(df)

The final effect is as follows:

img

You can also output the location of the corresponding province, city, and district, just add the following parameters:

pos_sensitive=True

Explanation of the official document: pos_sensitive: If it is True, it will return three more columns, the positions of the extracted provinces and cities in the string, if it does not exist in the string, it will display -1

It can be seen that the effect is still very nice. Basically, the province, city, and address can be extracted very well, which can be used for partial demo display; if the address information is complete, it can also be used for actual production. However, it can be seen that the city and district information of some addresses have not been extracted. Because the address was fabricated, the accuracy of the address was not checked.

If you encounter a more complex scene, such as obtaining the city and district information of the text, this will be a bit difficult. For more complex scenarios, you may need to use NLP. You can refer to the article Information Extraction of Express Order Based on PaddleNLP - Entity Extraction

reference article

  • [Use python to extract the province and city information in the Chinese address description](

Guess you like

Origin blog.csdn.net/weixin_55768452/article/details/132053987