利用jupyter提取xml数据集内容存入sql数据库

查看wikivoyage.json中page内容是否能转化用,输出作为页面

wikivoyage_xml_to_json.py:

       xml_str = open(filename).read()

       o = xmltodict.parse(xml_str)

 

       json_str = json.dumps(o)#json_str

       json_d = json.loads(json_str)#dict

 

       with open('data/wikivoyage/wikivoyage.json', 'wb') as j:

              j.write(json.dumps(json_d))

 

       return json_d

1json.dumps()json.loads()json格式处理函数(可以这么理解,json是字符串)
  (1)json.dumps()函数是将一个Python数据类型列表进行json格式的编码(可以这么理解,json.dumps()函数是将字典转化为json字符串
  (2)json.loads()函数是将json格式数据转换为字典

2json.dump()json.load()主要用来读写json文件函数

import json

2

3 # json.dump()函数的使用,将json信息写进文件

4 json_info = "{'age': '12'}"

5 file = open('1.json','w',encoding='utf-8')

6 json.dump(json_info,file)

 

Eda:

Page列表->Article字典建立

# articles without :'s in their titles

articles_dict = {}

for article in articles:

    # Lets strip the accents using unicodedata

        # Lets strip the accents

    title = article['title']

    title = clean_word(title)

   

    if ':' not in title:

        articles_dict[title] = article

 

print len(articles_dict)

articles_dict[articles_dict.keys()[1]]

final_articles = {}

for title in final_titles:

    if '(disambiguation)' not in title:

        final_articles[title] = articles_dict[title]['revision']['text']['#text']

print len(final_articles)

这个结构是一个title对应了一个#text

想办法提取出这一部分json,转化为dict,dictoxml

articles = data['mediawiki']['page']

article(dict)->xml

python 中还有一个模块dicttoxml ,将字典转成xml

import dicttoxml

ret_xml = dicttoxml.dicttoxml(dict)

print(type(ret_xml))

you can import the dicttoxml() function from the library.

>>> from dicttoxml import dicttoxml

>>> xml = dicttoxml(some_dict)

fetch a JSON object from a URL and convert it into XML. Simple:

>>> import json

>>> import urllib

>>> import dicttoxml

>>> page = urllib.urlopen('http://quandyfactory.com/api/example')

>>> content = page.read()

>>> obj = json.loads(content)

>>> print(obj)

{u'mylist': [u'foo', u'bar', u'baz'], u'mydict': {u'foo': u'bar', u'baz': 1}, u'ok': True}

>>> xml = dicttoxml.dicttoxml(obj)

>>> print(xml)

<?xml version="1.0" encoding="UTF-8" ?><root><mylist><item type="str">foo</item><item type="str">bar</item><item type="str">baz</item></mylist><mydict><foo type="str">bar</foo><baz type="int">1</baz></mydict><ok type="bool">true</ok></root>

revision的结构:

    <item type="dict">

        <comment type="str">otherwise a number just looks weird.. (Import from wikitravel.org/en)</comment>

        <sha1 type="str">jjresq501njc7hkah05ux8kk4t1y605</sha1>

        <format type="str">text/x-wiki</format>

        <timestamp type="str">2009-03-02T02:54:37Z</timestamp>

        <text type="dict">

            <key type="str" name="@xml:space">preserve</key>

            <key type="str" name="#text">#REDIRECT [[Town of 1770]]</key>

        </text>

        <contributor type="dict">

            <username type="str">Inas</username>

            <id type="str">1816</id>

        </contributor>

        <model type="str">wikitext</model>

        <id type="str">1</id>

    </item>

text、id等属性在page.sql数据集中已有,需要提取revision中的内容(将下一层dict的id取出)

总页面中junk太多,尝试存储清洗之后的数据

遍历清洗好的final_article的title,在articles_dict中分别取出重要内容和id

尝试:

pip install dicttoxml

在相应位置创建page_content.xml

import dicttoxml

#清洗前的保存

page_content = []

for article in articles:

    page_content.append(article['revision'])

page_content_xml = dicttoxml.dicttoxml(page_content)

with open('../data/page_content.xml', 'wb') as x:

       x.write(page_content_xml)

final_articles = {}

page_contents = []

for title in final_titles:

    if '(disambiguation)' not in title:

        final_articles[title] = articles_dict[title]['revision']['text']['#text']

#         page_id = articles_dict[title]['revision']['id']

        page_content = {}

        page_content['id'] = articles_dict[title]['revision']['id']

        page_content['title'] = title

        if 'comment' in articles_dict[title]['revision']:

            page_content['comment'] = articles_dict[title]['revision']['comment']

        else:

#             print articles_dict[title]['revision']

#             page_content[page_id]['comment'] = "parent id :" + str(articles_dict[title]['revision']['parentid'])

            page_content['comment'] = "parent id :" + articles_dict[title]['revision']['parentid']

        page_content['timestamp'] = articles_dict[title]['revision']['timestamp']

        page_content['text'] = articles_dict[title]['revision']['text']['#text']

        if 'id' in articles_dict[title]['revision']['contributor']:

            page_content['contributor_id'] = articles_dict[title]['revision']['contributor']['id']

            page_content['contributor_name'] = articles_dict[title]['revision']['contributor']['username']

        elif 'ip' in articles_dict[title]['revision']['contributor']:

            page_content['contributor_ip'] = articles_dict[title]['revision']['contributor']['ip']

        page_contents.append(page_content)

#         else:

#             print articles_dict[title]['revision']['contributor']

       

print len(final_articles)

page_content_xml = dicttoxml.dicttoxml(page_contents)

with open('../data/page_content.xml', 'wb') as x:

       x.write(page_content_xml)

报错:

KeyError: 'Szczecin' KeyError: 'id'

如果在查找的key不存在的时候就会报:KeyError:

创建嵌套字典需要先对下一层定义为字典

另外,有些comment['contributor']['id']可能结构不同,不止一层或没有

判断comment是否在key中,如果多层,取出多层,如果没有,查看整体结构

判断key是否在字典内:

除了使用in还可以使用not in,判定这个key不存在,使用in要比has_key要快。

#生成一个字典

d = {'name':Tom, 'age':10, 'Tel':110}

#打印返回值,其中d.keys()是列出字典所有的key

print ‘name’ in d.keys()

print 'name' in d

#两个的结果都是返回True

comment结构

{u'sha1': u'qw343ibxpjzzqzk6rhkqvb8yets1gvv',

u'format': u'text/x-wiki',

u'timestamp': u'2019-03-04T08:06:17Z',

u'parentid': u'3497383',

u'text': {u'@xml:space': u'preserve', u'#text': u"{ {***},

u'contributor': {u'username': u'Ground Zero', u'id': u'1423298'}, u'model': u'wikitext', u'id': u'3737196', u'minor': None}

comment的都有parentid

contributer无id结构

{u'ip': u'2A02:810D:9040:51DD:B954:392F:EFAC:BFFD'}

{u'ip': u'84.248.202.190'}

大部分有ip,无ip的则为

{u'@deleted': u'deleted'}

 

利用dict内置的get(key[,default])方法,如果key存在,则返回其value,否则返回default;使用这个方法永远不会触发KeyError,如:

t = {

    'a': '1',

    'b': '2',

    'c': '3',

}

print(t.get('d', 'not exist'))

print(t)

拼接parentid,python字符串拼接

使用加号(+)连接,使用加号连接各个变量或者元素必须是字符串类型

print('这个数是:'+str(number))

打开文件时,r只读,w覆盖写入,a追加写入

列表嵌套字典结构表示多条记录

desc = '51备忘录'.center(30,'-')

print(desc)

welcome = 'welcome'

print(f'{welcome}作者:',__author__)

# 添加备忘信息

"""dict = {'time':'8点',

           'thing':'起床'

        }

"""

all_memo = []

is_add = True

while (is_add):

    one = {}

    info = input('请输入备忘信息:')

    one['时间'] = info[info.find('点')-1:info.find('点')+1]

    one['事件'] = info[info.find('点')+1:]

    all_memo.append(one)

    print(f'备忘录{all_memo}')

    num = 0

    for i in all_memo:

        num += 1

        print('项目%s:%s' %(num,i))

    print(f'共{len(all_memo)}个待办事项',end='')

    is_add = input('是否继续 Y/N:') == 'Y'

 

Xml数据导入Mysql

https://www.csdn.net/gather_28/MtTaMgysNTg5MS1ibG9n.html

维基百科

下载方法

https://wenku.baidu.com/view/f647b44ae45c3b3567ec8b2d.html

处理方法

https://blog.csdn.net/weixin_34001430/article/details/94267243

https://blog.csdn.net/wangyangzhizhou/article/details/78348949

https://blog.csdn.net/jdbc/article/details/59483767

这些基本上都是只利用text中内容处理语料,没有关注到其他标签

维基导游(Wikivoyage)入门门槛较低,但中文版参与人数很少?

想找个数据集说明结果加载慢的要死,好想有个快一点的代理啊

 

猜你喜欢

转载自blog.csdn.net/lagoon_lala/article/details/104210239
今日推荐