Python Data Analysis II - Description of the operation data and common format

1. Local data read common format

Common data analysis file formats are:

csv，json，xls，xlsx，txt，xml，等

TXT

txt file by the string lines.

open a file

txt_filename = './files/python_baidu.txt'
file_obj = open(txt_filename, 'r', encoding='utf-8')

Read the entire contents of the file

all_content = file_obj.read()

Close the file

file_obj.close()
print(all_content)

Read line by line

txt_filename = './files/python_baidu.txt'

# 打开文件
file_obj = open(txt_filename, 'r', encoding='utf-8')

# 逐行读取
line1 = file_obj.readline()
print(line1)

# 继续读下一行
line2 = file_obj.readline()
print(line2)

# 关闭文件
file_obj.close()

Read the entire contents, return to the list

txt_filename = './files/python_baidu.txt'

# 打开文件
file_obj = open(txt_filename, 'r', encoding='utf-8')

lines = file_obj.readlines()

for i, line in enumerate(lines):
    print ('{}: {}'.format(i, line))

# 关闭文件
file_obj.close()

Write

txt_filename = './files/test_write.txt'

# 打开文件
file_obj = open(txt_filename, 'w', encoding='utf-8')

# 写入全部内容
file_obj.write("《Python数据分析》")
file_obj.close()

txt_filename = './files/jxtest_write.txt'

# 打开文件
file_obj = open(txt_filename, 'w', encoding='utf-8')

# 写入字符串列表
lines = ['这是第%i行\n' %n for n in range(100)]
file_obj.writelines(lines)
file_obj.close()

wIth statement

txt_filename = './files/test_write.txt'
with open(txt_filename, 'r', encoding='utf-8') as f_obj:
    print(f_obj.read())

CSV

CSV is a common, relatively simple file format that is widely used user, business and science. The most widely used is the transfer of data between the program table, and the programs are in operation in incompatible formats (often proprietary and / or non-standard format). Because a large number of programs CSV support a variant, at least as an alternative input / output formats.

"CSV" not a single, well-defined format (RFC 4180 although there is a definition is commonly used). Thus, in practice, the term "CSV" refers to any document with the following characteristics:

Plain text, use a character set, such as ASCII , Unicode , EBCDIC or GB2312 ;
Consists of records (typically, one record per line);
Each record is delimited divided into fields (typically separator comma, semicolon, or tab; sometimes separator may include an optional spaces);

Each record has the same sequence of fields.

Pandas read csv

import pandas as pd

filename = './files/gender_country.csv'            #csv文件地址
df = pd.read_csv(filename, encoding='utf-16')

print(type(df))
print(df.head())        #head()表示前五行数据

country_se = df[u'国家']
print(type(country_se))
print(country_se.head())

Pandas write csv

filename = './files/pandas_output.csv'
df.to_csv(filename, index=None, encoding='utf-8')

JSON（JavaScript Object Notation）

Lightweight data-interchange format

Grammar rules:

1. The key-value pair is

2. separated by commas

{3} for objects, such as {key1: value1, key2: value2}

4. [] store arrays. As fear [val1, val2, val3, ...]

Read Json

import json

filename = './files/global_temperature.json'
with open(filename, 'r') as f_obj:
    json_data = json.load(f_obj)

# 返回值是dict类型
print(json_data)

print(json_data.keys())     #读取json数据的属性名

Json turn CSV

JSON raw data as follows:

{'description': {'title': 'Global Land and Ocean Temperature Anomalies, January-December', 'units': 'Degrees Celsius', 'base_period': '1901-2000'}, 'data': {'1880': '-0.1247', '1881': '-0.0707', '1882': '-0.0710', '1883': '-0.1481', '1884': '-0.2099', '1885': '-0.2220', '1886': '-0.2101', '1887': '-0.2559', '1888': '-0.1541', '1889': '-0.1032', '1890': '-0.3233', '1891': '-0.2552', '1892': '-0.3079', '1893': '-0.3221', '1894': '-0.2828', '1895': '-0.2279', '1896': '-0.0971', '1897': '-0.1232', '1898': '-0.2578', '1899': '-0.1172', '1900': '-0.0704', '1901': '-0.1471', '1902': '-0.2535', '1903': '-0.3442', '1904': '-0.4240', '1905': '-0.2967', '1906': '-0.2208', '1907': '-0.3767', '1908': '-0.4441', '1909': '-0.4332', '1910': '-0.3862', '1911': '-0.4367', '1912': '-0.3318', '1913': '-0.3205', '1914': '-0.1444', '1915': '-0.0747', '1916': '-0.2979', '1917': '-0.3193', '1918': '-0.2118', '1919': '-0.2082', '1920': '-0.2152', '1921': '-0.1517', '1922': '-0.2318', '1923': '-0.2161', '1924': '-0.2510', '1925': '-0.1464', '1926': '-0.0618', '1927': '-0.1506', '1928': '-0.1749', '1929': '-0.2982', '1930': '-0.1016', '1931': '-0.0714', '1932': '-0.1214', '1933': '-0.2481', '1934': '-0.1075', '1935': '-0.1445', '1936': '-0.1173', '1937': '-0.0204', '1938': '-0.0318', '1939': '-0.0157', '1940': '0.0927', '1941': '0.1974', '1942': '0.1549', '1943': '0.1598', '1944': '0.2948', '1945': '0.1754', '1946': '-0.0013', '1947': '-0.0455', '1948': '-0.0471', '1949': '-0.0550', '1950': '-0.1579', '1951': '-0.0095', '1952': '0.0288', '1953': '0.0997', '1954': '-0.1118', '1955': '-0.1305', '1956': '-0.1945', '1957': '0.0538', '1958': '0.1145', '1959': '0.0640', '1960': '0.0252', '1961': '0.0818', '1962': '0.0924', '1963': '0.1100', '1964': '-0.1461', '1965': '-0.0752', '1966': '-0.0204', '1967': '-0.0112', '1968': '-0.0282', '1969': '0.0937', '1970': '0.0383', '1971': '-0.0775', '1972': '0.0280', '1973': '0.1654', '1974': '-0.0698', '1975': '0.0060', '1976': '-0.0769', '1977': '0.1996', '1978': '0.1139', '1979': '0.2288', '1980': '0.2651', '1981': '0.3024', '1982': '0.1836', '1983': '0.3429', '1984': '0.1510', '1985': '0.1357', '1986': '0.2308', '1987': '0.3710', '1988': '0.3770', '1989': '0.2982', '1990': '0.4350', '1991': '0.4079', '1992': '0.2583', '1993': '0.2857', '1994': '0.3420', '1995': '0.4593', '1996': '0.3225', '1997': '0.5185', '1998': '0.6335', '1999': '0.4427', '2000': '0.4255', '2001': '0.5455', '2002': '0.6018', '2003': '0.6145', '2004': '0.5806', '2005': '0.6583', '2006': '0.6139', '2007': '0.6113', '2008': '0.5415', '2009': '0.6354', '2010': '0.7008', '2011': '0.5759', '2012': '0.6219', '2013': '0.6687', '2014': '0.7402', '2015': '0.8990'}}

# 转换key
year_str_lst = json_data['data'].keys()
year_lst = [int(year_str) for year_str in year_str_lst]
print(year_lst)

# 转换value
temp_str_lst = json_data['data'].values()
temp_lst = [float(temp_str) for temp_str in temp_str_lst]
print(temp_lst)

import pandas as pd

# 构建 dataframe
year_se = pd.Series(year_lst, name = 'year')
temp_se = pd.Series(temp_lst, name = 'temperature')
result_df = pd.concat([year_se, temp_se], axis = 1)
print(result_df.head())

# 保存csv
result_df.to_csv('./files/json_to_csv.csv', index = None)

Write json data

book_dict = [{'书名':'无声告白', '作者':'伍绮诗'}, {'书名':'我不是潘金莲', '作者':'刘震云'}, {'书名':'沉默的大多数 (王小波集)', '作者':'王小波'}]

filename = './files/json_output.json'
with open(filename, 'w', encoding='utf-8') as f_obj:
    f_obj.write(json.dumps(book_dict, ensure_ascii=False))

2. Python database operations

2.1 SQLite

Relational database systems, embedded database, integrated in the user program to achieve the most votes SQL standard

Connect to the database

import sqlite3

db_path = './files/1test.sqlite'

conn = sqlite3.connect(db_path)
cur = conn.cursor()
conn.text_factory = str  # 处理中文

Obtain basic information

cur.execute('SELECT SQLITE_VERSION()')

print('SQLite版本：', str(cur.fetchone()[0]))

One by inserting data

cur.execute("DROP TABLE IF EXISTS book")
cur.execute("CREATE TABLE book(id INT, name TEXT, price DOUBLE)")
cur.execute("INSERT INTO book VALUES(1,'肖秀荣考研书系列:肖秀荣(2017)考研政治命题人终极预测4套卷',14.40)")
cur.execute("INSERT INTO book VALUES(2,'法医秦明作品集:幸存者+清道夫+尸语者+无声的证词+第十一根手指(套装共5册) (两种封面随机发货)',100.00)")
cur.execute("INSERT INTO book VALUES(3,'活着本来单纯:丰子恺散文漫画精品集(收藏本)',30.90)")
cur.execute("INSERT INTO book VALUES(4,'自在独行:贾平凹的独行世界',26.80)")
cur.execute("INSERT INTO book VALUES(5,'当你的才华还撑不起你的梦想时',23.00)")
cur.execute("INSERT INTO book VALUES(6,'巨人的陨落(套装共3册)',84.90)")
cur.execute("INSERT INTO book VALUES(7,'孤独深处(收录雨果奖获奖作品《北京折叠》)',21.90)")
cur.execute("INSERT INTO book VALUES(8,'世界知名企业员工指定培训教材:所谓情商高,就是会说话',22.00)")

Bulk insert data

books = (
    (9, '人间草木', 30.00),
    (10,'你的善良必须有点锋芒', 20.50),
    (11, '这么慢,那么美', 24.80),
    (12, '考拉小巫的英语学习日记:写给为梦想而奋斗的人(全新修订版)', 23.90)
)
cur.executemany("INSERT INTO book VALUES(?, ?, ?)", books)
conn.commit()

Find data

cur.execute('SELECT * FROM book')
rows = cur.fetchall()

# 通过索引号访问
for row in rows:
    print('序号: {}, 书名: {}, 价格: {}'.format(row[0], row[1], row[2]))
    

conn.row_factory = sqlite3.Row
cur = conn.cursor() 
cur.execute('SELECT * FROM book')
rows = cur.fetchall()

# 通过列名访问
for row in rows:
    print('序号: {}, 书名: {}, 价格: {}'.format(row['id'], row['name'], row['price']))
    
conn.close()

More than 2.2 meter connection

import sqlite3

db_path = './files/test_join.sqlite'

conn = sqlite3.connect(db_path)
cur = conn.cursor()

# 建 depaetment 表，并插入数据
cur.execute("DROP TABLE IF EXISTS department")
cur.execute("CREATE TABLE department(\
                id INT PRIMARY KEY NOT NULL, \
                dept CHAR(50) NOT NULL, \
                emp_id INT NOT NULL)")
depts = (
        (1, 'IT Builing', 1),
        (2, 'Engineerin', 2),
        (3, 'Finance', 7)
)
cur.executemany("INSERT INTO department VALUES(?, ?, ?)", depts)

# 建 company 表，并插入数据
cur.execute("DROP TABLE IF EXISTS company")
cur.execute("CREATE TABLE company(\
                    id INT PRIMARY KEY NOT NULL, \
                    name CHAR(50) NOT NULL, \
                    age INT NOT NULL, \
                    address CHAR(50) NOT NULL,\
                    salary DOUBLE NOT NULL)")
companies = (
        (1, 'Paul', 32, 'California', 20000.0),
        (2, 'Allen', 25, 'Texas', 15000.0),
        (3, 'Teddy', 23, 'Norway', 20000.0),
        (4, 'Mark', 25, 'Rich-Mond', 65000.0),
        (5, 'David', 27, 'Texas', 85000.0),
        (6, 'Kim', 22, 'South-Hall', 45000.0),
        (7, 'James', 24, 'Houston', 10000.0)
)
cur.executemany("INSERT INTO company VALUES (?, ?, ?, ?, ?)", companies)
conn.commit()

cross-connect cross join

cur.execute("SELECT emp_id, name, dept FROM company CROSS JOIN department;")
rows = cur.fetchall()
for row in rows:
    print(row)

(1, 'Paul', 'IT Builing')
(2, 'Paul', 'Engineerin')
(7, 'Paul', 'Finance')
(1, 'Allen', 'IT Builing')
(2, 'Allen', 'Engineerin')
(7, 'Allen', 'Finance')
(1, 'Teddy', 'IT Builing')
(2, 'Teddy', 'Engineerin')
(7, 'Teddy', 'Finance')
(1, 'Mark', 'IT Builing')
(2, 'Mark', 'Engineerin')
(7, 'Mark', 'Finance')
(1, 'David', 'IT Builing')
(2, 'David', 'Engineerin')
(7, 'David', 'Finance')
(1, 'Kim', 'IT Builing')
(2, 'Kim', 'Engineerin')
(7, 'Kim', 'Finance')
(1, 'James', 'IT Builing')
(2, 'James', 'Engineerin')
(7, 'James', 'Finance')

INTER JOIN connection within

cur.execute("SELECT emp_id, name, dept FROM company INNER JOIN department \
            ON company.id = department.emp_id;")
rows = cur.fetchall()
for row in rows:
    print(row)

(1, 'Paul', 'IT Builing')
(2, 'Allen', 'Engineerin')
(7, 'James', 'Finance')

An outer connecting OUTER JOIN

# 左连接
cur.execute("SELECT emp_id, name, dept FROM company LEFT OUTER JOIN department \
            ON company.id = department.emp_id;")
rows = cur.fetchall()
for row in rows:
    print(row)

(1, 'Paul', 'IT Builing')
(2, 'Allen', 'Engineerin')
(None, 'Teddy', None)
(None, 'Mark', None)
(None, 'David', None)
(None, 'Kim', None)
(7, 'James', 'Finance')

# 右连接，交换两张表
cur.execute("SELECT emp_id, name, dept FROM department LEFT OUTER JOIN company \
            ON company.id = department.emp_id;")
rows = cur.fetchall()
for row in rows:
    print(row)

(1, 'Paul', 'IT Builing')
(2, 'Allen', 'Engineerin')
(7, 'James', 'Finance')

From defeat

Released two original articles · won praise 0 · Views 22

Private letter concerns

Python Data Analysis II - Description of the operation data set and a common format