Introduction to Importing Data in Python

1. Introduction and flat files

</> Exploring your working directory

Your task is to use the IPython magic command ! ls to check out the contents of your current directory and answer the following question: which of the following files is in your working directory?

huck_finn.txt
titanic.csv
moby_dick.txt

In [1]: !ls
moby_dick.txt

</> Importing entire text files

Open the file moby_dick.txt as read-only and store it in the variable file. Make sure to pass the filename enclosed in quotation marks ‘’.

Print the contents of the file to the shell using the print() function. As Hugo showed in the video, you’ll need to apply the method read() to the object file.

Check whether the file is closed by executing print(file.closed).

Close the file using the close() method.

Check again that the file is closed as you did above.

file = open('moby_dick.txt', mode = 'r') # 打开一个文件
print(file.read()) # 打印文件
print(file.closed) # 检查文件是否关闭
file.close() # 关闭文件
print(file.closed)

<script.py> output:
    CHAPTER 1. Loomings.
    
    Call me Ishmael. Some years ago--never mind how long precisely--having
    little or no money in my purse, and nothing particular to interest me on
    shore, I thought I would sail about a little and see the watery part of
    the world. It is a way I have of driving off the spleen and regulating
    the circulation. Whenever I find myself growing grim about the mouth;
    whenever it is a damp, drizzly November in my soul; whenever I find
    myself involuntarily pausing before coffin warehouses, and bringing up
    the rear of every funeral I meet; and especially whenever my hypos get
    such an upper hand of me, that it requires a strong moral principle to
    prevent me from deliberately stepping into the street, and methodically
    knocking people's hats off--then, I account it high time to get to sea
    as soon as I can. This is my substitute for pistol and ball. With a
    philosophical flourish Cato throws himself upon his sword; I quietly
    take to the ship. There is nothing surprising in this. If they but knew
    it, almost all men in their degree, some time or other, cherish very
    nearly the same feelings towards the ocean with me.
    False
    True

</> Importing text files line by line

Open moby_dick.txt using the with context manager and the variable file.

Print the first three lines of the file to the shell by using readline() three times within the context manager.

with open('moby_dick.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())

<script.py> output:
    CHAPTER 1. Loomings.
    
    
    
    Call me Ishmael. Some years ago--never mind how long precisely--having

</> Pop quiz: examples of flat files

You’re now well-versed in importing text files and you’re about to become a wiz at importing flat files. But can you remember exactly what a flat file is? Test your knowledge by answering the following question: which of these file types below is NOT an example of a flat file?

A .csv file.
A tab-delimited .txt.
A relational database (e.g. PostgreSQL).

</> Pop quiz: what exactly are flat files?

Which of the following statements about flat files is incorrect?

Flat files consist of rows and each row is called a record.
Flat files consist of multiple tables with structured relationships between the tables.
Flat files consist of multiple tables with structured relationships between the tables.
Flat files are pervasive in data science.

</> Why we like flat files and the Zen of Python

The question you need to answer is: what is the 5th aphorism of the Zen of Python?

Flat is better than nested.
Flat files are essential for data science.
The world is representable as a flat file.
Flatness is in the eye of the beholder.

In [1]: import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

</> Using NumPy to import flat files

Fill in the arguments of np.loadtxt() by passing file and a comma ‘,’ for the delimiter.

Fill in the argument of print() to print the type of the object digits. Use the function type().

Execute the rest of the code to visualize one of the rows of the data.

import numpy as np
file = 'digits.csv'
digits = np.loadtxt(file, delimiter=',') # 导入文件
print(type(digits)) # 打印digits的数据类型

im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()

<script.py> output:
    <class 'numpy.ndarray'>

在这里插入图片描述

</> Customizing your NumPy import

The file that you’ll be importing, digits_header.txt, has a header and is tab-delimited.

Complete the arguments of np.loadtxt(): the file you’re importing is tab-delimited, you want to skip the first row and you only want to import the first and third columns.

Complete the argument of the print() call in order to print the entire array that you just imported.

import numpy as np
file = 'digits_header.txt'
data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=(0,2))
print(data)

<script.py> output:
    [[1. 0.]
     [0. 0.]
     [1. 0.]
	...
     [2. 0.]
     [0. 0.]
     [5. 0.]]

</> Importing different datatypes

Complete the first call to np.loadtxt() by passing file as the first argument.

Complete the second call to np.loadtxt(). The file you’re importing is tab-delimited, the datatype is float, and you want to skip the first row.

Print the 10th element of data_float by completing the print() command. Be guided by the previous print() call.

Execute the rest of the code to visualize the data.

file = 'seaslug.txt'
data = np.loadtxt(file, delimiter='\t', dtype=str)
print(data[0]) # 打印第1个元素
data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)
print(data_float[9]) # 打印第10个元素

plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
plt.show()

<script.py> output:
    ['Time' 'Percent']
    [0.    0.357]

在这里插入图片描述

</> Working with mixed datatypes

After importing the Titanic data as a structured array (as per the instructions above), print the entire column with the name Survived to the shell. What are the last 4 values of this column?

1,0,0,1.
1,2,0,0.
1,0,1,0.
0,1,1,1.

data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)
data['Survived']

array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1,
		...
       1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0])

Import titanic.csv using the function np.recfromcsv() and assign it to the variable, d. You’ll only need to pass file to it because it has the defaults delimiter=’,’ and names=True in addition to dtype=None!

Run the remaining code to print the first three entries of the resulting array d.

file = 'titanic.csv'
d = np.recfromcsv(file, delimiter=',', names=True, dtype=None)
print(d[:3])

<script.py> output:
    [(1, 0, 3, b'male', 22., 1, 0, b'A/5 21171',  7.25  , b'', b'S')
     (2, 1, 1, b'female', 38., 1, 0, b'PC 17599', 71.2833, b'C85', b'C')
     (3, 1, 3, b'female', 26., 0, 0, b'STON/O2. 3101282',  7.925 , b'', b'S')]

</> Using pandas to import flat files as DataFrames

Import the pandas package using the alias pd.

Read titanic.csv into a DataFrame called df. The file name is already stored in the file object.

In a print() call, view the head of the DataFrame.

import pandas as pd
file = 'titanic.csv'
df = pd.read_csv(file) # 将文件读入DataFrame
print(df.head()) # 查看前5行数据

<script.py> output:
       PassengerId  Survived  Pclass     Sex   Age  ...  Parch            Ticket     Fare  Cabin Embarked
    0            1         0       3    male  22.0  ...      0         A/5 21171   7.2500    NaN        S
    1            2         1       1  female  38.0  ...      0          PC 17599  71.2833    C85        C
    2            3         1       3  female  26.0  ...      0  STON/O2. 3101282   7.9250    NaN        S
    3            4         1       1  female  35.0  ...      0            113803  53.1000   C123        S
    4            5         0       3    male  35.0  ...      0            373450   8.0500    NaN        S
    
    [5 rows x 11 columns]

Import the first 5 rows of the file into a DataFrame using the function pd.read_csv() and assign the result to data. You’ll need to use the arguments nrows and header (there is no header in this file).

Build a numpy array from the resulting DataFrame in data and assign to data_array.

Execute print(type(data_array)) to print the datatype of data_array.

file = 'digits.csv'
data = pd.read_csv(file, header=None, nrows=5) # 将文件的前5行读入DataFrame
data_array = data.values # 从DataFrame中创建一个numpy数组
print(type(data_array))

<script.py> output:
    <class 'numpy.ndarray'>

</> Customizing your pandas import

Complete the sep (the pandas version of delim), comment and na_values arguments of pd.read_csv(). comment takes characters that comments occur after in the file, which in this case is ‘#’. na_values takes a list of strings to recognize as NA/NaN, in this case the string ‘Nothing’.

Execute the rest of the code to print the head of the resulting DataFrame and plot the histogram of the ‘Age’ of passengers aboard the Titanic.

import matplotlib.pyplot as plt
file = 'titanic_corrupt.txt'
data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')
print(data.head())

pd.DataFrame.hist(data[['Age']]) # 在直方图中绘制“年龄”变量
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()

<script.py> output:
       PassengerId  Survived  Pclass     Sex   Age  ...  Parch            Ticket    Fare  Cabin Embarked
    0            1         0       3    male  22.0  ...      0         A/5 21171   7.250    NaN       S 
    1            2         1       1  female  38.0  ...      0          PC 17599     NaN    NaN      NaN
    2            3         1       3  female  26.0  ...      0  STON/O2. 3101282   7.925    NaN        S
    3            4         1       1  female  35.0  ...      0            113803  53.100   C123        S
    4            5         0       3    male  35.0  ...      0            373450   8.050    NaN        S
    
    [5 rows x 11 columns]

在这里插入图片描述

2. Importing data from other file types

</> Not so flat any more

Check out the contents of your current directory and answer the following questions: (1) which file is in your directory and NOT an example of a flat file; (2) why is it not a flat file?

database.db is not a flat file because relational databases contain structured relationships and flat files do not.
battledeath.xlsx is not a flat because it is a spreadsheet consisting of many sheets, not a single table.
titanic.txt is not a flat file because it is a .txt, not a .csv.

In [1]: import os
... wd = os.getcwd()
... os.listdir(wd)
Out[1]: ['titanic_corrupt.txt', 'battledeath.xlsx', 'titanic.txt']

</> Loading a pickled file

Import the pickle package.

Complete the second argument of open() so that it is read only for a binary file. This argument will be a string of two letters, one signifying ‘read only’, the other ‘binary’.

Pass the correct argument to pickle.load(); it should use the variable that is bound to open.

Print the data, d.

Print the datatype of d; take your mind back to your previous use of the function type().

import pickle
with open('data.pkl', 'rb') as file: # 以读取模式打开二进制文件
    d = pickle.load(file)
    
print(d)
print(type(d))

<script.py> output:
    {'June': '69.4', 'Aug': '85', 'Airline': '8', 'Mar': '84.4'}
    <class 'dict'>

</> Listing sheets in Excel files

Specifically, you’ll be loading and checking out the spreadsheet ‘battledeath.xlsx’, modified from the Peace Research Institute Oslo’s (PRIO) dataset. This data contains age-adjusted mortality rates due to war in various countries over several years.

Assign the spreadsheet filename (provided above) to the variable file.

Pass the correct argument to pd.ExcelFile() to load the file using pandas, assigning the result to the variable xls.

Print the sheetnames of the Excel spreadsheet by passing the necessary argument to the print() function.

import pandas as pd
file = 'battledeath.xlsx'
xls = pd.ExcelFile(file) # 加载电子表格
print(xls.sheet_names) # 打印sheet名

<script.py> output:
    ['2002', '2004']

</> Importing sheets from Excel files

Load the sheet ‘2004’ into the DataFrame df1 using its name as a string.

Print the head of df1 to the shell.

Load the sheet 2002 into the DataFrame df2 using its index (0).

Print the head of df2 to the shell.

df1 = xls.parse('2004') # 按名称将工作表加载到DataFrame中
print(df1.head())
df2 = xls.parse(0) # 按索引将工作表加载到DataFrame中
print(df2.head())

<script.py> output:
      War(country)      2004
    0  Afghanistan  9.451028
    1      Albania  0.130354
    2      Algeria  3.407277
    3      Andorra  0.000000
    4       Angola  2.597931
      War, age-adjusted mortality due to       2002
    0                        Afghanistan  36.083990
    1                            Albania   0.128908
    2                            Algeria  18.314120
    3                            Andorra   0.000000
    4                             Angola  18.964560

</> Customizing your spreadsheet import

Parse the first sheet by index. In doing so, skip the first row of data and name the columns ‘Country’ and ‘AAM due to War (2002)’ using the argument names. The values passed to skiprows and names all need to be of type list.

Parse the second sheet by index. In doing so, parse only the first column with the usecols parameter, skip the first row and rename the column ‘Country’. The argument passed to usecols also needs to be of type list.

df1 = xls.parse(0, skiprows=[0], names=['Country', 'AAM due to War (2002)'])
print(df1.head())

df2 = xls.parse(0, usecols=[0], skiprows=[0], names=['Country'])
print(df2.head())

<script.py> output:
                   Country  AAM due to War (2002)
    0              Albania               0.128908
    1              Algeria              18.314120
    2              Andorra               0.000000
    3               Angola              18.964560
    4  Antigua and Barbuda               0.000000
                   Country
    0              Albania
    1              Algeria
    2              Andorra
    3               Angola
    4  Antigua and Barbuda

</> How to import SAS7BDAT

How do you correctly import the function SAS7BDAT() from the package sas7bdat?

import SAS7BDAT from sas7bdat
from SAS7BDAT import sas7bdat
import sas7bdat from SAS7BDAT
from sas7bdat import SAS7BDAT

</> Importing SAS files

Import the module SAS7BDAT from the library sas7bdat.

In the context of the file ‘sales.sas7bdat’, load its contents to a DataFrame df_sas, using the method to_data_frame() on the object file.

Print the head of the DataFrame df_sas.

Execute your entire script to produce a histogram plot!

from sas7bdat import SAS7BDAT
with SAS7BDAT('sales.sas7bdat') as file: # 将文件保存到一个DataFrame
    df_sas = file.to_data_frame()
print(df_sas.head())

import pandas as pd
import matplotlib.pyplot as plt
pd.DataFrame.hist(df_sas[['P']]) # 绘制直方图
plt.ylabel('count')
plt.show()

<script.py> output:
         YEAR     P           S
    0  1950.0  12.9  181.899994
    1  1951.0  11.9  245.000000
    2  1952.0  10.7  250.199997
    3  1953.0  11.3  265.899994
    4  1954.0  11.2  248.500000

在这里插入图片描述

</> Using read_stata to import Stata files

What is the correct way of using the read_stata() function to import disarea.dta into the object df?

df = ‘disarea.dta’
df = read_stata.pd(‘disarea.dta’)
df = pd.read_stata(‘disarea.dta’)
df = pd.read_stata(disarea.dta)

</> Importing Stata files

Use pd.read_stata() to load the file ‘disarea.dta’ into the DataFrame df.

Print the head of the DataFrame df.

Visualize your results by plotting a histogram of the column disa10. We’ve already provided this code for you, so just run it!

import pandas as pd
df = pd.read_stata('disarea.dta')
print(df.head())

pd.DataFrame.hist(df[['disa10']])
plt.xlabel('Extent of disease')
plt.ylabel('Number of countries')
plt.show()

<script.py> output:
      wbcode               country  disa1  disa2  disa3  ...  disa21  disa22  disa23  disa24  disa25
    0    AFG           Afghanistan   0.00   0.00   0.76  ...     0.0    0.00    0.02    0.00    0.00
    1    AGO                Angola   0.32   0.02   0.56  ...     0.0    0.99    0.98    0.61    0.00
    2    ALB               Albania   0.00   0.00   0.02  ...     0.0    0.00    0.00    0.00    0.16
    3    ARE  United Arab Emirates   0.00   0.00   0.00  ...     0.0    0.00    0.00    0.00    0.00
    4    ARG             Argentina   0.00   0.24   0.24  ...     0.0    0.00    0.01    0.00    0.11
    
    [5 rows x 27 columns]

在这里插入图片描述

</> Using File to import HDF5 files

The h5py package has been imported in the environment and the file LIGO_data.hdf5 is loaded in the object h5py_file.

What is the correct way of using the h5py function, File(), to import the file in h5py_file into an object, h5py_data, for reading only?

h5py_data = File(h5py_file, ‘r’)
h5py_data = h5py.File(h5py_file, ‘r’)
h5py_data = h5py.File(h5py_file, read)
h5py_data = h5py.File(h5py_file, ‘read’)

</> Using h5py to import HDF5 files

Import the package h5py.

Assign the name of the file to the variable file.

Load the file as read only into the variable data.

Print the datatype of data.

Print the names of the groups in the HDF5 file ‘LIGO_data.hdf5’.

import numpy as np
import h5py
file = 'LIGO_data.hdf5'
data = h5py.File(file, 'r')
print(type(data))
for key in data.keys(): # 打印组的名称
    print(key)

<script.py> output:
    <class 'h5py._hl.files.File'>
    meta
    quality
    strain

</> Extracting data from your HDF5 file

Assign the HDF5 group data[‘strain’] to group.

In the for loop, print out the keys of the HDF5 group in group.

Assign to the variable strain the values of the time series data data[‘strain’][‘Strain’] using the attribute .value.

Set num_samples equal to 10000, the number of time points we wish to sample.

Execute the rest of the code to produce a plot of the time series data in LIGO_data.hdf5.

group = data['strain'] # 获取HDF5组
for key in group.keys(): # 在组中打印HDF5组的键
    print(key)
strain = data['strain']['Strain'].value # 设置变量为时间序列数据
num_samples = 10000
time = np.arange(0, 1, 1/num_samples) # 设置时间向量

plt.plot(time, strain[:num_samples])
plt.xlabel('GPS Time (s)')
plt.ylabel('strain')
plt.show()

<script.py> output:
    Strain

在这里插入图片描述

</> Loading .mat files

Import the package scipy.io.

Load the file ‘albeck_gene_expression.mat’ into the variable mat; do so using the function scipy.io.loadmat().

Use the function type() to print the datatype of mat to the IPython shell.

import scipy.io
mat = scipy.io.loadmat('albeck_gene_expression.mat')
print(type(mat))

<script.py> output:
    <class 'dict'>

</> The structure of .mat in Python

Use the method .keys() on the dictionary mat to print the keys. Most of these keys (in fact the ones that do NOT begin and end with ‘__’) are variables from the corresponding MATLAB environment.

Print the type of the value corresponding to the key ‘CYratioCyt’ in mat. Recall that mat[‘CYratioCyt’] accesses the value.

Print the shape of the value corresponding to the key ‘CYratioCyt’ using the numpy function shape().

Execute the entire script to see some oscillatory gene expression data!

import scipy.io
import matplotlib.pyplot as plt
import numpy as np

print(mat.keys()) # 打印MATLAB字典的键
print(type(mat['CYratioCyt']))
print(np.shape(mat['CYratioCyt']))

data = mat['CYratioCyt'][25, 5:]
fig = plt.figure()
plt.plot(data)
plt.xlabel('time (min.)')
plt.ylabel('normalized fluorescence (measure of expression)')
plt.show()

<script.py> output:
    dict_keys(['__header__', '__version__', '__globals__', 'rfpCyt', 'rfpNuc', 'cfpNuc', 'cfpCyt', 'yfpNuc', 'yfpCyt', 'CYratioCyt'])
    <class 'numpy.ndarray'>
    (200, 137)

在这里插入图片描述

3. Working with relational databases in Python

</> Pop quiz: The relational model

Which of the following is not part of the relational model?

Each row or record in a table represents an instance of an entity type.
Each column in a table represents an attribute or feature of an instance.
Every table contains a primary key column, which has a unique entry for each row.
A database consists of at least 3 tables.
There are relations between tables.

</> Creating a database engine

Import the function create_engine from the module sqlalchemy.

Create an engine to connect to the SQLite database ‘Chinook.sqlite’ and assign it to engine.

from sqlalchemy import create_engine
engine = create_engine('sqlite:///Chinook.sqlite') # 创建引擎

</> What are the tables in the database?

Import the function create_engine from the module sqlalchemy.

Create an engine to connect to the SQLite database ‘Chinook.sqlite’ and assign it to engine.

Using the method table_names() on the engine engine, assign the table names of ‘Chinook.sqlite’ to the variable table_names.

Print the object table_names to the shell.

from sqlalchemy import create_engine
engine = create_engine('sqlite:///Chinook.sqlite')
table_names = engine.table_names() # 将表名保存到列表中
print(table_names)

<script.py> output:
    ['Album', 'Artist', 'Customer', 'Employee', 'Genre', 'Invoice', 'InvoiceLine', 'MediaType', 'Playlist', 'PlaylistTrack', 'Track']

</> The Hello World of SQL Queries!

Open the engine connection as con using the method connect() on the engine.

Execute the query that selects ALL columns from the Album table. Store the results in rs.

Store all of your query results in the DataFrame df by applying the fetchall() method to the results rs.

Close the connection!

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine('sqlite:///Chinook.sqlite')
con = engine.connect() # 打开连接
rs = con.execute("select * from Album")
df = pd.DataFrame(rs.fetchall())
con.close() # 关闭连接
print(df.head())

<script.py> output:
       0                                      1  2
    0  1  For Those About To Rock We Salute You  1
    1  2                      Balls to the Wall  2
    2  3                      Restless and Wild  2
    3  4                      Let There Be Rock  1
    4  5                               Big Ones  3

</> Customizing the Hello World of SQL Queries

Execute the SQL query that selects the columns LastName and Title from the Employee table. Store the results in the variable rs.

Apply the method fetchmany() to rs in order to retrieve 3 of the records. Store them in the DataFrame df.

Using the rs object, set the DataFrame’s column names to the corresponding names of the table columns.

with engine.connect() as con:
    rs = con.execute("SELECT LastName, Title FROM Employee")
    df = pd.DataFrame(rs.fetchmany(size=3))
    df.columns = rs.keys()

print(len(df))
print(df.head())

<script.py> output:
    3
      LastName                Title
    0    Adams      General Manager
    1  Edwards        Sales Manager
    2  Peacock  Sales Support Agent

</> Filtering your database records using SQL’s WHERE

Complete the argument of create_engine() so that the engine for the SQLite database ‘Chinook.sqlite’ is created.

Execute the query that selects all records from the Employee table where ‘EmployeeId’ is greater than or equal to 6. Use the >= operator and assign the results to rs.

Apply the method fetchall() to rs in order to fetch all records in rs. Store them in the DataFrame df.

Using the rs object, set the DataFrame’s column names to the corresponding names of the table columns.

engine = create_engine('sqlite:///Chinook.sqlite')
with engine.connect() as con:
    rs = con.execute("SELECT * FROM Employee WHERE EmployeeId >= 6")
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()
print(df.head())

<script.py> output:
       EmployeeId  LastName FirstName       Title  ReportsTo  ... Country PostalCode              Phone                Fax                    Email
    0           6  Mitchell   Michael  IT Manager          1  ...  Canada    T3B 0C5  +1 (403) 246-9887  +1 (403) 246-9899  michael@chinookcorp.com
    1           7      King    Robert    IT Staff          6  ...  Canada    T1K 5N8  +1 (403) 456-9986  +1 (403) 456-8485   robert@chinookcorp.com
    2           8  Callahan     Laura    IT Staff          6  ...  Canada    T1H 1Y8  +1 (403) 467-3351  +1 (403) 467-8772    laura@chinookcorp.com
    
    [3 rows x 15 columns]

</> Ordering your SQL records with ORDER BY

Using the function create_engine(), create an engine for the SQLite database Chinook.sqlite and assign it to the variable engine.

In the context manager, execute the query that selects all records from the Employee table and orders them in increasing order by the column BirthDate. Assign the result to rs.

In a call to pd.DataFrame(), apply the method fetchall() to rs in order to fetch all records in rs. Store them in the DataFrame df.

Set the DataFrame’s column names to the corresponding names of the table columns.

engine = create_engine('sqlite:///Chinook.sqlite')
with engine.connect() as con:
    rs = con.execute("SELECT * FROM Employee ORDER BY BirthDate")
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()
print(df.head())

<script.py> output:
       EmployeeId  LastName FirstName                Title  ReportsTo  ... Country PostalCode              Phone                Fax                     Email
    0           4      Park  Margaret  Sales Support Agent        2.0  ...  Canada    T2P 5G3  +1 (403) 263-4423  +1 (403) 263-4289  margaret@chinookcorp.com
    1           2   Edwards     Nancy        Sales Manager        1.0  ...  Canada    T2P 2T3  +1 (403) 262-3443  +1 (403) 262-3322     nancy@chinookcorp.com
    2           1     Adams    Andrew      General Manager        NaN  ...  Canada    T5K 2N1  +1 (780) 428-9482  +1 (780) 428-3457    andrew@chinookcorp.com
    3           5   Johnson     Steve  Sales Support Agent        2.0  ...  Canada    T3B 1Y7   1 (780) 836-9987   1 (780) 836-9543     steve@chinookcorp.com
    4           8  Callahan     Laura             IT Staff        6.0  ...  Canada    T1H 1Y8  +1 (403) 467-3351  +1 (403) 467-8772     laura@chinookcorp.com
    
    [5 rows x 15 columns]

</> Pandas and The Hello World of SQL Queries!

Import the pandas package using the alias pd.

Using the function create_engine(), create an engine for the SQLite database Chinook.sqlite and assign it to the variable engine.

Use the pandas function read_sql_query() to assign to the variable df the DataFrame of results from the following query: select all records from the table Album.

The remainder of the code is included to confirm that the DataFrame created by this method is equal to that created by the previous method that you learned.

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine('sqlite:///Chinook.sqlite')
df = pd.read_sql_query("SELECT * FROM Album", engine)
print(df.head())

with engine.connect() as con:
    rs = con.execute("SELECT * FROM Album")
    df1 = pd.DataFrame(rs.fetchall())
    df1.columns = rs.keys()
print(df.equals(df1))

<script.py> output:
       AlbumId                                  Title  ArtistId
    0        1  For Those About To Rock We Salute You         1
    1        2                      Balls to the Wall         2
    2        3                      Restless and Wild         2
    3        4                      Let There Be Rock         1
    4        5                               Big Ones         3
    True

</> Pandas for more complex querying

Using the function create_engine(), create an engine for the SQLite database Chinook.sqlite and assign it to the variable engine.

Use the pandas function read_sql_query() to assign to the variable df the DataFrame of results from the following query: select all records from the Employee table where the EmployeeId is greater than or equal to 6 and ordered by BirthDate (make sure to use WHERE and ORDER BY in this precise order).

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine('sqlite:///Chinook.sqlite')
df = pd.read_sql_query("SELECT * FROM Employee WHERE EmployeeId >= 6 ORDER BY BirthDate", engine)
print(df.head())

<script.py> output:
       EmployeeId  LastName FirstName       Title  ReportsTo  ... Country PostalCode              Phone                Fax                    Email
    0           8  Callahan     Laura    IT Staff          6  ...  Canada    T1H 1Y8  +1 (403) 467-3351  +1 (403) 467-8772    laura@chinookcorp.com
    1           7      King    Robert    IT Staff          6  ...  Canada    T1K 5N8  +1 (403) 456-9986  +1 (403) 456-8485   robert@chinookcorp.com
    2           6  Mitchell   Michael  IT Manager          1  ...  Canada    T3B 0C5  +1 (403) 246-9887  +1 (403) 246-9899  michael@chinookcorp.com
    
    [3 rows x 15 columns]

</> The power of SQL lies in relationships between tables: INNER JOIN

Assign to rs the results from the following query: select all the records, extracting the Title of the record and Name of the artist of each record from the Album table and the Artist table, respectively. To do so, INNER JOIN these two tables on the ArtistID column of both.

In a call to pd.DataFrame(), apply the method fetchall() to rs in order to fetch all records in rs. Store them in the DataFrame df.

Set the DataFrame’s column names to the corresponding names of the table columns.

with engine.connect() as con:
    rs = con.execute("SELECT Title,Name FROM Album INNER JOIN Artist ON Album.ArtistID=Artist.ArtistID")
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()
print(df.head())

<script.py> output:
                                       Title       Name
    0  For Those About To Rock We Salute You      AC/DC
    1                      Balls to the Wall     Accept
    2                      Restless and Wild     Accept
    3                      Let There Be Rock      AC/DC
    4                               Big Ones  Aerosmith

</> Filtering your INNER JOIN

Use the pandas function read_sql_query() to assign to the variable df the DataFrame of results from the following query: select all records from PlaylistTrack INNER JOIN Track on PlaylistTrack.TrackId = Track.TrackId that satisfy the condition Milliseconds < 250000.

import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('sqlite:///Chinook.sqlite')

df = pd.read_sql_query("SELECT * FROM PlaylistTrack INNER JOIN Track on PlaylistTrack.TrackId = Track.TrackId WHERE Milliseconds < 250000", engine)
print(df.head())

<script.py> output:
       PlaylistId  TrackId  TrackId              Name  AlbumId  ...  GenreId  Composer Milliseconds    Bytes  UnitPrice
    0           1     3390     3390  One and the Same      271  ...       23      None       217732  3559040       0.99
    1           1     3392     3392     Until We Fall      271  ...       23      None       230758  3766605       0.99
    2           1     3393     3393     Original Fire      271  ...       23      None       218916  3577821       0.99
    3           1     3394     3394       Broken City      271  ...       23      None       228366  3728955       0.99
    4           1     3395     3395          Somedays      271  ...       23      None       213831  3497176       0.99
    
    [5 rows x 11 columns]

Daisy Lee

发布了11 篇原创文章 · 获赞 0 · 访问量 679

私信关注