Udacity数据分析（入门）-探索美国共享单车数据

概述
自行车共享数据
数据集
问题
项目代码

导入库及数据集
输入函数
选取数据集
通过用户的输入来得到要分析的 “城市，月，日”
加载相应的 “城市，月，日” 的数据
计算并显示共享单车出行的最频繁时间
计算并显示共享单车出行的总/平均时间
计算并显示共享单车用户的统计信息
主函数

概述

利用 Python 探索与以下三大美国城市的自行车共享系统相关的数据：芝加哥、纽约和华盛顿特区。编写代码导入数据，并通过计算描述性统计数据回答有趣的问题。写一个脚本，该脚本会接受原始输入并在终端中创建交互式体验，以展现这些统计信息。

自行车共享数据

在过去十年内，自行车共享系统的数量不断增多，并且在全球多个城市内越来越受欢迎。自行车共享系统使用户能够按照一定的金额在短时间内租赁自行车。用户可以在 A 处借自行车，并在 B 处还车，或者他们只是想骑一下，也可以在同一地点还车。每辆自行车每天可以供多位用户使用。

由于信息技术的迅猛发展，共享系统的用户可以轻松地访问系统中的基座并解锁或还回自行车。这些技术还提供了大量数据，使我们能够探索这些自行车共享系统的使用情况。

在此项目中，你将使用 Motivate 提供的数据探索自行车共享使用模式，Motivate 是一家入驻美国很多大型城市的自行车共享系统。你将比较以下三座城市的系统使用情况：芝加哥、纽约市和华盛顿特区。

数据集

提供了三座城市 2017 年上半年的数据。三个数据文件都包含相同的核心六列：

起始时间 Start Time（例如 2017-01-01 00:07:57）
结束时间 End Time（例如 2017-01-01 00:20:53）
骑行时长 Trip Duration（例如 776 秒）
起始车站 Start Station（例如百老汇街和巴里大道）
结束车站 End Station（例如塞奇威克街和北大道）
用户类型 User Type（订阅者 Subscriber/Registered 或客户Customer/Casual）
芝加哥和纽约市文件还包含以下两列（数据格式可以查看下面的图片）：

性别 Gender
出生年份 Birth Year

问题

1.起始时间（Start Time 列）中哪个月份最常见？
2.起始时间中，一周的哪一天（比如 Monday, Tuesday）最常见？
3.起始时间中，一天当中哪个小时最常见？
4.总骑行时长（Trip Duration）是多久，平均骑行时长是多久？
5.哪个起始车站（Start Station）最热门，哪个结束车站（End Station）最热门？
6.哪一趟行程最热门（即，哪一个起始站点与结束站点的组合最热门）？
7.每种用户类型有多少人？
8.每种性别有多少人？
9.出生年份最早的是哪一年、最晚的是哪一年，最常见的是哪一年？

项目代码

导入库及数据集

import time
import pandas as pd
import numpy as np

CITY_DATA = { 'chicago': 'chicago.csv',
              'new york city': 'new_york_city.csv',
              'washington': 'washington.csv' }

输入函数

def input_mod(input_print,enterable_list):
    """
    Simplify code when user choose cities or months data
    Arg:
        (str) input_print - asking questions
        (str) enterable_list - find list(cities or months)
    Return:
        (str) ret- return user's choice about city, month or day
    """
    while True:
        ret = input(input_print).title()
        if ret in enterable_list:
            return ret.lower()
            break
        print('Sorry, please enter {}.'.format(enterable_list))

选取数据集

def see_datas(data):
    """
    User choose a data to input.
    Arg:
        (str) data - choose a data to input(cities,months,days)
    Return:
        (str) city, month or day - return user's choice about city, month or day
    """
    #bulid lists and dictionary( cities, months and days) for user to search data 
    cities=['Chicago','New York City','Washington']
    months =['January', 'February', 'March', 'April', 'May', 'June']
    days={'1':'Sunday', '2':'Monday', '3':'Tuesday', '4':'Wednesday', '5':'Thursday', '6':'Friday', '7':'Saturday'}
    while True:
        #get user input about cities
        if data=='cities':
            return input_mod('Would you like to see data for Chicago, New York City or Washington: \n',cities)
        #get user input about months
        elif data=='months':
            return input_mod('Which month? January, February, March, April, May or June?\n',months)
        #get user input about weekdays
        elif data=='days':
            while True:
                day = input('Which day? Please type an interger(e.g., 1=Sunday): \n')
                if day in days:
                    return days[day]
                    break
                print('Sorry, please enter a correct interger(e.g., 1=Sunday)')

通过用户的输入来得到要分析的 “城市，月，日”

def get_filters():
    """
    Asks user to specify a city, month, and day to analyze.

    Returns:
        (str) city - name of the city to analyze
        (str) month - name of the month to filter by, or "all" to apply no month filter
        (str) day - name of the day of week to filter by, or "all" to apply no day filter
    """
    print('Hello! Let\'s explore some US bikeshare data!')

    
    # TO DO: get user input for city (chicago, new york city, washington). HINT: Use a while loop to handle invalid inputs    
    city=see_datas('cities')   
    # TO DO: get user input for month (all, january, february, ... , june)
    while True:
        enter=input('Would you like to filter the data by month, day, both, or not at all? Type "none" for no time filter.\n').lower()
        if enter == 'none':
            month='all'
            day='all'
            break
        elif enter == 'both':
            month=see_datas('months')
            day=see_datas('days')
            break
        elif enter == 'month':
            month=see_datas('months')
            day='all'
            break
        elif enter == 'day':
            month='all'
            day=see_datas('days')
            break
        else:
            print ('Sorry, please input a correct content')
    # TO DO: get user input for day of week (all, monday, tuesday, ... sunday)
               
    print('-'*40)
    return city,month,day

加载相应的 “城市，月，日” 的数据

def load_data(city, month, day):
    """
    Loads data for the specified city and filters by month and day if applicable.

    Args:
        (str) city - name of the city to analyze
        (str) month - name of the month to filter by, or "all" to apply no month filter
        (str) day - name of the day of week to filter by, or "all" to apply no day filter
    Returns:
        df - Pandas DataFrame containing city data filtered by month and day
    """
    # load data file into a dataframe
    df = pd.read_csv(CITY_DATA[city])

    # convert the Start Time column to datetime
    df['Start Time'] = pd.to_datetime(df['Start Time'])

    # extract month and day of week from Start Time to create new columns
    df['month'] = df['Start Time'].dt.month
    df['day_of_week'] = df['Start Time'].dt.weekday_name

    # filter by month if applicable
    if month != 'all':
        # use the index of the months list to get the corresponding int
        months = ['january', 'february', 'march', 'april', 'may', 'june']
        month = months.index(month) + 1

        # filter by month to create the new dataframe
        df = df[df['month'] == month]

    # filter by day of week if applicable
    if day != 'all':
        # filter by day of week to create the new dataframe
        df = df[df['day_of_week'] == day.title()]

    return df

计算并显示共享单车出行的最频繁时间

def station_stats(df):
    """Displays statistics on the most popular stations and trip."""

    print('\nCalculating The Most Popular Stations and Trip...\n')
    start_time = time.time()

    # TO DO: display most commonly used start station
    common_start=df['Start Station'].value_counts().index[0]
    print('Most commonly used start station: {}.'.format(common_start))
    
    # TO DO: display most commonly used end station
    common_end=df['End Station'].value_counts().index[0]
    print('Most commonly used end station: {}.'.format(common_end))
    
    # TO DO: display most frequent combination of start station and end station trip
    df['combination']=df['Start Station']+'/ '+df['End Station']
    common_combine=df['combination'].value_counts().index[0]
    print('Most frequent combination of start and end station trip: {}.'.format(common_combine))
    
    print("\nThis took %s seconds." % (time.time() - start_time))
    print('-'*40)

计算并显示共享单车出行的总/平均时间

def trip_duration_stats(df):
    """Displays statistics on the total and average trip duration."""

    print('\nCalculating Trip Duration...\n')
    start_time = time.time()

    # TO DO: display total travel time
    total_time=df['Trip Duration'].sum()
    print('Total travel time: {} seconds.'.format(total_time))

    # TO DO: display mean travel time
    mean_time=df['Trip Duration'].mean()
    print('Mean travel time: {} seconds.'.format(mean_time))

    print("\nThis took %s seconds." % (time.time() - start_time))
    print('-'*40)

计算并显示共享单车用户的统计信息

def user_stats(df):
    """Displays statistics on bikeshare users."""

    print('\nCalculating User Stats...\n')
    start_time = time.time()

    # TO DO: Display counts of user types
    user_type=df['User Type'].value_counts()
    print('User type\n{0}: {1}\n{2}: {3}'.format(user_type.index[0],user_type.iloc[0],user_type.index[1],user_type.iloc[1]))
    
    # TO DO: Display counts of gender
    cities_columns=df.columns
    if 'Gender' in cities_columns:
        user_gender=df['Gender'].value_counts()
        print('Male:{0}\nFemale:{1}. '.format(user_gender.loc['Male'],user_gender.loc['Female']))
    else:
        print("Sorry, this city don't have gender data" )

    # TO DO: Display earliest, most recent, and most common year of birth
    if 'Birth Year' in cities_columns:
        earliest_birth=df['Birth Year'].min()
        recent_birth=df['Birth Year'].max()
        common_birth=df['Birth Year'].value_counts().index[0]
        print('Earliest user year of birth: %i.'%(earliest_birth))
        print('Most recent user year of birth: %i.'%(recent_birth))
        print('Most common user year of birth: %i.'%(common_birth))
    else:
        print("Sorry, this city don't have birth year data" )

    print("\nThis took %s seconds." % (time.time() - start_time))
    print('-'*40)

主函数

def main():
    while True:
        city, month, day = get_filters()
        df = load_data(city, month, day)

        time_stats(df)
        station_stats(df)
        trip_duration_stats(df)
        user_stats(df)

        restart = input('\nWould you like to restart? Enter yes or no.\n')
        if restart.lower() != 'yes':
            break

if __name__ == "__main__":
	main()

链接：https://pan.baidu.com/s/1sSgbXBaSy1IxIfJqoMil2w 密码：m55o