Kaggle data science practitioner analysis report

Data Science Practitioners Investigate Python Language Analysis

Data description

On August 26, 2017, Kaggle, the world's largest data science community, released a data set for the industry-wide survey of the current state of the data science/machine learning industry. The questionnaire data was collected from August 7th to August 25th, 2017. The interviewees included 16,716+ practitioners from more than 50 countries. According to the Kaggle questionnaire data set, we dig out some nutritious information.

Variable selection

There are a total of 228 variables. We choose some of the questions we are interested in to analyze. The variable description table is as follows:
Insert picture description here

Data preprocessing

Calculate the data of Taiwan in the region into the data of China, and generate a variable description table

data = pd.read_csv('multipleChoiceResponses.csv', encoding="unicode_escape") 
data.shape
data.info() 
col_name  =list( data.columns ) 
data['Country']=['China' if i=='People \'s Republic of China' or 
                 i=='Republic of China'  else i for i in data['Country'] ]
# 选中感兴趣的变量
df = data[[
            'Age', 'Country', 'CurrentJobTitleSelect','MLToolNextYearSelect',
            'MLMethodNextYearSelect', 'EmploymentStatus', 'FormalEducation',
            'CoursePlatformSelect', 'FirstTrainingSelect', 'Tenure','JobSatisfaction',
            'LanguageRecommendationSelect', 'JobSkillImportancePython',
            'JobSkillImportanceR', 'JobSkillImportanceSQL', 'JobSkillImportanceBigData'
            ]]
v_illustration = pd.DataFrame({
    
    '变量名':df.columns, 
      '类型':[type(df.iat[1,i]) for i,col in enumerate(df.columns)],
      '缺失量':[sum(df.iloc[:,i].isnull()) for i,col in enumerate(df.columns)]})
v_illustration.to_csv('v_illustration.csv',encoding='utf-8-sig')

Univariate analysis

Variable distribution display by type

The process of drawing horizontal bars is written as a function to facilitate reuse.

def type_to_bar(type_obj):
    '''输入pd.Series(),输出水平的条形图'''
    count = type_obj.value_counts()  
    if len(count)>10 :
        count = count[:10]  # 只要前十个
    count = count[::-1] # 第一放在上面
    position  = range(len(count)) 
    color = color_list[:len(count)]
    plt.barh(y=position, width=count, color = color[::-1] )
   # plt.title(type_obj.name)
    plt.yticks(ticks=position, labels=count.index)
    z = zip(count.values,position)
    for i,j in z:
        print(i,j)
        plt.text(i,j,s=str(int(i)),ha='right',va='center',color='white')      
View the distribution of respondents by country
fig = plt.figure(dpi=140)
type_obj= df['Country']
type_to_bar(type_obj)      

Insert picture description here

Position distribution of respondents
# 受访者职位分布
fig = plt.figure(dpi=140)
type_obj= df['CurrentJobTitleSelect']
type_to_bar(type_obj) 

Insert picture description here

Machine learning tools that respondents will learn next year
# 受访者下一年准备学习的机器学习工具
fig = plt.figure(dpi=140)
type_obj= df['MLToolNextYearSelect']
type_to_bar(type_obj)

Insert picture description here

Remaining variables

The remaining sub-type variables have the same method and can be drawn together, so I won’t show them here.

for i in df.columns: 
    fig = plt.figure(dpi=140)
    type_obj= df[i]
    type_to_bar(type_obj)
Numerical variable distribution display

Age distribution of respondents

# 单变量年龄分布
# 画出一条垂线标明均值 
fig = plt.figure(figsize=(10,6),dpi=200) 
sns.kdeplot(df['Age'], shade=True, alpha=.7,)
plt.suptitle('年龄分布') 
x=df['Age'].mean()
plt.axvline(x,color='red') 
plt.text(x,0.04,'年龄均值32',size =15)
plt.show()

Insert picture description here

Bivariate analysis

Influence of sub-type variables on numeric variables

Take the age difference in different countries as an example, use a horizontal bar chart to show the average age of different countries

# 双变量关系分析, 国家对年龄的影响.   分类型——数值型
# 水平条形图 
def num_to_bar(num, label):
    '''输入pd.dataframe(),一列为数字型,一列为标签,
    输出各类别均值直方图'''
    cla  = label.value_counts().index
    count = []
    for lab in cla:
        count.append(num[label==lab].mean())
    count = pd.Series(count, index = cla)
    count.sort_values(ascending=False,inplace=True)
    if len(count)>10 :
        count = count[:10]  # 只要前十个
    count = count[::-1] # 第一放在上面
    position  = range(len(count)) 
    color = color_list[:len(count)]
    plt.barh(y=position, width=count, color = color[::-1] )
    plt.title(num.name)
    plt.yticks(ticks=position, labels=count.index)
    z = zip(count.values,position)
    for i,j in z:
       # print(i,j)
        plt.text(i,j,s=str(int(i)),
                 ha='right',va='center',color='white')
               
# 各国家受访者年龄对比
fig = plt.figure(dpi=140)
num  = df['Age']; label=df['Country']
num_to_bar(num, label)  

Insert picture description here
Comparison of job data practitioner satisfaction in various countries

# 各国家受访者满意度对比
def trans_jbsatisfy(x):
    if x=='10 - Highly Satisfied':
        y=10 
    elif x=='1 - Highly Dissatisfied':
        y=1
    elif x=='I prefer not to share':
        y= np.NaN
    elif isinstance(x, float): 
        y=x
    elif isinstance(int(x), int):
        y=int(x)
    else : print(x)
    return y
df['JobSatisfaction']=df['JobSatisfaction'].apply(trans_jbsatisfy)
fig = plt.figure(dpi=140)
num  = df['JobSatisfaction']; label=df['Country']
num_to_bar(num, label)   
plt.title("工作满意度最高的是瑞士")

Insert picture description here

Influence of sub-type variables on sub-type variables

Take the influence of countries on preferred programming languages ​​as an example

fig = plt.figure(figsize=(10,6),dpi=140)
plt.subplot(1,2,1)
type_to_bar(df.loc[df['Country']=='China','LanguageRecommendationSelect'])
plt.title('China')
plt.subplot(1,2,2)
type_obj=df.loc[df['Country']=='United States','LanguageRecommendationSelect']
type_to_bar(type_obj)
plt.title('America')

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_43705953/article/details/111241330