Offline data cleaning, comparison of Spark and Python Pandas

Introduction

Recently I learned the core usage of RDD in Spark. In order to consolidate the learning results, I wrote a data cleaning code using Spark. I happened to use pandas in python to clean the same data before, so I put the code in two ways. Post them all and make a simple comparison

data demonstration

豆瓣图书标签: 小说,[日] 东野圭吾 / 李盈春 / 南海出版公司 / 2014-5 / 39.50元,解忧杂货店,8.6,(297210人评价)
豆瓣图书标签: 文学,[哥伦比亚] 加西亚·马尔克斯 / 范晔 / 南海出版公司 / 2011-6 / 39.50元,百年孤独,9.2,(138353人评价)
豆瓣图书标签: 小说,[英] 肯·福莱特 / 于大卫 / 江苏凤凰文艺出版社 / 2016-5-1 / 129.80元,巨人的陨落,8.9,(39014人评价)
豆瓣图书标签: 小说,亦舒 / 新世界出版社 / 2007-8 / 22.00元,我的前半生,7.9,(22722人评价)
豆瓣图书标签: 小说,林奕含 / 北京联合出版公司 / 2018-1 / 45.00元,房思琪的初恋乐园,9.2,(23870人评价)
豆瓣图书标签: 小说,[美] 卡勒德·胡赛尼 / 李继宏 / 上海人民出版社 / 2006-5 / 29.00元,追风筝的人,8.9,(325801人评价)
豆瓣图书标签: 小说,[哥伦比亚] 加西亚·马尔克斯 / 范晔 / 南海出版公司 / 2011-6 / 39.50元,百年孤独,9.2,(138253人评价)
豆瓣图书标签: 小说,[哥伦比亚] 加西亚·马尔克斯 / 杨玲 / 南海出版公司 / 2012-9-1 / 39.50元,霍乱时期的爱情,9.0,(76618人评价)
豆瓣图书标签: 小说,[意] 埃莱娜·费兰特 / 陈英 / 人民文学出版社 / 2017-4 / 59.00元,新名字的故事,9.0,(8257人评价)
豆瓣图书标签: 小说,[俄] 维克托·阿斯塔菲耶夫 / 夏仲翼 等 / 理想国 | 广西师范大学出版社 / 2017-4 / 78.00元,鱼王,9.0,(1915人评价)
豆瓣图书标签: 小说,[美] 戴维·伽特森 / 熊裕 / 全本书店|作家出版社 / 2017-6 / 52.00元,雪落香杉树,8.4,(3397人评价)
豆瓣图书标签: 小说,[英] 毛姆 / 傅惟慈 / 上海译文出版社 / 2006-8 / 15.00元,月亮和六便士,9.0,(82482人评价)
豆瓣图书标签: 小说,[英] 肯·福莱特 / 陈杰 / 江苏凤凰文艺出版社 / 2017-3-1 / 132.00(全三册),世界的凛冬,8.9,(12271人评价)
豆瓣图书标签: 小说,余华 / 南海出版公司 / 1998-5 / 12.00元,活着,9.1,(153115人评价)
豆瓣图书标签: 小说,[英] 乔治·奥威尔 / 刘绍铭 / 北京十月文艺出版社 / 2010-4-1 / 28.00,1984,9.3,(49985人评价)
豆瓣图书标签: 小说,[美] 哈珀·李 / 高红梅 / 译林出版社 / 2012-9 / 32.00元,杀死一只知更鸟,9.2,(21417人评价)
豆瓣图书标签: 小说,钱锺书 / 人民文学出版社 / 1991-2 / 19.00,围城,8.9,(204184人评价)
...............

The above is the book information of the crawled Douban reading, saved as a csv file, the field names are book label, book information, book name, rating value, and number of ratings . Because it is roughly crawled, many of the field information is not clear Yes, data analysis cannot be performed directly. The data needs to be cleaned and converted into structured data containing fields such as book label, author, publication date, price, book title, rating value, number of ratings, etc.

Python Pandas code

import pandas

if __name__ == '__main__':
    douban = pandas.read_csv('douban.csv', names=['tag', 'info', 'name', 'star', 'people'])
    douban = douban.drop_duplicates().reset_index(drop=True)
    infos = douban['info'].str.split('/')
    authors = []
    publish = []
    date = []
    si = 'sda'
    money = []
    country = []
    error = []
    for num in range(len(infos)):
        # money.append(one[-1])
        if len(infos[num]) >= 3:
            author_info = infos[num][0].strip()
            if len(author_info) > 3:
                if author_info.startswith('(') and author_info.__contains__(')'):
                    authors.append(author_info.split(')')[1])
                    country.append(author_info.split(")")[0].split("(")[1])
                    money.append(infos[num][-1].strip())
                    date.append(infos[num][-2].strip())
                elif author_info.startswith('[') and author_info.__contains__(']'):
                    authors.append(author_info.split(']')[1])
                    country.append(author_info.split("]")[0].split("[")[1])
                    money.append(infos[num][-1].strip())
                    date.append(infos[num][-2].strip())
                elif author_info.startswith('(') and author_info.__contains__(')'):
                    authors.append(author_info.split(')')[1])
                    country.append(author_info.split(")")[0].split("(")[1])
                    money.append(infos[num][-1].strip())
                    date.append(infos[num][-2].strip())
                elif author_info.startswith('【') and author_info.__contains__('】'):
                    authors.append(author_info.split('】')[1])
                    country.append(author_info.split("】")[0].split("【")[1])
                    money.append(infos[num][-1].strip())
                    date.append(infos[num][-2].strip())
                else:
                    error.append(num)
            else:
                country.append('中')
                authors.append(author_info)
                money.append(infos[num][-1].strip())
                date.append(infos[num][-2].strip())
        else:
            error.append(num)
    gudai = '唐宋元明清台台湾'
    country = ["中" if gudai.__contains__(x) else x for x in country]
    douban = douban.drop(index=error).reset_index(drop=True)
    douban['author'] = authors
    douban['money'] = money
    douban['country'] = country
    years = []
    for one in date:
        try:
            years.append(int(one.split("-")[0]))
        except:
            years.append(0)
    douban['year'] = years
    douban = douban[douban['year'] > 1800].reset_index(drop=True)
    douban['people'] = douban['people'].str.split('(').str[1].str.split("人").str[0]
    douban['tag'] = douban['tag'].str.split(':').str[1].str.strip()
    douban = douban.drop('info', axis=1)
    douban.to_csv("douban2.csv",index=None)

Because Pandas is relatively familiar, I wrote 65 lines of code to get it done, and optimization can be reduced to less than 50 lines. The following is the result of data cleaning

小说,解忧杂货店,8.6,297210, 东野圭吾,39.50元,日,2014
文学,百年孤独,9.2,138353, 加西亚·马尔克斯,39.50元,哥伦比亚,2011
小说,巨人的陨落,8.9,39014, 肯·福莱特,129.80元,英,2016
小说,我的前半生,7.9,22722,亦舒,22.00元,中,2007
小说,房思琪的初恋乐园,9.2,23870,林奕含,45.00元,中,2018
小说,追风筝的人,8.9,325801, 卡勒德·胡赛尼,29.00元,美,2006
小说,百年孤独,9.2,138253, 加西亚·马尔克斯,39.50元,哥伦比亚,2011
小说,霍乱时期的爱情,9.0,76618, 加西亚·马尔克斯,39.50元,哥伦比亚,2012
小说,新名字的故事,9.0,8257, 埃莱娜·费兰特,59.00元,意,2017
小说,鱼王,9.0,1915, 维克托·阿斯塔菲耶夫,78.00元,俄,2017
小说,雪落香杉树,8.4,3397, 戴维·伽特森,52.00元,美,2017
小说,月亮和六便士,9.0,82482, 毛姆,15.00元,英,2006
小说,世界的凛冬,8.9,12271, 肯·福莱特,132.00(全三册),英,2017
小说,活着,9.1,153115,余华,12.00元,中,1998
小说,1984,9.3,49985, 乔治·奥威尔,28.00,英,2010
小说,杀死一只知更鸟,9.2,21417, 哈珀·李,32.00元,美,2012
小说,围城,8.9,204184,钱锺书,19.00,中,1991
小说,斯通纳,8.8,17304, 约翰·威廉斯,39.00元,美,2016
小说,囚鸟,8.0,1982, 库尔特·冯内古特,38.00元,美,2017

Spark code

import org.apache.spark.{SparkConf, SparkContext}

object douban {
  def priArray(s:Array[String]): Unit ={
    for(x<-s){
      print(x)
      print(" ")
    }
    println()
  }
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[1]").setAppName("douban")
    val sc = new SparkContext(conf)
    val douban = sc.textFile("douban.csv").zipWithIndex()//.map(x=>(x._2,x._1))
    val info = douban.map(x=>x._1.split(",")(1)+"/"+x._2.toString).map(removeSpace)
    val info_country= info.filter(x => x.contains("[") | x.contains("(") | x.contains("(") | x.contains("【"))
    val info_no_country = info.subtract(info_country).map(_.split("/")).filter(x => x.length>=4 &&x(0).length<=3)
    val info_country2 = info_country.map(_.split("/")).filter(x => x.length>=3)
    val split_Country = info_country2.map(splitCountry).filter(_.length>1).map(changeCountry)
    val add_Country = info_no_country.map(addCountry)
    val result = split_Country.union(add_Country).map(changeDate).filter(x=>x(2).toInt>0).map(x=>(x(x.length-1).toLong,x.toList.dropRight(1).mkString(",")))
    val douban2 = douban.map(x=>(x._2,x._1))
    val result2 = result.leftOuterJoin(douban2).map(_._2).map(x =>x._1+","+x._2.mkString)
    val result3 = result2.map(tagAndPeople)
    result3.saveAsTextFile("douban")
  }

  def tagAndPeople(s:String):String ={
    var resu = s.split(",")
    var tag = resu(4).split("\\:")(1).trim
    var peopleNum ="0"
    try{
      peopleNum = resu(resu.length-1).split("人")(0).split("\\(")(1)
    } catch{
      case ex:Exception=>
    }
    resu(4) = tag
    resu(resu.length-1) = peopleNum
    resu.mkString(",")
  }

  def changeDate(s:Array[String]):Array[String] = {
    var year = 0
    try{
      year = s(2).split("-")(0).toInt
    } catch{
      case ex:Exception=>
    }
    s(2) = year.toString
    s
  }

  def changeCountry(l:Array[String]):Array[String]={
    val gudai = "唐宋元明清台台湾"
    var subString = ""
    if (gudai.contains(l.head)) subString="中"
    var one = l(0)
    one match{
      case "美国" => subString = "美"
      case "日本" => subString = "日"
      case "英国" => subString = "英"
      case "俄罗斯" => subString = "俄"
      case "葡萄牙" => subString = "葡"
      case "冰岛" => subString = "冰"
      case _ => subString=one
    }
    if(!subString.equals("")){
      l(0) = subString
    }
    l
  }

  def addCountry(s:Array[String]):Array[String]={
    val author = s(0)
    val country = "中"
    val money = s(s.length - 2)
    val date = s(s.length - 3)
    var index = s(s.length-1).toString
    Array(country,author,date,money,index)
  }

  def splitCountry(s:Array[String]):Array[String]={
    var author = new String()
    var country = new String()
    var money = new String()
    var date = new String()
    var index =new String()
    if(s(0).startsWith("(")&s(0).contains(")")){
      author = s(0).split("\\)")(1)
      country = s(0).split("\\)")(0).split("\\(")(1)
      money = s(s.length-2)
      date = s(s.length-3)
      index = s(s.length-1).toString
    } else if (s(0).startsWith("[")&s(0).contains("]")){
      author = s(0).split(']')(1)
      country = s(0).split(']')(0).split('[')(1)
      money = s(s.length-2)
      date = s(s.length-3)
      index = s(s.length-1).toString
    } else if (s(0).startsWith("(")&s(0).contains(")")){
      author = s(0).split(')')(1)
      country = s(0).split(")")(0).split("(")(1)
      money = s(s.length-2)
      date = s(s.length-3)
      index = s(s.length-1).toString
    } else if (s(0).startsWith("【")&s(0).contains("】")){
      author = s(0).split("】")(1)
      country = s(0).split("】")(0).split("【")(1)
      money = s(s.length-2)
      date = s(s.length-3)
      index = s(s.length-1).toString
    } else{
      return Array("None")
    }
    Array(country,author,date,money,index)
  }

  def removeSpace(s:String):String={
    val s_li = s.split("/")
    var out = ""
    for(o <- s_li){
      if (s_li.indexOf(o)==s_li.length-1){
        out+=o.replace(" ","")
      }
      else{
        out+=o.replace(" ","")+"/"
      }
    }
    out
  }
}

Spark I used the 131 line, which is twice Pandas, the scholarship may be the reason of it, of course, is not the best line Spark offline data cleansing, but on Spark Streaming real-time data cleaning
Following is a data cleaning Spark

散文,中,周耀辉,2013,32.00元,7749,7.4,699
随笔,中,陈丹青,2009,39,荒废集,8.1,9600
小说,英,劳伦斯,2004,24.00元,查特莱夫人的情人,7.6,5991
诗歌,法,伊夫·博纳富瓦,2017,52.00元,杜弗的动与静,8.7,108
小说,日,东野圭吾,2008,29.80元,白夜行,9.1,219833
文学,英,亨利·希金斯,2012,35.00元,真的不用读完一本书,6.8,226
文学,中,黄德海,2017,38.00元,书到今生读已迟,9.1,34
随笔,日,金子由纪子,2015,25.00元,不被理想束缚的生活,7,296
文学,西班牙,圣地亚哥·帕哈雷斯,2017,46,螺旋之谜,7.9,174
随笔,日,堺雅人,2014,30.00元,文·堺雅人:憧憬的日子,8.1,1728
小说,中,王小波,2006,19.80元,万寿寺,8.6,5761
小说,中,都梁,2005,28.00元,亮剑,8.9,13639
散文,日,永井荷风,2012,20.00元,晴日木屐,7.8,117
随笔,中,苗炜,2015,CNY36.00,面包会有的,7.9,1005

It can be seen that, except for the order of the fields, it is the same as the result of Pandas cleaning

to sum up

In Spark, the application of functions such as map, filter, and reduce is different from the traditional programming process. New programming scenarios such as batch processing, distributed, and stream processing make us need more functions to manipulate data. In the Spark data cleaning code I wrote above, 131 lines of code contain 6 field processing functions.

The above code is not commented due to time. If you need a detailed description of the code, please chat privately

Guess you like

Origin blog.csdn.net/mrliqifeng/article/details/82107737