spark dataframe -> rdd

1.dataframe 转 rdd

dataframe 是从关系型数据库里读出来的数据 表的形式

rdd=sc.parallelize(df.rdd.collect())

2.dataframe groupBy之后distinct() 在count()

from pyspark.sql.functions import countDistinct
df.groupBy("a","b").agg(countDistinct(some_column)).collect()

3.神奇的问题,ubuntu装spark时候,一切顺利最后启动时候报错

root@chenge:/usr/local/spark/sbin# ./start-all.sh 
hostname: Name or service not known
starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-chenge.out
failed to launch: nice -n 0 /usr/local/spark/bin/spark-class org.apache.spark.deploy.master.Master --host --port 7077 --webui-port 8080
      at org.apache.spark.deploy.master.MasterArguments.<init>(MasterArguments.scala:30)
      at org.apache.spark.deploy.master.Master$.main(Master.scala:1049)
      at org.apache.spark.deploy.master.Master.main(Master.scala)
  Caused by: java.net.UnknownHostException: chenge: Name or service not known
      at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
      at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
      at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
      at java.net.InetAddress.getLocalHost(InetAddress.java:1500)
      ... 10 more
  2018-08-17 14:56:11 INFO  ShutdownHookManager:54 - Shutdown hook called
full log in /usr/local/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-chenge.out
hostname: Name or service not known

最后发现问题,

root@chenge 这chenge找不到,修改Hostname  
hostname ubuntu
然后打开新的终端,可以启动

猜你喜欢

转载自www.cnblogs.com/deepvoice/p/9485631.html