Hanging local yum mkdir / Media / CDROM / Mount / dev / CDROM / Media / CDROM / CD /etc/yum.repos.d/ mkdir BAK Music Videos *. * BAK / CD BAK / Music Videos the CentOS -Media.repo ../ .. CD / VI the CentOS - Media.repo VI : I enters edit : Q ! forced to exit : W ! save and exit yum List [ Base ] name = RedHat BaseURL = File: /// mnt / CDROM # Note: baseurl here is you mount the directory, in this case / mnt / cdrom = enabled 1 # Note: The value here must be enabled to a value of 1 gpgckeck does not matter gpgckeck = 0 gpgkey = File: /// mnt / cdrom / RPM-GPG-KEY-CentOS-7 # Note: This you cd / mnt / cdrom / can see this key, this is just an example of installing a new editor yum install nano install SSH yum install OpenSSH -server openssh- Clients liunx tool Xshell ssh start card cd / etc / sysconfig / Network- scripts Nano the ifcfg - ens33 the ONBOOT = NO => the ONBOOT = Yes disposed fixing the IP BOOTPROTO = DHCP => = BOOTPROTO static the IPADDR = 192.168. 211.7 IP address GATEWAY = 192.168 . 211.1 Gateway NETMASK = 255.255 . 255.0 mask restart the network service systemctl restart network network tool yum install NET - Tools ifconfig the ping command to upload and download yum install - the y-lrzsz rz upload sz download upload software to the server tar - zcvf my.tar aaa / compressing and packing the tar -xvf my.tar -C XX / XXX unpacked pwd display the current path upload java package configuration java environment variable Export the JAVA_HOME = / usr / lib / java / jdk1. . 8.0_11 export PATH=$PATH:$JAVA_HOME/bin nano /etc/profile 刷新缓存 source /etc/profile /root/bigdata/lib/hadoop-2.9.0/etc/hadoop hadoop-env.sh export JAVA_HOME=/root/bigdata/lib/jdk1.8.0_11 core-site.xml <property> <name>fs.defaultFS</name> <value>hdfs://hdp01:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/hadoop/hdpdata</value> </property> hdfs-site.xml <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.http.address</name> <value>0.0.0.0:50070</value> </property> 不配置就是单机版 mapred-site.xml.template <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>512</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>1024</value> </property> 配置老大 yarn-site.xml <property> <name>yarn.resourcemanager.hostname</name> <value>hdp01</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> 克隆的改主机名 nano /etc/sysconfig/network hostname hdp01 nano /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 1 :: localhost6 localhost6.localdomain6 localhost.localdomain localhost 192.168 . 211.7 hdp01 192.168 . 211.8 hdp02 192.168 . 211.9 hdp03 configure hadoop environment variable pwd obtain the directory / root / BigData / lib / hadoop- 2.9 . 0 Export HADOOP_HOME = / root / BigData / lib / hadoop - 2.9 . 0 Export the PATH = $ the PATH: $ JAVA_HOME / bin: $ HADOOP_HOME / bin: $ HADOOP_HOME / sbin Source / etc / Profile format hadoop hadoop the NameNode - format to enable the cluster hadoop -Start the NameNode daemon.sh hadoop - daemon.sh Start Datanode the Yarn - daemon.sh Start the ResourceManager JPS view the process systemctl stop firewalld.service # Stop firewall systemctl disable firewalld.service # prohibit firewall startup firewall -cmd - State # view the default firewall state (after closing show notrunning, displayed after opening running) firewall Nano / etc / sysconfig / iptables Service iptables restart / Start / STOP 8032 8042 8088 9000 50070 50010 50075 Note firewall port authority is not uniform hadoop FS -put 1 .txt / join files hadoop FS-ls / 查看文件 hadoop fs -get /1.txt 下载文件 hadoop fs -mkdir -p /wordcount/input 建立文件夹 hadoop fs -put 1.txt 2.txt /wordcount/input 传多个 wordcount /home/hadoop/apps/hadoop-2.9.0/share/hadoop/mapreduce hadoop jar hadoop-mapreduce-examples-2.9.0.jar wordcount /wordcount/input /wordcount/output hadoop jar wc.jar org.bigdata.mr.wcdemo.WordcountDriver /wordcount/input /wordcount/output 单机版 and the hadoop.tmp.dir namenode datanode find the file just need VERSION clusterID wherein a change of the same can be / etc / Hadoop / the log4j.properties log4j.logger.org.apache.hadoop.util.NativeCodeLoader = ERROR Eclipse parameter forgery -DHADOOP_USER_NAME = Hadoop enter the conf / Spark - env.sh.template Export the JAVA_HOME . = / usr / lib / Java / jdk1 . 8 .0_11 Export SPARK_MASTER_IP = hdp01 Export SPARK_MASTER_PORT = 7077 slaves.template CP slaves.template slaves set the IP hdp02 hdp03 / Home / Hadoop / with BigData / spark- 2.2 . . 1-bin-hadoop2.7/bin export JAVA_HOME=/usr/lib/java/jdk1.8.0_11 export HADOOP_HOME=/home/hadoop/bigdata/hadoop-2.9.0 export SPARK_HOME=/home/hadoop/bigdata/spark-2.2.1-bin-hadoop2.7 export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/bin:$SPARK_HOME/sbin $SPARK_HOME/sbin/start-all.sh nano /etc/selinux/config SELINUX=disabled 全部环境变量 nano /etc/Profile Export the JAVA_HOME = / the root / with BigData / lib / jdk1. . 8 .0_11 Export HADOOP_HOME is = / the root / with BigData / lib / hadoop- 2.9 . 0 Export SPARK_HOME = / the root / with BigData / lib / spark- 2.2 . . 1 -bin-hadoop2 . . 7 Export the PATH = the PATH $: $ the JAVA_HOME / bin: $ HADOOP_HOME is / bin: $ HADOOP_HOME is / sbin: $ HADOOP_HOME is / sbin: $ SPARK_HOME / bin: $ SPARK_HOME / sbin Source / etc / Profile
pyspark conversion .map (... ) conversion, converting each line .filter (...) query .flatMap (...) conversion, and return a flat map different results .distinct (...) conversion, to the high cost of heavy .sample (... ) random sampling (False, 0.1 , 6666) Whether to replace, sampling 10% , random seed .join (...) .leftOuterJoin (...) join query .instrsection (...) match equal parts .repartition (...) prudent use will change the partition operations . take (n) before taking a few lines .collect (...) returns all be used with caution .reduce (...) using the specified method of filtering data .count (...) statistics number of elements .saveAsTextFile (...) to save the file . foreach (...) one by one through the elements, for saving data to the database does not support the default cache .cache () DataFrame json = spark.read.json (....) to read formatted data json.createOrReplaceTempView ( " json " ) to create a zero table DataFrameAPI query .show (n) query first n rows .sql (...). collect () using sql statement to query data can be used here .collect () .show (n). deprecated .collect () when the take (n) non-small data Designation mode .printSchema () view field type (schema definition) introduced from pyspark.sql.types Import * stringCSVRDD = sc.parallelize ([( 123 , ' Katie ' , . 19 , ' Brown ' )]) Schema = StructType ([StructField ( " the above mentioned id " , LongType (), True)]) spark.createDataFrame (stringCSVRDD, Schema) use StructType and StructField defined types, in most cases we do not need to use the specified mode, the default mode of inference use DataFrameAPI query .collect () .Show () .take () .count () .filter () filter clause sentences: . CSV the SELECT ( " the above mentioned id","age").filter("age=22").show() csv.select(csv.id,csv.age,csv.name).filter(csv.age==22).show() csv.select("name","eyeColor","age").filter("eyeColor like 'b%'").show() 利用Sql查询 spark.sql() 例句: spark.sql("select count(0) from csv").show() spark.sql("ID SELECT, WHERE Age Age = 22 is from CSV " ) the .Show () spark.sql ( " SELECT ID, Age, name, EyeColor from CSV WHERE EyeColor like 'B%' " ) the .Show () to read the file data can be read sequence data can take on any divided spark.read.csv (airportFilePath, header = ' to true ' , = InferSchema ' to true ' , On Sep = ' \ T ' ) .cache () cache data modeling inspection data repeating df.count () and df.distinct (). count () Comparative df.dropDuplicates () to remove the same line DF. SELECT ([C for C in df.columns IF C! = 'id' ]). DISTINCT (). COUNT () does not comprise a column of the check weight df.dropDuplicates (Subset = [C for C in df.columns IF C! = ' ID ' ]) does not contain a column removal was repeated to find the same data Import pyspark.sql.functions AS Fn df.agg () df.agg ( fn.count ( ' ID ' ) .alias ( ' COUNT ' ), fn.countDistinct ( ' ID ' ) .alias ( ' DISTINCT ' ) ) the .Show () reassigned new ID df.withColumn ( ' NEW_ID ', fn.monotonically_increasing_id ()) show () . Know your data HTTP: // spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.types. Descriptive statistics Import pyspark.sql.types AS Typ aggregation function as AVG (), COUNT (), CountDistinct (), First () the kurtosis (), max (), Mean (), min (), Skewness (), STDDEV (), STDDEV_POP (), stddev_samp (), SUM (), sumDistinct (), VAR_POP (), VAR_SAMP () and variance (). Visualization Import matplotlib.pyplot AS PLT data reaches the number of levels is not displayed directly, first of polymerization