Big Data from entry to the abandoned pit

Hanging local yum 
    mkdir / Media / CDROM / 
    Mount / dev / CDROM / Media / CDROM / 
    CD /etc/yum.repos.d/ 
    mkdir BAK 
    Music Videos *. * BAK /  
    CD BAK / 
    Music Videos the CentOS -Media.repo ../ 
    .. CD / 
    VI the CentOS - Media.repo 
    VI 
    : I enters edit 
    : Q !     forced to exit 
    : W !     save and exit 
    yum List 

[ Base ] 
name = RedHat 
BaseURL = File: /// mnt / CDROM # Note: baseurl here is you mount the directory, in this case / mnt / cdrom
= enabled 1                             # Note: The value here must be enabled to a value of 1 gpgckeck does not matter 
gpgckeck = 0 
gpgkey = File: /// mnt / cdrom / RPM-GPG-KEY-CentOS-7 # Note: This you cd / mnt / cdrom / can see this key, this is just an example of 
installing a new editor 
yum install nano 

install SSH 
yum install OpenSSH -server openssh- Clients 

liunx tool Xshell ssh 


start card 
cd / etc / sysconfig / Network- scripts 
Nano the ifcfg - ens33 
the ONBOOT = NO => the ONBOOT = Yes 

disposed fixing the IP 
BOOTPROTO = DHCP => = BOOTPROTO static  
the IPADDR = 192.168. 211.7 IP address 

GATEWAY = 192.168 . 211.1 Gateway 
NETMASK = 255.255 . 255.0 mask 

restart the network service 
systemctl restart network 

network tool 
yum install NET - Tools 
ifconfig the ping command to 

upload and download 
yum install - the y-lrzsz 
rz upload 
sz download 


upload software to the server 
tar - zcvf my.tar aaa /        compressing and packing 
the tar -xvf my.tar -C XX / XXX unpacked 

pwd display the current path 
upload java package 

configuration java environment variable 
Export the JAVA_HOME = / usr / lib / java / jdk1. . 8.0_11
export PATH=$PATH:$JAVA_HOME/bin
nano /etc/profile
刷新缓存
source /etc/profile

/root/bigdata/lib/hadoop-2.9.0/etc/hadoop


hadoop-env.sh
export JAVA_HOME=/root/bigdata/lib/jdk1.8.0_11

core-site.xml

<property>
<name>fs.defaultFS</name>
<value>hdfs://hdp01:9000</value>
</property>

<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hdpdata</value>
</property>


hdfs-site.xml

<property>
<name>dfs.replication</name>
<value>2</value>
</property>

<property>
  <name>dfs.http.address</name>
  <value>0.0.0.0:50070</value>
</property>

不配置就是单机版
mapred-site.xml.template
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>512</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>1024</value>
</property>
配置老大
yarn-site.xml

<property>
<name>yarn.resourcemanager.hostname</name>
<value>hdp01</value>
</property>

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>


克隆的改主机名
nano /etc/sysconfig/network
hostname hdp01

nano /etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4 1
::         localhost6 localhost6.localdomain6 localhost.localdomain localhost
 192.168 . 211.7 hdp01
 192.168 . 211.8 hdp02
 192.168 . 211.9 hdp03 


configure hadoop environment variable pwd obtain the directory
 / root / BigData / lib / hadoop- 2.9 . 0 

Export HADOOP_HOME = / root / BigData / lib / hadoop - 2.9 . 0 
Export the PATH = $ the PATH: $ JAVA_HOME / bin: $ HADOOP_HOME / bin: $ HADOOP_HOME / sbin 
Source / etc / Profile 
format hadoop 
hadoop the NameNode - format 
to enable the cluster 
hadoop -Start the NameNode daemon.sh 
hadoop - daemon.sh Start Datanode 
the Yarn - daemon.sh Start the ResourceManager 

JPS view the process 


systemctl stop firewalld.service # Stop firewall 
systemctl disable firewalld.service # prohibit firewall startup 
firewall -cmd - State # view the default firewall state (after closing show notrunning, displayed after opening running) 

firewall 
Nano / etc / sysconfig / iptables 
Service iptables restart / Start / STOP 

8032 
8042 
8088 
9000 
50070 
50010 
50075 



Note firewall port 
authority is not uniform 

hadoop FS -put 1 .txt /      join files
hadoop FS-ls /        查看文件
hadoop fs -get /1.txt    下载文件
hadoop fs -mkdir -p /wordcount/input 建立文件夹
hadoop fs -put 1.txt 2.txt /wordcount/input 传多个

wordcount
/home/hadoop/apps/hadoop-2.9.0/share/hadoop/mapreduce
hadoop jar hadoop-mapreduce-examples-2.9.0.jar wordcount /wordcount/input /wordcount/output
hadoop jar wc.jar org.bigdata.mr.wcdemo.WordcountDriver /wordcount/input /wordcount/output
单机版
and the hadoop.tmp.dir namenode datanode find the file just need VERSION clusterID wherein a change of the same can be


 / etc / Hadoop / the log4j.properties 
log4j.logger.org.apache.hadoop.util.NativeCodeLoader = ERROR   



Eclipse parameter forgery
 -DHADOOP_USER_NAME = Hadoop 


enter the conf / 
Spark - env.sh.template 

Export the JAVA_HOME . = / usr / lib / Java / jdk1 . 8 .0_11 
Export SPARK_MASTER_IP = hdp01 
Export SPARK_MASTER_PORT = 7077 

slaves.template 
CP slaves.template slaves 
set the IP 
hdp02 
hdp03

 / Home / Hadoop / with BigData / spark- 2.2 . . 1-bin-hadoop2.7/bin


export JAVA_HOME=/usr/lib/java/jdk1.8.0_11
export HADOOP_HOME=/home/hadoop/bigdata/hadoop-2.9.0
export SPARK_HOME=/home/hadoop/bigdata/spark-2.2.1-bin-hadoop2.7
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/bin:$SPARK_HOME/sbin
$SPARK_HOME/sbin/start-all.sh
nano /etc/selinux/config  
SELINUX=disabled  

全部环境变量
nano /etc/Profile 

Export the JAVA_HOME = / the root / with BigData / lib / jdk1. . 8 .0_11 
Export HADOOP_HOME is = / the root / with BigData / lib / hadoop- 2.9 . 0 
Export SPARK_HOME = / the root / with BigData / lib / spark- 2.2 . . 1 -bin-hadoop2 . . 7    
Export the PATH = the PATH $: $ the JAVA_HOME / bin: $ HADOOP_HOME is / bin: $ HADOOP_HOME is / sbin: $ HADOOP_HOME is / sbin: $ SPARK_HOME / bin: $ SPARK_HOME / sbin 

Source / etc / Profile
 

pyspark conversion .map (... ) conversion, converting each line .filter (...) query .flatMap (...) conversion, and return a flat map different results .distinct (...) conversion, to the high cost of heavy .sample (... ) random sampling (False,
0.1 , 6666) Whether to replace, sampling 10% , random seed .join (...) .leftOuterJoin (...) join query .instrsection (...) match equal parts .repartition (...) prudent use will change the partition operations . take (n) before taking a few lines .collect (...) returns all be used with caution .reduce (...) using the specified method of filtering data .count (...) statistics number of elements .saveAsTextFile (...) to save the file . foreach (...) one by one through the elements, for saving data to the database does not support the default cache .cache () DataFrame json = spark.read.json (....) to read formatted data json.createOrReplaceTempView ( " json " ) to create a zero table DataFrameAPI query .show (n) query first n rows .sql (...). collect () using sql statement to query data can be used here .collect () .show (n). deprecated .collect () when the take (n) non-small data Designation mode .printSchema () view field type (schema definition) introduced from pyspark.sql.types Import * stringCSVRDD = sc.parallelize ([( 123 , ' Katie ' , . 19 , ' Brown ' )]) Schema = StructType ([StructField ( " the above mentioned id " , LongType (), True)]) spark.createDataFrame (stringCSVRDD, Schema) use StructType and StructField defined types, in most cases we do not need to use the specified mode, the default mode of inference use DataFrameAPI query .collect () .Show () .take () .count () .filter () filter clause sentences: . CSV the SELECT ( " the above mentioned id","age").filter("age=22").show() csv.select(csv.id,csv.age,csv.name).filter(csv.age==22).show() csv.select("name","eyeColor","age").filter("eyeColor like 'b%'").show() 利用Sql查询 spark.sql() 例句: spark.sql("select count(0) from csv").show() spark.sql("ID SELECT, WHERE Age Age = 22 is from CSV " ) the .Show () spark.sql ( " SELECT ID, Age, name, EyeColor from CSV WHERE EyeColor like 'B%' " ) the .Show () to read the file data can be read sequence data can take on any divided spark.read.csv (airportFilePath, header = ' to true ' , = InferSchema ' to true ' , On Sep = ' \ T ' ) .cache () cache data modeling inspection data repeating df.count () and df.distinct (). count () Comparative df.dropDuplicates () to remove the same line DF. SELECT ([C for C in df.columns IF C! = 'id' ]). DISTINCT (). COUNT () does not comprise a column of the check weight df.dropDuplicates (Subset = [C for C in df.columns IF C! = ' ID ' ]) does not contain a column removal was repeated to find the same data Import pyspark.sql.functions AS Fn df.agg () df.agg ( fn.count ( ' ID ' ) .alias ( ' COUNT ' ), fn.countDistinct ( ' ID ' ) .alias ( ' DISTINCT ' ) ) the .Show () reassigned new ID df.withColumn ( ' NEW_ID ', fn.monotonically_increasing_id ()) show () . Know your data HTTP: // spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.types. Descriptive statistics Import pyspark.sql.types AS Typ aggregation function as AVG (), COUNT (), CountDistinct (), First () the kurtosis (), max (), Mean (), min (), Skewness (), STDDEV (), STDDEV_POP (), stddev_samp (), SUM (), sumDistinct (), VAR_POP (), VAR_SAMP () and variance (). Visualization Import matplotlib.pyplot AS PLT data reaches the number of levels is not displayed directly, first of polymerization

 

Guess you like

Origin www.cnblogs.com/ruralcraftsman/p/11387625.html