Overview of Spark/SparkSQL reading Hadoop LZO files

1. Pre-configuration

  • IDEA
  • Maven installation configuration
  • Scala (optional)
  • Java
  • Hadoop.dll (may be required, depending on whether there are related error messages)
  • hadoop-lzo-0.xx.xx.jar (if your version is too high, you need to download the high version from the official website, the highest version in the mvnrepository warehouse is 0.4.15; I am spark 2.2.0, using hadoop-lzo-0.4.21 .jar; if the version of Spark/Hadoop you use is relatively low, you can directly use the pom dependency)

2. Operation steps

  1. Create a new Project/Module in IDEA
  2. Introduce related dependencies in pom.xml (Spark, Hadoop, etc.)
  3. Write code to read lzo files
  4. test run
  5. Packaged to the server to run

3. Operating Instructions

Skip 1 and 2, talk about 3, 4, 5, errors generally appear in these three stages.

(1) Write code to read lzo file

Required content:

    val conf = new Configuration()
    conf.set("dfs.client.use.datanode.hostname", "true")
    conf.set("io.compression.codecs", "org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,com.hadoop.compression.lzo.LzopCodec")
    conf.set("io.compression.codec.lzo.class", "com.hadoop.compression.lzo.LzoCodec")

hostname is used to find metadata but fail to establish a configuration when there are internal and external network IPs;
the latter two configurations are used to read lzo files, otherwise an error will be reported: java.io.IOException: Codec for file hdfs:xxx.lzo not found, cannot run

import com.hadoop.mapreduce.LzoTextInputFormat

    val value = ss.sparkContext
      .newAPIHadoopFile(hdfsLzoPath, classOf[LzoTextInputFormat], classOf[LongWritable], classOf[Text], conf)
      .mapPartitions(p => p.map(row => row._2.toString))

This part is to read the package, pay attention to whether the package of LzoTextInputFormat is correct

(2) Test run

If the error is reported when running: ERROR lzo.LzoCodec: Cannot load native-lzo without native-hadoop, it means that there is no relevant environmental dependency.
If it is a Linux environment, lzo and lzop (.a) are not installed, and if it is a local development environment, there is no lzo dependency (.dll).
Solution: Install the lzo package on Linux, if it is Windows, add the dll file to the hadoop_home directory
dll

(3) Packaged to the server to run

If the version is relatively low, directly refer to the dependency coordinates of mvnrepository, and generally there is no running problem. If it is an imported jar package, an error may be reported, ERROR lzo.LzoCodec: Cannot load native-lzo without native-hadoop. This error is the same as the above error, but it is not an environmental problem, but the local dependencies are not included in the jar package during the packaging process. Solution: Make the local jar as a dependency and introduce it into the pom.

  1. 执行:mvn install:install-file -Dfile=hadoop-lzo-0.4.21-SNAPSHOT.jar -DgroupId=hadoop-lzo -DartifactId=hadoop-lzo -Dversion=0.4.21 -Dpackaging=jar

Format:
mvn install:install-file
-Dfile=jar package location
-DgroupId=groupId in the pom file
-DartifactId=artifactId in the pom file
-Dversion=version in the pom file
-Dpackaging=jar

  1. It can be imported normally in pom
    pom

You can also consider finding a warehouse address with a higher version of hadoop-lzo and configuring it in maven.setting.xml

Guess you like

Origin blog.csdn.net/qq_44491709/article/details/127411544