使用hivexmlserde解析xml文件

​ Hive XML SerDe是一个基于Hive SerDe(序列化/反序列化)框架的XML处理库。它依赖于Apache Mahout项目中的XmlInputFormat,根据特定的开始和结束标记将输入文件分解成XML片段。 XML SerDe的本质其实是使用XPath处理器查询XML片段来填充Hive表。

准备数据:

首先将jar包添加到hive中,将jar添加到Hive中有很多种方式,这里我们采用添加临时jar的方法

ADD JAR /home/hadoop/hive_jar/hivexmlserde-1.0.5.3.jar;

注意:如果在hive中整合了spark sql请用spark sql完成本操作

建表语句:

CREATE TABLE ebay_listing(seller_name STRING, 
seller_rating BIGINT, bidder_name STRING, 
location STRING, bid_history map<string,string>, 
item_info map<string,string>)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.seller_name"="/listing/seller_info/seller_name/text()",
"column.xpath.seller_rating"="/listing/seller_info/seller_rating/text()",
"column.xpath.bidder_name"="/listing/auction_info/high_bidder/bidder_name/text()",
"column.xpath.location"="/listing/auction_info/location/text()",
"column.xpath.bid_history"="/listing/bid_history/*",
"column.xpath.item_info"="/listing/item_info/*"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES (
"xmlinput.start"="<listing>",
"xmlinput.end"="</listing>"
);

加载数据:

load data local inpath "/home/hadoop/hive_data/ebay.xml" into table ebay_listing;

接下来进行测试:

SELECT seller_name, bidder_name, location, bid_history["highest_bid_amount"], item_info["cpu"]
FROM ebay_listing

结果:

seller_name     bidder_name     location        bid_history[highest_bid_amount] item_info[cpu]
cubsfantony     gosha555@excite.com     USA/Chicago     $620.00 Pentium III 933 System
ct-inc  petitjc@yahoo.com       USA/Los Angeles $680.00 Intel Pentium III 800EB-MHz Coppermine CPU
ct-inc  hsclm9@peganet.com      USA/Los Angeles $1,025.00       Intel Pentium III 933EB-MHz Coppermine CPU
bestbuys4systems        wizbang4        Allentown, PA 18109     $610.00 Genuine Intel Pentium III 1000MHz Processor
sales@ctgcom.com        chul2@mail.utexas.edu   LOS ANGELES, CA $535.00 INTEL Pentium III 800MHz
Time taken: 2.197 seconds, Fetched 5 row(s)

成功将xml文件解析为 relational table !

发布了101 篇原创文章 · 获赞 265 · 访问量 1万+

猜你喜欢

转载自blog.csdn.net/a805814077/article/details/103310413
今日推荐