Hive XML SerDe是一个基于Hive SerDe(序列化/反序列化)框架的XML处理库。它依赖于Apache Mahout项目中的XmlInputFormat,根据特定的开始和结束标记将输入文件分解成XML片段。 XML SerDe的本质其实是使用XPath处理器查询XML片段来填充Hive表。
准备数据:
首先将jar包添加到hive中,将jar添加到Hive中有很多种方式,这里我们采用添加临时jar的方法
ADD JAR /home/hadoop/hive_jar/hivexmlserde-1.0.5.3.jar;
注意:如果在hive中整合了spark sql请用spark sql完成本操作
建表语句:
CREATE TABLE ebay_listing(seller_name STRING,
seller_rating BIGINT, bidder_name STRING,
location STRING, bid_history map<string,string>,
item_info map<string,string>)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.seller_name"="/listing/seller_info/seller_name/text()",
"column.xpath.seller_rating"="/listing/seller_info/seller_rating/text()",
"column.xpath.bidder_name"="/listing/auction_info/high_bidder/bidder_name/text()",
"column.xpath.location"="/listing/auction_info/location/text()",
"column.xpath.bid_history"="/listing/bid_history/*",
"column.xpath.item_info"="/listing/item_info/*"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES (
"xmlinput.start"="<listing>",
"xmlinput.end"="</listing>"
);
加载数据:
load data local inpath "/home/hadoop/hive_data/ebay.xml" into table ebay_listing;
接下来进行测试:
SELECT seller_name, bidder_name, location, bid_history["highest_bid_amount"], item_info["cpu"]
FROM ebay_listing
结果:
seller_name bidder_name location bid_history[highest_bid_amount] item_info[cpu]
cubsfantony gosha555@excite.com USA/Chicago $620.00 Pentium III 933 System
ct-inc petitjc@yahoo.com USA/Los Angeles $680.00 Intel Pentium III 800EB-MHz Coppermine CPU
ct-inc hsclm9@peganet.com USA/Los Angeles $1,025.00 Intel Pentium III 933EB-MHz Coppermine CPU
bestbuys4systems wizbang4 Allentown, PA 18109 $610.00 Genuine Intel Pentium III 1000MHz Processor
sales@ctgcom.com chul2@mail.utexas.edu LOS ANGELES, CA $535.00 INTEL Pentium III 800MHz
Time taken: 2.197 seconds, Fetched 5 row(s)
成功将xml文件解析为 relational table !