Big data applications, Hadoop distributions show their magical powers

The source of this article is WeChat: Wang Wenzhong integrated the Hadoop application cases of the three major Hadoop distribution providers, Hortonworks, Cloudera, and MapR.

Cloudera: Accelerated data analysis


Edo Interactive is an American marketing company that helps advertisers connect online advertising and offline data to provide data-driven personalized recommendation services. But a few years ago, Edo ran into a problem: The data warehouse system took too long to process credit card transaction data, and it couldn't meet the company's business needs to make personalized recommendations to consumers and restaurants.


Tim Garnto, Edo's senior vice president of infrastructure and information systems, said: "It takes 27 hours to process data every day, so the task is simply not done." In 2013, Edo replaced the PostgreSQL-based system with a Hadoop cluster to provide The company has built a data resource pool.


The company collects 50 million retail transactions a day from across the United States, puts the data on a 20-node Cloudera-issued Hadoop cluster, and uses the Pentaho data integration tool. Data collected from banks and credit card companies is processed to use predictive models to recommend things like coupons to users with bank or credit cards. Coupon information is sent to customers weekly by Edo's partners and automatically matched with the user's spending behavior.


Because of the complexity of the models, Edo's data analysts can process data in minutes or hours, which was previously impossible, Garnto said.


However, the company also encountered difficulties in building the data pool. Initially, only one IT staff had experience in Hadoop and MapReduce programming frameworks. The company organized training for its internal staff, but the new MapReduce programming meant that employees gave up on the relational database approach, and the company spent a lot of time on the upgrade process.


It also takes time for the raw data coming into the system to be consistent and to generate standardized analysis data sets. Edo currently has 45 billion records and a total of 255TB of data is a core asset for the company, so Garnto has to manage it with extra care and add new Hadoop ecosystem technologies, because adding even a small technology will be a big difference. The way the system works has an impact. Garnto therefore said that of all the challenges we face, this is the most interesting, and we need to plan the future of cluster development with vision.


Hortonworks: Reducing Hardware Costs Webtrends, which


collects and processes network, collection and IoT activity data, is another data resource pool user. This is a Polish company that deployed a Hadoop cluster released by Hortonworks in July last year and officially launched it at the beginning of this year, initially to support a product called Explore, which allows the company's marketers to perform instant analysis of customer data. Peter Crossley, the company's director of product architecture, said that on a 60-node cluster, 500TB of data is added to each quarter, which adds up to 1.28PB.


Webtrends plans to use the Hadoop platform to replace the original storage system. By using Kafka message queue technology and automated processing scripts, network click data can enter the cluster, enabling data analysis within 20 to 40 milliseconds. Reporting and analysis are basically real-time, much faster than the old system. Hadoop clusters also support more advanced analytics, reducing hardware costs by 25% to 50%.


Using Hadoop data pools means a change in how companies manage and use information. Previously, companies had to first build common data reports from a wide range of data columns in the data warehouse.


Companies should also consider data resource pool architecture and data governance processes to better manage data across Hadoop clusters. The raw data entering the system is loosely structured, but there are strict regulations on data governance. In addition, the company divided the Hadoop cluster into three separate layers, one for raw data, a second for incremental daily data sets, and a third for third-party information. Each layer has its own data classification and governance policies, which vary by dataset.


MapR: Organized data


storage Suren Nathan, CTO of cloud-based predictive analytics software provider Razorsight, also mentioned being very "disciplined and organized" when it comes to building and using Hadoop data pools. Otherwise, the system becomes a runaway dump.


Razorsight provides cloud-based analytics services for the telecommunications industry, using Hadoop clusters released by MapR in the second quarter of 2014. Customer, operational and network data from agents is loaded into the system through a self-built extraction tool and provided to data scientists through the Spark processing engine. The cluster has 5 production nodes and 120TB of storage capacity.


Like Webtrends, Razorsight divides the data pool into three parts, one for data that is less than 6 months old, one for data that is longer but still useful, and the last for data that is no longer used but needs to be preserved. Currently, in the first two segments, the company has over 20TB of data. To make the system run smoothly, the company hired new employees with experience in data governance and deployment of distributed systems, and existing employees were responsible for Hadoop, Spark and related technologies.


The cost per terabyte of data for a Hadoop cluster is $2,000, which is one-tenth of the IBM Netezza data warehouse system, but Razorsight initially built a Hadoop cluster just for data storage, and analytical models and data visualization are still done in the old system, in part because It's because the Netezza hardware is bundled with IBM's SPSS analysis software. Nathan expects to complete the migration of the visualization layer and analysis resource pool to the Hadoop data resource pool architecture by the end of this year.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326484782&siteId=291194637
Recommended