Big Data era you have to understand the concept of Big Data

Big Data: refers to the collection of data can not be captured, managed and treated with conventional software tools within a certain time frame, the new model is needed in order to have a more powerful decision-making power, insight and process optimization capabilities found the force of massive, high growth rates and diverse information assets
The basic unit is the smallest bit, all the units in the order given: bit, Byte, KB, MB, GB, TB, PB, EB, ZB, YB, BB, NB, DB.
1 Byte =8 bit 1 KB = 1,024 Bytes = 8192 bit 1 MB = 1,024 KB = 1,048,576 Bytes 1 GB = 1,024 MB = 1,048,576 KB 1 TB = 1,024 GB = 1,048,576 MB 1 PB = 1,024 TB = 1,048,576 GB 1 EB = 1,024 PB = 1,048,576 TB 1 ZB = 1,024 EB = 1,048,576 PB 1 YB = 1,024 ZB = 1,048,576 EB 1 BB = 1,024 YB = 1,048,576 ZB 1 NB = 1,024 BB = 1,048,576 YB 1 DB = 1,024 NB = 1,048,576 BB
Mainly to solve, analyze massive data storage and computational problems of massive data
Features 2 large data
1, a large number. The first feature of large data reflected as "big", from the first Map3 era, a tiny MB level Map3 to meet the needs of many people, but as time goes on, the storage unit from the past GB to TB, even now PB, EB level. With the rapid development of information technology, data to explode. Social networking (micro-blog, Twitter, Facebook) - mobile network, a variety of intelligence tools, service tools, have become the source of the data. Commodity trading Taobao data generated nearly 400 million members every day about 20TB; face book log data about 10 million users daily production of more than 300TB. The urgent need for intelligent algorithms, powerful data processing platforms and new data processing techniques to statistics, analysis, forecasting and data such large-scale real-time processing.
2, and diverse. A wide range of data sources, determines the size of the data in the form of diversity. Any form of data can have an effect, the most widely used is the recommendation system, such as Taobao, Netease cloud music, headlines today, these platforms will be analyzed by the log data to the user, thereby further recommended users like things. Log data is clearly structured data, there are some obvious structured data, such as images, audio, video, data causal relationship is weak, we need to be manually marked.
3, high-speed. A large data very quickly, mainly through the Internet transmission. Everyone in life is inseparable from the Internet, which means that individuals are offering a lot of information to large data every single day. And these data are the need for timely treatment, because it takes a lot of capital to small historical data storage role is very worthwhile, for a platform, and perhaps save the data only in the past few days, or a month, then far data will be cleared up, or too costly. Based on this situation, there is a large data processing speed is very strict requirements, a large number of server resources to process and calculate data, many platforms need to do real-time analysis. Data generated all the time, who's faster, who have the advantage.
 
4, value. This is the central feature of big data. Data generated in the real world, a small proportion of valuable data. Compared to traditional small data, big data is that the greatest value through machine learning, artificial intelligence by various types of data from a large number of unrelated, dig out valuable data analysis and forecasting of future trends and patterns depth analysis or data mining methods, the discovery of new laws and new knowledge, and used in various fields of agriculture, finance, health care, etc., which ultimately improve social governance, increase productivity, and promote scientific research results
HADOOP Background
1.1 What is HADOOP
Introduction official website hadoop.apache.com -> read can use Baidu translation
Apache Hadoop is reliable, scalable, distributed computing open source software development. Apache Hadoop software library is a framework that allows the use of a simple programming model across a cluster of machines distributed processing of large data sets (vast amounts of data). These modules include:
• Hadoop Common: support for other common tools Hadoop modules.
• Hadoop Distributed File System (HDFS ™): a distributed file system that provides high throughput access to application data.
• Hadoop YARN: job scheduling framework and cluster resource management.
• Hadoop MapReduce: A parallel processing of large data sets based system for YARN.
Each said module has its own independent function, and there are associated with each other between the modules.
Broadly speaking, HADOOP usually refers to a broader concept --HADOOP ecosystem
1.2 HADOOP background
Prototype began in 2002, the Apache Nutch, Nutch is an open source Java implementation of the search engine. It provides all the tools we needed to run its own search engine. Including full-text search and the Web crawler. Nutch design goal is to build a large-scale network-wide search engines, including web crawling, indexing, query and other functions, but with the increase in the number of pages crawled, encountered serious scalability issues ------- - " how to solve the storage and indexing billions of pages of questions ."
• In 2003 Google published an academic paper technology Google File System (GFS). GFS is google File System, google search company in order to store massive amounts of data designed dedicated file system.
• In 2004 Nutch-based Google founder Doug Cutting of paper to achieve the GFS distributed file storage system called NDFS.
ps: 2003- 2004 years, Google disclosed the details of the GFS and Mapreduce thought, in order to spare time with a 2-year basis Doug Cutting, who realized the DFS and Mapreduce mechanism, a miniature version: Nutch
• In 2004 Google has published a technical academic MapReduce. MapReduce is a programming model for large data sets (greater than 1TB) parallel analysis algorithms.
• In 2005 Doug Cutting and based on MapReduce, the Nutch search engine implements the function.
Domestic and foreign HADOOP Applications Introduction
• 大型网站Web服务器的日志分析:一个大型网站的Web服务器集群,每5分钟收录的点击日志高达800GB左右,峰值点击每秒达到900万次。每隔5分钟将数据装载到内存中,高速计算网站的热点URL,并将这些信息反馈给前端缓存服务器,以提高缓存命中率。
• 运营商流量经营分析:每天的流量数据在2TB~5TB左右,拷贝到HDFS上,通过交互式分析引擎框架,能运行几百个复杂的数据清洗和报表业务,总时间比类似硬件配置的小型机集群和DB2快2~3倍。
1.5 国内HADOOP的就业情况分析
可以联网查智联
 
大数据方面的就业主要有三大方向:
• 数据分析类大数据人才 对应岗位 大数据系统研发工程师
• 系统研发类大数据人才 对应岗位 大数据应用开发工程师
• 应用开发类大数据人才 对应岗位 大数据分析师
大数据技术生态体系
上图中涉及到的技术名词解释如下:
1)Sqoop:sqoop 是一款开源的工具,主要用于在 Hadoop(Hive)与传统的数据库(mysql)间进 行数据的传递,可以将一个关系型数据库(例如 : MySQL ,Oracle 等)中的数据导进到 Hadoop 的 HDFS 中,也可以将 HDFS 的数据导进到关系型数据库中。
2)Flume:Flume 是 Cloudera 提供的一个高可用的,高可靠的,分布式的海量日志采集、聚 合和传输的系统,Flume 支持在日志系统中定制各类数据发送方,用于收集数据;同时,Flume 提供对数据进行简单处理,并写到各种数据接受方(可定制)的能力。
3)Kafka:Kafka 是一种高吞吐量的分布式发布订阅消息系统,有如下特性:
(1)通过 O(1)的磁盘数据结构提供消息的持久化,这种结构对于即使数以 TB 的消息 存储也能够保持长时间的稳定性能。 (2)高吞吐量:即使是非常普通的硬件 Kafka 也可以支持每秒数百万的消息 (3)支持通过 Kafka 服务器和消费机集群来分区消息。
(4)支持 Hadoop 并行数据加载。
4)Storm:Storm 为分布式实时计算提供了一组通用原语,可被用于“流处理”之中,实时
处理消息并更新数据库。这是管理队列及工作者集群的另一种方式。 Storm 也可被用于“连
续计算”(continuous computation),对数据流做连续查询,在计算时就将结果以流的形式
输出给用户。
5)Spark:Spark 是当前最流行的开源大数据内存计算框架。可以基于 Hadoop 上存储的大数据进行计算。
6)Oozie:Oozie 是一个管理 Hdoop 作业(job)的工作流程调度管理系统。Oozie 协调作业 就是通过时间(频率)和有效数据触发当前的 Oozie 工作流程。
7)Hbase:HBase 是一个分布式的、面向列的开源数据库。HBase 不同于一般的关系数据库, 它是一个适合于非结构化数据存储的数据库。
8)Hive:hive 是基于 Hadoop 的一个数据仓库工具,可以将结构化的数据文件映射为一张 数据库表,并提供简单的 sql 查询功能,可以将 sql 语句转换为 MapReduce 任务进行运行。 其优点是学习成本低,可以通过类 SQL 语句快速实现简单的 MapReduce 统计,不必开发专 门的 MapReduce 应用,十分适合数据仓库的统计分析。
9)Mahout:
Apache Mahout是个可扩展的机器学习和数据挖掘库,当前Mahout支持主要的4个用 例: 推荐挖掘:搜集用户动作并以此给用户推荐可能喜欢的事物。 聚集:收集文件并进行相关文件分组。 分类:从现有的分类文档中学习,寻找文档中的相似特征,并为无标签的文档进行正确 的归类。
频繁项集挖掘:将一组项分组,并识别哪些个别项会经常一起出现。
10)ZooKeeper:Zookeeper 是 Google 的 Chubby 一个开源的实现。它是一个针对大型分布 式系统的可靠协调系统,提供的功能包括:配置维护、名字服务、 分布式同步、组服务等。 ZooKeeper 的目标就是封装好复杂易出错的关键服务,将简单易用的接口和性能高效、功能 稳定的系统提供给用户。

Guess you like

Origin www.cnblogs.com/gcghcxy/p/11119107.html