Big data characteristics and development process

Large data (big data) is a collection of data: the amount of data grows fast, can not be collected at a certain time in a conventional tool data, processing, storage and computing data set.

The author believes has the following data five characteristics (4V + 1O) of it is called big data , namely:

The amount of data (Volume): The first feature is a large volume of data, including collection, storage and calculation are very large. Start unit of measurement is at least large data P (1000 th T), E (100 million th T) or Z (10 billion th T).

Type variety (Variety): The second feature is the type and source of diversity. Including structured, semi-structured and unstructured data, specific performance network logs, audio, video, pictures, location information, etc., many types of data processing capability of the data put forward higher requirements.

Low density value (Value): The third feature is the relatively low value of the data density, or a wave in Sentosa but precious. With the widespread use of the Internet and the Internet of things, information perception everywhere, a flood of information, but a lower density value, and how to combine business logic with powerful data mining algorithms to the value of the machine, the era of big data is most needed to solve the problem.

High speed aging (Velocity): faster growth rate of the fourth feature data, the processing speed is fast, time-critical requirements. For example, search engines require news a few minutes ago to user queries can be personalized recommendation algorithm requires real time as possible to complete the recommendation. This is different from the traditional big data mining significant feature.

Data is online (Online): Data is always online, and can call at any time of the calculation, which is different from the traditional data biggest feature big data. Now we are talking about big data is not just big, more important is the change of data online, which is characteristic of the rapid development of the Internet in the background. For example, the tool for a taxi, the taxi driver and customer data in real time online data, this data makes sense. If the disk is placed and is offline, the data is far less than the big online business value.

 

 

About Big Data features, particularly to emphasize the point that the data is online, because many people think that a large amount of data is the data, often overlooked characteristic of large online data. Only online data, data that is generated when a user is connected to the product or the customer makes sense. Such as when a user is using a certain Internet applications, the behavior in time of the data transmitted to the consumer, after the consumer in some data processing efficiently (data analysis through data mining or processing) to optimize the push content application, the most users want to see content pushed to the user, but also enhance the user experience.

Big data events development process

In 2005 the birth of Hadoop project.  Yahoo's Hadoop which was initially used to solve the problem of a Web search project, and later because of its high efficiency technology, was introduced and became the company's Apache Software Foundation open source applications. Hadoop itself is not a product, but an ecosystem composed of multiple software products together to achieve these software products are fully functional and flexible big data analytics. From the technical point of view, it consists of two key Hadoop services: using Hadoop Distributed File System (the HDFS) reliable data storage services, and the use of high-performance technology called MapReduce parallel data processing services. The common goal of the two services is to provide a make fast, reliable analysis of both structured and complex data into a real foundation.

The end of 2008, "big data" has been recognized by well-known part of American computer science researchers, and industry organizations to calculate Community Alliance (Computing Community Consortium), published a White Paper influential "big data computing: in business, science and social fields create a revolutionary breakthrough. " It makes people's thinking is not limited to data-processing machine, and proposed: Big Data is really important is the new uses and new ideas, not the data itself. This organization can be said to be the first to propose the concept of big data agency.

In 2009 the Indian government to establish a database for biometric identity management, the United Nations Global Pulse project has investigated how to use mobile phones and social networking sites to source data analysis and forecasting problems from the outbreak of the disease to spiral price class.

In 2009, the US government to further open the door to data by way of starting Data.gov site, this site offers a wide variety of government data to the public. The site of more than 44,500 volume data set is used to ensure some websites and smartphone apps to track and then to recall information in a particular area unemployment rate from flights to the product, an action inspired the government from Kenya to the UK-wide they have also launched similar initiatives.

In 2009, some of Europe's leading research libraries and information science and technology research institutions to establish a partnership committed to improving the ease of obtaining scientific data on the Internet.

February 2010, Basescu Kenny Kerr in "The Economist" published a special report large data up to 14 "Data, data everywhere." Kunkel mentioned in the report: "The world has a huge amount of digital information can not be imagined, and at great speed economic growth from the scientific community to community, from government departments to the arts, many aspects have felt this. Effect of huge amount of information kinds of computer scientists and engineers have created a new term for the phenomenon: "big data" Kunkel has thus become one of the first insights into the era of big data trend data scientist.

February 2011, IBM's Watson supercomputer can scan and analyze (about 200 million the amount of text) 4TB amount of data per second, and beat two humans on "Jeopardy" in the famous American TV quiz show "Jeopardy" players and win. Later, the New York Times that this moment is a "big data computing victory."

May 2011, world-renowned consulting firm McKinsey (McKinsey & Company) Ken tin Global Institute (MGI) released a report - "Big Data : innovation, the next frontier of competition and productivity," big data began to concern this is the professional organization for the first time all aspects of the presentation and Prospects of data. The report notes that big data has penetrated into every industry and in today's business functions, has become an important factor of production. People for the mining and use of vast amounts of data, indicates the arrival of a new wave of productivity growth and a wave of consumer surplus. The report also mentions that "big data" from production data and significantly enhance the capacity and speed of collection - As more and more people, devices and sensors are connected by a digital network, generate, transmit, share and access data capacity has also been revolutionized.

December 2011, the Ministry issued the things on the five-year plan, the information processing techniques have been proposed as one of four key technical innovation projects, including the mass data storage, data mining, image intelligent video analysis, which are It is an important part of big data.

In January 2012, the World Economic Forum held in Davos, Switzerland, Big Data is one of the topics, the report released at the conference of "big data, high-impact" ( Big the Data, Big Impact) announced that new data has become a economic asset classes, like money or gold the same.

In March 2012, the Obama administration issued a "Big Data Research and Development Initiative" at the White House website, this initiative marks a big data has become an important characteristic of the times. March 22, 2012, the Obama administration announced a $ 200 million investment in the field of big data, big data technology rose from business practices to divide the national science and technology strategy, in a conference call the next day, government data definition of "future new oil competition, "big data technologies field, a matter of national security and the future. And he said that the national level of competitiveness will be reflected as part of a country's size, activity, and explain that owns the data, the ability to use the; national digital sovereignty embodied the possession and control of the data. Digital sovereignty will be following the border, coastal defense, air defense, space another great power game.

In April 2012, the US software company Splunk on the 19th was successfully listed on NASDAQ, became the first listed large data processing company. Given the continuing US economic malaise, the stock market remain volatile in the background, Splunk first day of trading was particularly prominent were very impressed with the first day of soaring more than doubled. Splunk is a leading provider of large data monitoring and analysis services software provider, was founded in 2003. Splunk successfully listed on the capital market to promote the focus on big data, but also to promote IT vendors to speed up large data layout.

July 2012, the United Nations in New York released a white paper on large data-government, it summarizes how governments use big data to better serve and protect the people. This white paper illustrates in a data ecosystem, individuals, public and private sectors of their respective roles, motivations and needs: for example, by the desire to focus on price and better services, provide personal data and crowdsourcing information, and privacy exit power demand and proposed; the public sector for the purpose of improving the service, improve efficiency, and provides information such as statistical data, device information, health indicators, and tax and consumer information, and put forward the demand privacy and exit authority; for private sector improve customer awareness and trend forecasting purposes, provide aggregate data, consumption and use of information, and sensitive data ownership and business models to pay more attention. The white paper also pointed out that the great wealth of data resources that people can use today, including the old and new data, to an unprecedented real-time analysis of socio-demographic. The United Nations also to social networking activity growth in Ireland and the United States can serve as an early sign of rising unemployment, for example, if the government can show a reasonable analysis of available data resources, will be "the number of both into" rapid response.

July 2012, to tap the value of big data, set up in Alibaba Group's management, "Chief Data Officer," a post, responsible for comprehensively promote the "data-sharing platform" strategy, and launched large-scale data sharing platform - "Poly Spire" for the Lynx, electricity providers and electricity suppliers and other service providers on Taobao data platform provides cloud services. Subsequently, the Chairman of the Board of Directors of Alibaba Jack Ma in 2012, delivered a speech on the net 'Conference, said that from January 1, 2013 will reshape the transformation platform, financial data and three businesses. Ma stressed:. "If we have a prediction data sets, like corporate fitted with a GPS and radar, you will be more confident to sea." So, Alibaba Group hopes to share and massive data mining for the country and the small enterprises to provide value. The move is the first domestic enterprises to upgrade to the enterprise big data management is a highly significant milestone. Alibaba is also the first to propose the operation of the enterprise business data through the data.

April 2014, the World Economic Forum to "return and risk big data" theme released "Global Information Technology Report (13th ed.)." It reports that in the next few years for a variety of ICT policy will become even more important. In the following, active discussions will be confidential data and network control and other issues. Increasingly active global data industry, to accelerate the development and application of innovative technology evolution, so that governments come to realize the significance of big data in promoting economic development, improve public services, improving people's well-being, and even national security.

May 2014, the White House released a 2014 research report on global white paper "big data" "Big Data: seize the opportunity, the guardian of value." The report encourages the use of data in order to promote social progress, especially in markets with existing institutions not otherwise support such improvements in the field; at the same time, also need the appropriate framework, structure and research, to help protect Americans for protection personal privacy and ensuring fair or prevent a firm belief discrimination.

 

Development of large data-related technologies

Big Data technology is a next-generation technology and architecture, it is a low cost, fast acquisition, processing and analysis techniques to extract value from a variety of ultra-large-scale data. Big Data technologies are emerging and development, let us handle huge amounts of data easier, cheaper and quickly became good use data assistant, or even change the business model many industries, the development of Big Data technologies can be divided into six major directions:

(1)在大数据采集与预处理方向。这方向最常见的问题是数据的多源和多样性,导致数据的质量存在差异,严重影响到数据的可用性。针对这些问题,目前很多公司已经推出了多种数据清洗和质量控制工具(如IBM的Data Stage)。

(2)在大数据存储与管理方向。这方向最常见的挑战是存储规模大,存储管理复杂,需要兼顾结构化、非结构化和半结构化的数据。分布式文件系统和分布式数据库相关技术的发展正在有效的解决这些方面的问题。在大数据存储和管理方向,尤其值得我们关注的是大数据索引和查询技术、实时及流式大数据存储与处理的发展。

(3)大数据计算模式方向。由于大数据处理多样性的需求,目前出现了多种典型的计算模式,包括大数据查询分析计算(如Hive)、批处理计算(如Hadoop MapReduce)、流式计算(如Storm)、迭代计算(如HaLoop)、图计算(如Pregel)和内存计算(如Hana),而这些计算模式的混合计算模式将成为满足多样性大数据处理和应用需求的有效手段。

(4)大数据分析与挖掘方向。在数据量迅速膨胀的同时,还要进行深度的数据深度分析和挖掘,并且对自动化分析要求越来越高,越来越多的大数据数据分析工具和产品应运而生,如用于大数据挖掘的R Hadoop版、基于MapReduce开发的数据挖掘算法等等。

(5)大数据可视化分析方向。通过可视化方式来帮助人们探索和解释复杂的数据,有利于决策者挖掘数据的商业价值,进而有助于大数据的发展。很多公司也在开展相应的研究,试图把可视化引入其不同的数据分析和展示的产品中,各种可能相关的商品也将会不断出现。可视化工具Tabealu 的成功上市反映了大数据可视化的需求。

(6)大数据安全方向。当我们在用大数据分析和数据挖掘获取商业价值的时候,黑客很可能在向我们攻击,收集有用的信息。因此,大数据的安全一直是企业和学术界非常关注的研究方向。通过文件访问控制来限制呈现对数据的操作、基础设备加密、匿名化保护技术和加密保护等技术正在最大程度的保护数据安全。

互联网的发展是大数据发展的最大驱动力

截至 2014 年 6月,我国网民规模达 6.32亿,较 2013年底增加1442 万人,互联网普及 46.9%,即接近一半的中国人在使用互联网。互联网的增长速度超越了很多人的预期:4年前即2010年6月,互联网普及率为31.8%,而仅经历了四年,互联网的普及率增加了超过15%。

 

更为重要的是,CNNIC的数据还显示,截至2014年6月,我国网民上网设备中,手机使用率达83.4%,首次超越传统PC整体使用率(80.9%),手机作为第一大上网终端设备的地位更加巩固,手机使得上网变得更加随时随地,手机上网更加渗透到人们的日常工作和生活中。

因此,互联网普及使得网民的行为更加多元化,通过互联网产生的数据发展更加迅猛,更具代表性。互联网世界中的商品信息、社交媒体中的图片、文本信息以及视频网站的视频信息,互联网世界中的人与人交互信息、位置信息等,都已经成为大数据的最重要也是增长最快的来源。

强力推荐阅读文章

大数据工程师必须了解的七大概念

云计算和大数据未来五大趋势

如何快速建立自己的大数据知识体系

Guess you like

Origin blog.csdn.net/tttttt012/article/details/91471203