Why am I against big data?

Wonderful point of view :

  • When the era of big data has arrived, it is necessary to use big data thinking to explore the potential value of big data
  • In China, most companies don’t have much data
  • The reality is often that data can only verify the present, and data cannot foresee the future
  • All technologies that do not start from solving business are hooligans
  • Many times the data is not as valuable as we think, especially the data that is easily collected on the Internet
  • Big data should gradually evolve from small data. It is a normal ecology, not an instantaneous change.

statement

Personal remarks, this article only looks at big data from another perspective. If you don’t understand, please laugh it off. Don’t make unnecessary slaps, nothing more.


introduction

Many people are keen to put big data on their lips, but you have to ask what is big data and what does big data have to do with you? It is estimated that few people can say one, two or three.

The reason is that people have a deep original desire for new technologies, at least they don’t appear to be "earthquake" when chatting; the other is that there are too few cases where people can really participate in the practice of big data in work and living environments. .


1. What is big data

McKinsey was the first to propose the era of big data . He said:

"Data has been used in every industry and business function area today and has become an important production factor. People's mining and application of massive data heralds the arrival of a new wave of productivity growth and consumer surplus."

1616142436_60546064595d000de842c.png!small?1616142437027

IBM first summarized the characteristics of big data into four "Vs", namely:

  • Volume , the volume of data is huge. The starting unit of measurement for big data is at least P (1000 T), E (1 million T) or Z (1 billion T)
  • Diverse Variety , many types of data. For example, web logs, videos, pictures, geographic location information, etc.
  • Value , low value density, high commercial value.
  • Fast Velocity , fast processing speed. This is also fundamentally different from traditional data mining technology.

In fact, these Vs can't really explain all the characteristics of big data. The following figure provides an effective illustration of some relevant characteristics of big data.


Victor Meyer-Schoenberger cited various examples in the book "The Age of Big Data", all to illustrate one truth:

在大数据时代已经到来的时候要用大数据思维去发掘大数据的潜在价

书中,作者提及最多的是Google如何利用人们的搜索记录挖掘数据二次利用价值,比如预测某地流感爆发的趋势;

Amazon如何利用用户的购买和浏览历史数据进行有针对性的书籍购买推荐,以此有效提升销售量;

Farecast如何利用过去十年所有的航线机票价格打折数据,来预测用户购买机票的时机是否合适。


书中提到大数据的核心是预测。有三个思路的转变:

  • 不是随机样本,而是全量数据;
  • 不是精确性,而是混杂型;
  • 不是因果关系,而是相关关系。


二、现状分析

根据国家统计局2019年11月20日发布的《第四次全国经济普查公报(第二号)》中显示:

2018年末,全国共有第二产业和第三产业的企业法人单位1857.0万个,比2013年末增加1036.2万个,增长126.2%。其中,内资企业占98.8%,港、澳、台商投资企业占0.6%,外商投资企业占0.6%。内资企业中,国有企业占全部企业法人单位的0.4%,私营企业占84.1%。(如下图)

1616142620_6054611c629817f138458.png!small?1616142620809

这个数据说明了中国绝大多数的企业都是中小企业,在这种情况下,有多少企业有海量数据呢?

我们换个角度再看看以下数据,我们搜索一下国内几个典型客户的网站排名情况(查询网站是alexa)。


国内某知名财务软件公司:

1616142647_60546137e0ce1b695cf68.png!small?1616142648371

国内某知名IT解决方案服务公司:1616142659_605461431f17eaf0ef1a9.png!small?1616142659310

国内某排名前列网络安全防护公司:

1616142670_6054614e5bd7829c1c197.png!small?1616142670477

从中可以看出用友的PV最大,也就是一天63000个,一年的数据量也就是2300万,在加上别的数据,数据的量级也就是G级别的,还远未到T级,更别说P级了。

在这个量级上,一台好点的 PC Server 就可以处理完成大部分的需求,如果考虑到可靠性最多需要两台。

通过上面的分析,我们可以发现:在中国,绝大多数的公司是没有太多数据的。


三、大数据的核心价值

1616142731_6054618bd82447aad2368.png!small?1616142733351

《大数据时代》中提到的大数据的核心价值是预测,但我们提到大数据时,往往提到的都是大数据技术,比如Hadoop、Spark、Storm、Hbase、Hive等等,人们对此的讨论总是乐此不彼。


但现实的情况往往是数据只能验证现在,数据无法预见未来!

举个最近的例子:

大数据告诉我们股市暴跌后必然有反弹。于是6.25大跌后,大家都认为周五必然有一个反弹。结果周五就被庄家们狠狠的教育了一把。

6.28的双降(降息,降准),所有都说周一6.29会上涨,可周一中国的庄家让散户们明白,数据和经验只是你的一厢情愿,他不会给你一丝的喘息机会。


一切不以解决业务为出发点的技术都是耍流氓,计算机技术的发展是非常迅速的,往往一个技术可能没有多长时间就会被淘汰或者升级。

如果没有业务场景作为支持,纯粹的学习大数据的技术是没有太大价值的,笔者崇尚的是学以致用。

因为大脑有个很明显的特征是健忘,如果用不到,学这些技术过一段时间就会忘记,还不如暂时不学,等后面需要用到的时候在学(原理基础知识等除外)。


四、数据真的值钱吗

1616142784_605461c048b7ff4ce0d6b.png!small?1616142785685

很多时候数据并没有我们想象的这么值钱,尤其是互联网上很容易采集到的数据,比如:爬虫这个东西。

我一开始不是太懂,但是花点时间,基本上通了,无论是用Python自己写一个,还是直接用现成的各种软件,都是很快就可以部署并开始采集。

中国有大量的程序员,还有的稍微懂点编程的计算机爱好者,另外爬虫软件的出现可以让一个新手随便花点时间就可以学会采集。所以采集的门槛在降低。

另外,数据的可复制性导致其廉价,尤其是非结构性的数据,从现在互联网上大量的转载文章就可以看出知识的传播复制是非常廉价的。


数据的利用才是有价值的。比如:

一个老板,他每天看几十个零散数据放到他面前,但是却没有把行为数据和商业数据的关系告诉他,有什么用呢?

一个公司CEO,每天看到几十个数据,什么PV、PU、UV等等是没有意义的。

对于他们来说,只需要知道有问题吗?问题是什么?有新的发现吗?需要做什么?这就行了。


五、大数据的泡沫

1616142836_605461f43d0b3c7c29ea7.png!small?1616142836450

伯克利的Jordan教授给出了答案,他是机器学习世界范围内最被尊敬的专家之一(下面的翻译来自知乎上的作者Quinn Sure)。

  1. 目前的大数据给出的结果可靠性太低,如果急于应用到实际中,就好比是土木工程都没学好就开始造桥,结果只能造出“豆腐渣工程”。

  2. 1、一大波“false positive”(假阳性)正在接近,因为数据增长的速度不够支撑我们把大数据到处乱用的欲望。作为一个科学,不够严谨(原文是“没有error bar”)。不像造桥的土木工程,经过多年的积累,明确地能告诉我们什么样的情况可以造,什么不可以。而大数据没有。

  3. 2、目前在computer vision领域进展还很小,只能在非常有限的范围内识别,比如人脸识别这样非常具体的引用(虽然这个不是直接说大数据,但是可以看出,作者认为真正做到万物都sensor还很远,大数据的采集能力终究还是有限的)
  4. 3、neural network根本和人脑的neural network不是一回儿事,我们对大脑的理解根本没到可以引用到计算机科学的程度。

现在deep learning所采用的back propagation技巧,明显不是大脑的运作方式,network的结构都完全不同,什么对数据的模糊性处理已经达到人脑的境界云云,主要是媒体扯谈。


对他观点的总结:

  • 有些媒体为了让公众容易理解,打了些比方,但是这种比方造成了太多误解,进而造成了太多hype(夸张的大肆宣传)。
  • 大数据还是一个没有足够严谨程度的科学,可能有一定的概率做出一些有用的预测,但是使用不当,过分过早地依赖,则会造成灾难性的后果。
  • 很多时候大家过早对一个技术爆发热情,寄希望它可以改变世界,如果短时间没有成果,有可能热情一下子转冷又觉得这是个错误,加速抽离给这个技术的资源。

Obviously Michael is very worried that the public's enthusiasm for this technology is not based on the understanding of this technology, and may experience such a change in attitude. But he believes that this field exists in reality, and many important applications will create value over time. But now a lot of media propaganda, and even investment behavior, are all bubbles.


6. Start with small data

How do you do it?

1616142900_60546234a2c03c77c12eb.png!small?1616142901595

Starting from small data, small data is individualized data, which is the digital information of each of our individuals or organizations.

For example, I drank a couple of drinks every day, and suddenly one day I had a stomachache after drinking, so I thought, how is this day different from before? It turned out that the wine I drank this day was a new brand, and maybe it was because of this new brand that I had a stomachache.

This is the "small data" in my life. It is not as vast and complicated as big data, but it is very important to me.

The biggest problem facing many companies is not how to use big data, but how to use big data when small data is not good.


Big data should gradually evolve from small data. It is a normal ecology, not an instantaneous change.

First of all, we must understand what is the core of our own business and industry. In the process of competition, many companies are not defeated by current competitors, but by many competitors who are not yours.

A very simple example. Everyone thinks that Amazon is an e-commerce company, but this is wrong. Its main revenue now comes from cloud services.

So to find the core data of the enterprise, this is the most critical. Only on this basis, use and analyze these data, and then do some extensions.

Secondly, look for some internal data and grow it slowly. It's a bit like a snowball, the first layer is the core, and the second layer is data related to external wei. The third layer is some structured data from external agencies. The fourth layer is socialized, and various so-called unstructured data.

These layers need to find it layer by layer, and to find valuable things related to oneself. So that your data can be used.


Guess you like

Origin blog.51cto.com/14153008/2665721