NVIDIA Infrastructure 3.0: The Cornerstone of the Machine Learning Revolution.

Like it or not, we have entered the age of machine learning and artificial intelligence. The combination of massive amounts of data, cheap storage, flexible computing, and algorithm optimization (especially deep learning) has brought applications that were previously only limited to science fiction and movies into our real life. These former "illusions" can be brought into reality, and the continuous upgrade of the application's infrastructure has made an indelible contribution. From 1.0 to 2.0 to 3.0, the architecture is evolving, helping us enter the era of machine learning and AI.

In complex strategy games, machines have surpassed human beings (www.1159880099.com) QQ1159880099, and advances in machine learning, speech conversion, and other aspects have begun to make us wonder if the previous definition of human traits is not Too simplistic. Now that voice technology-based personal assistant applications are ubiquitous, fully automated vehicles appear to be on the road in the near future.

With these recent advances, much of the discussion around ML/AI has focused on breakthroughs in algorithms and their applications. However, I don't know if you have noticed that there is obviously very little mention in the discussion about the infrastructure on which these intelligent system applications are implemented.

Just as in the early days of computing, developing a simple application required experts in assembly language, compilers, and operating systems, creating and deploying AI systems at scale now requires a large number of experts in statistics and distributed systems. However, the scarcity of the tools needed to apply ML/AI has made ML/AI an expensive discipline reserved for a few elite engineers.

In addition, the emergence of this situation is also related to the lag in the development of infrastructure. So far, the innovation of machine learning in technology has far exceeded the development of infrastructure. In short, the current systems and tools for machine learning applications are far from being able to meet the needs of future AI development.

In the future, to unlock the huge potential of ML/AI, it is essential to create an entirely new toolchain that developers and businesses can operate and use. Therefore, the next big opportunity for infrastructure may be to support intelligent systems.

From 1.0 to 2.0 to 3.0

From 1.0 to 2.0, and beyond, the application and infrastructure are gradually moving forward.

With advances in hardware and system software, more and more new applications emerge, facilitating a virtuous cycle of innovation at the infrastructure level. The rise of better, faster, and cheaper building blocks has allowed users to experience unprecedented experiences at the end of apps, from punch cards, to Pong, to PowerPoint and the picture app Pinterest.

The commercial Internet of the late 1990s and early 2000s consisted of the x86 instruction set (Intel), standardized operating systems (Microsoft), relational databases (Oracle), Ethernet (Cisco), and network data storage (EMC). The earliest iterations of Amazon, eBay, Yahoo, Google, and Facebook were all built on this framework we call Infrastructure 1.0.

However, as the network matures, the number of network users has grown from 16 million in 1995 to more than 3 billion by the end of 2015, and the scale and performance requirements of applications have changed. For large network companies, the technologies developed in the era of master-slave architectures are no longer feasible and economical to operate the enterprise.

At this time, these companies began to combine excellent technology with parallel computing research results from academia, Google, Facebook and Amazon to define a new scalable, programmable (usually) open source infrastructure category. Similarly, technologies like Linux, KVM, Xen, Docker, Kubernetes, Mesos, MySQL, MongoDB, Kafka, Hadoop, Spark, etc. defined the cloud era, which we call Architecture 2.0.

This generation of technology is used to extend the Internet to billions of end users and efficiently store the information obtained from those users. In this way, the innovations of Infrastructure 2.0 have resulted in a dramatic increase in data.

Combined with a near-endless array of parallel computations and algorithms, this generation of infrastructure lays the foundation for today's machine learning era.

Architecture 3.0: Towards Intelligent Systems

The ultimate question to be solved by Architecture 2.0 is: How do we connect the world? And the question has now become: How do we understand the world?

连通性与认知之间的不同,是区分 ML / AI与前几代软件的标志。编码认知在计算上的挑战,是其颠覆了传统的编程范式。在传统应用中,逻辑是由手动编码以执行特定的任务,而在 ML / AI中,则是训练算法从数据库中推断逻辑,随后执行逻辑来做决定并进行预测。

然而,这个“聪明的”数据密集型应用程序计算成本昂贵,这让 ML / AI不适用于多用途的冯·诺依曼计算范式。相反,ML / AI代表了一个需要重建基础架构、工具和开发实践的新的基础架构。

但迄今为止,ML / AI的研究和创新的优势一直在于新算法、模型训练技术和优化,ML / AI系统中只有一小部分代码用于学习或预测,大部分的复杂性体现在数据准备、特征工程,以及大规模执行这些任务所需的分布式系统基础架构的操作上。

成功建立和部署 ML / AI需要复杂、精心协调的工作流程,涉及多个离散系统。首先,需要输入、梳理和标记数据。然后,必须确定预测所依据的正确属性(称为特征)。最后,开发人员必须对模型进行训练和验证,不断优化模型。从开始到结束,即使是有经验的团队,这个过程可能需要几个月的时间。

为了让 ML / AI充分发挥潜力,需要训练其不断进阶。这意味着在实践中,开发人员需要使用新的接口、系统和工具,以更加轻松地开发智能应用程序。

这些变化可不是微不足道的,相反地,它们在系统设计和开发工作流程中是具有颠覆性、基础性的变化。

随之,很多 ML / AI优化的新平台和工具将会出现,例如:

类似于处理器芯片,具有许多计算内核和高带宽内存(HBM)的专用硬件。这些芯片针对执行高速、低精度、神经网络浮点运算所需的高度并行数值计算进行优化。

分布式计算框架,用于培训和推理,可以有效地在多个节点上扩展模型操作。

数据和元数据管理系统,以实现可靠、统一和可重复的管道,用于创建、管理培训和预测数据。

极低延迟的服务基础架构,使机器能够根据实时数据和上下文快速执行智能操作。

模型解释、QA、调试和用来大规模监控、学习、优化模型和应用程序的观测工具。

管理整个 ML / AI工作流程的端到端平台,例如 Uber的 Michelangelo和 Facebook的 FBLearner等内部系统,以及 Determined AI这样的商业产品。

过去的十年见证了云本地堆栈的出现,在接下来的几年里,我们将目睹围绕 ML / AI的庞大基础架构和工具生态系统层出不穷。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324725977&siteId=291194637