There is no real machine learning platform?

What exactly are these platforms?

I can understand the efforts of major technology companies to build machine learning platforms. After all, as a major technology supplier, if there is no movement in the AI ​​field, it may soon be forgotten by the market. But what exactly are these platforms? Why is there such a fierce market competition?

To answer this question, the key is to be aware of the difference between machine learning and data science projects and typical applications or hardware development projects in the past. In the past, the focus of hardware and software development work was on system or application functions. On the contrary, data science and machine learning projects put more emphasis on data management, continuously learning knowledge from data, and iteratively evolving data models. From a data-centric point of view, traditional development processes and platforms simply cannot work properly in such new scenarios. Therefore, we need a new platform.

What is a machine learning platform?

Who can really simplify the creation, training and iteration of machine learning models will win this competition.

In fact, there is an intersection between machine learning platforms and data science platforms. For example, data science technology and machine learning algorithms are used and applied to large data sets to develop machine learning models. The tools that data scientists use every day are quite similar to those used by scientists and engineers who focus on machine learning. However, similarity does not mean the same. After all, the actual needs of machine learning scientists and engineers are still somewhat different from those of regular data scientists and engineers.

Generally speaking, people responsible for managing machine learning projects not only need to manage the notebook and ecosystem, and take care of the collaboration with other notebooks, but also need to coordinate various machine learning algorithms, libraries, and infrastructure, and then they need to be large and constantly developing Train these algorithms on top of the data set. An ideal machine learning platform can help machine learning engineers, data scientists, and data engineers understand which machine learning method is most effective, how to adjust hyperparameters, and deploy computationally intensive machine learning on own or cloud-based CPU, GPU or TPU clusters Training and provide the ecosystem necessary to manage and monitor supervised and unsupervised training modes.

Obviously, data science platforms need to provide a collaborative and interactive visualization system for the development and management of machine learning models, but such support is far from enough for machine learning platforms. As mentioned above, a core challenge for the normal operation of a machine learning system is the setting and adjustment of hyperparameters.

From a conceptual point of view, machine learning models need to learn various parameters from data. In other words, what the machine learning model actually learns is the data parameters and uses it to fit new data to the current model. Hyperparameters are configurable data values ​​and cannot be set in advance before the actual data is obtained by the machine learning model. These hyperparameters will directly affect various factors, such as complexity and learning speed. Different machine learning algorithms require different combinations of hyperparameters, and care should be taken to eliminate unnecessary hyperparameters. In this regard, machine learning platforms can help discover, set, and manage hyperparameters, especially algorithms selection and comparison functions that are not available in non-machine learning data science platforms.

What qualities should it have?

In the final analysis, all machine learning project managers want are tools that can improve their own work efficiency. However, machine learning projects are complex and diverse, and each has different needs. Some of these projects focus on conversational systems, some emphasize recognition or predictive analysis functions, and some focus on reinforcement learning or autonomous systems.

In addition, there are differences in how these models are deployed (or operated). Some models are in the cloud or on their own servers, and some models are deployed on edge devices, or use offline batch processing. Differences in application deployment and demand data scientists, engineers, and machine learning, data developers and other groups in machine learning, making the concept of a single machine learning platforms have little practical feasibility, which ultimately brought " to be versatile , All sloppy " results.

Therefore, there are currently four different platforms on the market: one focuses on the needs of data scientists and model builders; the second emphasizes the management of big data and data engineering; the third is oriented to model "building" and model interaction systems; and Four are used for model lifecycle management, that is, "machine learning operations." To truly fulfill the promise made by the machine learning platform, developers need to work hard in these four areas.

图片

Four application environments of AI

谁能真正简化机器学习模型的创建、训练与迭代,谁就能在这场竞赛中胜出。 在这类强大解决方案的帮助下,用户能够快速轻松地从笨拙的非智能系统,跨越至可利用机器学习功能,解决以往无法解决的难题。相比之下,那些无法适应机器学习功能需求的数据科学平台则将遭遇降级。同样的,那些天然具备数据工程能力的大数据平台也将在市场上成为赢家。 未来的应用程序开发工具亦需要着力将机器学习模型视为生命周期中的主要组成部分。总结来讲,机器学习运营才刚刚出现,且必将在未来几年内成为行业中的又一大事件。

数据科学平台是什么?

数据科学家们的任务是从海量数据中整理出有用信息,并将业务与运营信息转化为数据与数学语言。数据科学家需要掌握统计学、概率、数学以及算法相关知识,借此从大量信息中收集有用的洞察见解。数据科学家还负责创建数据假设、运行数据测试与分析,而后将结果转换为组织内能够轻松查看与理解的形式。

因此,一套纯数据科学平台应当满足以下要求:协助构建数据模型、确定最适合当前信息的假设、测试假设、促进数据科学家团队之间的协作,并随信息的不断变化推动数据模型的管理与开发。

此外,数据科学家的工作重点并不在以代码为中心的集成开发环境(IDE)中。相反,Notebook 才是他们的天地。Notebook 概念最初由 Mathematica 及 Matlab 等以数学为中心的学术型平台提出,目前在 Python、R 以及 SAS 社区当中非常流行。所谓 Notebook,其本质在于记录数据研究结果,并允许用户面向不同源数据加以运行,从而简化结果的可重复性。良好的 Notebook 应充当一种共享式协作环境,数据科学家小组可以在这里协同工作,并利用不断发展的数据集进行模型迭代。尽管,Notebook 不能算是代码开发的理想环境,但却能够为数据的协作、探索以及可视化提供强有力的支持。事实上,如果拥有足够的访问权限对接清洁数据,那么数据科学家们将毫不犹豫地利用 Notebooke 快速浏览大型数据集。

但是,如果无法访问大量清洁数据,数据科学家的工作则会陷入困境。很明显,数据的提取、清理与移动并不是数据科学家的职责所在,这些工作应该由数据工程师负责完成。数据工程师面对的主要挑战就是从各类系统中提取结构化与非结构化格式的数据,而且这些数据往往并不“清洁”——存在缺少字段、数据类型不匹配以及其他与数据形式相关的种种问题。

从这个角度来看,数据工程师实际上属于负责设计、构建以及安排数据的工程人员。优秀的数据科学平台还应帮助数据科学家轻松根据需求的增长动用计算能力。平台无需将数据集复制至本地计算机上即可开始工作,确保数据科学家始终以最简单便捷的方式访问算力与数据集。为了实现这一目标,数据科学平台当然也需要提供必要的数据工程功能。总结来讲,一套实用的数据科学平台应当具备一系列数据科学与数据工程功能元素。

大家在争些什么?

毫无疑问,不同规模的各类技术供应商都将重点放在平台开发上,毕竟数据科学家与机器学习项目经理必须依赖这些平台来开发、运行、操作以及管理企业中正在使用的数据模型。

对于这些供应商而言,未来的机器学习平台如同过去以及当下已存在的操作系统、云环境乃至移动开发平台一样。只要能够在数据科学和机器学习平台领域占据市场份额,厂商就能够在未来几十年获得丰厚的回报。

结果就是,在这场新兴斗争中,每位参与者都希望尽可能攫取更可观的市场份额。

所以,当供应商在宣传中提到他们拥有人工智能或者机器学习平台时,我们不妨多问一句:“是哪一种平台?”,通过本文,相信大家已经意识到这世界上存在着不只一种机器学习平台,而且各自面向不同的实际需求。多一点思考,才能确保我们不会因身陷市场炒作而信错厂商、选错产品。


Guess you like

Origin blog.51cto.com/15060462/2677036