Choice of Machine Learning Research and Development Platform

      I have recently started to learn machine learning and data mining related technologies. I have consulted the blogs of some technical experts, and I have gained a lot. I hereby share what I am interested in with you. This article is reprinted and published on this blog after re-typesetting. The original address is: Choice of Machine Learning Research and Development Platform .

      At present, machine learning can be said to be in the blooming stage, but if you want to learn or research machine learning and then use it in a production environment, it will take a lot of brains to choose platforms, development languages, and machine learning libraries. Here is a suggestion based on my own machine learning experience for reference only.

  The first thing to consider is how to choose a platform. It depends on whether you want to use it in a production environment (that is, in a specific product) or just for research purposes.

1. Construction of machine learning platform in production environment

  If the platform is to be used in a production environment, then there is the question of estimating the amount of data the product needs to analyze. If the amount of data is large, then you need to choose a big data platform, otherwise you only need a stand-alone version of the platform.

1.1 Construction of machine learning big data platform in production environment

  The most mainstream big data platform in the production environment is the Spark platform, plus auxiliary distributed data processing containers, such as YARN or Mesos. If you need to collect online data in real time, then add Kafka. In short, a general big data processing platform is to integrate Spark + YARN (Mesos) + Kafka. The product projects I am doing now are all based on Spark + YARN + Kafka. At present, the choice of this platform is basically the mainstream direction.

  Of course, some people will say that it is very troublesome to integrate so many open source software together, and there must be many pitfalls. Is there a general platform that can include big data platform functions similar to Spark + YARN + Kafka? At present, as far as I know, the better one is CDAP . It integrates Spark, YARN, Kafka and some mainstream open source data processing software, and developers only need to do secondary development on the layer of API encapsulated on it. This should be a good idea, but we haven't seen any successful commercial cases yet, so we didn't consider CDAP when we selected the architecture .

  Therefore, the big data platform around Spark + YARN + Kafka is still the first choice. Since Spark MLlib's machine learning algorithms are not rich and easy to use, if you need some algorithms that are not in MLlib in your product, you need to find open source implementations yourself.

1.2 Construction of machine learning stand-alone data platform in production environment

  If the amount of data in the production environment is not large, the big data platform seems to be a bit over-designed. At this time, we have more choices. The first choice is still the Spark platform, but we don't need the distributed container YARN and the routing Kafka for distributed data distribution. Why is Spark the first choice? Because we have to consider expansion, the current amount of data is not large, which does not mean that the amount of data will not be large in the future. This is also the reason why I chose Spark for some small data analysis projects I was involved in. Of course, I think there are other reasons, that is, Spark supports Python, Java, Scala and R at the same time, which lowers the participation threshold for many programmers. In the Spark project I am involved in, the development languages ​​are mainly Java and Scala. Python was not chosen for some speed reasons and the rest of the system is written in Java.

       The second option is a series of python tools based on scikit-learn, including numpy, scipy, pandas, MatplotLib, and more. The feature is that the class library is rich, especially the machine learning library of scikit-learn, which can be said to be eighteen weapons, all of which are available. In addition, because it can interactively write programs, it is convenient to quickly develop prototypes. Two projects I participated in were in the feasibility analysis stage, both using scikit-learn for prototyping and demo for customers.

       Therefore, as a single machine data platform for machine learning in the production environment, Spark is the first choice for products, while the scikit-learn family is suitable for rapid prototyping and verification.

2. Construction of machine learning platform in research environment

  If you just do research, then there are many choices, and there are three mainstream ones.

  The first is to learn based on Spark MLlib. The advantage is that what you have learned can be seamlessly switched to the production environment, but the disadvantage is also obvious. Spark has a lot of things. Running on its own single machine consumes a lot of memory and is relatively slow. Moreover, the MLlib class library is not rich, and many algorithms require Find the library yourself. According to the feedback of colleagues around, it is more difficult, so I personally think that it is not a good choice to learn machine learning based on Spark MLlib.

  The second is based on a series of python tools based on scikit-learn, including numpy, scipy, pandas, MatplotLib, etc. mentioned above. The advantage is that there are many class libraries, powerful APIs, allowing you to focus on data analysis, and there are many examples, so it is not difficult to learn. Of course, there are also disadvantages, that is, it takes a while to use this large number of python libraries proficiently. Personally, I recommend this method. For colleagues around me, using scikit-learn to learn and communicate is also the mainstream.

  The third is an R-based platform for machine learning (excluding Spark R), the main platform is R studio. Since R is an old language, it has rich APIs for data processing and machine learning, especially for those who have been data analysts before. But R is a relatively closed language, the community is far less active than Python, and for programmers, the syntax of R is uncomfortable. A few years ago, R was generally considered better for machine learning than Python, but now Python has left R far behind. Therefore, unless you are already familiar with the R language before, it is not recommended to use R to study machine learning at all, BTW, there is no meaning to discriminate against R here.

  In conclusion, if you want to study machine learning and don't have a special R background, scikit-learn is your first choice. Of course, some people will say that I like to implement machine learning algorithms a little by myself, and I don't like to call the class library directly. Is this not possible? Of course, this must be very good, and it is very beneficial to deepen the understanding of each algorithm. It just takes time to compare. If you don’t have much time like me, it is more straightforward to call the API directly to study the data.


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324521704&siteId=291194637