Learn about data science notebook in one article

Editor's note:

It mainly introduces what a notebook is, the importance and advantages of notebook application in the field of data science, and the key factors that data scientists/algorithm teams need to consider when choosing notebooks. At the same time, based on the notebook's screening and consideration dimensions, a preliminary comparative analysis of common notebooks is conducted to provide a reference for data scientists and algorithm engineers.

       Notebook is an interactive computing method based on web pages. Users can develop, write documents, run code, display results, and share results in Notebook. Compared with traditional non-interactive development environments, the biggest feature of Notebook is that it allows scripts to be executed cell by cell. Notebook is a vital tool in the field of data science. Data scientists use Notebook to conduct experiments and exploration tasks. In recent years, with the development of big data, more and more non-technical personnel such as business analysts have begun to use notebooks.

01. The core advantages of Notebook

       In a traditional non-interactive development environment, the program written by the developer needs to be compiled into an executable file, and then the executable file is completely run. If an error occurs, you need to return to the editor, type new code, and then run the entire program again. code.

        In Notebook , developers can write and run programs cell by cell. When an error occurs, they only need to adjust and run the cell where the error occurred. The correctly running cell has been saved in the memory, and there is no need to run it repeatedly, which is a great improvement. Improve development efficiency. Notebook is therefore very popular among data scientists and algorithm engineers, and is widely used in the field of AI algorithm development and training. Taking deep learning experiments as an example, model training usually takes several hours to more than ten hours. Using Notebook for model debugging, after making minor changes, there is no need to retrain the entire model, which can greatly save the time of data scientists and algorithm engineers. .

02. Basic structure of Notebook

       The earliest Notebook was Mathematica launched in 1988. Early notebooks were mainly used in the academic field. As notebooks have gradually entered the industrial field from academia in the past decade, more and more notebooks have emerged in the market, such as open source Jupyter and Apache Zeppelin, and commercially hosted Colab and JetBrains. Datalore, IDP Studio**, etc., support mixed multi-language Polynote, etc.

       Although there are many types of notebooks, their core components include two major components:

  • One is the front-end client , which consists of an ordered list of input/output cells into which users can enter code, text, etc.
  • Another component is the backend kernel (Kernel) , which can be configured locally or in the cloud. The code is passed from the front end to the Kernel, and the Kernel runs the code and returns the results to the user. Kernel determines the computing performance of Notebook. IDP Studio uses Rust language to rewrite Kernel. Notebook startup speed and resource configuration speed are improved by an order of magnitude.

       (** This article only uses IDP Studio to generally refer to the notebook interactive programming environment in IDP Studio. Other plug-in functions such as model management and model publishing in IDP Studio are beyond the scope of this article)

03. How to choose a suitable Notebook

       Different Notebooks have their own characteristics, and data scientists and algorithm engineers need to choose the most appropriate Notebook tool based on their own core requirements during actual use. Based on interviews with a large number of data scientists, we have summarized the four core issues that data scientists are concerned about when choosing notebooks, which can be used as a reference for tool selection criteria by algorithm developers and data mining personnel.

1) Complete basic functions and ease of use

       Installation and deployment: For novice data scientists, commercially hosted notebooks (such as IDP Studio, Colab, JetBrians Datalore) adopt the SaaS model and are ready to use out of the box, making it easier to install and get started. Open source notebooks need to be installed by the user. Usually local installation is relatively easy, but if it is installed and run on a remote server, it is quite challenging.

       Version management: Both algorithm models and algorithm interfaces will need to be constantly updated and optimized, and version management is crucial. The completeness and ease of use of different Notebook version management functions vary. For example, open source products such as Jupyter support Git for version management; IDP Studio and others have built-in version management functions while supporting Git, and automatically save historical versions, while Colab temporarily The version management function is not supported.

       Language support: Commonly used languages ​​in the field of machine learning and data science include Python, SQL, R, etc. Python is far ahead. According to Kaggle's 2021 survey of more than 25,000 data scientists, 84% use Python. Currently, all common notebooks have good support for Python, but the depth of support for SQL and R, the second and third most commonly used languages, is not the same. Therefore, when data scientists choose tools, they need to consider whether Notebook naturally supports their commonly used languages. For example, Jupyter can better support Python, Julia and R languages, but when supporting SQL, you need to install plug-ins and configure it yourself; IDP Studio naturally supports Python and SQL in depth, but it does not yet support other commonly used languages; if you need to use Scala and other multiple languages ​​have good support, you can consider Polynote.


2) Efficiency improvement

       On the basis of basic functions, data scientists focus on whether Notebook can help them reduce non-core work and improve development efficiency.

       Code assistance: Code assistance can greatly help developers save time and improve efficiency. Main code assistance includes code completion, error prompts, quick repairs, definition jumps, etc. Open source tools have a rich ecosystem and generally rely on third-party plug-ins to implement code assistance functions. Commercial hosting products have built-in code assistance functions, but they have different functional focus and performance, among which code completion is a common function. IDP Studio is the most comprehensive in terms of code assistance functions and has a relatively better experience in terms of speed and performance. However, the completion of functions of some third-party libraries needs to be improved.

       Access to data sources: Data is the cornerstone of the daily work of data scientists . Usually data sources are scattered across various places, posing great challenges to data access. It is crucial to have easy access to data. Data scientists need to choose an appropriate notebook based on the distribution of their own data sources. Currently, Jupyter and Zeppelin open source software require data scientists to configure their own access; Colab only supports data access from Google Drive; IDP Studio has integrated and docked with mainstream data sources, and users can access data sources with one click.

       Environment management: First of all, mature data scientists and algorithm teams have more important requirements for convenient environment setup and environment management. They want to be able to quickly configure the environment, and at the same time they want to be able to build and manage a consistent environment that can be shared between individuals and teams. sexual environment. Different notebooks have different support for environment configuration and reuse. Generally speaking, Datalore, which naturally supports team collaboration, is slightly more usable in terms of environment management. Users can make choices based on their own needs for environmental management.


3) Accelerate collaboration

       Cross-team collaborative analysis: Algorithms and business analysis are increasingly coupled. Algorithm developers hope to share results with business personnel in the form of interactive visual reports to achieve efficient collaborative analysis between algorithm teams and business teams. Data scientists with strong needs for cross-team collaboration can pay more attention to notebooks such as Datalore and IDP Studio, which were newly launched this year and highlight team collaboration in terms of functional positioning.
       Collaborative programming: In addition to cross-team collaboration, Notebook sharing, real-time collaborative editing and commenting have also become increasingly prominent needs of data scientists. At present, it seems that overseas data scientists have stronger demand for this function. Currently, common notebooks support a certain degree of collaborative programming, but there are differences in real-time performance and ease of use.


4) Cost

       Cost is usually an important consideration that affects the choice of data scientists and algorithm engineers, but in the field of notebook selection, we believe that this factor is relatively less important than performance and ease of use, because instant commercial notebooks are usually not suitable for individual users. A free basic version is available.

       We are happy to see Notebook becoming more and more popular and gradually becoming a communication bridge between algorithm teams and business teams. Notebook has also been further applied in the industry, providing data scientists with good support for algorithm development, experimentation and exploration.

For more technical content, please follow: Baihai IDP
 

Guess you like

Origin blog.csdn.net/iamonlyme/article/details/132799945