java scientific computing library Smile

official website

https://haifengl.github.io/

github

https://haifengl.github.io/

Introduction

The Statistical Machine Intelligence and Learning Engine, or Smile for short, is a promising modern machine learning system similar in some respects to Python's scikit-learn. It is developed in Java and also provides a Scala API. This library will amaze you with its fast and wide range of applications, efficient memory usage, and a large number of machine learning algorithms for classification, regression, nearest neighbor search, feature selection, and more. (As of 2023-07-07, the project is being updated)

Supported algorithms:

 use:

Still studying, English proficiency is average

Introduction to libraries in other scientific fields

Top 15 Scala Libraries for Data Science in 2018


In a previous article, we discussed the top Python libraries for data science . This time we will focus on Scala, which has recently become another important language for data scientists. Its popularity is mainly due to the rise of Spark, a preferred big data processing engine, which is written in Scala and thus provides a native API in Scala.

We won't compare Scala vs. Python in depth here, but it's important to note that, unlike Python, Scala is a compiled language. As a result, code written in it executes much faster (compared to pure Python, rather than specialized libraries like NumPy).

Writing in Scala is much more enjoyable than Java because the same logic can often be expressed in fewer lines. The function of Scala is not inferior to that of Java, and even has some more advanced properties. Java veterans may throw a lot of counterarguments here, but there is no doubt that Scala is better suited for data science tasks.

Currently, Python and R are still the leading languages ​​for fast data analysis and building, exploring and manipulating powerful models, while Scala is becoming a key language for developing big data functional products because the latter requires stability, flexibility, high speed, scalability wait. Typically, the analysis and models are done in Python during the research phase and then implemented in Scala during production.

For your convenience, we have prepared a comprehensive overview of the most important libraries for performing machine learning and data science tasks in Scala. We'll use an analogy with the corresponding Python tool to better understand some important aspects. In fact, only one top-level comprehensive tool can form the basis for data science and big data solution development in Scala, Apache Spark, complemented by a large number of libraries and tools written in Scala and Scala. Java. Let's take a closer look at it.

Data Analysis and Mathematics

1. Breeze (commits: 3316, contributors: 84)

Breeze is considered the main scientific computing library for Scala. It draws inspiration from MATLAB's data structures and Python's NumPy classes. Breeze provides fast and efficient operations on data arrays, and supports the implementation of many other operations, including:

  • Matrix and vector operations for creating, transposing, padding numbers, performing element-wise operations, inverting, computing determinants, and many more options for almost any need .

  • Probability and statistics functions, from statistical distributions and computing descriptive statistics (such as mean, variance, and standard deviation) to Markov chain models. The main packages used for statistics are  breeze.statsandbreeze.stats.distributions

  • Optimization, means studying the local or global minimum of a function. The optimized method is stored inbreeze.optimize package. 

  • Linear algebra: All basic operations rely on the netlib-java library, making Breeze's algebraic calculations extremely fast.

  • Signal processing operations are necessary for processing digital signals. Examples of important operations in Breeze are convolution and Fourier transform, which decompose a given function into a sum of sine and cosine components.

Breeze also offers drawing possibilities, which we'll discuss below.

2. Saddle (commits: 184, contributors: 10)

Another data manipulation toolkit for Scala is Saddle. It is the Scala analog of the pandas library for R and Python. Like data frames in pandas or R, Saddle is based on a Frame structure (a 2D indexed matrix).

There are five main data structures in total, namely: 

  • Vec (one-dimensional vector)

  • Mat (two-dimensional matrix)

  • Series (1D indexed matrix)

  • frame (2D indexed matrix)

  • Index (similar to hashmap)

Vec and Mat classes are based on Series and Frame. You can perform different operations on these data structures and use them for basic data analysis. Another advantage of Saddle is its robustness to missing values. /span>

3. Scalalab (commits: 23, contributors: 1)

ScalaLab is Scala's interpretation of MATLAB's computational capabilities. Moreover, ScalaLab can directly call and access the results of MATLAB scripts.

The main difference from previous computing libraries is that ScalaLab uses its own domain-specific language called ScalaSci. Scalalab provides easy access to various scientific Java and Scala libraries, so you can easily import data and then use different methods for manipulation and calculation. Most techniques are similar to Breeze and Saddle. Also, as with Breeze, there are plotting opportunities that can further interpret the resulting data.

natural language processing

4. Epic (commits: 1790, contributors: 15) & 5. Puck (commits: 536, contributors: 1)

Scala has some great natural language processing libraries as part of ScalaNLP, including Epic and Puck. These libraries are mainly used as text parsers, if you need to parse thousands of sentences, Puck will be more convenient due to its high speed and GPU usage. Additionally, Epic is also known as a predictive framework, which employs structured predictions to build complex systems.

Data Science Virtual Machine

visualization

6. Breeze-vis (Commits: 29, Contributors: 3)

As the name suggests, Breeze-viz is a drawing library developed by Breeze for Scala. It is based on the well-known Java charting library JFreeChart and has a MATLAB-like syntax. Although Breeze-viz has far fewer opportunities than MATLAB, matplotlib in Python, or R, it can still be very helpful in the process of developing and building new models. 

7. Vegas (Commits: 210, Contributors: 14)

Another Scala library for data visualization is Vegas. It is much more powerful than Breeze-viz, and allows some drawing specifications, such as filtering, transformation and aggregation. It is similar in structure to Python's Bokeh and Plotly.

Vegas provides declarative visualizations, allowing you to focus on specifying what needs to be done on the data and further analysis on the visualizations without worrying about code implementation.

machine learning

8. Smile (Commits: 1019, Contributors: 21)

The Statistical Machine Intelligence and Learning Engine, or Smile for short, is a promising modern machine learning system similar in some respects to Python's scikit-learn. It is developed in Java and also provides a Scala API. This library will amaze you with its fast and wide range of applications, efficient memory usage, and a large number of machine learning algorithms for classification, regression, nearest neighbor search, feature selection, and more.

 

9. Apache Spark MLlib and ML 

The MLlib library is built on top of Spark and provides a variety of machine learning algorithms. It's written in Scala and also provides powerful APIs for Java, Python, and R, but Scala's opportunities are more flexible. The library consists of two separate packages: MLlib and ML. Let's look at them one by one in more detail.

  • MLlib is an RDD-based library containing core machine learning algorithms for classification, clustering, unsupervised learning techniques, and supported by tools for implementing basic statistics such as correlation, hypothesis testing, and random data generation.

  • ML is a newer library that, unlike MLlib, operates on dataframes and datasets. The main purpose of this library is to provide the ability to build pipelines of different transformations on the data. A pipeline can be viewed as a series of stages, where each stage is either a Transformer (which transforms one dataframe into another) or an Estimator (an algorithm that can fit a dataframe to produce a Transformer).

Each package has its advantages and disadvantages, and in practice, it turns out that applying both is often more effective.

10. DeepLearning.scala (commits: 1647, contributors: 14)

DeepLearning.scala is an alternative machine learning toolkit that provides efficient solutions for deep learning. It utilizes mathematical formulas to create complex dynamic neural networks through a combination of object-oriented and functional programming. The library uses a wide range of types as well as application type classes. The latter allows multiple computations to be started at the same time, which we consider to be critical for data scientist processing. It is worth mentioning that the library's neural networks are programs and support all Scala features.

11. Summing Bird (Commits: 1772, Contributors: 31)

Summingbird is a domain-specific data processing framework that allows integrated batch and online MapReduce computation as well as hybrid batch/online processing modes. The main catalyst for designing the language came from Twitter developers who often had to write the same code twice: first for batch processing and then again for online processing.

Summingbird consumes and produces two types of data: streams (infinite sequences of tuples) and snapshots that are considered the complete state of the dataset at a point in time. Finally, Summingbird provides platform implementations of Storm, Scalding, and an in-memory execution engine for testing purposes.

12. PredictionIO (commits: 4343, contributors: 125)

Of course, we cannot ignore a machine learning server for building and deploying prediction engines called PredictionIO. It's built on top of Apache Spark, MLlib, and HBase, and was even voted the most popular Apache Spark-based machine learning product on Github. It enables you to easily and efficiently build, evaluate, and deploy engines, implement your own machine learning models, and incorporate them into your engines.

additional

13. Akka (commits: 21430, contributors: 467)

Developed by the company that created Scala, Akka is a concurrency framework for building distributed applications on the JVM. It uses an actor-based model, where actors represent objects that receive messages and take appropriate actions. Akka replaces the functionality of the Actor class provided in previous Scala versions.

The main difference (and also considered the most important improvement) is the additional layer between the actor and the underlying system, which only requires the actor to process the messages, while the framework handles all other complexities. All actors are arranged hierarchically, thereby creating an actor system that helps actors interact with each other more efficiently and solve complex problems by dividing them into smaller tasks.

14. Spray (Commits: 2663, Contributors: 74)

Now let's look at Spray - a set of Scala libraries for building Akka-based REST/HTTP web services. It ensures asynchronous, non-blocking Actor-based high-performance request processing, while the internal Scala DSL provides defined Web service behavior and efficient and convenient testing capabilities.

UPD: Spray is no longer maintained and has been suspended by Akka HTTP. While most of the library functionality is still there, there have been some changes and improvements related to this shift in terms of streams, module structure, routing DSL, etc. The migration guide will help you understand all progress.

15. Slick (Commits: 1940, Contributors: 92)

Last but not least on our list is Slick, which stands for Scala Language Integrated Connection Toolkit. It is a library for creating and executing database queries, providing various supported databases such as H2, MySQL, PostgreSQL, etc. Some databases are available through slick-extensions.

For building queries, Slick provides a powerful DSL that makes code look like you're working with Scala collections. Slick supports simple SQL queries and strongly typed joins of multiple tables. Also, simple subqueries can be used to construct more complex subqueries.

in conclusion

In this article, we've outlined some Scala libraries that are useful when performing major data science tasks. They have proven to be very helpful and effective in achieving the best results. You can also view activity statistics for each provided repository, fetched from GitHub.

Source: Google spreadsheet

Please note that the list mentioned above is not comprehensive, there are many other tools on the market suitable for different use cases. If you have some positive experience with any other useful Scala library or framework that deserves to be added to this list, feel free to share it in the comments section below.

Thank you very much for your attention and cooperation!

Guess you like

Origin blog.csdn.net/qq_26408545/article/details/131596655