"Volume" for a 16-year-old high school student: Wrote a C++ machine learning library from scratch with 13000+ lines of code

"Volume" for a 16-year-old high school student: Wrote a C++ machine learning library from scratch with 13000+ lines of code

Source: Heart of the Machine

High school students are now popular in the field of artificial intelligence to save the world?

A teenager who loves computers, can already do something at the age of 16, such as developing a Cantonese programming language, winning a Kaggle championship, writing a game, developing a cryptocurrency investment robot, building a C++ machine learning library from scratch, etc. .

Today I'm going to introduce a 16-year-old boy (@novak-99) who built a C++ machine learning library from scratch. His self-recommendation post has received hundreds of likes on reddit.

The library he built (ML++) is over 13,000 lines of code and covers topics such as statistics, linear algebra, numerical analysis, machine learning, and deep learning.

Project address: https://github.com/novak-99/MLPP

@novak-99 stated that he built this library because C++ was his language of choice, but when it comes to the ML front end, C++ is very rarely used .

C++ is efficient and good for fast execution. So most libraries (like TensorFlow, PyTorch, or Numpy) use C/C++ or some C/C++-derived language for optimization and speed.

But when he looked at the front-end implementations of various machine learning algorithms, he noticed that most of them were implemented in Python, MatLab, R, or Octave. He believes that the reason why C++ is less used in the ML front-end is mainly because of the lack of user support and the complex syntax of C++.

Compared to Python, C++ has very few machine learning frameworks . Also, even in popular frameworks such as PyTorch or TensorFlow, the C++ implementation is not as complete as the Python implementation, and there are problems including: lack of documentation; not all major functions are present; not many people are willing to contribute, etc.

Additionally, C++ does not support various key libraries of Python's ML suite. Neither Pandas nor Matplotlib support C++ . This increases the implementation time of ML algorithms because elements of data visualization and data analysis are more difficult to obtain.

Therefore, he decided to write a C++ machine learning library himself.

He also noted that because ML algorithms are so easy to implement, some engineers may overlook the implementation and mathematical details behind them. This can pose some problems, as customizing ML algorithms for a specific use case is impossible without knowing the mathematical details. So in addition to the library, he plans to release comprehensive documentation explaining the mathematical background behind each machine learning algorithm in the library, covering things like statistics, linear regression, Jacobians, and backpropagation. Here are some of the statistics:

Opening the project, we can see some of the details:

Covering 19 major topics, this ML++ is big enough and comprehensive

Like most frameworks, the ML++ library created by this high school student is dynamic and constantly changing. This is especially important in the world of machine learning, where new algorithms and techniques are developed every day.

Currently, the following models and techniques are being developed in the ML++ library:

  • Convolutional Neural Network (CNN)
  • Kernel of Support Vector Machine (SVM)
  • Support Vector Regression

Overall, the ML++ library includes 19 topics and related subdivisions, as follows:

  • Regression (Linear Regression, Logistic Regression, Softmax Regression, Exponential Regression, Probit Regression, Cloglog Regression, Tanh Regression)
  • Deep, dynamic, scaled neural networks (activation functions, optimization algorithms, loss functions, regularization methods, weight initialization methods, learning rate planners)
  • Prebuilt Neural Networks (Multilayer Perceptrons, Autoencoders, Softmax Networks)
  • Generative Modeling (Tabular Adversarial Generative Networks)
  • Natural language processing (Word2Vec, stemming, bag-of-words model, TFIDF, auxiliary text processing functions)
  • Computer Vision (convolution operations, max/min/average pooling, global max/min/average pooling, Prebuilt feature vectors)
  • Principal component analysis
  • Naive Bayes Classifier (Multinomial Naive Bayes, Bernoulli Naive Bayes, Gaussian Naive Bayes)
  • Support vector classification (primitive formation, dual formation)
  • K-Means algorithm
  • K nearest neighbors algorithm
  • Outlier Finder (using standard scores)
  • Matrix factorization (SVD factorization, Cholesky factorization, QR factorization)
  • Numerical analysis (numerical differentiation, Jacobi vector calculator, Hessian matrix calculator, function approximator, differential equation solver)
  • Mathematical Transform (Discrete Cosine Transform)
  • Linear Algebra Module
  • Statistics module
  • Data processing modules (feature scaling, mean normalization, One Hot characterization, inverse One Hot characterization, supported color space conversion types)
  • Utilities (TP/FP/TN/FN functions, precision, recall, precision, F1 score)

Please refer to the original project for more details.

Netizen: What am I going to do?

For being able to make such an excellent project at the age of 16, some netizens can't help but sigh, what are the high school students in this world doing? ! I'm still "biting my fingers" at their age. And they have published papers at ICLR, NeurIPS conferences...

Some netizens said that if high school students are doing these things, imagine how intense the doctoral application will be in a few years. Now, all you need to do is publish 3+ NeurIPS papers to win the Turing Award in the future.

If it seems like a joke, it can be said that it is a "roll" to some extent.

However, some netizens pointed out that there are 13,000 lines of code in the project but no tests? Another netizen believes that this is a pet project created based on personal hobbies and is not suitable for practical use cases. So testing is not important here.

Reference link:

https://www.reddit.com/r/MachineLearning/comments/srbvnc/p_c_machine_learning_library_built_from_scratch/

开源前哨Daily sharing of popular, interesting and useful open source projects. Participate in the maintenance of 100,000+ Star open source technology resource libraries, including: Python, Java, C/C++, Go, JS, CSS, Node.js, PHP, .NET, etc.

Guess you like

Origin blog.csdn.net/osfront/article/details/123041399