InfoWorld announces the best open source software of 2023

InfoWorld has announced its 2023 Best Open Source Software list, which recognizes the year's leading open source tools for software development, data management, analytics, artificial intelligence and machine learning.

InfoWorld is an international technology media brand dedicated to leading IT decision-makers at the forefront of technology. Every year, InfoWorld selects the year's "Best of Open Source Software" based on the software's contribution to the open source community and its influence in the industry. Source Software Awards, Bossies), which has been held for more than ten years.

The 25 software on the list cover programming languages, runtimes, application frameworks, databases, analysis engines, machine learning libraries, large language models (LLM), tools for deploying LLM, etc. The details are as follows (click on each The project name can jump to the project introduction page):

Apache Hudi

When building an open data lake or integrated lake and warehouse , many industries require a more scalable and changeable platform. In the case of advertising platforms for publishers, advertisers and media buyers, quick analysis is not enough. InfoWorld believes that Apache Hudi not only provides fast data formats, tables and SQL, but also enables them to perform low-latency real-time analysis. It integrates with Apache Spark, Apache Flink, and tools such as Presto, StarRocks, and Amazon Athena. In short, if you want to perform real-time analysis on a data lake, Hudi is a very good choice.

Apache Iceberg

" HDFS and Hive are too slow " . Apache Iceberg not only works with Hive, but also directly with Apache Spark and Apache Flink, as well as other systems such as ClickHouse, Dremio and StarRocks. Iceberg provides high-performance tabular formats for all of these systems, while supporting full schema evolution, data compression, and version rollback. Iceberg is a key component of many modern open data lakes.

Apache Superset

Apache Superset has been a leader in data visualization for many years. For those who want to deploy self-service, customer-facing or user-facing analytics tools at scale, Superset is pretty much the only option. Superset provides visualization capabilities for virtually any analysis scenario, from pie charts to complex geospatial charts. It works with most SQL databases and provides a drag-and-drop builder and SQL IDE. If you want to visualize data, Superset is worth a try.

Bun

Bun is a high-performance "all-in-one JavaScript runtime" written in Zig language. It is officially called " all-in-one  JavaScript runtime". Bun provides the functions of packaging, translating, installing and running JavaScript & TypeScript projects, with built-in native bundler, translator, task runner, npm client, and Web APIs such  as .fetchWebSocket

InfoWorld commented that just when you thought JavaScript had entered a predictable routine, Bun appeared. The "frivolous" name belies its serious goal: bringing everything you need for server-side JS—runtime, bundlers, package managers—into a single tool. Making it a drop-in replacement for Node.js and NPM, but much faster. This simple proposition seems to make Bun the most disruptive JavaScript tool since Node disrupted applecart.

Part of Bun's speed is due to Zig, and the rest is due to founder Jared Sumner's obsession with performance. In addition to performance, integrating all tools in one package also makes Bun a powerful alternative to Node and Deno.

Claude 2

Anthropic's Claude 2 can accept up to 100K tokens (~70,000 words) in a single prompt and can generate stories with up to thousands of tokens. Claude can edit, rewrite, summarize, classify, extract structured data, conduct questions and answers based on the content, etc. It trains most in English, but also excels in a range of other commonly used languages. Claude also has extensive knowledge of common programming languages.

Claude was trained from the beginning to be a  helpful, honest, and harmless  robot, and has been extensively retrained to become more harmless and less likely to produce aggressive or dangerous output. It doesn't train on your data or look up answers on the internet.

CockroachDB

CockroachDB is a distributed SQL database that can achieve strong consistency ACID transactions. By achieving horizontal scalability of database reading and writing, it solves the key scalability issues of high-performance, transaction-heavy applications. CockroachDB also supports multi-region and multi-cloud deployments to reduce latency and comply with data regulations. Example deployments include Netflix's data platform, which has more than 100 CockroachDB production clusters supporting media applications and device management. Major clients include Hard Rock Sportsbook, JPMorgan Chase, Santander and DoorDash.

CPython

In the two versions of Python 3.11 and Python 3.12, the Python core development team has made a series of revolutionary upgrades to the reference implementation of the Python interpreter, CPython. The result is a massive improvement in Python runtime performance for everyone, not just those few who choose to use new libraries or cutting-edge syntax.

InfoWorld believes that the Global Interpreter Lock (Global Interpreter Lock) is a long-term obstacle that prevents Python from truly realizing multi-thread parallelism.

DuckDB

DuckDB is an analytical database in the spirit of small but powerful projects like SQLite. DuckDB provides all the familiar RDBMS functionality—SQL queries, ACID transactions, secondary indexes—but adds analytical capabilities such as joins and aggregations of large data sets. It can also ingest and directly query common big data formats such as Parquet.

HTMX and Hyperscript 

HTMX takes the HTML people know and love and extends it with enhancements that make it easier to write modern web applications. HTMX eliminates the bulk of boilerplate JavaScript used to connect web frontends and backends. Instead, it uses intuitive HTML attributes to perform tasks such as making AJAX requests and populating elements with data.

A similar project, Hyperscript, introduces a syntax similar to HyperCard that simplifies many JavaScript tasks, including asynchronous operations and DOM manipulation. All in all, HTMX and Hyperscript offer a bold alternative to the current trend of reactive frameworks.

To that

Istio is a service mesh that simplifies networking and communications for container-based microservices, providing traffic routing, monitoring, logging, and observability while enhancing security with encryption, authentication, and authorization capabilities.

Istio decouples communications and its security capabilities from applications and infrastructure, enabling more secure and consistent configurations. The architecture consists of a control plane deployed in a Kubernetes cluster and a data plane used to control communication policies. In 2023, Istio graduated from the CNCF incubation project and received support and contributions from companies including Google, IBM, Red Hat, Solo.io and other companies in the cloud native community.

Containers said

Combining the speed of containers with the isolation of virtual machines, Kata Containers is a secure container runtime that uses Intel Clear Containers and Hyper.sh runV. Kata Containers works with Kubernetes and Docker and supports multiple hardware architectures, including x86_64, AMD64, Arm, IBM p-Series, and IBM z-Series.

It has been sponsored by Google Cloud, Microsoft, AWS, Alibaba Cloud, Cisco, Dell, Intel, Red Hat, SUSE and Ubuntu.

LangChain

LangChain is a modular framework that simplifies the development of language model-driven applications. LangChain enables language models to connect to data sources and interact with their environment. LangChain components are collections of modular abstractions and abstract implementations.

LangChain off-the-shelf chains are structured combinations of components used to accomplish specific high-level tasks. You can use components to customize existing chains or build new ones. LangChain currently has three versions: one is Python version, one is TypeScript/JavaScript version, and one is Go version. As of now, there are approximately 160 LangChain integrations.

Language Model Evaluation Harness

When a new large language model (LLM) is released, it is usually evaluated to compare the model with ChatGPT on some benchmark. Many companies may use lm-eval-harness to generate evaluation scores. Created by EleutherAI, a distributed artificial intelligence research institute, lm-eval-harness contains more than 200 benchmarks and is easily extensible. The tool has even been used to find deficiencies in existing benchmarks, as well as power Hugging Face's open LLM rankings.

Llama 2

Llama 2 is Meta AI's next generation large language model. Compared with Llama 1, its training data volume has increased by 40% (2 trillion tokens from public sources) and the context length has doubled (4096).

Llama 2 is an auto-regressive language model using an optimized Transformer architecture . The adapted version uses supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to match human preferences for usefulness and safety. Code Llama is trained by fine-tuning Llama 2 on code-specific datasets, and it can generate code and natural language about code based on code or natural language prompts.

To be

Ollama is a command line tool that runs Llama 2, Code Llama and other models natively on macOS and Linux, with support for Windows planned. Ollama currently supports nearly two dozen language model families, each of which has many available "tags". Tags are variations of models that are trained at different scales using different fine-tuning methods and quantized at different levels to perform well locally. The higher the quantization level, the more accurate the model, but it runs slower and requires more memory.

polar

What Pandas can do, Polars can't necessarily do, but what it can do, it can do very quickly - 10 times faster than Pandas, using only half the memory of Pandas. Developers coming from PySpark will find the Polars API easier to use than the one in Pandas. If you're working with large amounts of data, Polars will make your work faster.

PostgreSQL

PostgreSQL is 35 years old, has more than 700 contributors, and holds an estimated 16.4% market share among relational database management systems. A recent survey showed that 45% of 90,000 developers preferred PostgreSQL.

PostgreSQL 16, released in September, improves the performance of aggregating and selecting disparate queries, increases query parallelism, brings new I/O monitoring capabilities, and adds more fine-grained security access controls. Also in 2023, Amazon Aurora PostgreSQL added pgvector to support generative AI embedding, and Google Cloud released similar functionality for AlloyDB PostgreSQL.

QLoRA

QLoRA is an effective fine-tuning method proposed by the University of Washington that can reduce memory usage enough to fine-tune a 65B parameter model on a single 48GB GPU while retaining full 16-bit fine-tuning task performance. QLoRA backpropagates gradients to a low-order adapter (LoRA) via a frozen 4-bit quantized pre-trained language model.

Using QLoRA means you can fine-tune a massive 30B+ parametric model on your desktop with very little accuracy loss compared to full tuning on multiple GPUs. In fact, QLoRA sometimes does even better. InfoWorld commented, “Low-level inference and training means more people can use LLM – isn’t this what open source is about?”

RAPIDS

RAPIDS is a collection of GPU-accelerated libraries for common data science and analytics tasks. Each library handles a specific task, such as cuDF for data frame processing, cuGraph for graph analysis, and cuML for machine learning.

Other libraries cover image processing, signal processing, and spatial analysis, while integrations bring RAPIDS to Apache Spark, SQL, and other workloads. If none of the existing libraries fit the bill, RAPIDS also includes RAFT, a collection of GPU-accelerated primitives for building your own solutions. RRAPIDS also works with Dask to scale across multiple nodes and with Slurm to run in high-performance computing environments.

Spark NLP

Spark NLP is a natural language processing library that runs on Apache Spark and supports Python, Scala and Java. This library helps developers and data scientists experiment with large language models, including Transformer models from Google, Meta, OpenAI, and more.

Spark NLP's Model Center has more than 20,000 models and pipelines available for download for language translation, named entity recognition, text classification, question answering, sentiment analysis, and other use cases. In 2023, Spark NLP released a number of LLM integrations, a new image-to-text annotator, support for all major public cloud storage systems, and support for ONNX (Open Neural Network Exchange).

StarRocks

Analytics technology has changed. Today's companies often have to serve complex data in real time to millions of concurrent users, and even petabyte-scale queries must complete within seconds. StarRocks is a query engine that combines native code (C++), an efficient cost-based optimizer, vector processing using the SIMD instruction set, caching and materialized views to efficiently handle large-scale joins.

StarRocks can even provide near-native performance when directly querying data lakes and lake warehouses, including Apache Hudi and Apache Iceberg. InfoWorld believes that whether you are pursuing real-time analytics, providing customer-facing analytics services, or simply want to query a data lake without moving the data, StarRocks is worth a try.

TensorFlow.js

TensorFlow.js packages the powerful features of Google's TensorFlow machine learning framework into a JavaScript package, bringing extraordinary functionality to JavaScript developers with the lowest learning cost. You can run TensorFlow.js in the browser, a pure JavaScript stack with WebGL acceleration, or the tfjs-node library on the server. The Node library gives you the same JavaScript API but runs on top of C binaries to maximize speed and CPU/GPU usage.

"TensorFlow.js is clearly a good choice for JS developers interested in machine learning. It makes a welcome contribution to the JS ecosystem and makes artificial intelligence more accessible to developers."

vLLM

The rush to deploy large language models in production has led to the emergence of a large number of frameworks focused on doing inference as quickly as possible. vLLM is one of the most promising frameworks, which supports Hugging Face model, OpenAI-compatible API and PagedAttention algorithm.

It is now the obvious choice for serving LLMs in production, and new features such as FlashAttention 2 support are being added quickly.

Weaviate

The boom in generative AI has spurred the need for new types of databases that can support massive amounts of complex unstructured data. Vector databases came into being.

Weaviate offers developers a lot of flexibility in terms of deployment models, ecosystem integration, and data privacy. Weaviate combines keyword search with vector search for fast, scalable discovery of multimodal data (text, images, audio, video). It also has out-of-the-box modules for Retrieval Augmented Generation (RAG), which provides domain-specific data to chatbots and other generative AI applications, making them more useful.

Zig

InfoWorld says Zig may be the most important of all open source projects today.

Zig strives to create a general-purpose programming language with program-level memory control that performs better than C while providing a more powerful, less error-prone syntax. Its goal is to replace the C language and become the benchmark language in the programming ecosystem. Since the C language is ubiquitous (ie, the most common component in systems and devices), Zig's success could mean widespread improvements in performance and stability.

"This is something we should all look forward to. Plus, Zig is a good, old-fashioned grassroots project with huge ambition and an open source spirit."


The above are the 2023 InfoWorld Bossie Awards projects. For detailed information such as specific selection comments for each project, please view the original text of the website .

Guess you like

Origin www.oschina.net/news/263384/2023-infoworld-bossie-awards