Four crazy open source projects born in 2000: the entire Internet is converted into a large-scale model corpus, and the embedding cost of 100 million tokens is only 1 US dollar

Source | Qubit | Public Account QbitAI

All papers on Arxiv are converted into Tokens, which add up to only 14.1GB.

This is the feat accomplished by Alexander, the latest explosive open source project .

In fact, this is only the first step.

They ultimately want to turn the entire Internet into Tokens , in other words, transform them all into the way ChatGPT and other large models understand the world.

Once such a data set is born, wouldn't it be another powerful tool for the development of a large model like GPT-4, and knowledge of astronomy and geography is just around the corner? !

As soon as the news came out, it immediately aroused great attention.

Netizens praised, epic .

And behind this is only four teenagers with an average age of 20 years old. At present, all Arxiv paper data sets have been released, and they will release the Embedding (Embedding) search platform next week.

Start with all papers on Arxiv

More than 4 million items, 600 million tokens, and 3.07 billion vector dimensions.

This open source project, named Alexander, first started with each paper on Arxiv.

The chosen method is embedding. Simply put, it is to embody various objects in the real world into vectors that computers can understand.

The most classic example is to represent the original image as grayscale pixels.

The biggest feature of this technology is that it can express the semantic similarity perceived by humans.

For example, it is difficult to find papers by keywords when there are 10 words that mean the same thing. But embedding can be done, so it is very suitable for search, clustering, recommendation and classification.

Based on the consideration of practicality and efficiency, the development team only chooses to embed the title and abstract of the paper.

After testing various models, I finally chose to use the InstructorXL text embedding model, which is suitable for a variety of tasks (such as classification, retrieval, clustering, text evaluation, etc.) and fields (such as scientific , finance, medicine, etc.)》

Next week they will post an Arxiv search. The flow so far is to first do a similarity search of the 100 closest articles, then compute embeddings of these on the fly, and do a second, more complex search.

The ultimate goal is an entire Internet embedding program.

The crazy open source project of a 20-year-old boy

There are two main reasons for launching such a crazy open source project.

On the one hand, it embeds huge value. A lot of problems in the world are just search, clustering, recommendation or classification, and these things embeddings are very good at. And also, as mentioned, some complex puzzles can be solved.

Costs on the other hand are one-time and cheap. In most cases there is no need to recalculate the same file. Currently only $1 per 100 million Tokens .

But they didn't find any open datasets for embedding, so such an organization was born.

They will also open up more data sets in the future, and these are all selected by these users. In addition to the public data sets on the official website, the remaining open source projects have opened voting channels.

It is worth mentioning that behind the scenes is a team of teenagers with an average age of only 20 years old.

And their team name is also very domineering, Macrocosm (macro world) alliance.

As soon as you zoom in far enough, humans become a single creature.

As far as the official introduction is concerned, they are committed to building plug-ins for ChatGPT and other similar products, and are also developing core products, personal research assistants based on large models, to help learning, teaching and scientific research.

Interested friends can click the link below to learn~

https://alex.macrocosm.so/download
参考链接:
[1]https://www.macrocosm.so/
[2]https://twitter.com/willdepue/status/1661781355452325889
[3]https://github.com/macrocosmcorp
[4]https://www.pinecone.io/learn/vector-embeddings/

Guess you like

Origin blog.csdn.net/lqfarmer/article/details/131131493
Recommended