Search engine notes

1. Search engine notes

1.1. Why Google is successful

There has always been a principle in the technology industry that people are not willing to change their usage habits. Ramaswamy said frankly in the interview, “One of the biggest obstacles we face is indeed changing the inherent habits of users. People forget that Google’s success is not just about developing better products. In order to achieve our goals, we must make a series of Accurate distribution decisions.”

According to reports, Google pays Apple up to $15 billion a year to become the default search engine in the Safari browser on various Apple devices. Google also pays Mozilla to become the preferred search engine in the Firefox browser. The cost is as high as $450 million per year. Google also has partnerships with other device manufacturers and browser developers, and even has similar deals with telecom operators. According to the Wall Street Journal, Samsung briefly considered ending the deal with Google in 2023, but ultimately gave up due to various reasons, including "the possible impact on its extensive business relationship with Google."

Google's real strength lies in its other products. Android is currently the most popular mobile operating system in the world, with a market share of approximately 78%. Chrome is the most popular web browser, accounting for about 62% of the market. On these two major platforms, Google has naturally become the unshakable default search engine.

1.2. Building a search engine is both complex and simple

Search engines are magical things—both incredibly complex and yet pure and simple.

Essentially, what a search engine does is compile a database of web pages (a "search index"), then browse that database each time a query is received, extracting and delivering the highest quality, most relevant set of pages. But every step in the process involves huge complexity and requires a series of trade-offs. There are two core trade-offs: time and money.

Even if an entrepreneur can build a continuously updated database that covers hundreds of billions of pages on the Internet, the storage and bandwidth costs alone are enough to bankrupt any giant company on the planet. This does not include the cost of performing countless searches of the database every day. Plus, every millisecond in a search response counts—Google displays how long each query takes above the results. All in all, entrepreneurs may not have enough time to view the entire database one by one.

In addition, the construction of search engines also starts from a basic philosophical question: What is a high-quality web page? Entrepreneurs must decide which disagreements are reasonable and which information is pure nonsense. They must figure out what proportion of advertising should be considered. It will be excessive. Websites written by AI and full of SEO garbage are certainly not good, but food blogs written by individuals and full of SEO garbage are not bad.

Once the above discussion is completed and clear boundaries are set, the search engine has basically determined the thousands of domain names that need to be reserved. These include news websites such as CNN and Breitbart, popular discussion boards such as Reddit, Stack Overflow and Twitter, tool services such as Wikipedia and Craigslist, service platforms such as YouTube and Amazon, and various top recipe/sports/shopping networks. Sometimes, entrepreneurs can negotiate cooperation with these websites and obtain data directly in a structured way, instead of browsing individual pages. It is worth mentioning that many large platforms have dedicated teams, and sometimes they are even willing to cooperate for free.

After that it's time to release the crawlers. These robots can crawl the content on a given web page, then find and track each link on the page, index all page content, and thus complete the link and index search and tracking cycle. Every time a crawler visits a page, it will be evaluated based on the previously set high-quality web page standards. Content deemed to be of high quality is downloaded to a server, and the search index begins to expand rapidly.

Of course, crawlers are not popular everywhere. Every time a crawler opens a web page, it incurs bandwidth costs for the content provider. Now imagine a suite of search engines loading and saving individual pages on your website every second. The cost of such updates would quickly exceed what the provider can afford.

Therefore, most websites have a file called robots.txt that defines which crawlers can access their content, which crawlers cannot, and which URLs the crawlers are allowed to crawl. Technically, search engines are free to ignore the rules in robots.txt, but it's part of the structure and culture of the Web. Almost all websites are willing to embrace Google and Bing because the discoverability they bring outweighs the cost of bandwidth. There are also many people who block specific service providers, such as not wanting Amazon to crawl and analyze their shopping sites. Others set blanket rules: no crawlers except Google and Bing.

Soon, the crawler will bring back a fairly extensive snapshot of the internet. The next step is to rank all pages in order for every query that the search engine may receive. You can sort your pages by topic, which divides them into smaller, more searchable indexes rather than one all-encompassing behemoth. To put it simply, local results match local results, shopping matches shopping, and news matches news. We need to use a lot of machine learning technology to collect the topics and content of specific pages, and we also cannot do without human assistance.

Additionally, a scoring team is brought in, presented with the query and results, and asked to rate the authenticity of the results from 0 to 10. Sometimes the problem is obvious. If someone searches for "Facebook", but the first response result is not facebook.com, that is definitely unacceptable. But most of the time, we combine ratings from a large number of inputs, feed them into indexes and topic models, and repeat the process.

At this point, the problem has only been half solved. We also need to improve our so-called "query understanding" capabilities, which means realizing that people who search for "Dwayne Johnson" and people who search for "Dwayne Johnson" are actually looking for the same information. Eventually, we will accumulate a large library of synonyms and similarities from which we can rewrite queries to make searching easier. And as Google says, there are 15% new searches in their engine every day, so this race to understand people's real needs and expand new knowledge will never end.

After a while, the search engine was officially launched and began to gain more people's attention, clicks and preferences. There is also a gold standard here: if the user no longer searches and clicks on other links immediately after clicking the link, it means that the quality of the current results is satisfactory. On the other hand, the more clicks users get, the better they understand what they really want.

In addition, running a search engine requires constantly striking a balance between speed, cost and quality. For example, when someone types in "YouTube" and presses Enter, searching the entire database will take too long, causing unnecessary bandwidth and storage costs; if a database is retained that accommodates the entire Internet, not only will the storage cost be high, but the search speed will also suffer It will also be too slow; if you set it to only display the 100 most popular websites on the Internet, you can ensure speed and cost, but the content will be incomplete and the quality will be unreliable. At the same time, each website itself is constantly changing, and search engine crawlers and ranking systems must also continue to keep up.

Guess you like

Origin blog.csdn.net/wan212000/article/details/132325687