Doug Cutting—The Father of Search

Original author: time friend

Original address: Doug Cutting, the father of Hadoop

Doug Cutting saw his son holding a little yellow elephant while he was learning to speak, and his affectionate name was hadoop. His inspiration flashed and he named this technology Hadoop, and also used the little yellow elephant as the logo, but the fact is The elephant on the top is thin and long, not as chubby as shown on the Logo. "My son is now 17 years old, so he gave me the baby elephant. When there were events, he brought the baby elephant to attend. When there were no events, the baby elephant was left in the drawer of the socks at home." Doug Cutting said with a laugh. .

Figure 丨 The yellow elephant held by Doug Cutting is the inspiration for naming Hadoop

 

In 1985, Cutting graduated from Stanford University. He was not determined to join the IT industry at the beginning. In the first two years of college, Cutting took regular courses such as physics and geography. Because of the pressure of tuition, Cutting began to realize that he must learn some more practical and interesting skills. In this way, on the one hand, you can help you pay off your loan, and on the other hand, you can also make plans for your future life. Because Stanford University is located in Silicon Valley, the "holy land" of the IT industry, learning software is a natural thing for young people.

Cutting's first job was as an intern at Xerox. Xerox's laser scanners were running three different operating systems, one of which had no screen saver. Therefore, Cutting began to develop a screen saver for this system. Since this program is developed based on the bottom of the system, other colleagues can add different themes to this program. This job gave Cutting a certain sense of satisfaction and was also his earliest "platform" level work.
Although Xerox has allowed Cutting to accumulate a lot of technical knowledge, he believes that the research he did at the time was just on paper, and no one had tested the practicality of these theories. So, he decided to take this step bravely, so that search technology can be used by more people.

At the end of 1997, Cutting began to invest two days a week, trying to use Java at home to make this idea a reality. Soon after, Lucene was born. As the first open source library to provide full text search, Lucene's greatness needless to say. After that, Cutting made persistent efforts to deepen the idea of ​​open source based on Lucene.

In 2004, Cutting and Mike Cafarella, who was also a programmer, decided to develop an open source search engine that could replace the mainstream search products at the time. This project was named Nutch. Doug Cutting hopes to develop a set of search technology with an open source architecture, similar to the current Google Search or Microsoft's Bing. Just in 2004, Google Labs released a paper on its own big data analysis and MapReduce algorithm . Doug Cutting uses the technology disclosed by Google to expand the Lucene search technology he has developed, and then builds Hadoop .

At the beginning of the project in 2006, the word "Hadoop" only represented two components-HDFS and MapReduce. Until now, this word represents the "core" (ie Core Hadoop project) and a growing ecosystem related to it. This is very similar to Linux, which consists of a core and an ecosystem.

Hadoop is constructed based on open source codes for distributed processing and analysis of huge data sets on computer clusters. It can also be thought of as a cloud platform that can store and manage large amounts of data. It mainly has two core technologies, namely Hadoop Distributed File System (HDFS) and MapReduce technology. It is precisely because the multi-node division of labor is used to process huge amounts of data, which solves the problem of file storage and greatly shortens the operation time, making Hadoop the mainstream technology of big data. Well-known large companies such as Google, Facebook, Wal-Mart, UnionPay, China Unicom, TSMC, etc., all use Hadoop technology.

Doug Cutting said that the significance of Hadoop is not in technology, but in "digital transformation". From the successful experience of Hadoop, we can learn several things: First, open source is necessary, for example, in 20 When he developed the predecessor of Hadoop-Lucene open source code search technology years ago, he did not expect Lucene to succeed. "Because it is not the best technology, it is not perfect, but because it is an open source code, through the social The power of the group makes it the best search technology. "For users, what they want more now is open source software. The second thing is that digital transformation requires different computing and storage architectures. A few years after completing Lucene, Cutting began to invest in the research and development of Hadoop, "You can see that the overall application has risen, and its success lies in meeting everyone's needs." Before Hadoop, almost all data or applications Programs are all stored in separate systems, but with Hadoop, they can be stored in a single system, which has better results in terms of scalability and processing operations .

Doug Cutting pointed out that Hadoop is highly related to machine learning and AI. To train, test, and evaluate artificial intelligence, data is required. Many developers have written many applications on the Hadoop platform, which can be used to collect a variety of huge amounts of data. , Data supporting AI and machine learning is also growing explosively. It is difficult for a company to provide such a large number of tools.

Now, in addition to being the father of Hadoop, Doug Cutting is also the chief architect of Cloudera. Cloudera can be said to be the most well-known company in the Hadoop ecosystem. Its core product is to build a Hadoop-based big data platform for enterprise customers to help companies install, configure, and run Hadoop for massive data processing, analysis, and machine learning.

 

When Doug Cutting sent a message to 2017, he pointed out five ways to make open source projects successful:

1. Embrace the constant change and evolution of open source

Constantly changing , this is the first lesson that everyone who is new to open source technology needs to learn, and it is also the biggest difference between open source and traditional software. The nature of open source is changeable and flexible, and its new projects often originate from some special use cases. This dynamic cycle makes products better and faster. Therefore, if companies want to fully benefit from open source, they must remain open to technological changes. The Spark and MapReduce debate perfectly demonstrates the importance of this point:

In fact, when people build new applications, MapReduce is used less and less, and Spark has become their default data processing engine. MapReduce is gradually becoming the underlying engine of Hive and Pig, which does not mean that it is out of date. It will also work well for existing applications for many years, and is still an excellent tool for some large-scale bulk loads. This trend follows the natural evolution of open source technology: MapReduce is the 1.0 engine of the open source data ecosystem, Spark is 2.0, and one day 3.0 will appear to make Spark history.

2. When introducing a new technology stack, start small and go from top to bottom

Regardless of what kind of solutions to build and deploy, we now have many common data platforms and many tools that can be flexibly combined to do search, stream processing, machine learning, and more. These jobs require not only a different set of skills, but also cultural changes in management methods and organizational structures. To this end, it is important to obtain the support of the senior management within the organization and make data management a key issue at the board level. At the same time, it is recommended to take some new applications to gradually build a new culture, rather than replacing everything, so that everyone can adapt to this change through specific use cases.

3. Choose open source software carefully to avoid bundling by cloud vendors

As more and more enterprise organizations and industries use cloud computing, it should be considered that open source software will not only bring better and better robustness, scalability, and security, but also help them avoid being tied up by cloud vendors . Through the construction of open source platforms, organizations can use cloud provider arbitrage to reduce costs, use different clouds in different regions, or a hybrid approach based on cloud and on-premises deployment. In fact, the open source platform has already proven its technical superiority, and may achieve more landings in 2017. A large number of organizations cooperate through open source projects, and it is very difficult for a single supplier to compete. For example, those open source data systems are now in a leading position in performance and flexibility, and are improving more rapidly.

4. For job seekers, the open source ecological environment should focus on forests, not trees

Job seekers in the IT field, whether it is programming or data science, should not only focus on mastering individual technologies, but should focus on understanding the best use of the various components of the open source data ecosystem and how to connect them to solve problems . This understanding of the superstructure is the most valuable skill for enterprises in technological innovation. With the advent of new technologies, it is important to understand how adaptable they are, what they can replace, and what they can do.

5. Look for opportunities in the skills gap

The skills gap for big data will remain relatively stable next year, but this should not be an obstacle to the adoption of Hadoop and other open source technologies. Most people know that when new technologies are created and compete for users, they are foreign to the outside world. Only when a certain type of software becomes a mature and standard part of a specification, will it begin to appear a large number of people skilled in using it, and even so, there will still be a skill gap. Only when we no longer make major improvements to the technology stack will this gap be eliminated, but Doug doesn't think we want to do so. In short, the skills gap is one of the main factors affecting the speed of platform change and a sign of upcoming innovation.

Guess you like

Origin blog.csdn.net/sanmi8276/article/details/112978035