Ali Big Data Architect necessary skills, you "Paige" Well?

The two naive is "What is Paige" This refresh the commercials. Paige obviously is a comedy role, yet she gave everyone watching cry!

In the middle of the story, grandson sentence: "If you want Page", the result grandfather began looking over the village of Pec, to finally find a small series that is the best looking Page
Ali Big Data Architect necessary skills, you "Paige" Well?

I do not know what it feels like after you read, anyway, the feeling after I saw was very moved. However, after several days of fermentation, "Paige" is the word seems to have more meaning to it! Various "Paige" Let not poor, the woman's "Page" is what? Programmer "Paige" What?

In this case still have to recommend my own build Big Data learning exchange group: 529 867 072, the group is big data science development, big data if you are learning, you are welcome to join small series, we are all party software development, from time to time Share dry (only the big data-related software development), including a copy of the latest big data and advanced data advanced development course my own sort of welcome advanced and want to delve into the big data small partners to join.

Here today I'll share with you, Big Data Engineer "Paige" What!

"Paige" skills

1. programming skills

Whether it is Java or Python, to learn programming language must first sink in the heart specializing in one door, especially open source tools, it is widely used in any company.

For example, the Java language base starter kit of learning grammar, OOP programming, multithreading and network programming, MySQL database, Maven project management, you can train yourself to master the necessary basic coding large capacity data, also for big data analysis or follow-up study senior content recommendation system to lay a solid foundation.

2.Hadoop

Hadoop crucial role in big data technologies in the system, Hadoop big data technology is the basis of the degree of solid grasp of the basics of Hadoop, we will decide how far to go on the road Big Data technologies. Hadoop which includes several components HDFS, MapReduce and YARN, HDFS is where we store data like a computer hard disk as files are stored in the above, MapReduce data processing is calculated, it has a feature that no matter how much data just give it time it will be able to finish the data, but the time may not be fast so it is called batch data.

YARN important component is the embodiment of the concept of Hadoop platform with which other software Big Data ecosystem will be able to run on hadoop, so that we can better take advantage of the large storage HDFS and save more resources, such as we do not have a single spark to build a cluster, and run it directly on top of the existing hadoop yarn on it. The following are commonly modular architecture of FIG Hadoop:Ali Big Data Architect necessary skills, you "Paige" Well?

3.Spark

It is used to compensate for shortcomings in MapReduce based data processing speed, it is characterized by the data loaded into memory rather than calculated read slow death slow evolution also particularly hard. Particularly suitable for iteration, the algorithm flow were particularly porridge it. It is written in scala. Scala or Java language can operate it, because they are using the JVM.

4.Storm

Storm is a free and open source distributed real-time computing systems. Storm can easily be done using the reliably handle unlimited data streams, like Hadoop to process large quantities of data, like, Storm can handle real-time data. Storm simple, you can use any programming language.

5.Kafka

Kafka is a distributed, partitioned, replicated commit logservice. It provides features like JMS, but completely different in design and implementation, in addition it is not a JMS implementation specification. When messages are stored kafka Topic classification according to, send messages become Producer, Consumer message recipients become, in addition kafka kafka instances a plurality of clusters, each instance (server) becomes broker. Whether kafka cluster, or the producer and consumer are dependent on the zookeeper to ensure the preservation of some of the meta information system availability cluster.

6.Flink

Flink is a distributed computing engine that can be used for batch processing, namely processing of static data sets, historical data sets; can also be used for stream processing, that is, to deal with some real-time data streams in real time, in real-time to produce the results of data; can also be used to do some behavior based on application events, such as real-time monitoring of users drops and drivers by Flink CEP flow to determine the user or the driver's behavior is justified. Big Data learning exchange group: 529 867 072

7.Hive

Hive implemented by Facebook and Open Source

Hadoop is based on a data warehouse tool

Structured data may be mapped to a database table

And provide HQL (Hive SQL) queries

The underlying data is stored in the HDFS

Hive is the essence of the SQL statements into MapReduce tasks to run

MapReduce user not familiar with the use of easily HQL processing and calculations based on the data structure HDFS, suitable for bulk data calculated offline.

8.ElacsticSearch

ES is a distributed based on Lucene full-text search server, and the full-text index of SQL Server (Fulltext Index) is somewhat similar, are based on word segmentation and full-text search engine, has a word, synonyms, stemming query function, but ES inherently distributed and real-time properties, the essays demonstrate ElasticSearch installed in a Windows environment, and for Head of plug-in management ElasticSearch.

to sum up

In the technology industry which, every day something new appears, need to focus on the latest technology trends, continuous learning. Any general technologies are first learning theory, then in practice, constantly improve the process theory.

If you find yourself reading efficiency is too slow, you can collect some online courses.

Ability to learn quickly, problem-solving skills, ability to communicate in this industry is really very important indicator.

To be good at using StackOverFlow and Google to help you learn the process of problem encountered.

Guess you like

Origin blog.51cto.com/14296550/2401271