Big Data's Two Sides Cool Jing

before

I am full of hope, happy, and happy doing everything. Bought a train ticket to Hangzhou, thinking it was a ticket to my dream. In order to catch the train, I got up at 6 o'clock early. The sky was still dark, the supermarket did not open, and the aunt who sold early did not get up. I just sat hungry on the bus for more than an hour, bumping all the way to the train station. I arrived at the station early, bought a 10.27 ticket, and changed my ticket (8:52). I bought a sweet-scented osmanthus cake in the supermarket, eat and so on. Looking at the roadside scenery after getting in the car, I seemed to fall into a painting. Full of expectation, the idea of ​​longing for the metropolis came out involuntarily. There is a city in my mind, people are struggling, the city is busy. Suddenly, the train tilted. It was an uneven road. After a while, it returned to normal, and after a while, it leaned again, and was frightened. It turned out that the rails were slanted, probably because of the turning. There are a lot of big brothers with backpacks in the car. Looking at their faces, I feel that they are programmers. They are a role model for me. I am more motivated. To make the interview more stable, I took out my notes and looked at it for a while. . . .

When I arrived in Hangzhou, everything was the same as I imagined, with a constant stream, tall buildings and a small me. There is a feeling of looking up at the size of the universe and looking down at the prosperous category. I am a little ant.

After looking for a while, I found the company to be interviewed. The manpower lady welcomed me warmly and poured me a cup of hot water. Take me to see the big brother, the top of his head is a little sparse, hahaha. Arranged for an early interview after the meal. After chatting with young lady for a while, young lady gave me a meal ticket. After eating satisfactorily, the interview began.

1. Introduction

2. Introduce your project

3. Hadoop cluster construction

1. Install jdk
2. Modify the host
3. Install ssh and configure password-free login
4. Modify the hosts file
5. Do time synchronization
6. Upload the hadoop package and unzip
7. Configure core-site.xml, hdfs-stite.xml, yarn-site.xml, mapred-site.xml, hadoop-env.sh, slaves(workers)
8. Format

4. Principle of MR

1. When an mr program is started, the first to start is MRAppMaster. After MRAppMaster starts, according to the description information of this job, calculate the number of maptask instances needed, and then apply to the cluster to start the corresponding number of maptask processes.

2. After the maptask process is started, data processing is performed according to the given data slice range. The main flow is:

2.1 Use the inputformat specified by the customer to obtain the RecordReader to read the data to form an input KV pair

2.2 Pass the input KV pairs to the customer-defined map() method, do logical operations, and collect the KV pairs output by the map() method into the cache

2.3 After sorting the KV pairs in the cache according to the K partition, they continue to overflow and write to the disk file

3. After MRAppMaster monitors that all maptask process tasks are completed, it will start the corresponding number of reducetask processes according to the parameters specified by the customer, and inform the reducetask process of the data range (data partition) to be processed

4. After the Reducetask process is started, according to the location of the to-be-processed data notified by MRAppMaster, several maptask output result files are obtained from the machine where the maptask runs, and re-merged and sorted locally, and then the KV of the same key is one Group, call the reduce() method defined by the customer to perform logical operations, and collect the result KV of the operation output, and then call the outputformat specified by the customer to output the result data to external storage

5. Talk about the join process

Simply put, MapJoin is to read the small table data from HDFS into the hash table in memory in the Map phase. After reading, the hash table in memory is serialized into a hash table file. In the next stage, when the MapReduce task At startup, the hash table file will be uploaded to the Hadoop distributed cache, which will send these files to the local disk of each Mapper. Therefore, all Mappers can load this persistent hash table file back into memory and perform Join as before. Scan the large table sequentially to complete the Join. Reduce expensive shuffle operations and reduce operations

MapJoin is divided into two stages:

1. Use MapReduce Local Task to read small tables into memory, generate HashTableFiles and upload them to Distributed Cache, where HashTableFiles will be compressed.

2. In the Map phase of MapReduce Job, each Mapper reads HashTableFiles from the Distributed Cache to the memory, scans the large tables sequentially, and directly joins in the Map phase, and passes the data to the next MapReduce task
Common Join

6. The advantages and implementation of Hadoop distributed

Convenience----->No expensive and highly reliable hardware resources, commercial hardware is sufficient.
Elastic---->The cluster nodes are easy to expand or uninstall.
Robust----->Failure detection and automatic recovery are
simple- ---->Users can quickly write efficient parallel distributed code

7. If I were your younger brother, introduce distributed

8. The difference or the advantages and disadvantages of distributed databases and ordinary databases

9. Tell me what the star model is

10. Why use data warehouse instead of ordinary database, the difference between hive and traditional database

1. Write mode and read mode

The traditional database is a write-time mode. During the load process, query performance is improved, because the columns can be indexed and compressed after pre-parsed, but this will also take more loading time.

Hive is in read mode, 1 oad data is very fast, because it does not need to read data for analysis, only copy or move files.

2. Data format. There is no special data format defined in Hive, which is specified by the user. Three attributes need to be specified: column separator, row separator, and the method of reading file data. In the database, the storage engine defines its own data format. All data will be stored in a certain organization

3. Data update. The content of Hive is read more than write less, therefore, does not support the rewriting and deletion of data, the data is determined when loading. The data in the database usually needs to be modified frequently

4. Execution delay. Hive needs to scan the entire table (or partition) when querying data, so the latency is high, and it has advantages only when processing big data. When the database is processing small data, the execution delay is low.

5. Index. Hive is weak and not suitable for real-time query. The database has.

6. Implementation. Hive is Mapreduce and the database is Executor

7. Scalability. Hive is high, database is low

8. Data scale. Hive is big, database is small

11. Campus experience

12. Career planning

13. Rhetorical link

14.hr come back and talk to you about non-technical

Then you can go home

Guess you like

Origin blog.csdn.net/qq_42706464/article/details/109128393