Large data frequency face questions

Problems (focus) Interview **

1. characteristics RDD (RDD explanation)
1.RDD partition can be seen as consisting of a series
dependencies between 2.RDD
3. operator is acting on the partition the
4 acting on the partition is in the form of RDD kv
optimum position calculating 5.partition provided, i.e. localized facilitate data calculation processing moves to data rather than the mobile data
ps: RDD data itself is not stored, it can be seen as RDD is itself a reference data
RDD elastically
1) automatically switch memory and disk data storage
Spark priority data into memory, if the memory does not fit, it will put the inside disk, the program stores automatically switch
2) based on the origin efficient fault tolerance
when converting and operation in the RDD, Lineage dependency chain formed of RDD, when a certain failure RDD, can regenerate the missing data by recalculating RDD RDD upstream.
3) Task If it fails will automatically retry a certain number of
computing tasks RDD If you fail, it will automatically recalculate the task, the default number is four times.
4) Stage will automatically fail if a certain number of retries
if a Stage Stage Job calculation fails, the framework will automatically recalculate the task, the default number is four times.
5) Checkpoint and Persist be active or passive trigger
RDD can persist through Persist the RDD cache memory or disk, when used in the RDD again read directly on the line. RDD may be checkpointed, the checkpoint data will be stored in HDFS, the RDD RDD all parent dependency will be removed.
6) data scheduling elastic
Spark this JOB execution model abstracted general directed acyclic graph the DAG, Stage multiple tasks can be performed in parallel or in series, automatic scheduling processing engine failure and Stage Task failure.
7) the height of the elastic data pieces
according to the service characteristics, dynamically adjusting the number of data pieces to enhance the overall efficiency of the application.



2. RDD two types of operator
RDD programming API
RDD supports two modes of operation: a conversion action and action operations. RDD conversion operation is a return operation of the new RDD, such as map () and filter (), and the operation action is to the driver program returns the result or the result of the write operation of the external system. For example, count () and first ().
Spark uses lazy evaluation mode, only the first RDD used in a mobile operation, it will really calculated. Spark can optimize the entire calculation process. By default, Spark of RDD will be recalculated every time you move them to operate. If you want to reuse the same RDD, you can use RDD.persist () allow Spark to the RDD cached in multiple action operations.
3.25.17 Transformation operators (important)
all conversions in the RDD are lazy loading, that is, they do not calculate the results directly. Instead, they simply remember the conversion operation based on those applied to the data set (e.g., a document). Occurs only when asked to return a result to the Driver's action, these conversions will really run. This design allows Spark to run more efficiently.
Conversion meanings
Map (func) returns a new RDD, the RDD by each input element after the conversion function func composition
filter (func) returns a new RDD, the RDD returns a value of true input element after calculation function func composition
flatMap (func) similar map, but each input element may be mapped to zero or more output elements (func it should return a sequence, rather than a single element)
mapPartitions (func) similar map, but independently in RDD operating on each sub-chip, and therefore on the type of operation when the RDD T, func function type must be Iterator [T] => Iterator [ U]
mapPartitionsWithIndex (func) similar mapPartitions, but with func integer parameter index value indicates a slice, thus the type of operation when the RDD T, must be the type of function func (Int, Iterator [T]) => Iterator [the U-]
sample (withReplacement, fraction, sEED) samples the data according to the ratio specified fraction can choose whether to use a random number to be replaced, for specifying the sEED the random number generator seed
union (otherDataset) source RDD and RDD demand parameters and the set returns a new RDD
intersection (otherDataset) after the source RDD and parameters RDD intersection of returning a new RDD
DISTINCT ([numTasks])) after the source RDD de re returns a new RDD
GroupByKey ([numTasks] ) in a (K, V) of the RDD call, it returns a (K, Iterator [V]) of the RDD
reduceByKey (FUNC, [numTasks]) in a (K, calling on V), RDD, returns a (K, V) the RDD, reduce the specified function, the same key value is polymerized together with groupByKey Similarly, reduce the number of tasks can be set by a second optional parameter
aggregateByKey (zeroValue) (seqOp, combOp , [ numTas ks same Key value]) of the polymerization operation, again using a neutral initial value during the polymerization zeroValue: neutral, defined value of the return type, and involved in computing SEQOP: combOp consolidated values to a partition of the same: for different values were combined partiton
sortByKey ([ascending], [numTasks ]) in a (K, V) of the RDD calls, K must implement the Ordered interface returns a sorted according to the key of the (K, V) of the RDD
the sortBy (FUNC, [Ascending], [numTasks]) and sortByKey similar, but more flexible
join (otherDataset, [numTasks]) in the type (K, V), and (K, W) of the RDD call, it returns all elements corresponding to a same key together (K, (V, W) ) of the RDD
cogroup (otherDataset, [numTasks]) in the type (K, V), and (K, W) of the RDD call, returns a (K, (Iterable <V> , Iterable <W>)) type RDD
of Cartesian (otherDataset) Cartesian product
pipe (command, [envVars]) Spark some shell commands for generating new RDD
COALESCE (numPartitions) repartition
repartition (numPartitions) repartition
repartitionAndSortWithinPartitions (partitioner ) repartition and ordering
3.25.18 action operators (important)
to run calculations on RDD, and return the results to the Driver or write to the file system
operation meaning
reduce (func) gather all the elements in the RDD by func function, this function must be exchangeable parallel and may
collect () in the driver, to It returns an array of all the elements of the data set
count () returns the number of elements of RDD
first () returns the first element RDD (similar Take (. 1))
Take (n) Returns an array of the first n elements of the data set consisting
takeSample (withReplacement, num, [sEED]) returns an array of elements of a data set num randomly sampled composition, may be selected whether to replace the shortfall with the random number, specifies the sEED the random number generator seed
takeOrdered (n, [ordering]) similar takeOrdered and top, only to return and top elements in reverse order
saveAsTextFile (path) the elements of the data set is saved to the file system HDFS or other support in the form of a textfile file system, for each element, Spark will call toString method, it installed for the document text
saveAsSequenceFile (path) of the data set to the storage element in a specified directory Hadoop SequenceFile format can support Hadoop HDFS or other file systems.
saveAsObjectFile (path)
countByKey () for (K, V) type RDD, a return (K, Int) of the map, represents the number of elements corresponding to each key.
the foreach (func) on each element of the data set, function func update operation.



3. Principle operator (operator shuffle principle, lead Shuffle principle)
4. Shuffle principle (and shuffle the difference Hadoop)
shuffle occurs between the map and refuce, a bridge between the map and reduce, he is some non regular data into data of regular process, pull to map data from the data terminal, after sorting stored in the system disk and synthetic, Reduc end sorting processed data packet to read from the map-side disk during
spark shuffle is hashshuffle spark used
hashbuffle before no optimization is the presence of each of a plurality of maptask Executor, each will have a plurality of maptask
bucket files, the reduce these bucket grouping, there are a lot of such bucket, read troublesome, slow packet, affect system efficiency
optimization once after the shuffle (1.6-2.0 version) is the bucket after maptask group, and then processed again by reduce, although simplifying the number of bucket, but there are still a lot of small files
three, Sort-Based shuffle
in order to produce too many files and cache overhead Writer alleviate the problem of excessive Shuffle process, spark had introduced similar shuffle mechanism oop Map-Reduce is. Each ShuffleMapTask this mechanism does not create separate files for subsequent tasks, but will all the Task results are written to the same file, and corresponds generate an index file. Previous data is in memory cache until the cached data to disk reads painted over, and now in order to reduce memory usage, when the memory is not enough, you can output the overflow to disk. End, then these different file jointly memory data merge together, thereby reducing the amount of memory used. On the one hand a significant reduction in the number of files, on the other hand to reduce the occupied Writer cache memory size, and at the same time avoiding the risk of GC and frequency.

Sort-Based Shuffle There are several different strategies: BypassMergeSortShuffleWriter (Bypass mechanism), SortShuffleWriter (general mechanisms) and UnsafeShuffleWriter.

For BypassMergeSortShuffleWriter, the use of this model is characterized by:
# Shuffle mainly used for the polymerization process and does not require sorting operation, the data is written directly to the file, when a large amount of data, the network I / O and memory heavier burden.
# The main task for processing Reducer relatively small number of cases.
# Each partition will be written to a separate file, and finally merge these files, reduce the number of files. However, this approach requires multiple concurrent open files, memory consumption is relatively large.
Because this way faster than BypassMergeSortShuffleWriter SortShuffleWriter, so if the number is not Reducer, and does not need to end the polymerization and sorting in the map, and the number is smaller than Reducer spark.shuffle.sort.bypassMergeThreshold specified threshold (default 200), the that is, the use of this way (ie enabling conditions).

For SortShuffleWriter, use this mode features are:
# amount of data is more suitable for a big, big scene or cluster size.
# Redistribute sequencer can support local or terminal polymerization Map not polymerize.
# If the external sequencer functions enable the spill, if the memory is not enough, you can first output overflow written to the local disk, the final results of the memory and local disk overflow write files to merge.
Further, the Sort-Based Shuffle Executor nucleated with no relationship, i.e. there is no relation with the degree of concurrency, each of which is a data file will have ShuffleMapTask and index files, only the so-called combined merged partition corresponding to the respective partition file ShuffleMapTask the data file only. So this needs to and consolidation of mechanisms Hash-BasedShuffle distinguished.
UnsafeShuffleWriter due to the need for caution, we will not do analysis.

Tuning shuffle
1. Combined output map file, file merge activation output adjusting mechanism spark.shuffle.consolidateFiles 2. spark.shuffle.file.buffer map memory buffer memory and reduce the proportion spark.shuffle.memoryFraction, 0.2
3. Select shufflemananger

5 check point, holding plan, shared variables
https://blog.csdn.net/wjn19921104/article/details/80268661
6. partitions (custom partition, the default partition, the difference between the role)

7. parallelism (running task parallelism) document reference tuning 1.2.3


8. Spark task operating principle (the focus of the focus)

9. The principle of the Task (localized level)
1. Concept: task execution are acquired before partition information data are allocated, priority will always be assigned to the node where the data is to be calculated it, as much as possible to reduce network transmission
2. process: general default 3s, five times to retry distribution once the time-out failure, will be selected again assigned a local level than a level difference, if the data transfer occurs, the data acquired by the first task BlockManager, if no local data, the data is acquired from the node where the data BlockManager method and return to getRemote where the node Task
3. level
PROCESS_LOCAL: the process of localization, the best performance. It refers to the code and data in the same process, i.e. the same executor; task performed by the calculation data executor, in case the data in an executor blockmanager
NODE_LOCAL: Localization node. Code and data in the same node, the data is stored as data block hdfs block nodes,
task execution in a particular executror node; and or data in the same task executor a different node, the process data to be transmitted across
NO_PREF: data where acquisition are the same, such as to obtain data from the database, there is no difference in terms of task for
RACK_LOCAL: task data and a rack on two nodes, the data is transferred between network nodes
ANY: task data and may anywhere in the cluster, but not in a rack, the worst performance
4. Adjust: spark.locality.wait default parameters 3s, by default, the following parameters are to spark.locality.wait default value, spark.locality.wait.process spark.locality.wait.node spark. locality.wait.rack
actual situation by adjusting the different values calculated optimal allocation effect

principles 10. DAG (the source code level)
then the time division Stage, created by a method createResultStage ResultStage, then the recursive derivation method according getShuffleDependencies, the principle is the last rdd into the station, the judge and his father, his relationship rdd, if it is shuffle, then divide ShuffleMapStage, if no inbound continue to push forward will be its parent node until all Parsing dependence, generates a set of wide-dependent, and to a method for generating getOrCreateShuffleMapStage stage, to achieve all of the divided Stage finally through getMissingParentStages method submitStage method recursively find all stage (similar rdd division) by the method submitStage submitMissingTasks stage method will encapsulating Task

11. The difference SparkSQL and Hive

PS: https://www.cnblogs.com/lixiaochun/p/9446350.html
12. The relationship between the DF and the DS (from its type)
DataFrame weak type, is an abstract data set, and Schema of Rdd set, can operate as a two-dimensional table
DataSet: DataFrame belonging to the parent class, DataFrame = Dataset [row], strongly typed

13. The window function (ranking function)
when rank () jumps sort, there are two behind second place followed by a fourth dense_rank () continuous sequencing, followed by the third over when there are still two second () to open window function:
after using the polymerization function, will become a multi-line row, and the row is the windowing function into a plurality of rows;
and after use aggregate functions, to display additional column must be added to the group by a column , while the use of the windowing function, without using group by, all the information is directly displayed.
Windowing function applied to the result of adding a function of the polymerization in the last column of each row.
Windowing function used:
1. for each data display information about the (aggregate function () over ()).
2. Each data packet is provided to a function of the result of the polymerization (polymerization function () over (partition by field) as an alias)
- according to a packet field, the packet is calculated
using (row number () over (order by field) as an alias) :( most commonly used analysis functions should be sorted 1.2.3), together with the ranking function 3.
1, row_number () over (Order by Partition by ... ...) 2, Rank () over (Order by Partition by ... ...)
. 3, DENSE_RANK () over (Order by ... ... Partition by )
. 4, COUNT () over (Order by Partition by ... ...)
. 5, max () over (Order by Partition by ... ...)
. 6, min () over (Order by Partition by ... ...)
. 7, SUM () over (Order by Partition by ... ...)
. 8, AVG () over (by ... Partition Order by ...)
. 9, FIRST_VALUE () over (Order by Partition by ... ...)
10, LAST_VALUE () over (Order by Partition by ... ...)
. 11, LAG () over (by Partition. Order by ... ..) 12 is, lead () over (Order by Partition by ... ...)
LAG and lead can obtain the result set, the current vertical line by a certain sort of a plurality of the arranged adjacent to the offset row of a column (not result from the correlation set); LAG, respectively, lead forward, backward;
LAG lead and has three parameters, the first parameter is the name of the column, the second parameter is the offset drift the third parameter is the default value at the time of recording windows exceeds

14. SparkSQL-UDF (custom function)



in two ways 15. SparkStreaming docking Kafka's (key)


16. how to solve the problem of the backlog of data (or increase the backpressure mechanism and the partition consumers)
backpressure mechanisms:
The reason: the data accumulated in the data processing is continuous inflow through each interval of time (batch interval), the inflow in the period Spark Streaming as a batch, then the data in this batch as input RDD job DAG to submit new the job runs. When a batch is greater than the processing time interval The batch, it means that the data reception speed data processing can not keep pace, this time at the data receiving end (i.e., the data receiver Receiver typically run on Executor) will accumulate the data, which is by BlockManager management, if the data is stored using MEMORY_ONLY mode will lead to OOM, using MEMORY_AND_DISK redundant data saved to disk but will increase the data read time.

Parameters:
spark.streaming.backpressure.enabled set to open up backpressure spark.streaming.kafka.maxRatePerPartition consumption per partition per second rate estimation spark.streaming.backpressure.rateEstimator class number is true, the default value pid, current Spark only support this.


17. How to ensure data consistency problems (producer and consumer)



18. A Kafka data transmission mechanism (refer to FIG.)

19. A Kafka how to ensure that data is not lost (consumer)

20. A data type Redis ·

21. A Redis persistence
RDB persistence is by way of a snapshot (snapshotting) completed, Redis will automatically take a snapshot of the data in memory when certain conditions are met and persisted to disk.
RDB is a persistent manner Redis used by default, default configuration file has this redis.conf Configuration: save 900 1
For persistence by RDB way, once Redis abnormal exit, you will lose all data changes after the last snapshot. This requires the developer according to the particular application, the conditions set by the automatic snapshot data combination to the possible loss of control in the acceptable range. If the data is very important that they can not afford any losses, you might consider using AOF way for persistence.

22. redis and avalanche breakdown (interview reference Collection, bottom)

cluster 23. redis principle (the need to pay attention to achieve the cluster)


24. A SparkStreaming accumulate operation (refer to UpdateStateBykey and MapwithState)
https://blog.csdn.net/ zhanglh046 / Article This article was / Details / 78,505,124
https://www.cnblogs.com/yinpin2011/p/5539708.html
25. a difference between the ForeachRDD Transform and


26. the difference operator promoter (MapPartitions and Map and foreachPartition foreach or different)


27. What It is DSTREAM
the Spark Streaming Overview
Spark Streaming similar to Apache Storm, for processing streaming data. According to its official document describes, Spark Streaming strong high throughput and fault tolerance features. Spark Streaming supports a lot of data input sources, such as: Kafka, Flume, Twitter, ZeroMQ and simple TCP sockets and so on. After the data input may be Spark height abstract primitives such as: calculates map, reduce, join, window and the like. The results can be saved in many places, such as HDFS, databases and so on. In addition Spark Streaming can and MLlib (machine learning) and Graphx perfect fusion.
DStream concept
Discretized Stream is Spark Streaming basis of abstraction, on behalf of persistent data flow and data flow through the results of the various Spark primitive operations. On the internal implementation, DStream is a series of consecutive RDD to represent. Each RDD containing data over a time interval, as shown below:
DSTREAM describes primitive types (important)
similar to primitives on DSTREAM RDD, divided Transformations (conversion) and Output Operations (output), with the switching operation in addition there are some special primitives, such as: updateStateByKey (), transform () Window and various related primitives.
Detailed: See Spark Streaming courseware


28. tuning the document (refer Spark tuning)
29. Data tilt solutions (refer Spark core analytical and Tuning Guide)
30. The tuning the JVM (Spark tuning reference document)
31. The GC garbage collection mechanism (algorithm theory)
32 handwriting or other quick drain algorithm (algorithm based)
33. a memory model the Spark
34. handwriting singleton Hadoop
1. the HDFS file storage mechanism (write process)


2. the principle of the MR (Map and Reduce) 3. Shuffle principle
4 . Hive inner and outer sheet
5. the dynamic partitioning of Hive
6. hive divisions tub
7. the Hive and Hive and 8. the difference Mysql Hbase difference
tuning 9. the Hive (see interview Collection)



hot spots 10. HBASE of (what when will trigger, how to avoid)
a, occurs because hot issue
1, the data is sorted according to the dictionary hbase sequence, when a large number of rowkey continuous focus on writing in the individual region, each region of data between the uneven distribution;
2, no advance pre-partition to create a table, the table is created by default only one region, a large amount of data written to the current Region;
3, pre-partition table has been created in advance, but the design of rowkey no rules to follow, set The rowkey should be composed by the regionNo + messageId

Second, the solution of salt
The salt mentioned here is not cryptography salt, but the increase in front rowkey random number, in particular is assigned to a random rowkey prefix and that before the beginning of its rowkey different. The number of prefixes and species distribution should be decentralized to the data you want to use a different region of the same number. After salting rowkey dispersion will prefix randomly generated according to the respective region, in order to avoid hot spots.
Hash
hash will always be on the same line with a prefix of salt. Hash can also spread the load across the cluster, but reading it is predictable. Using the determined hash allows clients to reconstruct a completely rowkey, get operation may be used to accurately acquire a certain row data inversion
inversion fixed length or a digital format in the third method of preventing hot rowkey. This makes the portion (least significant part) rowkey constantly changing in front. This can effectively random rowkey, but at the expense of ordering rowkey.
Examples of reverse rowkey to phone number rowkey, after a string of reverse phone number may be used as rowkey, so avoiding the mobile phone number that compares the timestamp at the beginning of the hot issues fixed cause reverse
a common data processing problem quick access to the most recent version of the data, the use of reverse time stamp as part of rowkey very useful for this problem, you can use Long.Max_Value - key appended to the end of the timestamp, for example [key]
[reverse_timestamp], [key] is the latest value the first can be obtained [key] by Scan [key]

11. the principles Hbase (read and store)


12. the design principles HbaseRowKey
1.RowKey length principle, the maximum length of 64KB, typically designed to a fixed length. Advice is as short as possible, not more than 16 bytes long and memory space occupied HFile
2.RowKey Hash principle: If RowKey timestamp is incremented by the way, do not put the time in front of binary code, it is recommended
RowKey highs as the hash field, generated by the program cycle, the low puts a time field,
3.RowKey the only principle: it must ensure the uniqueness of the design. RowKey is sorted according to the dictionary is stored, to take full advantage of the characteristics of this sort, store data to an often read together, the most recent data may be accessed on a piece.

13. Hbase advantages compared to other databases (features)


14. The Flume Source source of which there are


15. Flume high availability (how to achieve high availability)


16. The Linux command, HDFS commands (basic commands)
election mechanism 17. Zookeeper's ( how to achieve internal)
18. the Oozie and Azkaban differences (mainly configuration)
19. Bloom filter (principle)


to master one to two kinds of algorithms (principle)

need to know things
1. project framework: All
2. project process: data trends, data processing time indicator
3. project personnel with
the amount of data 4. project
5. issues project kafka questions majority (data consistency, data integrity, dynamic partition expansion, data backlog, kafka throughput) sparkStreaming problem (grant times the amount of data, batch interval, Job waiting)
6 cluster size (high availability)
7. item description

Guess you like

Origin www.cnblogs.com/-courage/p/11497355.html