2019-12-30 reflection interview

Interview company: a small and medium sized electricity provider
Interview questions:
- Do not thinking too jump, or easily embarrassed.
- When the brain needs to quickly guide the operation of what to say.
- Cooked their own piece must be prepared to fully know the result of more than 80 percent.
- Sorting algorithm, choose a saying: It is recommended fast row, merge, before the interview to get a feel about the handwriting.
  - Digging bucket row: the need to prepare for the difficulties: in what conditions to optimize the utilization of sub-barrel space?.
  - Next time I plan to dig under the terms row and see.
- Spark Core -> RDD -> Categories -> Resource Scheduling -> Spark On Yarn
- Small file processing:
  - hdfs
    - Hadoop Archive(HAR)
    - Sequence file
    - CombineFileInputFormat
  - HBase
    - A basically it, minor compact, major compact (deletes data, crontab script manually merge midnight), rowKey design
- SparkStreaming
  - Recommended Direct mode, kafka0.11 coordinate with later versions so far they are compatible.
    - Zookeeper management without the use of consumer offset, using Spark to manage, there is a default memory, you can set checkpoint, the checkpoint will be saved in one.
    - Read kafka use Simple Consumer API, you can manually maintain consumer offsset
    - Parallelism and read partition correspondence topic
    - You can use checkpoint set up the way consumers manage offset, use StreamingContext.getOrCreate (ckDir, CreateStreamingContext) recovery.
    - If the code logic changes can not use the checkpoint mode management offset, the consumer can manually maintain the offset, offset may be stored in the external system.
  - And above kafka0.11 version of integration
  - kafka characteristics:
    - ISR
    - ACK
    - PageCache
    - ZeroCopy(netty)
    - Disk Sequential Write
    - Message Default seven days
    - Binding as the underlying database can be directly used RocksDB
    - Fault Tolerance (checkpoint)
    - Is there a state
    - Backpressure mechanism
      - Dynamically adjusting the data processing efficiency according to the intake flow rate, in order to achieve current limiting.
      - When the batch time ( Batch Processing Time) is greater than the batch interval ( Batch Interval，即 BatchDurationtime), indicating that data processing speed is less than the data rate of uptake, duration is too long or the source data explosion, likely to cause the accumulation of data in memory, resulting in tasks or Executor OOM Ben collapse.
  - Idempotent message
    - at least once
    - Exactly once
- SparkSQL
  - Spark on Hive / Hive on Spark
  - Dataset 与 DataFrame
    - Dataset -> DataFrame
  - The underlying architecture I remember he said, can not remember nothing.
  - Predicate pushdown (this is supposed to say out)
- Machine Learning:
  - I asked a simple linear regression gradient descent
  - If you must use linear regression clusters of data that ye deal with it?
    - By adding L-dimensional polynomial features (such as may be used sklearn PolynomialFeatures)
    - Conclusion: Low-dimensional data set, the linear model is often a problem of poor fitting, and the data set after polynomial expansion characteristics, we can solve the problem of poor fitting a linear model to a certain extent.
    - There is no performance advantages of linear regression after the general rise several dimensions.
    - Suitable for multi-dimensional data or support vector machine SVM (Support vector machine)

2019-12-30 reflection interview

Guess you like