Sophisticated interview questions (six)

The number of bins and the number of layers ordinary warehouse database difference

       The traditional hierarchical is divided into three layers namely ODS, DW, DM layer, in which our project is broken down to the level DW and DWS layer DWD, DWD is mainly used to keep the fact table

Source data layer: the original data. Sources include business library, Buried logs, other data sources

ODS layer: Operate data store, data store operation, is the layer closest to the data in the data source, the data source, after extraction, washing, transfer, also said that after the legend of the ETL, loaded in the layer. This layer of data, mostly in accordance with the general classification source of business systems and classification. However, this level of data is not identical to the original data. When the source data is loaded into this layer, such as noise removal to be performed (for example, the age of the human piece of data is 300 years old, which belongs to the abnormal data, you need to do some processing in advance), de-emphasis (eg personal data sheet , the same there are two duplicate data ID, when access needs to be done to further weight), field naming a series of operations.

 

DW layer: Data warehouse, data warehouse layer. Here, the data obtained from the ODS layer in accordance with the theme of the establishment of various data models. For example in the study of the theme of tourism consumption data set, it can be combined with the airline's check-in travel information, and credit card records UnionPay system, a combination of analysis, generate data sets. Here, we need to understand four concepts: dimension (dimension), facts (Fact), index (Index) and granularity (Granularity)

(DW and DWS layer can be divided into DWD)

DWD: data warehouse detail details of the data layer. Business layer and data warehouse isolation layer. This layer is mainly to solve some of the problems of data quality and data integrity issues. For example, user profile information from many different tables, and often delay problems with lost data, etc., in order to facilitate better use of various consumer data, we can do a shield in this layer. dwd mainly to do some normalization of the data and the operation of the cleaning layer ods.

DWS layer: data warehouse service data service layer, mild summary level, from the ODS layer of the user's behavior to make a preliminary summary, abstracted some common dimensions: time, ip, id, and do some statistics based on these dimensions , such as the number of users log in each time period in a different ip purchased. Here to do a layer of mild summary will make computing more efficient. Analysis Services data integration aggregated into a certain subject area, usually a wide table. DWS layer is mainly used to store small amounts of data aggregation, DWS layer can solve the problem in 70 percent of the company, and the rest can not be resolved and that is to conduct a high degree of polymerization, will result in DM layer.

dws dwd and have no dependencies mainly to see there is no such demand. Standing an ideal point of view, if the data ods layer is very structured, which can basically meet most of our needs, which of course is good, this time dwd layer actually not much need. But the reality of the situation is in contact with the data layer ods is difficult to ensure quality, after all, a wide variety of data sources, push parties will have their own push logic, in this case, we need to pass an extra layer dwd Some differences in the underlying shield.

 

DM layer ( Data Market ): also known as data mart or a wide table. Divided according to the service, such as flow, orders, users, etc., generate a field width of more tables, for providing a subsequent service query, the OLAP analysis, data distribution. Data generated by: generating from a light calculated summary data layer and a layer of detail. Deeper data processing. For example, this project is a statistical level among DWS past, but in the DM layer statistics over the past 90 days there will be indicators of 30 days, can be understood as DM layer is another layer of polymeric DWS.

APP layers: application layer is based on business needs, from the preceding three-tier data out of the statistical results, can provide direct access to the show, or import into Mysql in use. Data generating mode: mild layer summary data generated by the mart layer of detail, mainly from the general requirements data mart layer. This layer provides the product data and the data analyzed using, for the online storage system is generally used in ES, MySQL, etc. systems. For example, we often say that the report data, or the kind of large wide table, usually put here.

Data warehouse build steps:

Step One: Select Business Process

The method is based firstly dimension identifying which business processes is the database to be covered, so the first step is to describe the modeling requires modeling business process, the business process described, can simply use the two text records or using MPMN ( business process modeling Notation) method, and the like may be used UML.

Step two: Make sure the size

Size is used to determine what is represented in the fact table, choosing the most fine-grained fact table stored transaction records, updated hourly increments, 2:00 two full update yesterday

The third step: to confirm dimensions

Dimension refers to every point of view, such as version, channels, functional

Level dimension (Hierarchy) and level (Level)

Dimension table changes over time, slowly change dimensions: Fixed, migration, new

Step 4: Make sure the facts

The original table with the dimension tables associating generated fact table

Also we need to establish a number of related intermediate summary tables to facilitate inquiries

Step five: physical model

 

oozie how to use?

       You can answer the following aspects:

  1. How to develop workflow, what scheduling, such as hive, shell, etc.
  2. How to set task scheduling, regular time, manner,
  3. Integrated Project Description

Project in which a specific hive do

       Washing the ETL data, data processing, data analysis. Integrated Project business terms.

Huawei is not used oozie with other scheduling tool

       There are other scheduling tools Azkaban, Airflow, etc., you can find out.

 

hive custom function to achieve what interfaces

1) Custom UDF needs to inherit: org.apache.hadoop.hive.ql.UDF

2). Need to evaluate function, evaluate function supports overloading.

There UDAF, UDTF

A few examples of statements hive mr process occur

count, sum, min, avg, max functions, a variety of join operations

hive external table scenarios

       Hive load map to a raw data table

 

What type of data scala

Scala and Java data types are similar, substantially in the scala types of all the Java class has a corresponding packet, the code is compiled Scala is Java byte code, the compiler uses Scala Java basic types possible, so as to provide the basic type of performance advantages, eight kinds of basic data types, located in scalathe package: Byte, Short, Int, Long , Float, Double, Char, Boolean

Any: Any all other superclass

AnyRef: Scala in all references to the class (reference class) of the base class

Unit: Only one instance value (), the method returns Unit corresponds to the void returned Java

Null: Each reference subclass of the class

Nothing: the lowest end of the Scala class hierarchy; it is any other type of sub-types, can be assigned to any other type, for an exception, that it will not return to normal

String type

 

scala partial function and I can talk about it?

In Scala, partial function is a functional type having PartialFunction [-T, + V] is . T is a function of the type of its acceptance, V is the type of result returned. Partial greatest feature is only a function of receiving and processing a subset of the parameters of the domain, and for this subset of parameters other than the run-time exception is thrown . This is a perfect fit with the characteristics of the Case statement, because we use the case statement is often match a specific set of patterns, and finally with "_" to represent the rest of the model. If a group of eleven case statement does not cover all cases, so this set of case statements can be seen as a partial function.

https://blog.csdn.net/bluishglc/article/details/50995939

Scala talk of currying

Currying (Currying) is a function of receiving the plurality of parameters is converted into a function that accepts a single parameter (first parameter of the first function) and the function returns a new technology that takes the remaining arguments and returns the result.

E.g:

def add(x:Int,y:Int)=x+y

After currying:

def add(x:Int)(y:Int)=x+y

 

The scala apply and unapply method is what role

apply method

Typically, the half-apply method defined object class, the object of this class is generated, eliminating the need for new keywords.

unapply method

The method may be considered unapply apply method is the reverse operation, accepts configuration parameters apply method into the object, the method takes an object unapply, extract the value.

appay and unapply method is implicitly called.

scala which defines a set of methods which have

Common set of three: List, Set, Map

List succession to Seq, elements of the collection can be repeated val l = List (1, 1, 2, 2, 3, 5)

Set of elements unrepeatable val s = Set (1, 1, 2, 2, 3, 5)

Map the elements presented key -> the form of value, wherein the first parameter mapping is a bond, is the value of the second parameter mapping. Map button and each element is unique.

val m = Map(1 -> 1, 1 -> 3, 2 -> 3, 3 -> 4, 4 -> 4, 5 -> 7)

Above, you may also be obtained by other conversion

After the definition of the variable tuple do scala

Mapping is a collection of K / V of the dual, even the simplest form of tuples, tuples can be filled with a plurality of different types of values

Tuple is a very useful container object in the Scala language. And lists, tuples are immutable; however with a list of different, tuples may comprise different types of elements

Guess you like

Origin www.cnblogs.com/lingboweifu/p/11909792.html