Data Development Engineer-Interview Questions

1. What is the structure of the data warehouse?

A data warehouse is a centralized data storage system used to store, manage, and analyze large-scale data. Its structure usually includes the following main components and layers.

  • Data sources: Data sources for data warehouses include various data systems, such as relational databases, log files, external APIs, cloud storage, etc. Data is collected, extracted, transformed and loaded from these sources into the data warehouse.
  • Data Extraction: In this stage, data is extracted from various data sources for subsequent processing. The extracted data may include original data, historical data, transaction data, etc.
  • Data Transformation: In a data warehouse, data often needs to be cleaned, transformed, and integrated to meet analysis and reporting needs. Data conversion tasks include data cleaning, field mapping, data merging, calculating indicators, etc.
  • Data loading: The transformed data is loaded into different levels in the data warehouse, usually including raw data storage, intermediate data storage and data warehouse data storage. The data loading process can be batch or real-time streaming.
  • Data storage levels: The data warehouse includes multiple storage levels, including ① original data storage , which retains unprocessed raw data to prepare for different analysis needs; ② data warehouse storage, which contains data that has been cleaned, transformed and integrated. Used to perform advanced analysis and reporting. ③Data summary storage : Contains summarized and precomputed data, used to improve query performance and perform complex analysis tasks.
  • Metadata management : Data warehouses need to maintain metadata to describe the structure, source, definition, transformation rules and other information of the data stored in it. Metadata is very important for data warehouse management and query optimization.
  • Data access layer: The data warehouse provides different ways for users and applications to access data, including SQL queries, OLAP (online analytical processing) tools, reporting tools and APIs, etc.
  • Query and analysis tools: Data warehouses are usually integrated with various query and analysis tools to support users in data exploration, query and reporting.
  • Security and permission control: The data warehouse needs to implement strict security and permission control to ensure that only authorized users can access and operate the data.
  • Monitoring and performance optimization: Data warehouses require real-time monitoring to ensure performance and availability. Performance optimization is a continuous process, including indexing, partitioning, caching and other technologies.

2. python-modulo reverse order

Modulo the reverse order usually requires the use of the merge sort algorithm. Reverse-order pair In an array, if one number is greater than the number after it, the two numbers are said to form a reverse-order pair.

def merge_sort(arr, mod):
    if len(arr) > 1:
        mid = len(arr) // 2
        left_half = arr[:mid]
        right_half = arr[mid:]

        left_half, left_inv_count = merge_sort(left_half, mod)
        right_half, right_inv_count = merge_sort(right_half, mod)

        merged_arr, split_inv_count = merge(left_half, right_half, mod)

        inv_count = left_inv_count + right_inv_count + split_inv_count
        return merged_arr, inv_count
    else:
        return arr, 0

def merge(left, right, mod):
    merged_arr = []
    inv_count = 0
    i = j = 0

    while i < len(left) and j < len(right):
        if left[i] <= right[j]:
            merged_arr.append(left[i])
            i += 1
        else:
            merged_arr.append(right[j])
            j += 1
            inv_count += len(left) - i

    merged_arr.extend(left[i:])
    merged_arr.extend(right[j:])

    return merged_arr, inv_count

def count_inverse_pairs(arr, mod):
    _, inv_count = merge_sort(arr, mod)
    return inv_count % mod

# 示例用法
arr = [2, 4, 1, 3, 5]
mod = 1000000007
result = count_inverse_pairs(arr, mod)
print("逆序对数取模结果:", result)

3. Python-Merge two sorted arrays

Method: Use the double pointer method to traverse the two input arrays, compare the elements one by one, and add the smaller element to the new array.

def merge_sorted_arrays(arr1, arr2):
    merged_arr = []
    i = j = 0

    while i < len(arr1) and j < len(arr2):
        if arr1[i] < arr2[j]:
            merged_arr.append(arr1[i])
            i += 1
        else:
            merged_arr.append(arr2[j])
            j += 1

    # 将剩余的元素添加到新数组中
    while i < len(arr1):
        merged_arr.append(arr1[i])
        i += 1

    while j < len(arr2):
        merged_arr.append(arr2[j])
        j += 1

    return merged_arr

# 示例用法
arr1 = [1, 3, 5, 7]
arr2 = [2, 4, 6, 8]
result = merge_sorted_arrays(arr1, arr2)
print("合并后的已排序数组:", result)

4. Spark and Hadoop are two important frameworks in the field of big data processing. They have some similarities and differences.

Same point:

  • They are all open source frameworks and can be used and modified for free.
  • Both are designed to process large-scale data and adopt a distributed computing model that can execute tasks in parallel on multiple computing nodes.
  • Fault tolerance: Both are fault tolerant and can handle node failures and ensure task completion.
  • Support multiple data sources: Both can handle different data sources, including structured data (such as relational data), semi-structured data (XML, JSON) and unstructured data (text, logs, etc.)
  • Supports multiple programming languages: including Java, Python, Scala, etc., allowing developers to use languages ​​they are familiar with for big data processing.

difference:

  • Data processing model: ① Hadoop uses the MapReduce model , which is a batch processing model suitable for offline data processing. MapReduce tasks typically require writing data to disk intermediate storage and are therefore less efficient for iterative algorithms or real-time processing. Spark uses a memory-based computing model that can perform iterative processing and real-time processing more efficiently because it can store intermediate data in memory.
  • Performance: ① Spark is generally faster than Hadoop’s Mapreduce , especially in iterative algorithms (such as machine learning) and real-time data processing. This is because spark makes full use of memory computing and reduces disk IO . Hadoop’s MapReduce is more suitable for one-time, batch-processing tasks.
  • API and ecosystem: ① Spark provides more advanced APIs , including Spark SQL that supports SQL queries, machine learning library MLib, graph processing library GraphX, etc., making development more convenient. The Hadoop ecosystem includes Hive (supports SQL queries, pig (data flow scripting language), HBase (NoSQL database), etc., but more tools and libraries are needed to achieve similar functions.
  • Complexity: ① Spark is generally easier to use because it provides a high-level API and a more friendly programming model. Hadoop’s MapReduce requires more manual coding and configuration.
  • Resource manager: ① Hadoop uses YARN (Yet Another Resource Negotiator) as the resource manager, which is responsible for allocating cluster resources to different tasks. Spark comes with a resource manager called Cluster Manager, which can also be integrated with external resource managers such as YARN and Mesos.

5. Java-binary search

public int search(int[] nums, int left, int right, int target) {
        while (left <= right){
            int mid = left + (right - left) / 2;
            if (nums[mid] == target){
                return mid;
            }else if (nums[mid]< target){
                return search(num, mid + 1, right, target);
            }else{
                return search(num, left, mid - 1, target);
            }
        }
        return -1;
 }

6. Java-double pointer

Increment the array and determine whether there are two numbers in the array. The sum is the target. The idea is to use two pointers, one begin and one end. Each time one pointer is moved.

public int[] twoSum(int[] numbers, int target) {
        int p1 = 0;
        int p2 = numbers.length - 1;
        while (p1 < p2){
            int sum = numbers[p1] + numbers[p2];
            if (sum == target){
                return new int[]{p1 + 1, p2 + 1};
            } else if (sum < target){
                p1++;
            } else {
                p2--;
            }
        }
        //无解
        return new int[]{-1, -1};
}

7. Java-Longest Increasing Subsequence LIS (Dynamic Programming)

 Given an integer array nums, find the length of the longest strictly increasing subsequence in it. The optimal time complexity is O(nlogn)

public int lengthOfLIS(int[] nums) {
        int dp[] = new int[nums.length];
        dp[0] = 1;
        int maxSeqLen = 1;
        for (int i=1; i<nums.length; i++) {
            //初始化为1
            dp[i] = 1;
            for (int j = 0; j <= i; j++) {
                //对dp[i]进行更新,严格递增
                if (nums[i] > nums[j]){
                    dp[i] = Math.max(dp[i], dp[j]+1);
                }
            }
            maxSeqLen = Math.max(maxSeqLen, dp[i]);
        }
        return maxSeqLen;
}

8. Sum of two numbers in reverse sequence linked list

You are given two  non-empty  linked lists, representing two non-negative integers. Each of their digits is   stored  in reverse order, and each node can only store one  digit.

/**
 * Definition for singly-linked list.
 * public class ListNode {
 *     int val;
 *     ListNode next;
 *     ListNode(int x) { val = x; }
 * }
 */
class Solution {
    public ListNode addTwoNumbers(ListNode l1, ListNode l2) {
        ListNode pre = new ListNode(0);
        ListNode cur = pre;
        int carry = 0;
        while(l1 != null || l2 != null) {
            int x = l1 == null ? 0 : l1.val;
            int y = l2 == null ? 0 : l2.val;
            int sum = x + y + carry;
            
            carry = sum / 10;
            sum = sum % 10;
            cur.next = new ListNode(sum);

            cur = cur.next;
            if(l1 != null)
                l1 = l1.next;
            if(l2 != null)
                l2 = l2.next;
        }
        if(carry == 1) {
            cur.next = new ListNode(carry);
        }
        return pre.next;
    }
}

9. Three paradigms of Mysql

  • First Normal Form (1NF) : Each column in the data table contains irreducible atomic data , that is, only one value can be stored in each cell. All data must be atomic and cannot contain non-atomic data such as sets, arrays, and nested tables.

  • Second Normal Form (2NF) : The data table must conform to First Normal Form (1NF). All non-primary key columns (non-key columns) must be completely dependent on the candidate key (primary key). This means that there are no partial dependencies , i.e. every column in the table should have a relationship to the primary key, rather than just a part of the primary key.

  • Third Normal Form (3NF) : The data table must conform to Second Normal Form (2NF). All non-primary key columns should not transitively depend on the primary key. In other words, there should be no transitive dependencies between non-primary key columns. If a non-primary key column depends on another non-primary key column, which in turn depends on the primary key, then there is a transitive dependency, which needs to be eliminated by decomposing the table. Non-primary keys are directly related to the primary key, and there is no indirect correlation.

10. The difference between Mysql’s InnoDB and MyIsam

  • InnoDB supports transactions, MyISAM does not support transactions.
  • InnoDB supports foreign keys
  • InnoDB is a clustered index and MyISAM is a non-clustered index.
  • InnoDB does not support a specific number of rows; the minimum lock granularity of InnoDB is a row lock, and the minimum granularity of MyISAM is a table lock.

11. MySQL isolation level

  • Uncommitted read: Read even if it is not committed, and can also be read in the middle of a transaction. It is easy to produce dirty reads, phantom reads, and non-repeatable reads.
  • Committed read: It can only be read after the transaction is committed, avoiding dirty reads, but it cannot avoid non-repeatable reads and phantom reads.
  • Repeatable read: When a transaction is committed, data cannot be read or modified, avoiding dirty reads and non-repeatable reads.
  • Serial read: The highest isolation level, all transactions are executed serially, avoiding dirty reads, phantom reads and non-repeatable reads.

Guess you like

Origin blog.csdn.net/qq_43687860/article/details/133217780