Common interview algorithms for big data

1. Mass data processing

1. Give a log file with a size of more than 100G, the IP address is stored in the log, and the algorithm is designed to find the IP address with the most occurrences

  • The 100G file gives us the feeling that it is too large. Our computer memory is generally about 4G, so we can't load so much information into the memory at one time, so we have to divide it into 100 copies. The IP address is a long string , We can convert it to integer %100, so that the modulus values ​​fall in the range of 0-99, and the IP addresses with the same modulus value are all assigned to the same file, then we can use The hash table counts the IP address with the most in each file, and finally compares the largest IP among the 100 IPs.

2. With the same conditions as the previous question, how to find the IP of top k?

  • Seeing that the top IP is required to immediately react to the use of heap sorting. The heap sorting here should pay attention to building a small heap. If we build a large heap, we can only ensure that the top element of the heap is the largest. The largest IP

3. Given 10 billion integers, the design algorithm finds integers that only appear once

  • Integers are divided into signed and unsigned. The value of signed numbers is -2147483648, 2147483648 is from -21 billion to +2.1 billion, and the range of unsigned numbers is 0, 4294967296 is from 0 to 4.2 billion, but it gives us 10 billion integers, to find integers that only appear once, so we still need to use the idea of ​​a hash table, but we better not define an integer array, because 4.2 billion * 4B is about 16G, so a large array we It would be too troublesome to split again. Here we can use BitMap to use a bit to indicate that a number does not exist, non-existence is expressed as 0, one occurrence is expressed as 1, and more than one occurrence is expressed by another bit. In this way You can reduce the size of the array to 1/16 of the original size. Another problem is how to define this array. Positive numbers are easy to define. If negative numbers, we can use 32-bit all 1(-1) to XOR it. Or take the same position as the positive number, we define a two-dimensional array at this time, half of which means positive and half of negative numbers, all in the same row, at this time we can use 1G of space to solve this problem
  • Expansion: What if the interviewer asks us if there is only 500M or less space here?
  • The same idea of ​​segmentation is adopted, but I think here we can directly segment by the range of numbers. If there is 500M memory, we can cut it once. At this time, if we have a 50% chance of finding this only once The number that appears once may be more efficient

4. Given two files, each with 10 billion queries, we only have 1G memory, how to find the intersection of two files and give the exact algorithm and approximate algorithm respectively

  • Find the intersection of two documents. This algorithm must be used for comparison. If we divide both documents into 100 copies and compare one document in one document with 100 copies in the other document. The efficiency is too low, we can borrow the thinking of the first question to take the modulus, so that we only need to compare the two files with the same value, and if they are the same, mark

5. How to extend BloomFilter so that it supports the operation of deleting elements?

  • BloomFilter (Bloom filter is a bit vector or bit array, you can check the introduction of related Bloom filters yourself) does not support the operation of deleting elements, because it is likely to produce hash conflicts (that is, calculated by different hash functions) The position points to the same bit), so that changing a bit may affect the judgment of other elements. Here we can solve it according to the idea of ​​sharedptr with smart pointers, namely "reference counting", we add a cSount counter, whenever we are in When this bit represents an element, let it count++, and let it count every time an element represented by this bit is deleted, so that only when the count is 0, we will set this position to 0, thus completing the deletion Operation

6. For thousands of files, each file size is 1K-100M, for n words, the design algorithm finds the file containing him for each word, only 100K memory

  • We can use bloom filters to determine whether a file contains these n words to generate n bloom filters and put them in external storage. We define a file info containing these n words in advance, whenever we are in a file When you find a corresponding word, write the information of this file into the location of the corresponding word in info. We only have 100k memory, and part of this 100k memory is used to store the Bloom filter and part of the file can be stored, because the minimum file size is 100k, so we You can try to divide it into 50k small files, and each file is marked with a large file, so that we read a bloom filter and a small file each time, and if the file has a corresponding word, mark it in the info The information of the large file to which it belongs, if not, read the next bloom filter. After all bloom filters are used, read the next file and repeat the above steps until all files are traversed.

7. There is a dictionary that contains N English words. Now give a string arbitrarily, and design an algorithm to find all English words that contain this string

  • First of all, to judge whether a word contains a string, we can use the instr function. For this problem, I think if the prefix of the string is the same as the word to be found, the dictionary tree can be used to find it, but we can assume N English words It is very large, we type it into a file, and read only a fixed number of words each time for judgment.
  • Summary: For this type of big data problem, we generally use hash splitting, i.e., modulate the length of an array to allocate the data to a reasonable location, and at the same time split a large file into small files, which is particularly convenient to combine it with Compare other numbers, such as rounding the IP address and performing hash segmentation, or operating on internal elements. Use BloomFilter to determine the existence of elements in the collection

2. Data structure

1. Bubble sort

  • Algorithm idea:
  • 1. Compare all the elements in the sequence in pairs and put the largest one at the end
  • 2. Compare all the elements in the remaining sequence pairwise, and put the largest one at the end
  • 3. Repeat the second step until only one number remains
	/**
     * 冒泡排序:两两比较,大者交换位置,则每一轮循环结束后最大的数就会移动到最后
     * 时间复杂度为O(n^2) 空间复杂度为O(1)
     * @param arr
     */
    private static void bubbleSort(int[] arr){
    
    
        //外层循环length-1次
        for (int i = 0; i < arr.length-1; i++) {
    
    
            //外层每循环一次最后都会排好一个数
            //所以内层循环length-1-i次
            for (int j = 0; j < arr.length-1-i; j++) {
    
    
                if(arr[j]>arr[j+1]){
    
    
                    int temp = arr[j];
                    arr[j] = arr[j+1];
                    arr[j+1] = temp;
                }
            }
        }	
	}

2. Binary search

  • 1. Algorithm concept
  • Binary search algorithm is also called binary search, binary search, is a search algorithm to find a specific element in an ordered array. Please note that this algorithm is based on an ordered array
  • 2. Algorithm idea
  • 1) The search process starts from the middle element of the array. If the middle element happens to be the element to be found, the search process ends
  • 2) If a particular element is greater than or less than the middle element, search in the half of the array that is greater than or less than the middle element, and start comparing from the middle element as at the beginning
  • 3) If the array is empty at a certain step, it means it cannot be found
  • 4) This search algorithm reduces the search range by half with every comparison
//时间复杂度为O(log n)
public static int binarySearch(int[] srcArray,int des){
    
    
        int low = 0;
        int height = srcArray.length-1;
        while (low<=height){
    
    
            int middle = (low+height)/2;
            if(des==srcArray[middle]){
    
    
                return middle;
            }else if(des < srcArray[middle]){
    
    
                height = middle-1;
            }else {
    
    
                low = middle+1;
            }
        }
        return -1;
 }

3. Recursive method to achieve binary search

public static int binarySearch(int[] dataset,int data,int beginIndex,int endIndex){
    
    
        int midIndex=(beginIndex+endIndex)/2;
        if(data<dataset[beginIndex] || data>dataset[endIndex] || beginIndex>endIndex){
    
    
            return -1;
        }
        if (data<dataset[midIndex]){
    
    
            return binarySearch(dataset,data,beginIndex,midIndex-1);
        }else if(data>dataset[midIndex]){
    
    
            return binarySearch(dataset,data,midIndex+1,endIndex);
        }else {
    
    
            return midIndex;
        }
}

4. Single linked list reversal

class ListNode{
    
    
	int val;
	ListNode next;
	ListNode(int x){
    
    
		val = x;
	}
}

public static ListNode reverseList(ListNode head){
    
    
	ListNode prev = null;
	while(head != null){
    
    
		ListNode temp = head.next;
		head.next=prev;
		prev=head;
		head=temp;
	}
	return prev;
}

5. Insertion sort

  • 1. Initially, it is assumed that the first record forms an ordered sequence by itself, and the remaining records are disordered records
  • 2. Then, starting from the second record, insert the currently processed records into the ordered sequence before it in turn according to the size of the record
  • 3. Until the last record is inserted into the ordered sequence
public static void insertSort(int[] a){
    
    
	int temp;
	for(int i=1;i<a.length;i++){
    
    
		for(int j=i;j>0;j--){
    
    
			if(a[j-1]>a[j]){
    
    
				temp=a[j-1];
				a[j-1]=a[j];
				a[j]=temp;
			}
		}
	}
}

6. Choose Sort

  • Choose the smallest or largest
  • 1) For a given set of records, the smallest record is obtained after the first round of comparison, and then the record is exchanged with the position of the first record
  • 2) Then perform a second round of comparison of other records except the first record, get the smallest record and exchange position with the second record
  • 3) Repeat the process until there is only one record for comparison
public static void selectSort(int[] a){
    
    
        if (a==null || a.length<=0){
    
    
            return;
        }
        for (int i = 0; i < a.length; i++) {
    
    
            int min = i;
            for (int j = i+1; j <a.length ; j++) {
    
    
                if(a[j]<a[min]){
    
    
                    min=j;
                }
            }
            if(i!=min){
    
    
                int temp = a[min];
                a[min] = a[i];
                a[i] = temp;
            }
        }
    }

7. Queue: A simple data structure, the scheduling strategy adopted is first in, first out

8. Concept and characteristics of binary tree

  • 1. Binary tree concept

  • Binary tree is a very important data structure. It has the characteristics of arrays and linked lists at the same time. It can be quickly searched like an array, and can also be added quickly like a linked list. But it also has its own shortcomings: the deletion operation is complicated. The so-called binary tree The number of layers is the depth. The specific binary tree classification is as follows:

  • Binary tree: It is an ordered tree with at most two subtrees per node. When using a binary tree, the data is not randomly inserted into the node. The key value of the left child node of a node must be less than this node, the right child node The key value must be greater than or equal to this node, so it is also called binary search tree, binary sort tree, and binary search tree

  • Complete binary tree: If the height of the binary tree is assumed to be h, except for the hth layer, the number of nodes in the other layers (1~h-1) has reached the maximum number, and the hth layer has leaf nodes, and the leaf nodes are all Arranged in order from left to right, this is a complete binary tree.
    Full binary tree: every node except the leaf nodes has left and right cotyledons and the leaf nodes are all at the bottom of the binary tree

  • 2. Features of Binary Tree

  • 1) The time complexity of tree search, deletion, and insertion are all O(logN)

  • 2) The method of traversing the binary tree includes preorder, middle order and postorder

  • 3) An unbalanced tree refers to the inconsistent number of child nodes on the left and right sides of the root

  • 4) In a non-empty binary tree, the total number of nodes in the i-th layer does not exceed 2^(i-1), i>=1

  • 5) A binary tree with depth h has at most 2^(h-1) nodes (h>=1), and at least h nodes

  • 6) For any binary tree, if its leaf node tree is N0, and the total number of nodes with degree 2 is N2, then N0=N2+1

9. Give you an array with odd and even numbers, write an algorithm to realize that all odd numbers are on the leftmost side, and even numbers are all on the rightmost side

/**
 * 定义一个方法接受一个int数组,在方法内新建一个数组
 * 将传进来的数组中的元素装进去,但是要求奇数在左,偶数在右
 * 最后返回这个新数组,在main方法中调用定义数组,调用该方法,获取返回值
 * 遍历输出返回的数组
 */
public class  Test{
    
    
    public static int[] newArray(int[] arr){
    
    
        int[] newArr = new int[arr.length];//定义新数组
        //定义两个变量
        int index1=0;
        int index2=arr.length-1;
        for (int i = 0; i < arr.length; i++){
    
    
            if(arr[i]%2!=0){
    
    
                //奇数放到新数组左边
                newArr[index1]=arr[i];
                index1++;
            }else {
    
    
                //偶数放到新数组右边
                newArr[index2]=arr[i];
                index2--;
            }
        }
        return newArr;
    }

    public static void main(String[] args) {
    
    
        int[] arr = {
    
    1,2,3,4,5,6,7,8,9,0};
        int[] newArr = newArray(arr);
        //遍历输出
        for (int i = 0; i < newArr.length; i++) {
    
    
            System.out.println(newArr[i]+"\t");
        }
    }

}

10.java data structure

Java mainly has the following data structures

  • 1) Array
  • 2) Linked list
  • 3) Stack and queue
  • 4) Binary tree
  • 5) Heap and stack
  • 6) Hash table
  • 7) Red-black tree

Guess you like

Origin blog.csdn.net/sun_0128/article/details/108571791