Reptile section

Python interview focused (reptiles articles)

I will answer the first part

Note: The first 31 title 1 point, other questions have 3 points each.

1. Understand what basis reptile related modules?
Re
Request
BeautifulSoup
lmlx
the Selenium
scrapy
PANDAS
numpy

2. Common data analytical methods?
PANDAS
numpy

3. What is more difficult to enumerate mechanism of anti-climb reptiles encountered in the process?
js confusion

4. How brief crawl dynamic load data?

5. How mobile data terminal crawl?

6. crawl over which types of data, much of the order?
scv, txt
one million

7. Understand what reptiles framework?
scrapy framework

8. knowledge of scrapy talk of?
Scrapy is based on the framework developed from twisted, twisted python web framework is a popular event-driven. Thus Scrapy using one kind of non-blocking code (also known as asynchronous) to implement concurrency.

  • Scrapy (asynchronous):
    - a high-performance network requests
    - data analysis
    - persistent store
    - the station data crawling
    - crawling Depth
    - Distributed
    9. how to parse a partial page data carrying tag?

10.scrapy core components?
Engine (Scrapy)
for processing the entire data stream processing system, triggering transaction (frame core)
scheduler (Scheduler)
for receiving a request sent over the engine, is pressed into the queue, and returns the engine again when the request can imagine as a URL (web site is crawled pages or links) priority queue, to decide what the next it is a URL to be crawled, while removing duplicate URLs
Downloader (Downloader)
for downloading web content, and web page content back to the spider (Scrapy Downloader is based on twisted this effective asynchronous model)
reptiles (spiders)
crawler is the main work is used to extract the information they need from a particular web page, the so-called entity ( Item). Users can also extract the link, let Scrapy continue to crawl to the next page
project pipeline (Pipeline)
is responsible for handling reptiles drawn from the Web entity, the main function is persistent entity, to verify the effectiveness of the entity, remove unnecessary information . When a page is parsed crawler, the pipe will be sent to the project, and after a few specific order of processing data.

11.scrapy middleware application?

12. How to achieve all station data crawling?

13. How to Detect website data updates?

14. The principle of distributed reptiles?

15. How to improve the efficiency of crawling data (asynchronous reptiles)

16. enumerated anti-climb mechanism you contact?

17. What is the depth-first and breadth-first (pros and cons)
depth-first algorithm accounts for less memory but slower, breadth-first algorithm accounts for more memory but faster, can quickly obtain in the case of distance and proportional to the depth of Optimal solution.
Depth and breadth first priority and generating system control structure is very similar, the only difference lies in the extension of the selected node. Because it retains all predecessor node, the subsequent node in generating partially overlapping nodes may be removed, thereby increasing search efficiency.
Both algorithms each time a node is extended to all child nodes, but the difference is that the next time expansion is a depth-first child node of this extension out of one, and expand the breadth-first is the extension of this node the brothers point. In the specific implementation in order to improve efficiency, so the use of different data structures.

How to achieve persistent storage 18.scrapy

  • Based on the terminal persistent storage instruction
    - can only parse the return value is stored in the specified extension (csv) text file
    - command: scrapy crawl spiderName -o filePath
  • Based on persistent storage pipe
    - in the pipeline received item, the data item may be stored in any form of persistent storage (pipelines.py)
    - process_item (): responsible for receiving item object and its persistent storage
    19. crawlspider talk about the understanding of how to use its deep crawl

20. How to achieve data cleansing?
Handling null values

21. The machine learning to find out about it?
Machine learning is the automatic analysis obtained law (model) from the sample data, and data to predict the unknown, using the classification rule (model).

22. crawlers Why is selenium? What is the association between selenium and crawler is?

23. Its action you include common methods known selenium module

24. Explain what role in the multi-task coroutine asynchronous event loop (loop) is?

25. How do multi-tasking asynchronous co-routines is to achieve asynchronous?

26. How to deal with the verification code?

27.scrapy and scrapy-redis What is the difference?

28. Open your browser talk www.baidu.com get access to the results of the entire process.

29. You know the contents of the header list and information

  1. Brief scrapy deduplication principle?

31. The following is a description of the error: (1 min)

```
A.栈是线性结构    B.队列是线性结构
C.线性列表是线性结构 D.二叉树是线性结构
```

D

32. Description of the preamble, in order, a subsequent traversal?

33. Write the code: to achieve a bubble sort.
"" "
DEF Sort (a_list):
for J in Range (0, len (a_list) -1):
for I in Range (len (a_list) -1-J):
IF a_list [I]> a_list [I +. 1 ]:
a_list [I], a_list [I +. 1] a_list = [I +. 1], a_list [I]
return a_list
a_list = [3,5,2,0,55,32]
Print (Sort (a_list))
" ""
34. The write code: achieve a quick sort.
"" "
The while Low <high: \ n-",
IF alist [high] <MID: less than #high base
alist [Low] alist = [high]
BREAK
the else: # cardinality is less than a high, the high offset to the left
high - = 1 #high think the left is offset by a

low the while <High:
IF alist [low] <mid: # if so low is less than the low mid want RightOffset
low + = 1 # shifted to the right so that a low
the else:
alist [High] alist = [low]
BREAK
if low == high: #low and high repetition, assigned to the low base or high position
alist [low] = MID #alist [high] = MID
return alist
"" "

The second part of the supplementary question

1. listed common use git command?
git init: create a new repo locally into a project directory, execute git init, will initialize a repo, and create a .git folder in the current folder.
git clone: Gets a url corresponding remote Git repo, create a Copy local.
git status: query repo state.
git log: View submit a branch of history.
git diff: View the current differences between files and staging area
git commit: add submission has been coming changes
git reset: restore to a state of submission
git checkout: switching branch
git merge: merge into a branch of the current branch
git tag: built on a submission on a bookmark
git pull: update local
git push: submit to a remote branch server
2. your company / team with how to do collaborative development?
git / github

3. How do companies Code review? Who does?

4. How to solve the bug occurs if the line of code?

5.git rebase the role of?

git combat video address: https://www.bilibili.com/video/av70772636

Guess you like

Origin www.cnblogs.com/zcc52/p/12425640.html