python interview 2

1, a large data file read

 

 

  1.  Generator generator using
  2. Iterator iterates over: for line in file

2, the difference between iterators and generator

 

  1. An iterator is a more abstract concept, any object, if it is next class has its own method and the method returns iter. For this type of container object string, list, dict, tuple, etc., used for loop iterates it is convenient. In the background for statement calls iter () function, iter () on the container object is Python's built-in functions. ITER () returns a defined next () method of the iterator object, which individually accessed within the container elements in the container, built next () function is the python. In the absence of subsequent elements, next () will throw an exception of a StopIterration.
  2. Generator (Generator) is a simple and powerful tool for creating iterators. They are written like regular functions but use the yield statement at the time required to return the data. Each time next () is called, the generator will return to the position (position it remembers the last time the statement is executed and all data values) it out of

Differences: Builder can do all things iterators can do, but also created automatically __iter __ () and next () method, the generator is particularly simple, but also highly efficient generator, use a generator expression instead list comprehension can save memory at the same time. In addition to creating and maintaining a state of the program is automatically generated when generators terminate, it automatically ran StopIterration exception.

 

3. The role and function decorators

 

  • The introduction of the log
  • Function execution time statistics
  • Execution of the function of money preparatory process
  • Cleaning function after the function is executed
  • Check permissions and other scenes
  • Cache

4, simple talk about the next GIL

Global Interpreter Lock (Global Interpreter Lock)

    Python Python code executed by a virtual machine (also known as the interpreter main loop, CPython version) to control, Python was designed with consideration to be in the main loop of the interpreter, only one thread in execution, that is, any time, only one thread running in the interpreter. Access to the Python virtual machine is controlled by the Global Interpreter Lock (GIL), it is this lock ensures that only one thread is running.

 

In a multithreaded environment, Python virtual machine executes the following manner:

1. Set GIL

2. Switch to a thread to run

3. Run:

    a. Specify number of bytecode instructions, or

    b. Thread initiative to the control (you can call time.sleep (0))

4. Put the thread to sleep

5. Unlock GIL

6. All of the above steps again repeated

Then call external code (such as C / C ++ extension functions) when, GIL say will be locked until the end of the function (because there is no Python bytecode is run during this time, so do not thread switching).

 

5, find and grep

grep command is a powerful text search tools, grep all content strings can be regular expressions, allowing text file lookup mode. If a matching pattern is found, grep prints all lines containing pattern.

find commonly used to search for eligible files in a particular directory, it can also be used to search for specific user owner of the file.

 

6, online services may be due to various causes hang how to do?

Background process in Linux management tool supervisor

After each files modified to perform service supervisord restart in linux

 

7, how to improve the operating efficiency of python

    Using generators; external function package using the key code (Cython, pylnlne, pypy, pyrex); for loop optimization - in the cycle to avoid the property variable access

 

8, commonly used Linux commands

    ls,help,cd,more,clear,mkdir,pwd,rm,grep,find,mv,su,date

 

9, Python usage of yield

    yield is simply a generator, so that it functions in a position to remember the last time the function returns. For the second (or n times) to generate the function call jump.

 

10, Python is how memory management

First, the garbage collection:

    python Unlike C ++, and other languages, like Java, they can not declare variables in advance the type of direct variable assignment. Python in terms of language, object type at run time and memory are determined. This is why we call python language is dynamically typed reasons (here we dynamically typed languages ​​can simply be attributed to a variable allocation of memory addresses at run time automatically determine the variable type and variable assignment).

Second, the reference count:

    python using the same windows kernel objects similar way to manage the memory. Each object, which maintains a count of pointing the object reference. When the variable is bound to an object, the variable reference count is 1, (and in other cases will lead to increased count variable references), the system will automatically maintain these tags, and the timing of the scanning, when a label when the reference count becomes 0, the object will be recovered.

Third, the memory pool mechanism

    python memory mechanism into a pyramid:

    -1, -2 main operating system layer operation;

    Layer 0 is the C malloc, free and other memory allocation and deallocation function operation;

    The first layer and the second layer is a memory pool, the python has an interface function PyMem_Malloc function implemented when the object of this layer is less than 256 bytes of memory directly assigned;

    The third layer is the uppermost layer, that is, we direct the operation of the python object;

When the C malloc and free calls if frequent, will produce performance problems, coupled with frequent memory allocation and release small pieces of debris will produce memory.

python here doing the work in the main are:

If the request allocation of memory between 1 and 256 bytes on the use of their own memory management system, or directly using malloc.

Here still calls malloc to allocate memory, but every time he returns to allocate a chunk size of 256 bytes of memory.

 

Registration via the memory pool memory to eventually be recycled to the memory pool, and does not call the free C relieved, for the next use. Python for simple objects, such as the value, a string, a tuple (tuple is not allowed to be changed) is used in a duplicated manner (deep copy), that is, when speaking to another variable B is assigned to the variable A, while A and B's memory space remains the same, but when the value of a changes, re-allocation of space to the a, a and B address is no longer the same.

 

11, describing the difference between arrays, linked lists, queues, stacks of?

Arrays and lists of data storage concept, an array of data stored in a continuous space, and the linked list data may be stored in a non-contiguous space;

Stack and queue is a conceptual description of the data access mode, a FIFO queue, and the stack is a LIFO; stack and queue can be implemented with an array, linked list implementation can also be used.

 

12, you know some sort, you talk about the most familiar kind?

 

web frame portion

When a user logs in 1.django A application server (enter the login state), then the next request is nginx proxy to what impact the application server B will appear?

A server application if the user login session data is not shared to the application server B, before that, there is no user logged in.

2. cross-domain requests django how to solve the problem of the (principle)

Enabling Middleware

post request

Verification code

Added in the form {% csrf_token%} tag

 

3. Please explain or describe Django framework

For Django framework that follows the MVC design, and has a proprietary term: MVT

M is a spelling Model, and M is the same function MVC, responsible for data processing, embedded ORM framework

V spelling of View, and C functions in the same MVC, the HttpRequest reception, service processing, returns HttpReponse

T spelling as Template, and V in MVC same function, responsible for html package structure to be returned, embedded template engine

 

4.django sort the data query results how to do, how to do it in descending order, the query is larger than a field how to do

Sort using order_by ()

Required Before descending sort field name -

Query field is greater than a certain value: Use the filter (field name _gt = value)

 

5. talk about the role of Django, MIDDLEWARES middleware?

Middleware is interposed between a processing request and response process, relatively lightweight, and changes in the input and output of the global Django

 

6. Do you know of Django's?

Django is large and the direction to go, it is best known for fully automated management background; only need to use ORM, do a simple object definitions, it can automatically generate the database structure, as well as full-featured management background.

A high degree of built-in Django ORM coupled with other modules within the framework

The application must use the built-in Django ORM, otherwise you can not enjoy all kinds of ORM based on its convenience provided within the framework. It is theoretically possible to switch off the ORM module, but this is equivalent to take renovated house demolition renovated, it would be better to make a start and went blank room newly renovated.

 

Django's selling point is the development of ultra-high efficiency, expand its performance is limited; using Django project, after the flow reaches a certain size, they need to be reconstructed in order to meet performance requirements.

Django is suitable for small and medium sites, or as large sites quickly achieve product prototype tools.

Django template design philosophy is completely code, style separation; Django fundamentally eliminate encoded in the template, may process the data.

 

7. Django redirects How did you achieve? What is the status with the code?

Use HttpResponseRedirect

redire and reverse

Status Code: 302, 301

 

8.ngnix forward proxy and reverse proxy?

Forward proxy server is located between the client and the origin server (origin server), in order to obtain the content from the origin server, the client transmits a request to the proxy and specify the destination (origin server), and then transmit the request to the proxy server and original content will get back to the client. The client must make some special settings to use the forward proxy.

Reverse proxy on the contrary, it is like for the client in terms of the origin server, and the client does not require any special settings. The client sends to the content of the reverse proxy namespace ordinary request, followed by a reverse proxy to determine where (the original server) to transmit the request and returns the content available to the client, as is its own original content of the same.

 

9. Tornado's core is what?

The core is the Tornado iostream ioloop and these two modules, the former provides a highly efficient I / O event cycles, the latter socket encapsulates a non-blocking. By adding a network I / O event to ioloop, the use of non-blocking socket, and then with the appropriate callback function, you can achieve efficient asynchronous execution dream.

 

10.Django itself provides runserver, why not to deploy?

 

runserver method is often used when operating mode debugging Django, it comes with Django

WSGI Server running, mainly used in the testing and development, and the way is open runserver single process.

 uWSGI is a Web server that implements the WSGI protocol, uwsgi, http and other protocols. Note uwsgi is a communication protocol, and Web server implementation is uWSGI uwsgi WSGI protocol and protocol. uWSGI ultra-fast performance, low memory footprint and multiple app management, etc., and with the Nginx

Is a production environment, the user can access the request and isolate application app, a real deployment. Compared speaking, higher concurrency support, facilitate the management of multi-process, multi-core play the advantages of improved performance.

 

Network programming and the front end portion

 

 

What 1.AJAX, how to use AJAX?

ajax (asynchronous javascript and xml) can refresh the page data locally instead of reloading the entire page.

First, create xmlhttprequest objects, var xmlhttp = new XMLHttpRequest (); XMLHttpRequest object is used to exchange data with the server.

The second step, using the object xmlhttprequest open () and send () method sends a resource request to the server.

The third step, object using xmlhttprequest responseText responseXML property or receiving the response.

The fourth step, onreadystatechange function, when sending a request to the server, the server response we want to perform some of the functions you need to use onreadystatechange function, each time the readyState xmlhttprequest object changes will trigger onreadystatechange function.

2. Common HTTP status codes What?

200 OK

301 Moved Permanently

302 Found

304 Not Modified

307 Temporary Redirect

400 Bad Request

401 Unauthorized

403 Forbidden

404 Not Found

410 Gone

500 Internal Server Error

501 Not Implemented

3. Post and get the difference?

1, GET request, the requested data will be appended to the URL, to? URL and transmit the divided data, a plurality of parameters & connect. URL encoding format uses ASCII encoding instead of unicode, and then transfer that is to say after all the non-ASCII characters to be encoded.

POST request: POST request will request data packet is placed in the body of the HTTP request packet. The above item = bandsaw is the actual data transmission.

Therefore, GET requested data will be exposed in the address bar, and POST request does not.

 

2, the size of data transmission

In the HTTP specification, there is no limitation on the length and size of the transmission data of the URL. But in the actual development process, for GET, specific browsers and servers have restrictions on the length of the URL. Thus, when using the GET request, the transmission data is limited by the length of the URL.

For POST, because the URL is not traditional values, in theory, is not restricted, but in fact each server will require the submission of data to limit the size of the POST, Apache, IIS has its own configuration.

 

3, security

Security is higher than the GET POST. Security here is the real security, the security method differs from the above-mentioned GET security, safety not only the above-mentioned modified data server. For example, during the login operation, through GET request, the user name and password will be exposed longer URL, the login page because there may be reasons for the browser cache and others to view your browser history, in which case the user name and password is very easy the others got. In addition, GET requests submitted data may also cause Cross-site request frogery attack.

 

4.cookie session and the difference?

1, cookie data is stored on the customer's browser, session data on the server.

2, cookie is not very safe, people can analyze stored locally COOKIE COOKIE cheat and safety should take into account the use of session.

3, session will be stored on the server within a certain period of time. When accessing the increase would be more occupy the server's performance taking into account mitigating server performance, you should use COOKIE.

4, a single cookie stored data can not exceed 4K, many browsers are limited to a maximum of 20 sites saved cookie.

5, recommended:
   the login information and other important information is stored as SESSION
   additional information if necessary, it can be placed in COOKIE

 

5. Create a simple tcp server processes needed

1.socket create a socket

2.bind bind ip and port

3.listen the socket can become passive link

4.accept WAIT client

5.recv / send data transmission and reception

 

Reptile and part of the database

 

 

1.scrapy and scrapy-redis What is the difference? Why redis database?

1) scrapy is a Python framework reptile, crawling high efficiency, high degree of customization, but does not support distributed. The scrapy-redis redis based on a set of database components running on scrapy framework, allowing scrapy support distributed strategy, Slaver side shared Master end redis database of item queue, the request queue and request a set of fingerprints.

2) Why redis database because redis supports master-slave synchronization, and the data is cached in memory, so the redis based distributed crawler, a high frequency of requests and data reading efficiency is very high.

2. You used crawler module frame or what? They talk about the differences or advantages and disadvantages?

Python comes with: urllib, urllib2

Third party: requests

Framework: Scrapy

urllib and urllib2 modules do operations related to the request URL, but they provide different functions.

urllib2.:urllib2.urlopen Request object can accept a url or, (when receiving the Request object, and thus can be set headers of a URL), receives only a url urllib.urlopen

The reason urllib have urlencode, urllib2 not, therefore always urllib, urllib2 often used together

scrapy is encapsulated frame, he includes a downloader, parsers, logging and exception handling, based on multi-threaded, twisted way process, crawling for the development of a single fixed site, there are advantages, but for multi-site crawling 100 website, concurrent and distributed processing, inflexible, inconvenient adjustment and include exhibition.

request is an HTTP library, it is only used, a request for HTTP requests, he is a powerful database, download, parse their full treatment, greater flexibility, high concurrency and distributed deployment is also very flexible and can function for better implementation.

Scrapy advantages and disadvantages:

Advantage: scrapy is asynchronous

Xpath instead of taking a more regular readability

Powerful statistics and log system

While crawling on a different url

Support shell, convenient and independent commissioning

Write middleware, easy to write some unified filters

Stored in the database through the pipeline

Disadvantages: crawler frame based on the python, poor scalability

Based twisted frame, the operation of the exception is not to kill reactor, and the asynchronous frame error is not stopped other tasks, the data errors difficult to detect.

3. your favorite mysql engine What? What is the difference between each engine?

MyISAM and InnoDB two main engines, the main differences are as follows:

A, InnoDB supports transactions, MyISAM does not support, and this is very important. The transaction is a high

Treatment stage, as long as a series of additions and deletions which can also rollback error reduction, while MyISAM

Is not it;

Two, MyISAM for query-based applications as well as insert, InnoDB for frequent modifications, and involve

High security applications;

Three, InnoDB support foreign keys, MyISAM does not support;

Four, MyISAM is the default engine, InnoDB needs to be specified;

Five, InnoDB FULLTEXT index type is not supported;

Six, the number of rows is not InnoDB storage table, as select count (*) from when the Table, InnoDB; requires

Scanning over the whole table to calculate how many rows, but MyISAM simply read a good number of lines that is saved

can. Note that, when the count (*) statement contains MyISAM where conditions need to scan the entire table;

Seven, for self-growth fields, InnoDB must contain only the index of the field, but in the MyISAM

A table can be established with the joint index and other fields;

Eight, when clear the entire table, InnoDB is to delete the line by line, the efficiency is very slow. MyISAM will be heavy

To build the table;

Nine, InnoDB row lock support (in some cases the whole table is locked as update table set a = 1 where

user like '%lee%'

4. Describe the mechanism scrapy framework running under?

Obtaining first and url from start_urls in a send request to the scheduler by the engine into the request queue, the acquisition is completed, the scheduler queue the request to the downloading request to acquire the response corresponding to the request resource, and in response analytical method of their preparation to do extraction process: 1 if the extracted data is required to the document processing pipeline; 2 if the extracted url, proceed to step before (url transmission request, the request by the engine. to the scheduler queues ...), until the request queue is not requested, the program ends.

5. What is the relational query, what?

The query multiple tables together, mainly in the connection, connect the left and right connections, fully connected (outer join)

6. Write reptiles with multi-process better? Or multi-threaded good? why?

When IO intensive code (document processing, web crawler etc.), multi-threading can effectively improve the efficiency (IO operations will be IO have to wait for a single thread, resulting in unnecessary waste of time, and can open multiple threads waiting in the thread A, automatic switch to the thread B, you can not waste CPU resources, which can improve the efficiency of program execution). In the actual data collection process, we consider the problem and speed of response, but also need to consider the machine's hardware itself, to set up multi-process or multi-threaded

7. The database optimization?

1. Optimize index, SQL statements, slow query analysis;

2. Design a strict time table according to database design paradigm to design the database;

3. cache, the frequently accessed data and does not require frequent change of the data in the cache can be

Save disk IO;

4. optimize hardware; using SSD, using techniques disk queue (RAID0, RAID1, RDID5) and the like;

The internal built using MySQL partitioning table, different file hierarchical data, it is possible to improve the magnetic

Efficiency of reading the disc;

Table 6. Vertical; some do not always read data in a table, save disk I / O;

7. The reader is separated from the master; the master copy using the read operations and write operations separated from the database;

8. A sub-library sub-sub-table machine (particularly large amount of data), the main principle is the data routing;

9. Select the appropriate engine optimization parameter table;

10. Carry out level cache architecture, static and distributed;

11. The full-text index is not used;

12. The use of a faster storage, e.g. NoSQL store frequently accessed data

8. Common anti-reptile and our approach?

1) Anti reptile through Headers

Request from the user Headers anti-crawler is the most common anti-reptile strategy. Headers are many sites on the User-Agent is detected, as well as part of the site Referer will be detected (some of the resources of the site is to detect anti-hotlinking Referer). If you encounter this type of anti-reptile mechanism, can be added directly in the reptile in Headers, copy the browser's User-Agent to Headers reptile's; or modify the Referer value for the target domain. Headers for detecting anti reptiles, modify or add Headers can be a good bypassed crawlers.

2) based on user behavior anti reptiles

There is a part of the site by detecting user behavior, such as the same IP to access the same page multiple times within a short time, or the same account in a short time the same operation several times.

Most sites are the former case, in this case, the use of IP proxy can be resolved. You can write a special reptile, crawling public Internet proxy ip, save all post-test them. Such proxy ip crawlers are often used, it is best to prepare yourself a. Once you have a large number of agents can each ip ip request to replace a few times, it is very easy to do in requests or urllib2, so you can easily bypass the first anti-crawlers.

For the second case, one may request a few seconds at random intervals and then after each request. Some logical vulnerable site, you can request several times, log out, log back in and continue to limit the request to bypass the same account in a short time can not make the same request several times.

3) Dynamic page counter reptile

Most of the above situations are found in static pages, as well as part of the site, we need crawling through ajax request data is obtained or generated by JavaScript. First, the network analyzes the request by Fiddler. If you can find ajax request, can analyze the meaning of specific parameters and specific responses, we can use the above method, directly or urllib2 analog ajax request requests, for json data obtained by analyzing the response desired.

Can directly simulate ajax request to obtain data of course is excellent, but some sites all the parameters ajax requests all encrypted. We simply can not construct the requested data they need. In this case on the use of selenium + phantomJS, call the browser kernel, and use phantomJS js execution to simulate human action and trigger js script page. From Fill out the form and then click the button to scroll the page, can simulate all, without regard to the specific request and response process, but a full account of the man page for browsing data simulation again.

With this framework almost able to bypass most anti-reptile, since it is not disguised as a browser to access the data (the above by adding Headers to some extent is to masquerade as a browser), which itself is a browser, phantomJS is no interface of a browser, but the browser is not handling this person. Lee selenium + phantomJS capable many things, for example, touch-type recognition (12,306) or slide type codes, to form the page brute like.

9. Distributed reptiles mainly solve the problem?

1)ip

2) Bandwidth

3)cpu

4) I

10. reptile process how to deal with the verification code?

1.scrapy comes

2. Pay Interface

Guess you like

Origin www.cnblogs.com/xiaoxiaoxl/p/11110137.html