The things you need to pay attention to when using Java to develop high-performance websites

Original address: http://www.javabloger.com/java-development-concern-those-things/

In the recent industry technology conferences held by various IT media, many websites are disclosing their own technology insider and sharing with their peers, ranging from facebook, baidu, to small websites that have just started. The technology and extraordinary processing power adopted by large websites such as facebook and Baidu are indeed refreshing, but not every website is like facebook. Baidu has hundreds of millions of users visiting traffic, and there is a huge amount of data that needs to be stored. Need to use mapreduce/parallel computing, HBase/column storage technologies are not available. Technical means are always the support of operations, and they are suitable for the current operating environment. There is no need to be fashionable. You must have a relationship with a popular technology before you can let it go.

In recent technology conferences, we have focused more on these large-scale websites. In fact, the technical system of small and medium-sized portal websites is also worthy of discussion and attention. The siege divisions all over the world are not serving these large portals. More siege divisions are quietly serving some small and medium-sized websites that have just started, and they occupy more than 60% of the siege division team. crowd. When paying attention to large portals, the technical development and practical experience of small and medium portals are more worth sharing.

Whether large portals or small and medium vertical types of websites will pursue stability, performance and scalability. The technical experience sharing of large-scale websites is worth learning and borrowing, but it is not applicable to all websites when implemented in more specific practices. Insert a few words:

The correct use of the JVM parameter configuration parameters running in the JVM
JEE container is directly related to the performance and processing capacity of the entire system. The tuning of the JVM is mainly for memory management. The optimization direction is divided into the following four points:
1. HeapSize The size of the heap can also be said to be the strategy of the Java virtual machine using memory, which is very critical.
2. GarbageCollector uses the four algorithms (strategies) of the garbage collector in Java by configuring relevant parameters.
3.StackSize The stack is the memory instruction area of ​​the JVM. Each thread has its own Stack. The size of the Stack limits the number of threads.
4. DeBug/Log In the JVM, you can also set the log output when the JVM is running and the log output after the JVM hangs. This is very critical. Only the appropriate parameters can be configured according to the log output of various JVMs.
JVM configuration skills can be seen everywhere on the Internet, but I still recommend reading Sun's official 2 articles, you can still have an understanding of the configuration parameters.
1.Java HotSpot VM Options
http://www.oracle.com/technetwork/java /javase/tech/vmoptions-jsp-140102.html
2.Troubleshooting Guide for Java SE 6 with HotSpot VM  http://www.oracle.com/technetwork/java/javase/index-137495.html
Also, I believe not every Personal siege division is facing these JVM parameters every day, if you forget those key parameters, you can enter Java -X (capital X) to prompt.

JDBC The parameters of JDBC
for MySQL have also been introduced in previous articles. The rational use of configuration parameters in JDBC in a single machine or cluster environment also has a great impact on operating the database.
Some so-called high-performance Java ORM open source frameworks also open many default parameters in JDBC:
1. For example: autoReconnect, prepStmtCacheSize, cachePrepStmts, useNewIO, blobSendChunkSize, etc.
2. For example, in a cluster environment: roundRobinLoadBalance, failOverReadOnly, autoReconnectForPools, secondsBeforeRetryMaster.
For details, please refer to MySQL's JDBC official manual:
http://dev.mysql.com/doc/refman/5.1/zh/connectors.html#cj-jdbc-reference

Database connection pool (DataSource)
The frequent interaction between applications and database connections will bring bottlenecks to the system and a lot of overhead will affect the performance of the system. The JDBC connection pool is responsible for allocating, managing and releasing database connections. It allows applications to reuse a Existing database connections instead of re-establishing a connection, so the application does not need to switch connections with the database frequently, and can release database connections whose idle time exceeds the maximum idle time to avoid database connections caused by not releasing database connections omission. This technique can significantly improve the performance of database operations.
Here I think one thing needs to be explained:
the use of the connection pool also needs to be closed, because when the database connection pool is started, the corresponding connection is obtained with the database in advance, and the application no longer needs to deal with the database directly, because the application Using the database connection pool is a "borrowing" concept. The application obtains resources from the database connection pool is "borrowing" and needs to be returned. It is like having 20 buckets here, and anyone who needs to get water can use it. These wooden buckets take water from the pool. If 20 people have finished taking the water and do not return the buckets to the original place, then the people who come after need to take water again and can only wait for someone to return the buckets. After use, you need to put it back, otherwise people behind will keep waiting, causing resource blockage. Similarly, when the application obtains a database connection, when the Connection connection object is allocated a database connection from the "pool", it will be used after use. After that, return the database connection, so as to maintain the "borrow and return" principle of the database connection.
Reference:
http://dev.mysql.com/doc/refman/5.1/zh/connectors.html#cj-connection-pooling

The optimization of data access
database server and data access, what type of data is better to put in which is a question worth thinking about, the future storage is likely to be mixed, Cache, NOSQL, DFS, DataBase in one system There will be. The tableware and clothes in daily life need to be placed at home, but they will not be stored in the same type of furniture. It seems that no one puts tableware and clothes in the same cabinet. This is like different types of data in the system, and different types of data need to use appropriate storage environments. The storage of files and pictures is first classified according to the popularity of access, or according to the size of the file. For those with strong relational types and need transaction support, use traditional databases. For weak relational types that do not require transaction support, you can consider NOSQL. For massive file storage, consider DFS that supports network storage. As for caching, it depends on the size and read/write of your single data storage. proportion.
Another point worth noting is the separation of data read and write. No matter in the DataBase or NOSQL environment, most of the read is greater than the write. Therefore, in the design, it is necessary to consider not only the need to disperse the data read on multiple machines, but also need to Considering the data consistency between multiple machines, MySQL has one master and multiple slaves. When adding MySQL-Proxy or borrowing some parameters in JDBC (roundRobinLoadBalance, failOverReadOnly, autoReconnectForPools, secondsBeforeRetryMaster) for subsequent application development, read and Write separation spreads the pressure of a large number of reads across multiple machines, and also ensures data consistency.

From a macro perspective, caches
are generally divided into two types: local caches and distributed
caches Taking it out from static data combination, it is recommended to use ConcurrentHashMap or CopyOnWriteArrayList as a local cache for high concurrency environments. More specifically, the use of cache is the use of system memory. The amount of memory resources used needs to have an appropriate ratio. If the use of storage access is exceeded, it will be counterproductive, resulting in low operating efficiency of the entire system.
2. Distributed cache, generally used in a distributed environment, centrally stores the cache on each machine, and is not only used for the use of cache, but also as a method of data synchronization/transmission in distributed systems The most commonly used methods are Memcached and Redis.
The read/write efficiency of data storage on different media is different. How to make good use of the cache in the system to make your data closer to the cpu The masterpiece of Jeff Dean ( Ref ), as shown in the figure:
cache-speed
Concurrency/multi-threading
is recommended for developers to use the concurrency package (java.util.concurrent) that comes with JDK in a high-concurrency environment, and use java.util after JDK1.5 The tool classes under .concurrent can simplify multi-threaded development. The tools of java.util.concurrent are mainly divided into the following main parts:
1. Thread pool, thread pool interface (Executor, ExecutorService) and implementation classes (ThreadPoolExecutor, ScheduledThreadPoolExecutor), using the thread pool framework that comes with jdk can manage the queuing and scheduling of tasks, and allow controlled shutdown. Because running a thread consumes system CPU resources, and creating and ending a thread also has overhead on system CPU resources, using thread pools can not only effectively manage the use of multiple threads, but also improve the running efficiency of threads.
2. Local queue, which provides an efficient, scalable, thread-safe non-blocking FIFO queue. All five implementations in java.util.concurrent support the extended BlockingQueue interface, which defines blocking versions of put and take: LinkedBlockingQueue, ArrayBlockingQueue, SynchronousQueue, PriorityBlockingQueue, and DelayQueue. These various classes cover most common usage contexts for producer-consumer, message passing, parallel task execution, and related concurrency designs.
3. Synchronizer, four classes can assist in the realization of common dedicated synchronization statements. Semaphore is a classic concurrency tool. CountDownLatch is an extremely simple yet extremely common utility for blocking execution until a given number of signals, events or conditions are held. CyclicBarrier is a resettable multiway synchronization point useful in certain parallel programming styles. Exchanger allows two threads to exchange objects at the collection point, which is useful in multi-pipeline designs.
4. Concurrent package Collection, this package also provides Collection implementations designed for use in multithreaded contexts: ConcurrentHashMap, ConcurrentSkipListMap, ConcurrentSkipListSet, CopyOnWriteArrayList and CopyOnWriteArraySet. When many threads are expected to access a given collection, ConcurrentHashMap is generally preferred over synchronized HashMap, and ConcurrentSkipListMap is generally preferred over synchronized TreeMap. CopyOnWriteArrayList is preferred over a synchronized ArrayList when the expected reads and traversals are much larger than the number of updates to the list.

Queues
can be divided into two types: local queues and distributed queues.
Local queues are generally used for batch writing of non-timely data. The acquired data can be cached in an array when the number reaches a certain amount. Batch one-time writes can be implemented using BlockingQueue or List/Map.
Relevant information: Sun Java API .
Distributed Queue: Generally used as a message middleware to build a bridge for communication between subsystems in a distributed environment, the most used in the JEE environment are Apache's AvtiveMQ and Sun's OpenMQ.
The lightweight MQ middleware has also been introduced to you before, such as: Kestrel and Redis (Ref http://www.javabloger.com/article/mq-kestrel-redis-for-java.html), I heard about it recently The search technology team at LinkedIn has launched an MQ product - kaukaf (Ref http://sna-projects.com/kafka ), keep an eye on it.
Related information:
1.ActiveMQ  http://activemq.apache.org/getting-started.html
2.OpenMQ   http://mq.java.net/about.html
3.Kafka        http://sna-projects.com/ kafka      
4.JMS article   http://www.javabloger.com/article/category/jms

NIO
NIO appeared in versions after JDK 1.4. Before Java 1.4, Jdk provided a stream-oriented I/O system. For example, reading/writing files processed data one byte at a time, and an input stream One byte of data is produced, one output stream consumes one byte of data, stream-oriented I/O is very slow, and a packet has either been received in its entirety or not yet. The Java NIO non-blocking technology actually adopts the Reactor mode, which will automatically notify when content comes in, without having to wait and endless loop, which greatly improves the system performance. In real scenarios, NIO technology mostly uses two aspects, 1 is the read and write operations of files, and 2 is the operation of data streams on the network. There are several core objects that need to be mastered in NIO: 1 Selector, 2 Channel, 3 Buffer.
My nonsense:
1. In the technical category of Java NIO, memory-mapped files are an efficient method, which can be used to separate the cold/hot data stored in the cache, and perform such processing on a part of the cold data in the cache. In practice, it is much faster than conventional stream-based or channel-based I/O. It is done by making the data in the file appear as the content of the memory array, and the part actually read or written in the file will be mapped to memory. , instead of reading the entire file into memory.
2. In the jdbc driver of Mysql, NIO technology can also be used to operate the database to improve the performance of the system.

Long connection / Servlet3.0
The long connection mentioned here is long polling. In the past, the browser (client) needed to pay attention to the data changes on the server side and needed to continuously access the server, so that the number of clients would inevitably cause a lot to the server side. Great pressure, for example: in-site messages in forums. A new feature is now provided in the Servlet3.0 specification: asynchronous IO communication; this feature will maintain a long connection. Using this technology of Servlet3 asynchronous request can greatly relieve the pressure on the server side.
The principle of Servlet 3.0 is to open a thread to suspend the request request, and set the waiting timeout time in the middle. If the background event triggers the request request, the result is returned to the client's request request. The event will also return the request to the client, and the client will initiate the request again, and the interaction between the client and the server can go back and forth.
It's like, you come to me first and tell me that if someone is looking for you, I will immediately notify you that you are coming to see him. Originally, you needed to keep asking me if I was looking for you, and whether or not someone was looking for you, you needed to keep asking Ask me if anyone is looking for you, in this case both the person who asks and the person who is asked will die of exhaustion.

Log4J is a tool that is usually used by people. When the system is just online, the log
is generally set at the INFO level. After the system is actually online, it is generally set at the ERROR level. However, at any time, the input content of the log needs to be paid attention to. Developers can generally rely on the output log to find problems or optimize the performance of the system by relying on the output log. The log is also the basis for reporting the system running status and troubleshooting.
Simply put, the logs are output to different environments according to different defined strategies and levels, which is convenient for us to analyze and manage. On the contrary, if you do not have the output of the strategy, then if there are too many machines and a long time, there will be a lot of messy logs, which will make you unable to start when troubleshooting. Therefore, the output strategy of the log is the key point of using the log.
Reference: http://logging.apache.org/log4j/1.2/manual.html

Packaging/deployment
When designing the code, it is best to divide different types of functional modules into different projects in the IDE environment, so that they can be packaged into different jar packages and deployed in different environments. There is such an application scenario: it is necessary to remotely obtain 100 pieces of news and weather forecasts of some cities from the SP every day. Although the amount of data per day is not large, the concurrent amount of front-end access is large. Obviously, the system architecture needs to be improved. Do read and write separation.
If the web project and the function modules of timed crawling are completely packaged in one project, it will lead to both web applications and timers on each machine when expansion is required. Because the function modules are not separated, each machine has a timer The work of the server will cause the data in the database to be duplicated.
If the web and the timer are divided into two projects during development, they can be deployed separately during packaging. 10 webs correspond to one timer, which decomposes the pressure of front-end requests, and the data writing will not be repeated.
Another advantage of this is that it can be shared. In the above scenario, both the web and the timer need to read the database, so the web and timer projects have code to operate the database, and the logic of the code is still confusing. messy. If another DAL layer jar is extracted, developers of web and timer application modules only need to refer to the DAL layer jar to develop related business logic and interface-oriented programming without considering the specific database operations. When the developer is done, the division of development tasks can be very clear and do not interfere with each other.

Framework
The so-called popular SSH (Struts/Spring/Hiberanet) lightweight framework is not lightweight at all for many small and medium-sized projects. Developers not only need to maintain code, but also need to maintain cumbersome xml configuration files, and maybe An incorrectly written configuration file makes the entire project unable to run. There are really too many products that can replace the SSH (struts/Spring/Hiberanet) framework without configuration files. I have introduced some products to you before ( Ref ).
I am not blindly opposed to the use of the SSH (Struts/Spring/Hiberanet) framework. In my eyes, the SSH framework really functions to achieve standardized development without using the SSH (Struts/Spring/Hiberanet) framework to improve how much performance.
The SSH framework is only for a very large project team with hundreds of people, and companies that need and continue to increase the size of the team need to choose some technologies that are recognized and familiar in the market, SSH (Struts/Spring/Hiberanet) framework It is more mature, so it is the first product.
However, for some small teams with highly skilled teams, you can choose a more concise framework to really speed up your development efficiency, abandoning the SSH framework as soon as possible and choosing a more concise technology is an important factor in small team development. a more conscious choice

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326485003&siteId=291194637