foreword
In the process of project research and development, a series of processes such as requirements review, development review, code writing, test case review, project testing, product and UI acceptance have been experienced, and a lot of manpower and energy have been invested.
However, in the final launch stage, there are always many uncertainties and variability. Often there is no problem in testing N times in the test stage, and bugs will appear as soon as it goes online (it is simply the curse of Murphy's law).
After years of experience summarizing and cruel lessons, we have sorted out these known or potential risk points in detail, hoping that the launch of each project can be solid, foolproof and smooth.
In this article, we will prevent online risks from three aspects: operational prevention, double posts & self-examination, and monitoring and alarming.
1. Operational precautions
It mainly includes four categories of prevention: research and development prevention, configuration prevention, operation and maintenance prevention and approval prevention.
1.1 R&D prevention
1.1.1 General layer
1.
Unified standardization of Loading/Confirm
2.
Error page/skeleton screen/no data/network exception specification
3.
Specifications for announcements and pop-up windows
1.1.2 Code layer
1.
Use https, prohibit non-jd sources, and verify that the external network is available
2.
Environment switching is distinguished by system variables
3.
Commit specification
▪Single
submission of independent function code
▪All
R&D codes submitted for Coding
4.
Unified IDE and scaffolding
5.
The development environment node.js, npm, joyer, taro and other versions are unified
6.
Unify common components
▪Immersive
Navigation
▪Shared
components, detection version support and fault-tolerant processing
▪Ladder
components
▪Encryption
anti-swiping: AKS, AAR
▪Fingerprint
of risk control equipment
▪Singularity
Buried
▪Download
evoke component
▪Share
components and golden password
7.
Data processing
▪Paged
loading to avoid request endless loop
▪Gateway
layer error code handling
▪Server
interface layer error code handling
▪The
exception of the main function interface is the necessary processing such as jumping to the error page/popup window retry
▪Basic
plan
1.1.3 UI layer
1.
Is there a dynamic effect
2.
Whether the audio and video are compatible
3.
Is there performance lag?
1.1.4 Security layer
1.
Coding problem: whether to pass eslint
2.
Compatibility issues: processing of coding syntax, method attributes, minimum supported version of component library, etc.
3.
Logic function: whether the code logic is consistent with the expected function
4.
Abnormal conditions: whether to consider abnormal conditions such as downgrade/fault tolerance/timeout
5.
User experience: Whether the new or modified functions have a bad user experience on performance or experience
6.
Security: ciphertext transmission, brush prevention, script injection, etc.
7.
Mock: Whether the display of mock data is correctly processed
8.
Sensitive data: whether there is a potential customer complaint risk in data processing, etc.
1.2 Configuration Defense
1.2.1 R&D configuration
1.
Content Configuration Platform Configuration The online configuration should be operated again to pay attention not to affect the online, try to add new configuration
2.
The configuration data type does not support time controls, and it is forbidden to configure data such as time or timestamp on it
3.
Data verification before configuration (for example: whether the link format is correct, whether the data length needs to be limited, etc.)
4.
Data fault-tolerant processing, if it is important data, it needs to be set as a required item "required": true;
1.2.2 Operation configuration
1.
All production configurations must be completed before the event goes live
2.
For configurations that have been launched, the re-operation needs to be confirmed with the product and R&D; double-post confirmation within the operation
3.
Pre-release environment verification, data and production are consistent (prize type, coupon type, seckill time, task type, etc.)
4.
Rewards, coupons or tasks need to be verified and can be issued or received normally before being configured and displayed on the front end
5.
To use benefit points from the event to other event pages, it is necessary to ensure that the use of benefit points in the secondary event page is normal
1.2.3 Environment Configuration
1.
All new projects are initialized with joyer scaffolding
2.
Unified commands, local environment, packaging, publishing environments, etc.
3.
vconsole, annotations, etc. are only configured in non-production packages
4.
Each project must have a mock environment, use mock data to verify various situations, instead of modifying or annotating code
5.
Whether the old project adopts the unified specification of webpack and vue-cli
1.3 Operation and Maintenance Prevention
1.3.1 Domain name resolution operation
1.
The ip does not exist under the application
2.
Whether there are accessible items on the instance
3.
Find cooperation with operation and maintenance to check whether the project is accessible
4.
Ensure that the operation and maintenance work order is accepted and the developer is notified to verify it in time
1.3.2 CDN operation
1.
Make sure that the domain name of the source site and the domain name of the accelerated domain name are inconsistent
2.
Make sure the uploaded accelerated content matches the distribution method (images, large files, videos, live streams)
3.
Ensure that the files under the accelerated domain name are static resources (consider whether dynamic and static separation is required)
4.
Make sure the source site IP is correct
5.
After applying for access, it only means that the CDN has been completed, and it is also necessary to configure DNS resolution changes
6.
Query and enter the domain name to check whether the resolution is effective in all regions of the country
1.3.3 HSTS operation
1.
Make sure that the client or application is https or whether there is a problem with https strong jump
2.
For multiple applications or domain names under VIP, all parties should be notified to confirm whether there is any impact
1.3.4 http2 operation
Make sure the domain name is https before it can be opened
1.3.5 ddos operation
CDN domain names are temporarily unavailable
1.3.6 Expansion operation
1.
The machine approval is completed and the execution results are confirmed to be all successful
2.
Ensure new expansion machine configuration and project deployment
3.
For hybrid deployment, if there are multiple applications, all applications need to be deployed and verified (allowing operation and maintenance to cooperate)
4.
Mixed deployment of applications ensures that each application must go through a multiplexing work order
5.
Make sure to restart the machine operation after the expansion operation is completed
1.3.7 Shrinking operation
1.
Make sure that the CDN domain name resolves to the intranet VIP (if it is rip, you need to go through the change VIP work order process)
2.
Mixed deployment ensures that each application must go through a work order
3.
The pre-issued machine needs to supplement the pre-issued domain name reverse proxy change work order
1.3.8 Offline operation
1.
Make sure whether the offline machine affects the online (independent deployment of a project)
2.
Pay attention to the steps of picking up traffic - picking up machines and other steps before you can go offline
1.3.9 Rollback operation
1.
Use JDOS to click the rollback operation
2.
To roll back the selected package, carefully check whether it is the last online one
1.3.10 Bastion machine operation
1.
The container must start normally
2.
The image built by dockerfile can only apply for root permission and port 22 must be opened
3.
For public images, you can only apply for root privileges and port 22 must be open
1.4 Approval precautions
1.
Whether it has been approved by the test node
2.
Whether it is reviewed by the leader
3.
Whether the development and approval authority are separated
2. Double post & self-inspection
The double-post self-examination before going online is a standard process we have formulated. R&D personnel are required to follow the list below before going online, and seek the assistance of other colleagues to check the project code (the authorities are obsessed, the bystanders are clear).
2.1 Front end
2.1.1 Environmental inspection
1.
Whether the domain name is connected to CDN
2.
Whether the jen configuration is consistent
3.
Is jen all online
4.
Whether to enable gzip
5.
Is the number of deployed machines consistent with expectations?
2.1.2 Common components
1.
Whether to access AAR
2.
Whether to access AKS
3.
Whether access to risk control
4.
Whether to add SGM monitoring
2.1.3 Requirements Check
1.
Does this online resource contain content that is not iterative for this product requirement?
2.
Whether the resources introduced by the page are all the online content of this time
3.
Is the online resource this time a pre-release tested version?
2.1.4 Code inspection
1.
Whether there is third-party code injection
2.
Whether there are sensitive fields
3.
Whether to remove debugging tools such as log/mock/Vconsole
4.
Whether there are http domain name resources in the project
5.
Whether the server interface is online
6.
Detect whether all resource domain names are online and extranet domain names
7.
Whether the hash of the package resource file is deployed by production
8.
Whether the warehouse master code is up to date
9.
For hybrid deployment applications, does this launch only update the current application code?
10.
For the Babel custom components, whether this change considers the lower version, and whether it affects the templates referenced in other projects
2.1.5 Regression check
1.
Use 4G/5G verification
2.
Check whether the CDN resource is the latest online after going online
3.
Verification after going online. For mixed deployment projects, whether the latest branch is merged into master
2.1.6 Process work order
1.
Double-post inspection confirmed to pass
2.
The UI walkthrough is passed and confirmed
3.
Risk control acceptance is passed and confirmed
4.
Submit and complete the security test work order
2.2 Server
2.2.1 Monitoring checkpoints
1.
Business monitoring
◦Orders
_
◦
log exception
◦
SQL exceptions
◦
SQL time-consuming
◦Business
time-consuming monitoring
◦Business
status abnormal monitoring
◦
Abnormal process monitoring
2.
Basic monitoring
◦The
first type of operation and maintenance: the hardware, virtual machine, network, etc. that the application system depends on
◦The
second type of operation and maintenance: the operating system level, such as cpu, memory, hard disk, IO, etc.
◦The
third type of operation and maintenance: middleware level, such as database, cache, tomcat, ningx, etc.
◦The
fourth type of operation and maintenance: the application itself, such as JVM monitoring, log collection, etc.
◦The
fifth type of operation and maintenance: new function online operations and daily emergency drills
2.2.2 General self-check points
1.
Online order class
◦There
are multiple internal applications online, dependencies and online order, whether it has been considered
◦Before
the application goes online, do you need to create the relevant table structure, register mq, rpc and other operations
◦This
version is online, whether it involves external applications, whether it needs other modules to cooperate, whether there is a sequence requirement for online
2.
Security class
◦Whether
to consider external network security issues, such as SQL injection, XSS attack, sensitive information encryption, account blasting, etc.
◦Whether
to consider interface communication security issues, signature verification, secret key management, etc.
◦Whether
to consider adding a whitelist or certificate or SMS for various visits
◦Whether
the database sensitive fields are encrypted
3.
Anti-brushing, anti-heavy
◦Anti
-duplication mechanism, in which states and scenarios are allowed to send orders repeatedly
◦Is
there any limit to accept multiple same orders in the same second
◦Whether
the unique ID generation of the platform may be duplicated
◦Whether
optimistic locking is used for all request entries, timers and API requests. Consider the problem of concurrent and repeated processing, and judge the number of items affected by the update
4.
Exception handling class
◦Whether
abnormal branches other than the main branch of each business have been processed
◦Do
not eat the detailed exception stack
◦Whether
the three-party interaction is completed
▪Need
to catch IOException for processing
▪
IOException needs to print the URL to facilitate alarm troubleshooting
▪Need
to set connection timeout and read timeout
▪Whether
it is necessary to go online through a proxy
▪Whether
it is necessary to add the whitelist again
▪Whether
there is a maximum number limit for the three parties
▪Reasonably
set the number of http connections and close connections
5.
Log specification class
◦Does
log printing have its own business specifications, which is helpful for log inspection
6.
Scheduled tasks
◦Whether
the business timer has waves and repeated processing, and whether the concurrent configuration is set to false
◦Whether
the amount of data processed in the scheduled task has the expected execution size, and whether there will be more and more processing sizes in abnormal situations
7.
SQL class
◦
Whether a unique index is used
◦Is
the use of the unique index correct? For example, if multiple fields are used as a joint unique index, whether there is a case where the field is null
◦
Whether the update and select statements have the expected execution size
◦Whether
to avoid using complex sql
◦
Whether SQL has checked the execution plan, whether it can hit the index, whether there is a possibility of slow SQL during business growth for a period of time
8.
Use of cache
◦Cache
usage, whether to set the timeout period, whether the timeout period is set correctly, whether it is in seconds or milliseconds
◦Evaluation
of solutions to cache synchronization problems (database pessimistic locks + transactions + sorting, redis pessimistic locks, CAS)
◦Clear
the usage scenarios of redis
9.
Use of transactions
◦Deadlock
scenarios need to be considered when using transactions in code
10.
Management background
◦
Whether there is a number limit and frequency limit for functions such as managing background downloads and queries
11.
Type conversion
◦Whether
the type conversion is correct, whether it is empty first and then converted
12.
Number of connections, number of threads
◦Is
the creation of threads reasonably limited to the number of threads?
◦Whether
the number of connection pools of related middleware is set reasonably
13.
Return code analysis
◦Whether
the parsing response code is correct, especially for special cases such as network exceptions, catch exceptions, and no such order
◦Response
analysis - network exception/order does not exist (caused by network exception and query earlier than transaction), non-specific failure, failure cannot be set
14.
System Design Issues
◦Asynchronous
to synchronous, if the back-end asynchronous component is down or restarted, the synchronous dispatch data will be blocked consistently
◦Is
there a single node
◦Whether
to support distributed deployment
◦
Optimistic locks prevent concurrent modification, pessimistic locks
15.
Timeout setting
◦
Whether to set connection timeout and response timeout in any RPC calling place, including HTTP, redis, database, etc.
16.
Financial attributes
◦Accounting
functions need to consider whether the balance and loss are accurate under concurrent conditions
◦Amount
unit, whether the precision is correct
◦Amount
type conversion is correct
17.
Time writing
◦Time
format, whether there is a problem with the accuracy, whether there will be rounding after writing to the database, resulting in a query mismatch
◦Database
time configuration problem, whether to set the East Eighth District, whether the activity uses the East Eighth District format for the time
18.
Configuration file
◦Whether
the online configuration file is extracted separately from the online package, and whether it has been configured separately on the platform in advance
◦If
there is a configuration file that cannot be extracted, whether the configuration file submitted with the code has been checked to be the configuration information of the formal environment
2.2.3 Resource Support Items
1.
Whether to provide additional support for operation, such as operation background parameter configuration and other matters
2.
Do you want to provide additional support for operation and maintenance, such as configuring the network environment, adding certificate keys, creating file directories, adding and deleting jar packages, etc.
3.
Do you want the DBA to provide additional support, such as adding new modules to add database access whitelists, etc.
3. Monitoring and alarming
Monitoring and alarming is a necessary mechanism for risk management after launch. Once an alarm occurs, we can investigate and resolve it as soon as possible to prevent more customer complaints.
1.
RPC layer monitoring
◦Timeout
monitoring
◦Exception
error reporting
◦Availability
_
2.
CACHE monitoring
◦
The redis connection is abnormal
◦
r2m availability rate
◦
r2m capacity
◦
r2m master-slave switching
3.
MQ monitoring
◦
MQ receive duplicate
◦
MQ send failed
◦
Processing failed in MQ
4.
Task monitoring
◦The
scheduled task is not executed
◦Timed
task timeout
◦Execution
of scheduled tasks is abnormal
5.
Business exception monitoring
◦Acquiring
a lock exception
◦
AKS and anti-swipe failed exception
◦Abnormal
tasks such as receiving/accepting tasks
◦The
crowd does not have permission
6.
JVM monitoring
◦
fullGc logs and alarms
◦
jvm monitoring alarm
7.
Container Monitoring
◦Instance
Survival
◦
CPU load & usage
◦Machine
memory
8.
DB monitoring
◦
DB layer CRUD execution exception
◦
cleverBD slow SQL regular inspection
◦
The DB query operation takes too long
◦
Whether the person in charge of the approval of the online environment (application, database, configuration, etc.) is the current leader
9.
Interest point monitoring
◦Marketing
rewards failed
◦Insufficient
stock
◦
Event not started/ended
◦Be
controlled by risk
◦Anti
-duplication failure
◦The
number of benefits received by a single user exceeds the configured warning line
◦The
overall distribution of the event exceeds the configured warning line
◦Other
exceptions fail
10.
Business response code monitoring
◦Third
-party interface normal code and abnormal code configuration to monitor availability
11.
Configuration verification
◦
Get configuration exception
◦The
configuration field is not configured in the configuration
◦The
field configuration type in the configuration is abnormal
◦There
is no configuration matching the current time
◦The
campaign has ended but still has a large number of users visiting
◦
Time point conflicts for multiple configurations
◦The
configured reward Id/task Id, etc. are not queried on the third-party interface
◦Each
operation modifies the configuration, and the modified items are sent to R&D through alarms, and the alarms are graded
12.
Activity qualification verification
◦Bypass
a verification warning
◦It
should be an old user who receives the award, but the new user enters the award process through the pre-verification
Author: Hu Jun, JD Technology
Source: Reprinted by JD Cloud developer community, please indicate the source