[Protect your online] The road to prevention and investigation of risk management | JD Cloud technical team

foreword

In the process of project research and development, a series of processes such as requirements review, development review, code writing, test case review, project testing, product and UI acceptance have been experienced, and a lot of manpower and energy have been invested.

However, in the final launch stage, there are always many uncertainties and variability. Often there is no problem in testing N times in the test stage, and bugs will appear as soon as it goes online (it is simply the curse of Murphy's law).

After years of experience summarizing and cruel lessons, we have sorted out these known or potential risk points in detail, hoping that the launch of each project can be solid, foolproof and smooth.

In this article, we will prevent online risks from three aspects: operational prevention, double posts & self-examination, and monitoring and alarming.

1. Operational precautions

It mainly includes four categories of prevention: research and development prevention, configuration prevention, operation and maintenance prevention and approval prevention.

1.1 R&D prevention

1.1.1 General layer

1. Unified standardization of Loading/Confirm
2. Error page/skeleton screen/no data/network exception specification
3. Specifications for announcements and pop-up windows

1.1.2 Code layer

1. Use https, prohibit non-jd sources, and verify that the external network is available
2. Environment switching is distinguished by system variables
3. Commit specification
▪Single submission of independent function code
▪All R&D codes submitted for Coding
4. Unified IDE and scaffolding
5. The development environment node.js, npm, joyer, taro and other versions are unified
6. Unify common components
▪Immersive Navigation
▪Shared components, detection version support and fault-tolerant processing
▪Ladder components
▪Encryption anti-swiping: AKS, AAR
▪Fingerprint of risk control equipment
▪Singularity Buried
▪Download evoke component
▪Share components and golden password
7. Data processing
▪Paged loading to avoid request endless loop
▪Gateway layer error code handling
▪Server interface layer error code handling
▪The exception of the main function interface is the necessary processing such as jumping to the error page/popup window retry
▪Basic plan

1.1.3 UI layer

1. Is there a dynamic effect
2. Whether the audio and video are compatible
3. Is there performance lag?

1.1.4 Security layer

1. Coding problem: whether to pass eslint
2. Compatibility issues: processing of coding syntax, method attributes, minimum supported version of component library, etc.
3. Logic function: whether the code logic is consistent with the expected function
4. Abnormal conditions: whether to consider abnormal conditions such as downgrade/fault tolerance/timeout
5. User experience: Whether the new or modified functions have a bad user experience on performance or experience
6. Security: ciphertext transmission, brush prevention, script injection, etc.
7. Mock: Whether the display of mock data is correctly processed
8. Sensitive data: whether there is a potential customer complaint risk in data processing, etc.

1.2 Configuration Defense

1.2.1 R&D configuration

1. Content Configuration Platform Configuration The online configuration should be operated again to pay attention not to affect the online, try to add new configuration
2. The configuration data type does not support time controls, and it is forbidden to configure data such as time or timestamp on it
3. Data verification before configuration (for example: whether the link format is correct, whether the data length needs to be limited, etc.)
4. Data fault-tolerant processing, if it is important data, it needs to be set as a required item "required": true;

1.2.2 Operation configuration

1. All production configurations must be completed before the event goes live
2. For configurations that have been launched, the re-operation needs to be confirmed with the product and R&D; double-post confirmation within the operation
3. Pre-release environment verification, data and production are consistent (prize type, coupon type, seckill time, task type, etc.)
4. Rewards, coupons or tasks need to be verified and can be issued or received normally before being configured and displayed on the front end
5. To use benefit points from the event to other event pages, it is necessary to ensure that the use of benefit points in the secondary event page is normal

1.2.3 Environment Configuration

1. All new projects are initialized with joyer scaffolding
2. Unified commands, local environment, packaging, publishing environments, etc.
3. vconsole, annotations, etc. are only configured in non-production packages
4. Each project must have a mock environment, use mock data to verify various situations, instead of modifying or annotating code
5. Whether the old project adopts the unified specification of webpack and vue-cli

1.3 Operation and Maintenance Prevention

1.3.1 Domain name resolution operation

1. The ip does not exist under the application
2. Whether there are accessible items on the instance
3. Find cooperation with operation and maintenance to check whether the project is accessible
4. Ensure that the operation and maintenance work order is accepted and the developer is notified to verify it in time

1.3.2 CDN operation

1. Make sure that the domain name of the source site and the domain name of the accelerated domain name are inconsistent
2. Make sure the uploaded accelerated content matches the distribution method (images, large files, videos, live streams)
3. Ensure that the files under the accelerated domain name are static resources (consider whether dynamic and static separation is required)
4. Make sure the source site IP is correct
5. After applying for access, it only means that the CDN has been completed, and it is also necessary to configure DNS resolution changes
6. Query and enter the domain name to check whether the resolution is effective in all regions of the country

1.3.3 HSTS operation

1. Make sure that the client or application is https or whether there is a problem with https strong jump
2. For multiple applications or domain names under VIP, all parties should be notified to confirm whether there is any impact

1.3.4 http2 operation

Make sure the domain name is https before it can be opened

1.3.5 ddos ​​operation

CDN domain names are temporarily unavailable

1.3.6 Expansion operation

1. The machine approval is completed and the execution results are confirmed to be all successful
2. Ensure new expansion machine configuration and project deployment
3. For hybrid deployment, if there are multiple applications, all applications need to be deployed and verified (allowing operation and maintenance to cooperate)
4. Mixed deployment of applications ensures that each application must go through a multiplexing work order
5. Make sure to restart the machine operation after the expansion operation is completed

1.3.7 Shrinking operation

1. Make sure that the CDN domain name resolves to the intranet VIP (if it is rip, you need to go through the change VIP work order process)
2. Mixed deployment ensures that each application must go through a work order
3. The pre-issued machine needs to supplement the pre-issued domain name reverse proxy change work order

1.3.8 Offline operation

1. Make sure whether the offline machine affects the online (independent deployment of a project)
2. Pay attention to the steps of picking up traffic - picking up machines and other steps before you can go offline

1.3.9 Rollback operation

1. Use JDOS to click the rollback operation
2. To roll back the selected package, carefully check whether it is the last online one

1.3.10 Bastion machine operation

1. The container must start normally
2. The image built by dockerfile can only apply for root permission and port 22 must be opened
3. For public images, you can only apply for root privileges and port 22 must be open

1.4 Approval precautions

1. Whether it has been approved by the test node
2. Whether it is reviewed by the leader
3. Whether the development and approval authority are separated





 

2. Double post & self-inspection

The double-post self-examination before going online is a standard process we have formulated. R&D personnel are required to follow the list below before going online, and seek the assistance of other colleagues to check the project code (the authorities are obsessed, the bystanders are clear).

2.1 Front end

2.1.1 Environmental inspection

1. Whether the domain name is connected to CDN
2. Whether the jen configuration is consistent
3. Is jen all online
4. Whether to enable gzip
5. Is the number of deployed machines consistent with expectations?

2.1.2 Common components

1. Whether to access AAR
2. Whether to access AKS
3. Whether access to risk control
4. Whether to add SGM monitoring

2.1.3 Requirements Check

1. Does this online resource contain content that is not iterative for this product requirement?
2. Whether the resources introduced by the page are all the online content of this time
3. Is the online resource this time a pre-release tested version?

2.1.4 Code inspection

1. Whether there is third-party code injection
2. Whether there are sensitive fields
3. Whether to remove debugging tools such as log/mock/Vconsole
4. Whether there are http domain name resources in the project
5. Whether the server interface is online
6. Detect whether all resource domain names are online and extranet domain names
7. Whether the hash of the package resource file is deployed by production
8. Whether the warehouse master code is up to date
9. For hybrid deployment applications, does this launch only update the current application code?
10. For the Babel custom components, whether this change considers the lower version, and whether it affects the templates referenced in other projects

2.1.5 Regression check

1. Use 4G/5G verification
2. Check whether the CDN resource is the latest online after going online
3. Verification after going online. For mixed deployment projects, whether the latest branch is merged into master

2.1.6 Process work order

1. Double-post inspection confirmed to pass
2. The UI walkthrough is passed and confirmed
3. Risk control acceptance is passed and confirmed
4. Submit and complete the security test work order

2.2 Server

2.2.1 Monitoring checkpoints

1. Business monitoring
◦Orders _
log exception
SQL exceptions
SQL time-consuming
◦Business time-consuming monitoring
◦Business status abnormal monitoring
Abnormal process monitoring
2. Basic monitoring
◦The first type of operation and maintenance: the hardware, virtual machine, network, etc. that the application system depends on
◦The second type of operation and maintenance: the operating system level, such as cpu, memory, hard disk, IO, etc.
◦The third type of operation and maintenance: middleware level, such as database, cache, tomcat, ningx, etc.
◦The fourth type of operation and maintenance: the application itself, such as JVM monitoring, log collection, etc.
◦The fifth type of operation and maintenance: new function online operations and daily emergency drills

2.2.2 General self-check points

1. Online order class
◦There are multiple internal applications online, dependencies and online order, whether it has been considered
◦Before the application goes online, do you need to create the relevant table structure, register mq, rpc and other operations
◦This version is online, whether it involves external applications, whether it needs other modules to cooperate, whether there is a sequence requirement for online
2. Security class
◦Whether to consider external network security issues, such as SQL injection, XSS attack, sensitive information encryption, account blasting, etc.
◦Whether to consider interface communication security issues, signature verification, secret key management, etc.
◦Whether to consider adding a whitelist or certificate or SMS for various visits
◦Whether the database sensitive fields are encrypted
3. Anti-brushing, anti-heavy
◦Anti -duplication mechanism, in which states and scenarios are allowed to send orders repeatedly
◦Is there any limit to accept multiple same orders in the same second
◦Whether the unique ID generation of the platform may be duplicated
◦Whether optimistic locking is used for all request entries, timers and API requests. Consider the problem of concurrent and repeated processing, and judge the number of items affected by the update
4. Exception handling class
◦Whether abnormal branches other than the main branch of each business have been processed
◦Do not eat the detailed exception stack
◦Whether the three-party interaction is completed
▪Need to catch IOException for processing
IOException needs to print the URL to facilitate alarm troubleshooting
▪Need to set connection timeout and read timeout
▪Whether it is necessary to go online through a proxy
▪Whether it is necessary to add the whitelist again
▪Whether there is a maximum number limit for the three parties
▪Reasonably set the number of http connections and close connections
5. Log specification class
◦Does log printing have its own business specifications, which is helpful for log inspection
6. Scheduled tasks
◦Whether the business timer has waves and repeated processing, and whether the concurrent configuration is set to false
◦Whether the amount of data processed in the scheduled task has the expected execution size, and whether there will be more and more processing sizes in abnormal situations
7. SQL class
Whether a unique index is used
◦Is the use of the unique index correct? For example, if multiple fields are used as a joint unique index, whether there is a case where the field is null
Whether the update and select statements have the expected execution size
◦Whether to avoid using complex sql
Whether SQL has checked the execution plan, whether it can hit the index, whether there is a possibility of slow SQL during business growth for a period of time
8. Use of cache
◦Cache usage, whether to set the timeout period, whether the timeout period is set correctly, whether it is in seconds or milliseconds
◦Evaluation of solutions to cache synchronization problems (database pessimistic locks + transactions + sorting, redis pessimistic locks, CAS)
◦Clear the usage scenarios of redis
9. Use of transactions
◦Deadlock scenarios need to be considered when using transactions in code
10. Management background
Whether there is a number limit and frequency limit for functions such as managing background downloads and queries
11. Type conversion
◦Whether the type conversion is correct, whether it is empty first and then converted
12. Number of connections, number of threads
◦Is the creation of threads reasonably limited to the number of threads?
◦Whether the number of connection pools of related middleware is set reasonably
13. Return code analysis
◦Whether the parsing response code is correct, especially for special cases such as network exceptions, catch exceptions, and no such order
◦Response analysis - network exception/order does not exist (caused by network exception and query earlier than transaction), non-specific failure, failure cannot be set
14. System Design Issues
◦Asynchronous to synchronous, if the back-end asynchronous component is down or restarted, the synchronous dispatch data will be blocked consistently
◦Is there a single node
◦Whether to support distributed deployment
Optimistic locks prevent concurrent modification, pessimistic locks
15. Timeout setting
Whether to set connection timeout and response timeout in any RPC calling place, including HTTP, redis, database, etc.
16. Financial attributes
◦Accounting functions need to consider whether the balance and loss are accurate under concurrent conditions
◦Amount unit, whether the precision is correct
◦Amount type conversion is correct
17. Time writing
◦Time format, whether there is a problem with the accuracy, whether there will be rounding after writing to the database, resulting in a query mismatch
◦Database time configuration problem, whether to set the East Eighth District, whether the activity uses the East Eighth District format for the time
18. Configuration file
◦Whether the online configuration file is extracted separately from the online package, and whether it has been configured separately on the platform in advance
◦If there is a configuration file that cannot be extracted, whether the configuration file submitted with the code has been checked to be the configuration information of the formal environment

2.2.3 Resource Support Items

1. Whether to provide additional support for operation, such as operation background parameter configuration and other matters
2. Do you want to provide additional support for operation and maintenance, such as configuring the network environment, adding certificate keys, creating file directories, adding and deleting jar packages, etc.
3. Do you want the DBA to provide additional support, such as adding new modules to add database access whitelists, etc.





 

3. Monitoring and alarming

Monitoring and alarming is a necessary mechanism for risk management after launch. Once an alarm occurs, we can investigate and resolve it as soon as possible to prevent more customer complaints.

1. RPC layer monitoring
◦Timeout monitoring
◦Exception error reporting
◦Availability _
2. CACHE monitoring
The redis connection is abnormal
r2m availability rate
r2m capacity
r2m master-slave switching
3. MQ monitoring
MQ receive duplicate
MQ send failed
Processing failed in MQ
4. Task monitoring
◦The scheduled task is not executed
◦Timed task timeout
◦Execution of scheduled tasks is abnormal
5. Business exception monitoring
◦Acquiring a lock exception
AKS and anti-swipe failed exception
◦Abnormal tasks such as receiving/accepting tasks
◦The crowd does not have permission
6. JVM monitoring
fullGc logs and alarms
jvm monitoring alarm
7. Container Monitoring
◦Instance Survival
CPU load & usage
◦Machine memory
8. DB monitoring
DB layer CRUD execution exception
cleverBD slow SQL regular inspection
The DB query operation takes too long
Whether the person in charge of the approval of the online environment (application, database, configuration, etc.) is the current leader
9. Interest point monitoring
◦Marketing rewards failed
◦Insufficient stock
Event not started/ended
◦Be controlled by risk
◦Anti -duplication failure
◦The number of benefits received by a single user exceeds the configured warning line
◦The overall distribution of the event exceeds the configured warning line
◦Other exceptions fail
10. Business response code monitoring
◦Third -party interface normal code and abnormal code configuration to monitor availability
11. Configuration verification
Get configuration exception
◦The configuration field is not configured in the configuration
◦The field configuration type in the configuration is abnormal
◦There is no configuration matching the current time
◦The campaign has ended but still has a large number of users visiting
Time point conflicts for multiple configurations
◦The configured reward Id/task Id, etc. are not queried on the third-party interface
◦Each operation modifies the configuration, and the modified items are sent to R&D through alarms, and the alarms are graded
12. Activity qualification verification
◦Bypass a verification warning
◦It should be an old user who receives the award, but the new user enters the award process through the pre-verification



 

Author: Hu Jun, JD Technology

Source: Reprinted by JD Cloud developer community, please indicate the source

The country's first IDE that supports multi-environment development——CEC-IDE Microsoft has integrated Python into Excel, and Uncle Gui participated in the framework formulation. Chinese programmers refused to write gambling programs and were pulled out 14 teeth, with 88% body damage . Podman Desktop, an open-source imitation Song font, breaks through 500,000 downloads. Automatically skips opening screen advertisements. The application "Li Tiao Tiao" stops updating indefinitely. There is a remote code execution vulnerability Xiaomi filed mios.cn website domain name
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10102756