2020 China Postgraduate Mathematical Contest in Modeling Question A

2020 China Postgraduate Mathematical Contest in Modeling Problem A (Huawei company proposition)

Design and Implementation of Carrier Recovery DSP Algorithm on ASIC Chip

Optical digital signal processing (DSP) chips are the "heart" in the field of optical transmission. Such chips are often implemented based on application-specific integrated circuits (ASIC). For example, the capacity of an optical transmission chip manufactured with a 7nm chip process can reach 800Gbps, which is equivalent to a single optical fiber that can achieve a capacity of 48T bps, ensuring the explosive growth of network traffic. DSP algorithm design ASIC chip generally comprises two main steps, the first step is designed according to the physical model of channel impairments compensation algorithm, in which case only need to consider floating point calculations; second step is a two -chip resources and power constraints, the algorithm Transform into a fixed-point form that can be realized by an ASIC chip. At this time, the algorithm needs to be refined into the most basic operations such as multiplication and addition on the chip, and the influence of fixed-point quantization noise must be considered. How to weigh performance and resources to achieve the optimal design in specific scenarios is a persistent topic in the field of DSP chip algorithm engineering. This question takes a key carrier recovery algorithm in oDSP as an example to discuss the optimal engineering design of the algorithm and chip.

First introduce the basic knowledge about basic communication system and algorithm design on ASIC chip.

1. Communication system model

This question considers a simplified digital communication system performance evaluation model, as shown in Figure 1. The binary sequence encoded by the transmitter is mapped and modulated into symbols on the constellation point and sent out. The number of symbols sent per second is called the baud rate fBaud. The signal is affected by dispersion and phase noise in the channel, and additive white Gaussian noise is artificially added. The amount of noise is expressed by the ratio of signal to noise power. The receiver first compensates for the dispersion, and then compensates for the phase noise by the Carrier Recovery (CR) algorithm, and finally the signal is judged and inversely mapped into a binary bit sequence. Affected by the damage and noise in the channel, the constellation will  spread [pq1] , which will cause the signal to be misjudged and make the received binary sequence inconsistent with the sender, which will cause bit errors. The ratio of error binary bits to the total binary bits is called bit error rate (BER). As long as the BER is less than a certain threshold, the BER after error correction coding can be less than the order of 1e-15, achieving "error-free" transmission in the engineering sense. Error correction coding is not considered in this question, and BER refers to the BER directly determined.

The RSNR (Required SNR) cost is commonly used in algorithm evaluation to evaluate algorithm performance. SNR (Signal-to-Noise Ratio) refers to the ratio of signal power to noise power. For example, in Figure 1, if only additive white Gaussian noise exists, the relationship between SNR and BER is determined for a given modulation format. The SNR imposed on the channel when the BER reaches the threshold is defined as Required SNR (RSNR), which can be understood as the amount of noise that the system can tolerate. When there are interferences such as phase noise and chromatic dispersion , the SNR value corresponding to the same pre-correction threshold point will increase, indicating that the amount of noise that the system can tolerate is reduced. The increase in RSNR is called the RSN R cost. The RSNR cost is a commonly used index to measure the performance of systems and algorithms. For example, the better the CR algorithm performance, the lower the RSNR cost should be. Fig. 1 The noise of the RSNR calculated by the model is "artificially" added for the purpose of evaluating system performance. In real optical transmission systems, noise may come from various system components such as electrical devices, optical devices, and optical amplifiers.

The terms in the model are explained further below.

 

Figure 1 Simplified performance evaluation model of digital communication system

[Modulation, constellation diagram and bit error rate BER]

A binary sequence usually requires K bits to be transmitted as a "symbol", and each symbol has 2 K different states. Optical transmission uses the complex amplitude of light waves to carry signals, so different points on the complex plane can be used to correspond to different symbol states. This kind of picture drawing the symbol states on the complex plane is called a "constellation diagram", and the points on the picture are called "Constellation Point". As shown in Figure 2(a), QPSK (Quadrature Phase Shift Keying) modulation, after channel superimposed noise and receiver processing, the constellation diagram at the receiving end is no longer the ideal four points, but spread. When the receiver receives 1 symbol, it judges the transmitted symbol as the constellation point closest to the symbol. Obviously, if the noise is too large, the received symbol may be judged wrong and cause a bit error, as shown in the blue dot in Figure 2(b). Bit Error Ratio (BER) is defined as the ratio of the number of error bits to the total number of transmitted bits. For example, if 50 symbols are transmitted, a total of 100 bits are transmitted , and one symbol is misjudged as an adjacent symbol. For 1 bit, the bit error rate is 0.01. BER is the most fundamental indicator to measure the performance of a communication system.

   

Figure 2 Schematic diagram of constellation diagram and noise caused by error Figure 3 Schematic diagram of related definition of signal and noise

In Figure 3, the ideal constellation point is represented by s k , and the received symbol is represented by r k , then the noise is

nk=rk-sk                                 (1)

Noise usually follows a normal distribution with a mean value of 0. The variance of noise is equal to the average power of noise, defined as

P n = 1 N k=1 N n k 2                             (2) where N is the total number of symbols transmitted. The average signal power is defined as the mean value of the square of the absolute value of the transmitted symbol:

Ps=1Nk=1Nsk2                             (3)

Define the ratio of signal to noise power as the signal-to-noise ratio (SNR),

SNR = P s / P n (4)

In engineering, dB is usually used as the unit of SNR, which is defined as

SNRdB=10log10(Ps/Pn)                       (5)

【Phase Noise and CR Algorithm】

Phase noise adds a time-varying phase to the signal,

S1t=S0(t)exp{ t}                          (6)

Among them, S 0 ( t ) is the waveform before the phase noise is superimposed, S 1 t is the waveform after the phase noise is superimposed, and θ t is the phase noise. Communication systems are often represented in discrete manner after sampling at equal intervals, and each sampling point corresponds to a waveform sampled at a certain time. For phase noise, the phase difference between k+1 and k is expressed as:

dθ=θk+1-θk=2π*LWfb*Xk                       (7)

Among them, LW is a laser linewidth index, the unit is kHZ. f b is the baud rate, and X k is a random variable with a mean value of 0 and a variance of 1. A typical law of phase noise changes with time is shown in Figure 4. The phase change may also evolve to a negative value.

 

Figure 4 Typical phase noise evolution curve

As shown in Figure 5, a typical CR algorithm inserts known pilot symbols (Pilot) at intervals, and estimates the current phase noise by comparing the phase difference between the received signal and the known symbol, and then this phase difference Reverse multiplication to the affected symbol at the receiving end realizes phase noise compensation. The ratio of Pilot to the total symbols is called Pilot overhead. For example, if every N symbols contain M Pilots, the overhead is M/N. There are various considerations when designing the CR algorithm: for example, the Pilot should be minimized to reduce system overhead, because the Pilot itself is a known quantity and does not transmit information; the phase of the payload between the pilots can be interpolated through various interpolation methods To approximate; additive white noise will affect the phase estimation accuracy, you can use 2 consecutive Pilot symbols to average to suppress the impact of white noise, you can also average between the interleaved Pilot. Of course, the above is only an example, and the actual CR algorithm is not limited to this.

 

Figure 5 Pilot's carrier recovery algorithm

[Dispersion and dispersion compensation algorithm]

   The effect of dispersion in the fiber can be considered as applying a phase that varies with the square of the frequency point to the frequency domain data after the signal is directly Fourier transformed, as shown in Equation 8.

Hf=exp[j*λ2πDzcf2]                    (8)

Among them, λ is the wavelength, 1550nm, Dz is the dispersion value, c is the speed of light, and f is the frequency point. In this question, the applied dispersion in the channel and the compensation dispersion in the algorithm are shown in Figure 6. Assuming that the dispersion value is known, first FFT will convert the received data to the frequency domain, then multiply the frequency domain response in the formula, and then IFFT Just switch back to time domain. The frequency domain response of the dispersion in the channel is in a conjugate relationship with the frequency domain response of the compensation dispersion in the algorithm.

 

 Figure 6 Dispersion and dispersion compensation method

2. Algorithm implementation on ASIC chip

Unlike the software-based mathematical calculations that we usually carry out on computers, the calculations on ASICs are carried out based on hardware circuits. For example, to calculate a certain formula, a general-purpose computer is converted into a logic instruction, which runs in the same CPU in chronological order, and finally outputs the result. Calculated conducted in ASIC, an calculation is the process of splitting with the basic operation body addition, multiplication, etc., each corresponding to a basic operation are different dedicated logic on the chip each occupy a certain area. Designing DSP algorithms on ASIC chips requires consideration of constraints such as parallel implementation, fixed-point quantization, timing constraints, and resource/power consumption constraints, which are briefly described below.

[Parallel implementation]

The calculations on the chip are all run at a beat by beat rhythm under the system clock. The throughput calculated by the chip must be greater than the signal transmission speed to ensure that information is not lost. If the serial processing method is adopted, the chip clock frequency is extremely high, and the power consumption of the chip is approximately in a square relationship with the main clock frequency. Obviously, the processing flow cannot be increased by blindly increasing the main frequency. Parallel processing must be used to increase resources. In exchange for an increase in processing traffic. For example, in the squaring operation shown in Figure 7, one symbol is processed in one clock cycle in the serial case, and a baud rate of 100G requires at least a clock frequency of 100GHz, which is far beyond the level that can be achieved in reality. If N times of resources are paid and N symbols are calculated at the same time each time, the clock frequency is only 1/N, and the power consumption is greatly reduced. The main frequency of the current oDSP clock is in the order of 500MHz~1GHz, and the parallelism of the corresponding symbol is about 100~200.

 

Figure 7 Schematic diagram of ASIC chip parallel operation

【Fixed Point Quantization】

In computers, double-precision floating-point numbers are commonly used to define parameter variables, while ASICs usually use fixed-point numbers to indicate the size of parameter variables. Double precision floating point calculation in most cases almost no loss of accuracy, because the fixed-point ASIC bits tend to be smaller, the rounding error increases with to the quantization noise. The number of binary digits representing a fixed-point number is called the fixed-point bit width. Fixed-point numbers are often divided into signed and unsigned bits, as shown in the figure below. For example, S(8,4) represents a signed 8-bit fixed-point number, where the decimal place occupies 4 places; u(7,4) represents An unsigned 7-bit fixed-point number with 4 decimal places. A single fixed-point number can only represent a real number, while a complex number uses two fixed points to represent the real part and the imaginary part respectively. Any calculation in ASIC should be quantified as a fixed-point calculation, and the influence of quantization noise is also one of the keys to algorithm design. For example, in the CR phase noise algorithm, the symbols affected at the receiving end are often represented by 6-9bit fixed-point numbers, and the bit width for calculating the phase noise part varies according to actual conditions.

 

Figure 8 Fixed-point representation

[Basic operation, timing constraints and resource/power consumption constraints]

The algorithm design on ASIC needs to disassemble calculation into basic operations. In consideration of this problem addition, multiplication, and data cache lookup operation four, which represents as shown in Figure 9. Both addition and multiplication are carried out on two numbers, and the real part and imaginary part are calculated separately in complex number operations. Subtraction can be thought of as doing a multiplication by -1, and then an addition. Since the ASIC is a binary fixed-point number representation, multiplying or dividing by a power of 2 is equivalent to shifting the decimal point, which will not bring additional resource costs. The look-up table is used for operations that cannot be directly implemented in addition and multiplication. For example, to find the sin function, the input-output mapping relationship can be quantified into discrete input-output correspondences, and the output can be obtained by looking up the table. For example, to find d=sin⁡(a+b 2 *c), firstly calculate a+b by one-level addition, then get a+b 2 by shifting , then get a+b 2 *c by one-stage multiplication , and finally check by sin Table way to get d. Each operation corresponds to the resources on the chip. Obviously, the larger the data bit width involved in the operation, the more resources will be occupied.

 

Figure 9 Basic operation of ASIC chip

On the other hand, the chip runs in one beat at the main clock frequency, and only a limited number of continuous basic operations can be completed in one beat operation. However, a pipeline structure can be realized by adding a buffer. For example, in Figure 10, if the result of operation 1 is obtained, and operation 2 cannot be completed in the remaining time within one beat, a level 1 buff buffer needs to be added at the exit of operation 1 to store the output data of operation 1. During each beat operation, operation 2 reads the previous beat output of operation 1 stored in the buff to perform calculations, and operation 1 performs calculations with the current input data, and stores the result in the buff. In this way, each beat runs Operation 1 and operation 2 can be carried out at the same time, but the output result will be 1 beat later than if operation 1 and operation 2 are completed in the same beat.

 

Figure 10 Cache implementation pipeline structure

Different computing paths have different clock delays. When algorithms have different computing paths, it is necessary to ensure that the delays of each path are aligned. For example, in the CR algorithm, the calculated phase noise difference needs to be multiplied back to the data, but the phase noise calculation usually cannot be completed within 1 beat, which results in a delay between the phase and the data, and the delay needs to be aligned with the buffer. In this question, it is assumed that 1 level of multiplication, 4 levels of addition, and 1 level of table look-up operation can be completed in 1 clock cycle.

Addition, multiplication, table look-up, and caching all require resources and should also be considered in algorithm design. Addition and multiplication are two-by-two operations. The addition resource is approximately proportional to the maximum bit width, the multiplication resource is approximately proportional to the bit width product, and the delay resource is approximately proportional to the bit width and the delay depth. For a lookup table of MN, the resource is proportional to (2^M)*N. M represents the input bit width, N represents the output bit width, for example: look-up table operation for the phase of a known signal, signal bit width s(10,1), phase bit width s(9,1), at this time M=20 , N=9. When looking up the table, the input is the combination of the real and imaginary parts of the signal, the addresses to be traversed are 2^20, and the bit width of the phase value corresponding to each address is s(9,1).

Table 1

Basic operation

8+8bit

 Adder

8*8 bit

Multiplier

8bit-8bit lookup table

8bit,

Delay every 2048 symbols

Resources

1 U

8 U

128 U

1 U

Third, the typical steps of ASIC chip algorithm design

  Based on the above content, the algorithm design on the chip usually includes the following specific steps:

1) Design the algorithm prototype according to the physical model and coefficient characteristics: For example, in the CR algorithm, Pilot is used to estimate the phase difference, and the phase noise of the payload is calculated by interpolation and other methods, and finally the estimated phase difference is multiplied back to the data to compensate for the phase noise.

2) On the basis of considering the degree of parallelism and timing constraints, refine the algorithm into basic achievable operations, and roughly consider the impact of algorithm implementation complexity on algorithm implementation. At this time, it can be assumed to be a floating-point number, and there is no need to consider fixed-point quantization noise. For example, in the CR algorithm, it may be necessary to consider how the interpolation of the phase noise of the payload is realized by the basic operation, and how the delay of the calculated noise path matches the delay of the compensated noise path, etc.

3) Further consider the impact of fixed-point quantization noise and try to achieve it with lower resources. At this time, bit width optimization is a key step.

----------------------------------------problem--------- ------------------------------------

Question 1: Consider the standard 16QAM signal with a baud rate of 150Gbaud, let the line width be 100kHz, the dispersion value is 20,000 ps/nm, the parallelism of the algorithm is fixed at 128, and fixed-point quantization is not considered. Please design a set of CR algorithm based on basic addition, multiplication, table look-up and buffering, and with RSNR cost <0.3dB as the goal, to minimize Pilot overhead?

Question 2: Consider the scenario where the line width varies from 10kHz to 10MHz and the dispersion Dz varies from 0 to 10,000 ps/nm. With the RSNR cost <0.3dB as the target, the relationship between the dispersion, linewidth and Pilot overhead is quantitatively explored.

Question 3: In the scenario of Question 2, the resources implemented by the chip are further included in the investigation. At this time, the impact of fixed-point quantization on performance and resources needs to be considered, and the pilot overhead can be changed arbitrarily (but it must be ensured that the payload flow rate is >145Gbaud ), how to design the CR algorithm with the lowest resources?

Question 4: The trade-off between performance and resources in reality is related to specific scenarios. For example, long-distance trunk transmission often has higher performance requirements than short-distance transmission, and long-distance transmission can pay more resources to reduce the RSNR price. And choose one scenario that your team thinks is representative in Question 3, give an overall algorithm design idea under the comprehensive consideration of "performance-resource", construct a comprehensive cost function of performance and resources, and try to give a set of automatic Optimize the bit width and implementation design plan, and give quantitative results to guide algorithm development.

Note: In this question, the complexity and resources of dispersion compensation and bit error rate calculation are not considered, only the resources related to the CR algorithm (computing phase noise + compensating phase noise) are considered.

 

 

Guess you like

Origin blog.csdn.net/weixin_41971010/article/details/108634786