Tabla de contenido

2. Eficiencia mejorada de la instrucción SSE

La realización de memcpy es una pregunta de entrevista que a todas las grandes empresas les gusta hacer, principalmente porque hay dos puntos de prueba para memcpy: direcciones superpuestas y eficiencia mejorada. Debido a que la interfaz de la biblioteca C estándar de Linux no resuelve estos dos problemas, y la copia en memoria es una función de biblioteca muy común, los entrevistadores la prefieren.

1. Superposición de direcciones

El manual del usuario de Linux indica que memcpy no admite la superposición de direcciones. Si una dirección se superpone, debe ser reemplazada por memmove. Echemos un vistazo a la implementación del código fuente de memcpy:

/**
 * memcpy - Copy one area of memory to another
 * @dest: Where to copy to
 * @src: Where to copy from
 * @count: The size of the area.
 *
 * You should not use this function to access IO space, use memcpy_toio()
 * or memcpy_fromio() instead.
 */
void *memcpy(void *dest, const void *src, size_t count)
{
	char *tmp = dest;
	const char *s = src;

	while (count--)
		*tmp++ = *s++;
	return dest;
}

Se puede ver que memcpy simplemente copia el contenido de cada byte de adelante hacia atrás. Cuando las direcciones de origen y destino no se superponen, no hay problema. Cuando ocurre la superposición, memcpy definitivamente traerá algunos problemas. Lo primero que hay que tener claro es que existen dos casos de superposición de direcciones:

1. La dirección se superpone hacia adelante, es decir, src> dest && dest + size> src , como se muestra a continuación:

Utilice memcpy para copiar correctamente, pero es inevitable que el contenido señalado por src deba cambiar después de que se complete la copia.

2. Las direcciones se superponen hacia atrás, es decir, dest> src && src + tamaño> dest , como se muestra a continuación:

El uso de memcpy no se puede copiar correctamente, porque algunos de los datos detrás del espacio de memoria al que apunta src se sobrescriben con los datos recién copiados, por lo que los datos definitivamente no son los datos originales al final de la copia. Cuando la dirección se superpone hacia atrás, debe ser reemplazada por memmove. memmove es principalmente para juzgar la dirección para juzgar si hay una superposición hacia atrás en la dirección, Y: copia de atrás hacia adelante; N: copia de adelante hacia atrás.

/**
 * memmove - Copy one area of memory to another
 * @dest: Where to copy to
 * @src: Where to copy from
 * @count: The size of the area.
 *
 * Unlike memcpy(), memmove() copes with overlapping areas.
 */
void *memmove(void *dest, const void *src, size_t count)
{
	char *tmp;
	const char *s;

	if (dest <= src) {
		tmp = dest;
		s = src;
		while (count--)
			*tmp++ = *s++;
	} else {
		tmp = dest;
		tmp += count;
		s = src;
		s += count;
		while (count--)
			*--tmp = *--s;
	}
	return dest;
}

Por lo tanto, para resolver la superposición de direcciones, solo necesita mirar el método de superposición de direcciones y juzgar si es copiar de adelante hacia atrás o copiar de atrás hacia adelante . El código específico es el siguiente:

void *memcpy_func(void *dest, const void *src, size_t n)
{
	if(dest ==  NULL || src == NULL || n <= 0){
		return NULL;
	}

	char *p_dst = (char *)dest;
	char *p_src = (char *)src;
	
	if((p_src + n > p_dst) && (p_dst > p_src)){//地址向后重叠
		p_dst += n;
		p_src += n;
		while(n-- > 0){
			p_dst--;
			p_src--;
			*p_dst = *p_src;
		}
	}else{
		while(n-- > 0){
			*p_dst++ = *p_src++;
		}
	}
	return (char *)dest;
}

2. Eficiencia mejorada de la instrucción SSE

Una copia de un solo byte es definitivamente una tontería. Cuando el valor de n es pequeño, no habrá problemas. Cuando el número de bytes a copiar alcanza el nivel M, copiar una gran cantidad de datos de una vez es sin duda la mejor opción. Con la ayuda de las instrucciones SSE, el código ensamblador se incrusta en C para realizar una copia única de 16, 32, 64, 128, 256 bytes, etc. El código es el siguiente:

/**
 * Copy 16 bytes from one location to another using optimised SSE
 * instructions. The locations should not overlap.
 *
 * @param s1
 *   Pointer to the destination of the data.
 * @param s2
 *   Pointer to the source data.
 */
static inline void
mov16(uint8_t *dst, const uint8_t *src)
{
	asm volatile ("movdqu (%[src]), %%xmm0\n\t"
		      "movdqu %%xmm0, (%[dst])\n\t"
		      :
		      : [src] "r" (src),
			[dst] "r"(dst)
		      : "xmm0", "memory");
}

/**
 * Copy 32 bytes from one location to another using optimised SSE
 * instructions. The locations should not overlap.
 *
 * @param s1
 *   Pointer to the destination of the data.
 * @param s2
 *   Pointer to the source data.
 */
static inline void
mov32(uint8_t *dst, const uint8_t *src)
{
	asm volatile ("movdqu (%[src]), %%xmm0\n\t"
		      "movdqu 16(%[src]), %%xmm1\n\t"
		      "movdqu %%xmm0, (%[dst])\n\t"
		      "movdqu %%xmm1, 16(%[dst])"
		      :
		      : [src] "r" (src),
			[dst] "r"(dst)
		      : "xmm0", "xmm1", "memory");
}

/**
 * Copy 48 bytes from one location to another using optimised SSE
 * instructions. The locations should not overlap.
 *
 * @param s1
 *   Pointer to the destination of the data.
 * @param s2
 *   Pointer to the source data.
 */
static inline void
mov48(uint8_t *dst, const uint8_t *src)
{
	asm volatile ("movdqu (%[src]), %%xmm0\n\t"
		      "movdqu 16(%[src]), %%xmm1\n\t"
		      "movdqu 32(%[src]), %%xmm2\n\t"
		      "movdqu %%xmm0, (%[dst])\n\t"
		      "movdqu %%xmm1, 16(%[dst])"
		      "movdqu %%xmm2, 32(%[dst])"
		      :
		      : [src] "r" (src),
			[dst] "r"(dst)
		      : "xmm0", "xmm1", "memory");
}

/**
 * Copy 64 bytes from one location to another using optimised SSE
 * instructions. The locations should not overlap.
 *
 * @param s1
 *   Pointer to the destination of the data.
 * @param s2
 *   Pointer to the source data.
 */
static inline void
mov64(uint8_t *dst, const uint8_t *src)
{
	asm volatile ("movdqu (%[src]), %%xmm0\n\t"
		      "movdqu 16(%[src]), %%xmm1\n\t"
		      "movdqu 32(%[src]), %%xmm2\n\t"
		      "movdqu 48(%[src]), %%xmm3\n\t"
		      "movdqu %%xmm0, (%[dst])\n\t"
		      "movdqu %%xmm1, 16(%[dst])\n\t"
		      "movdqu %%xmm2, 32(%[dst])\n\t"
		      "movdqu %%xmm3, 48(%[dst])"
		      :
		      : [src] "r" (src),
			[dst] "r"(dst)
		      : "xmm0", "xmm1", "xmm2", "xmm3","memory");
}

/**
 * Copy 128 bytes from one location to another using optimised SSE
 * instructions. The locations should not overlap.
 *
 * @param s1
 *   Pointer to the destination of the data.
 * @param s2
 *   Pointer to the source data.
 */
static inline void
mov128(uint8_t *dst, const uint8_t *src)
{
	asm volatile ("movdqu (%[src]), %%xmm0\n\t"
		      "movdqu 16(%[src]), %%xmm1\n\t"
		      "movdqu 32(%[src]), %%xmm2\n\t"
		      "movdqu 48(%[src]), %%xmm3\n\t"
		      "movdqu 64(%[src]), %%xmm4\n\t"
		      "movdqu 80(%[src]), %%xmm5\n\t"
		      "movdqu 96(%[src]), %%xmm6\n\t"
		      "movdqu 112(%[src]), %%xmm7\n\t"
		      "movdqu %%xmm0, (%[dst])\n\t"
		      "movdqu %%xmm1, 16(%[dst])\n\t"
		      "movdqu %%xmm2, 32(%[dst])\n\t"
		      "movdqu %%xmm3, 48(%[dst])\n\t"
		      "movdqu %%xmm4, 64(%[dst])\n\t"
		      "movdqu %%xmm5, 80(%[dst])\n\t"
		      "movdqu %%xmm6, 96(%[dst])\n\t"
		      "movdqu %%xmm7, 112(%[dst])"
		      :
		      : [src] "r" (src),
			[dst] "r"(dst)
		      : "xmm0", "xmm1", "xmm2", "xmm3",
			"xmm4", "xmm5", "xmm6", "xmm7", "memory");
}

/**
 * Copy 256 bytes from one location to another using optimised SSE
 * instructions. The locations should not overlap.
 *
 * @param s1
 *   Pointer to the destination of the data.
 * @param s2
 *   Pointer to the source data.
 */
static inline void
mov256(uint8_t *dst, const uint8_t *src)
{
	/*
	 * There are 16XMM registers, but this function does not use
	 * them all so that it can still be compiled as 32bit
	 * code. The performance increase was neglible if all 16
	 * registers were used.
	 */
	mov128(dst, src);
	mov128(dst + 128, src + 128);
}


/**
 * Copy bytes from one location to another. The locations should not overlap.
 *
 * @param s1
 *   Pointer to the destination of the data.
 * @param s2
 *   Pointer to the source data.
 * @param n
 *   Number of bytes to copy.
 * @return
 *   s1
 */
void * fast_memcpy(void *s1, const void *s2, size_t n)
{
	uint8_t *dst = (uint8_t *)s1;
	const uint8_t *src = (const uint8_t *)s2;

	/* We can't copy < 16 bytes using XMM registers so do it manually. */
	if (n < 16) {
		if (n & 0x01) {
			*dst = *src;
			dst += 1;
			src += 1;
		}
		if (n & 0x02) {
			*(uint16_t *)dst = *(const uint16_t *)src;
			dst += 2;
			src += 2;
		}
		if (n & 0x04) {
			*(uint32_t *)dst = *(const uint32_t *)src;
			dst += 4;
			src += 4;
		}
		if (n & 0x08) {
			*(uint64_t *)dst = *(const uint64_t *)src;
		}
		return dst;
	}

	/* Special fast cases for <= 128 bytes */
	if (n <= 32) {
		mov16(dst, src);
		mov16(dst - 16 + n, src - 16 + n);
		return s1;
	}
	if (n <= 64) {
		mov32(dst, src);
		mov32(dst - 32 + n, src - 32 + n);
		return s1;
	}
	if (n <= 128) {
		mov64(dst, src);
		mov64(dst - 64 + n, src - 64 + n);
		return s1;
	}

	/*
	 * For large copies > 128 bytes. This combination of 256, 64 and 16 byte
	 * copies was found to be faster than doing 128 and 32 byte copies as
	 * well.
	 */
	for (; n >= 256; n -= 256, dst += 256, src += 256) {
		mov256(dst, src);
	}

	/*
	 * We split the remaining bytes (which will be less than 256) into
	 * 64byte (2^6) chunks.
	 * Using incrementing integers in the case labels of a switch statement
	 * enourages the compiler to use a jump table. To get incrementing
	 * integers, we shift the 2 relevant bits to the LSB position to first
	 * get decrementing integers, and then subtract.
	 */
	switch (3 - (n >> 6)) {
	case 0x00:
		mov64(dst, src);
		n -= 64;
		dst += 64;
		src += 64;       /* fallthrough */
	case 0x01:
		mov64(dst, src);
		n -= 64;
		dst += 64;
		src += 64;       /* fallthrough */
	case 0x02:
		mov64(dst, src);
		n -= 64;
		dst += 64;
		src += 64;       /* fallthrough */
	default:
		;
	}

	/*
	 * We split the remaining bytes (which will be less than 64) into
	 * 16byte (2^4) chunks, using the same switch structure as above.
	 */
	switch (3 - (n >> 4)) {
	case 0x00:
		mov16(dst, src);
		n -= 16;
		dst += 16;
		src += 16;       /* fallthrough */
	case 0x01:
		mov16(dst, src);
		n -= 16;
		dst += 16;
		src += 16;       /* fallthrough */
	case 0x02:
		mov16(dst, src);
		n -= 16;
		dst += 16;
		src += 16;       /* fallthrough */
	default:
		;
	}

	/* Copy any remaining bytes, without going beyond end of buffers */
	if (n != 0) {
		mov16(dst - 16 + n, src - 16 + n);
	}
	return s1;
}

Cabe señalar que la eficiencia de esta interfaz no es alta para copias de bytes más pequeños correspondientes al orden de magnitud, y este método de implementación aún no admite la superposición de direcciones hacia atrás.

programación linux C: reescribir la función memcpy

1. Superposición de direcciones

2. Eficiencia mejorada de la instrucción SSE

Supongo que te gusta