C语言中指针, 数组和字符串(Pointer, Array and String in C Programming Language)

指针

在C语言中，指针是一种衍生类型(derived type). 一个指针就是一个保存某个对象或函数的地址的变量("A pointer is a variable that contains the address of a variable")[10](p93). 例如：

int* pa;

其中pa是一个指向整型数的指针，整型数是pa的基础类型(referenced type) . 常量指针的声明格式如下所示：

float * const const_pointer_a

const_pointer_a是一个指向整型数的常量指针, 其中的const是修饰const_pointer_a的。下面这行代码

const float * pointer_const_a

pointer_const_a是一个指向常量整型数的指针, 其中的const是修饰float的. 又如：

struct tag (*[5])(float)

是一种包含5个函数指针的数组，其中函数指针的基础类型是一种以一个float做参数，返回值为名为struct tag的数据结构.

数组

在C语言中，声明一个数组：

int a[10];

其相应的汇编代码(VC2015)是:

COMM	_a:DWORD:0aH

其中“0aH”即数组长度10(16进制为0a)

有以下含义：

1. 声明了一个名为a的数组; (a是一个数组的名字)
2. 为数组a在内存空间申请一块连续的内存块;
3. 这个内存块能保存10个int型变量, 这些int型变量名分别为a[0], a[1], ... , a[9] ;
4. 数组名a不是变量(不能当作l-value被赋值)，它代表的是包含多个变量的一个数组。(在计算机机器代码实现C语言中定义的数组结构时，不能将保存数组的内存块像整型数那样当作一个整体在计算机存储空间和CPU之间复制过来，拷贝过去。因此在具体实现中保存数组中第一个变量的地址保存给数组名a)

如果在数组声明中没有指明数组长度，会引发编译错误(在VC2015中, 直接将这行代码忽略掉)。例如：

int array[]; //这行代码直接被编译器优化掉了，如果程序后面的代码中使用变量array，会引发编译错误。

C语言中，数组和指针的关系十分密切。 C语言程序运行时, 任何使用数组下标实现的操作都可以通过指针实现，而且通常使用指针耗时更少[10]p97 ("Any operation that can be achieved by array subscripting can also be done with pointers. The pointer version will in general be faster (at least to the uninitiated)" )。

字符串

C语言的基本数据类型中没有字符串。 C语言中使用字符数组保存字符串，null('\0')字符表示字符串的结束。也就是说在C语言中字符串是以一个以null('\0')字符结尾的字符数组。例如：

char label[] = "Single";

在内存中的保存形式如下[1]：

------------------------------
| S | i | n | g | l | e | \0 |
------------------------------

其相应的汇编代码(VC2015)是:

_label	DB	'Single', 00H

null('\0')是C语言内定的字符串结尾标识，因此字符串是不应包含null('\0')字符。

C语言中提供字符串常量，例如：

char str[] = "Single";
char *message = "Single";

其相应的汇编代码(VC2015)是:

PUBLIC	_message
_DATA	SEGMENT
_str	DB	'Single', 00H
	ORG $+1
_message DD	FLAT:$SG4519
$SG4519	DB	'Single', 00H
_DATA	ENDS

使用字符串常量会产生一个指向字符串的常量指针，如上述代码中的mesage实际上是一个指针(就是$SG4519)。C语言的发明人Brian W. Kernighan 和 Dennis M. Ritchie[10] 没有提及字符串常量中的字符是否可以被修改. C语言标准(ANSI C)则明确声明修改字符串常量的效果是未定义的。

指针、数组和字符串之间的关系

指针、数组和字符串之间的关系如下图所示：数组保存在计算机中一块连续的内存段中；字符串是一个以null('\0')字符结尾的字符数组；对数组元素的操作是通过指针实现的。

数组->指针

当数组名作为参数传递给某一函数时，由于C语言是值传递，因此实际传递给函数的是数组第一个元素的地址。在C语言中，函数的参数在调用函数时是作为局部变量使用，数组名参数实际上就当作一个保存某变量地址的局部变量(也就是指针)使用[10]p99。因此在函数中，使用操作符sizeof()作用于数组名函数参数得不到其数组长度, 实际上得到的是指针变量长度。

例如：

void foo( int pVar[] )
{
    int tmp = sizeof(pVar);
    return;
}

其相应的汇编代码(VC2015)是:

; Listing generated by Microsoft (R) Optimizing Compiler Version 19.00.23506.0 
;
; ...

_TEXT	SEGMENT
_tmp$ = -4						; size = 4
_pVar$ = 8						; size = 4
_foo	PROC
; Line  
	push	ebp
	mov	ebp, esp
	push	ecx
; Line  
	mov	DWORD PTR _tmp$[ebp], 4
; Line  
	mov	esp, ebp
	pop	ebp
	ret	0
_foo	ENDP
_TEXT	ENDS

sizeof和strlen

sizeof 在代码中看着像一个函数，但实际上在C语言和C++语言中sizeof是一个操作符。 sizeof() 在编译期间就直接处理了。例如：

int array[2];

void main()
{
    int len_1   = 8;
    int len_2 = sizeof(array);
    return;
}

将上述代码保存到文件foo.c中。在VC2015编译环境中，启动Developer Command Prompt for VS2015，运行

cl.exe /FA foo.c

得到汇编代码文件foo.asm。打开foo.asm(使用记事本即可)，其中有两行汇编代码是

; Line 5
	mov	DWORD PTR _len_1$[ebp], 8
; Line 6
	mov	DWORD PTR _len_2$[ebp], 8

可以看到 sizeof()在编译期间就直接转换相应的整数。

C语言strlen(s)方法是利用字符串结尾字符null('\0')来计算字符串的长度。例如:

char label[10] = "Single";

其相应的汇编代码(VC2015)是:

_label	DB	'Single', 00H
	ORG $+3
	ORG $+2

其含义是:

1. 声明了一个变量 label;
2. 为变量label在内存空间申请一块能容纳10个字符变量的连续内存块, 变量名分别为label[0], label[1], ... , label[9] ;
3. 将内存块的首地址赋值给变量label;
4. 将S,i,n,g,l,e依次复制到变量label[0]，...label[5]。
5. 将"\0"复制到变量label[6]

在内存中的保存形式如下[1]：

------------------------------------------
| S | i | n | g | l | e | \0 |   |   |   |
------------------------------------------

最后3个字节没有用上，仍属于字符数组，但是却不属于字符串。

	char label[10] = "Single";

	printf("len = %d\n", strlen(label) );
	printf("size = %d\n", sizeof(label) );

strlen(label) 在统计字符串label的长度时是从label[0]开始，依次遍历每一个字符，直到遇到"\0"为止。 strlen(label)的值是6。

sizeof(label) 在代码编译期间就直接替换成 10。因此代码运行结果是:

len = 6 
size = 10

对于字符串，

	char* label = "Single";

	printf("len = %d\n", strlen(label) );
	printf("size = %d\n", sizeof(label) );

其中指针变量声明语句相应的汇编代码是

_label	DD	FLAT:$SG4518
$SG4518	DB	'Single', 00H

运行结果是：

len = 6 
size = 7

初始化

字符数组和字符串差别不仅仅只有以上几点，它们不同初始化方法, 所分配的内存位于程序内存空间的不同区域. 例如：

	char *c = "abc";
	c[1] = 'a';

在VC2010编译环境中能通过编译，程序运行时却崩溃了。在VC2015编译环境中，不能通过编译。

下面的代码一切正常。

	char c[]="abc";
	c[1] = 'a';

在多任务操作系统中的每一个进程都运行在一个属于它自己的内存沙盘中。这个沙盘就是虚拟地址空间（virtual address space），在32位模式下它总是一个4GB的内存地址块。这些虚拟地址通过页表（page table）映射到物理内存，页表由操作系统维护并被处理器引用[2][3]。下图是一个Linux进程的标准的内存段布局[2]：

程序进程使用的内存一般分为代码段(Code or Text),只读数据段(RO Data)，已初始化数据段(RW Data)，未初始化数据段(RSS)，堆(heap)和栈(stack). 代码段、只读数据段、读写数据段、未初始化数据段属于静态区域，而堆和栈属于动态区域[4]。字符串和字符数组的不同初始化方法，它们所分配的内存位于程序内存空间的不同区域。例如[4]:

const char ro[] = { "this is read only data" }; //只读数据区
static char rw_1[] = { "this is global read write data" }; //已初始化读写数据段
char BSS_1[100]; //未初始化数据段
const char *ptrconst = "constant data"; //字符串放在只读取数据段

int main()
{
	short b; //在栈上，占用2个字节

	char a[100]; //在栈上开辟100个字节， 它的值是其首地址

	char s[] = "abcdefg"; //s在栈上，占用4个字节，"abcdefg"本身放置在只读数据存储区，占8个字节

	char *p1; //p1在栈上，占用4个字节

	char *p2 = "123456"; //p2 在栈上，p2指向的内容不能改，“123456”在只读数据区

	static char rw_2[] = { "this is local read write data" }; //局部已初始化读写数据段

	static char BSS_2[100]; //局部未初始化数据段

	static int c = 0; //全局(静态)初始化区

	p1 = (char *) malloc(10 * sizeof(char)); //分配内存区域在堆区

	strcpy(p1, "xxxx"); //“XXXX”放在只读数据区，占5个字节

	free(p1); //使用free释放p1所指向的内存

	return 0;
}

栈区(stack)内存—由编译器自动分配释放，存放函数的参数值，局部变量的值等。其操作方式类似于数据结构中的栈。栈区内存可用于保存函数内部的动态变量，函数的参数和函数的返回值。在函数调用时，第一个进栈的是主函数中后的下一条指令（函数调用语句的下一条可执行语句）的地址，然后是函数的各个参数，在大多数的C编译器中，参数是由右往左入栈的，然后是函数中的局部变量。注意静态变量是不入栈的[5]。

堆区(heap)内存的分配和释放是由程序员所控制的，程序结束时由操作系统回收。使用方法:C中是malloc函数,C++中是new标识符[5].

由于程序内存空间各存储区域功能的差别，不同内存区域所允许的操作不同。在编程时，如不注意它们的差别，会引发编译或运行错误，例如：

int main(){
    char *pa = "Hello, world."; 
    return 1;
}

文本字符串"Hello, world."存储在代码区，不可修改[7]。字符指针pa保存在栈区，可以修改。但是如果试图通过pa修改字符串内容，代码编译正常，但程序运行时会引发异常[6], 例如：

int main(){
    char *pa;
    pa = "Hello, world.";
    pa[2]='a';
    return 1;
}

为了避免这类运行异常，可把pa看成指向常量的指针，将代码改写成：

const char *pa  = "Hello, world.";

这样如果试图修改pa的内容，在编译时即可报错:

you cannot assign to a variable that is const

如果使用字符数组，例如：

char c[] = "abc";

是在栈顶分配4个字节，分别在这四个字节存放'a'，'b'，'c'，'\0'。栈区内存允许修改，因此上述修改字符数组的内容的代码编译运行正常。

初始化：

如果在初始化char array时字符串的长度大于字符数组声明的长度，例如

	char label[10] = "SingleSingle123213213121";

	printf("len = %d\n", strlen(label) );
	printf("size = %d\n", sizeof(label) );

编译时会触发"array bounds overflow"错误

char array在单独声明时必须要使用常量指定数组长度，例如:

const int len = 10;
        //...
	char label_array[len];

如果不指定数组长度

char label_array[];

编译时会触发"unkonw size" 错误

如果使用变量，

	int len = 10;
	char label_array[len];

编译时会触发 "error C2131: expression did not evaluate to a constant note: failure was caused by non-constant arguments or reference to a non-constant symbol"

下面是用于测试字符数组和字符串之间差别的代码

//
//  Char* operation
//

#include <stdio.h>
#include <string.h>
#include <stdlib.h>


/*从字符串的左边截取n个字符*/
char * left(char *dst,char *src, int n)
{
    char *p = src;
    char *q = dst;
    int len = strlen(src);
    if(n>len) n = len;
    /*p += (len-n);*/   /*从右边第n个字符开始*/
    while(n--) *(q++) = *(p++);
    *(q++)='\0'; /*有必要吗？很有必要*/
    return dst;
}


/*从字符串的中间截取n个字符*/
char * mid(char *dst,char *src, int n,int m) /*n为长度，m为位置*/
{
    char *p = src;
    char *q = dst;
    int len = strlen(src);
    if(n>len) n = len-m;    /*从第m个到最后*/
    if(m<0) m=0;    /*从第一个开始*/
    if(m>len) return NULL;
    p += m;
    while(n--) *(q++) = *(p++);
    *(q++)='\0'; /*有必要吗？很有必要*/
    return dst;
}


struct Card{
	char*  _pData = NULL;
	int _fieldNumber;
	int _fieldSize;
	bool _alive;

	void show()
	{
		if (NULL != _pData)
		{
			printf("data is: %s \n", _pData);
		}else{
			printf("data pointer is NULL. \n" );
		}
		return;
	}

	char* get( int index ){
//		char result[9];
//		int startLocation = (index*_fieldSize + 8) - 8;
//		mid(result, _pData, _fieldSize, startLocation );
//		return result;
		//memset(buffer,9,sizeof(buffer))

		if (NULL == _pData)
		{
			printf("data pointer is NULL. \n" );
			return NULL;
		}

		char* result =  (char *)malloc(_fieldSize+1);
		int startLocation = (index*_fieldSize + 8) - 8;
		mid(result, _pData, _fieldSize, startLocation );
		return result;
	}

	void pushEnetry( char* entryData )
	{
		 int entryLen = strlen(entryData);
		 int len = 0;

		if (NULL != _pData)
		{
			len = strlen(_pData);
		}

		 printf("pushEnetry: len = %d \n", len);

		 char *theEntry = (char *)malloc((_fieldSize)* sizeof(char));
		 {

		 }

		 int newLen = len + entryLen;

		 char* newpData =  (char *)malloc((newLen+1)* sizeof(char));

		char *p = entryData;
		char *q = newpData;
		char* oldData = _pData;
		 while( len-- )
		 {
			 *(q++) = *(oldData++);
		 }

		 while( entryLen-- )
		 {
			 *(q++) = *(p++);
		 }
		 *(q++)='\0';

		 printf("newpData = %s \n", newpData);


////
//		 memcpy(newpData, _pData, len * sizeof(char));
//
//		 //memcpy(&newpData[len], entryData, entryLen * sizeof(char));
//
//
//		 newpData += len;
//
//
//		  newpData[newLen]='\0';

		if (NULL != _pData)
		{
			free(_pData);
		}

		 _pData = newpData;

	}
};


void test_1()
{
	char* text = "1234567890abcdefghijklmn";
	printf("string is: %s \n", text );

    char* result_1 =  (char *)malloc(8);   // = new char[8];
    mid(result_1, text, 7, 5 );
	printf("string is: %s\n", result_1 );
}

void test_2()
{
	Card myCard;

	char* text = "1234567890abcdefghijklmn";
	//myCard._pData = text;
	myCard._fieldSize = 8;
	myCard.show();


	char* result;
	result = myCard.get(0);
	if (NULL != result)
	{
		printf("string is: %s\n", result);
	}

	char* foo = "foo";
	myCard.pushEnetry(foo);

	// myCard.show();

	result = myCard.get(0);
	if (NULL != result)
	{
		printf("string is: %s\n", result);
	}

	return;

}


void test_3(char pa[])
{
	printf("test_3: len = %d\n", strlen(pa) );
	printf("test_3: size = %d\n", sizeof(pa) );
	return;
}



void printArray(int data[], int length)
{
//    for(int i(0); i < length; ++i)
//    {
//        std::cout << data[i] << ' ';
//    }
//    std::cout << std::endl;
}


const int len = 10;
int main()
{
	test_1();
	test_2();
	return 0;
}

[8]对字符数组和字符串的差别做了较为详细的解释：

"Okay, I'm going to have to assume that you mean SIGSEGV (segmentation fault) is firing in malloc. This is usually caused by heap corruption. Heap corruption, that itself does not cause a segmentation fault, is usually the result of an array access outside of the array's bounds. This is usually nowhere near the point where you call malloc."

"malloc stores a small header of information "in front of" the memory block that it returns to you. This information usually contains the size of the block and a pointer to the next block. Needless to say, changing either of these will cause problems. Usually, the next-block pointer is changed to an invalid address, and the next time malloc is called, it eventually dereferences the bad pointer and segmentation faults. Or it doesn't and starts interpreting random memory as part of the heap. Eventually its luck runs out."[8]

"Note that free can have the same thing happen, if the block being released or the free block list is messed up."[8]

"How you catch this kind of error depends entirely on how you access the memory that malloc returns. A malloc of a single struct usually isn't a problem; it's malloc of arrays that usually gets you. Using a negative (-1 or -2) index will usually give you the block header for your current block, and indexing past the array end can give you the header of the next block. Both are valid memory locations, so there will be no segmentation fault."[8]

"So the first thing to try is range checking. You mention that this appeared at the customer's site; maybe it's because the data set they are working with is much larger, or that the input data is corrupt (e.g. it says to allocate 100 elements and then initializes 101), or they are performing things in a different order (which hides the bug in your in-house testing), or doing something you haven't tested. It's hard to say without more specifics. You should consider writing something to sanity check your input data."[8]

References:

[1] https://www.cs.bu.edu/teaching/cpp/string/array-vs-ptr/
[2] http://duartes.org/gustavo/blog/comments/anatomy.html
[3]http://www.cnblogs.com/lancidie/archive/2011/06/26/2090547.html
[4] http://jingyan.baidu.com/article/4665065864601ff549e5f8a9.html
[5] http://blog.csdn.net/codingkid/article/details/6858395
[6] http://www.cnblogs.com/nzbbody/p/3553222.html
[7] http://www.cnblogs.com/dejavu/archive/2012/08/13/2627498.html
[8] http://stackoverflow.com/questions/7480655/how-to-troubleshoot-crashes-in-malloc
[9] http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory/
[10] Brian W. Kernighan, Dennis M. Ritchie. C 程序设计语言(2nd version). 清华大学出版社, Prentice Hall, 1997年.
[11] International standard of programming language C. April 12, 2011.