Find a non-existent from 4 billion integers

Foreword

Given a random arrangement of up to four billion 32-bit integers sequential file order, a 32-bit integer not find the file. (In the file did such a number at least - why?). In the case of having enough memory, how to solve this problem? If there are several external "temporary" file is available, but only a few hundred bytes of memory, how to solve this problem?

analysis

This is still a problem, "Programming Pearls" in the. Earlier we mentioned the " Bitmap law ," we use a bitmap method solves this problem. Up to 32-bit integer integer 4294967296, 4000000000 and it is clear that the number of missing at least one bound. We also can also try to use a bitmap method to solve this problem, the use of 536,870,912 bytes, which stores approximately 4 billion memory 512M integers, the position of the integer 1, the last bit traversal, the output of the first bit is 0 location. That means if only a few "temporary" file, in the case of the use of a few hundred bytes of memory how to handle it?

Can I use a binary search it? This 4 billion integers are randomly arranged, so ordinary binary search can not find the number does not exist. But we can based on the idea of ​​binary search.

A 32-bit integer, we are each 0 or 1 bit, to find the range of data into two. From the highest bit to start:

  • The most significant bit of 0 is placed in a pile, is a pile on one another
  • If as much, is free to choose a pile, for example selected from 0, the bit is 0
  • If not the same, less the selected pile continued, such as less 1, the bit is 1

It should be some explanation:

  • Since the integer 2 ^ 32, each bit is the number 0 or 1 are the same. If in this 4000000000 integer, a is the number of bits 0 and 1 are the same, the number does not exist on both sides is described. So you can select any of the bunch.
  • If the multi-bit integer integer than 0, then, is the number of bits in a bunch of 0's definitely lacks some numbers. And as the number of bits in a pile of 1, you may be missing some of the numbers. Therefore, we chose less, that is, the number of bits is that a bunch of 0's.
  • Every choice, both record selection is 0 or 1, up to 32 times after the selection, you can find at least one integer, which does not exist in the number 4 billion.

Example shows

Since many of the 32-bit integer data amount, the inconvenience described, we used a 4-bit data of the above described ideas do a. 4 up to 16-bit number.
Consider the following data sources:

      
      
1
      
      
3 5 2 6 7 -1 -4 -6 -3 1 -5

It corresponds to the following binary form (negative numbers stored in memory complement form):

      
      
1
      
      
0011 0101 0010 0110 0111 1111 1100 1010 1101 0001 1011

1. Processing of the first bit data is divided into two parts, namely:

  • Bit is 0
      
      
1
      
      
0011 0101 0010 0110 0111 0001
  • Bit 1 of
      
      
1
      
      
1111 1100 1010 1101 1011

Can be seen that the first bit is a number from 1 to 5, less than the number of 0 bits are therefore selected bit is a number of 1, processing continues. And the first bit to obtain 1 .

3. Treatment 2 bits of data is still divided into two parts, namely:

  • Bit is 0
      
      
1
      
      
1010 1011
  • Bit 1 of
      
      
1
      
      
1111 1100 1101

Can be seen that the first bit is a number from 1 to 3, the ratio of the number of bits to 0 to more, so the number of selected bits are 0, the process continues. And the second bit to obtain 0 .

2. Treatment of 3 bit data is still divided into two parts, namely:

  • Bit is 0
      
      
1
      
      
no
  • Bit 1 of
      
      
1
      
      
1010 1011

Apparent third bit number is not 0, thus selecting bit 0, to obtain 0 . At this point, it has no need to continue to find a.

We finally obtain the first three bits 100, and therefore these numbers do not exist, at least 1000, 1001, i.e. -8, -7.

Code

C language:

      
      
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
      
      
#include <stdlib.h>
#define MAX_STR 10
#define SOURCE_FILE "source.txt" //最原始文件,需要保留
#define SRC_FILE "src.txt" //需要分类的文件
#define BIT_1_FILE "bit1.txt"
#define BIT_0_FILE "bit0.txt"
#define INT_BIT_NUM 32
/*
FILE *src 源数据文件指针
FILE *fpBit1 存储要处理的比特位为1的数据
FILE *fpBit0 存储要处理的比特位为0的数据
int bit 要处理的比特位
返回值
0:选择比特位为0的数据继续处理
1:选择比特位为1的数据继续处理
-1:出错
*/
int (FILE *src,FILE *fpBit1,FILE *fpBit0,int bit,int *nums)
{
/*入参检查*/
if( NULL == src || NULL == fpBit1 || NULL == fpBit0 || NULL == nums)
{
printf( "input para is NULL");
return -1;
}
/*bit位检查*/
if(bit < 0 || bit > INT_BIT_NUM )
{
printf( "the bit is wrong");
return -1;
}
char string[MAX_STR] = { 0};
int mask = 1<< bit;
int bit0num = 0;
int bit1num = 0;
int num = 0;
//printf("mask is %xn",mask);
/*循环读取源数据*/
while(fgets( string, MAX_STR, src ) != NULL)
{
num = atoi( string);
//printf("%d&%d %dn",num,mask, num&mask);
/*根据比特位的值,将数据写到不同的位置,注意优先级问题*/
if( 0 == (num&mask))
{
//printf("bit 0 %dn",num);
fprintf(fpBit0, "%dn", num);
bit0num++;
}
else
{
//printf("bit 1 %dn",num);
fprintf(fpBit1, "%dn", num);
bit1num++;
}
}
//printf("bit0num:%d,bit1num:%dn",bit0num,bit1num);
if(bit0num > bit1num)
{
/*说明比特位为1的数少*/
*nums = bit1num;
return 1;
}
else
{
*nums = bit0num;
return 0;
}
}
/***
*关闭所有文件描述符
*
* **/
void closeAllFile(FILE **src,FILE **bit0,FILE **bit1)
{
if( NULL != src && NULL != *src)
{
fclose(*src);
*src = NULL;
}
if( NULL != bit1 && NULL != *bit1)
{
fclose(*bit1);
*bit1 = NULL;
}
if( NULL != bit0 && NULL != *bit0)
{
fclose(*bit0);
*bit0 = NULL;
}
}
int findNum(int *findNum)
{
int loop = 0;
/*打开最原始文件*/
FILE *src = fopen(SOURCE_FILE, "r");
if( NULL == src)
{
printf( "failed to open %s",SOURCE_FILE);
return -1;
}
FILE *bit1 = NULL;
FILE *bit0 = NULL;
int num = 0;
int bitNums = 0; //得到比特位的数字数量
int findBit = 0; //当前得到的比特位
for(loop = 0; loop < INT_BIT_NUM;loop++)
{
/*第一次循环不会打开,保留源文件*/
if( NULL == src)
{
src = fopen(SRC_FILE, "r");
}
if( NULL == src)
{
return -1;
}
/**打开失败时,注意关闭所有打开的文件描述符**/
bit1 = fopen(BIT_1_FILE, "w+");
if( NULL == bit1)
{
closeAllFile(&src,&bit1,&bit0);
printf( "failed to open %s",BIT_1_FILE);
return -1;
}
bit0 = fopen(BIT_0_FILE, "w+");
if( NULL == bit0)
{
closeAllFile(&src,&bit1,&bit0);
printf( "failed to open %s",BIT_0_FILE);
return -1;
}
findBit = splitByBit(src,bit1,bit0,loop,&bitNums);
if( -1 == findBit)
{
printf( "process errorn");
closeAllFile(&src,&bit1,&bit0);
return -1;
}
closeAllFile(&src,&bit1,&bit0);
//printf("find bit %dn",findBit);
/*将某比特位数量少的文件重命名为新的src.txt,以便进行下一次处理*/
if( 1 == findBit)
{
rename(BIT_1_FILE,SRC_FILE);
num |= ( 1 << loop);
printf( "mv bit1 file to src filen");
}
else
{
printf( "mv bit0 file to src filen");
rename(BIT_0_FILE,SRC_FILE);
}
/*如果某个文件数量为0,则没有必要继续寻找下去*/
if( 0 == bitNums)
{
printf( "no need to continuen");
break;
}
}
*findNum = num;
return 0;
}
int main()
{
int num = 0;
findNum(&num);
printf( "final num is %d or 0x%xn",num,num);
return 0;
}

Code Description:

  • Here splitByBit function based on the data bits into two parts
  • closeAllFile for closing the file descriptor
  • findNum function loops 32 bits, the processing time for each bit to obtain a final can be an integer which does not exist.

The use of scripting produced about 20 million integers:

      
      
1
2
      
      
wc -l source.txt
20000001 source.txt

Compile and run:

      
      
1
2
3
4
5
6
7
      
      
$ gcc -o binarySearch binarySearch.c
$ time ./binarySearch
final num is 18950401 or 0x1212901
real 0m8.001s
user 0m6.466s
sys 0m0.445s

The main program of the time spent reading and writing files, and take up minimal memory.

to sum up

This article from a particular point of view with the most common binary search to solve the problem, up to 32 split times, you can find the integer that does not exist. Do you have any better ideas or optimization points, welcome message.

Original: Large column  to find a non-existent from 4 billion integers


Guess you like

Origin www.cnblogs.com/wangziqiang123/p/11618442.html