【软件工程实践】Pig项目9-Data目录源码分析-其他包

2021SC@SDUSC

上篇,我们Pig的数据结构基本讲完,本篇讲的是目前架构的变种

InternalCachedBag

首先是内部已缓存包

 继承关系如下

public class InternalCachedBag extends SelfSpillBag

所继承类的UML

这个类注释里面没有标明类的作用,但是通过查看引用找到了一个测试类TestDataBag,里面有各种类的测试类,不过暂时没找到入口函数,也不知道整个项目是怎么跑起来的,所以没法直接测试了hhh,UML如下

 

 不过以上的都不重要,重要的是看看测试函数的源代码,可以看出,这部分源码测试了InternalCachedBag的各种性质,其中assertEquals就是断言前一个值与后一个值相同,若不同会报错,所以可以认为后面的值就是前面的宜当的值

@Test
public void testInternalCachedBag() throws Exception {
  
// check adding empty tuple
  
DataBag bg0 = new InternalCachedBag();
   bg0.add(TupleFactory.getInstance().newTuple());
   bg0.add(TupleFactory.getInstance().newTuple());
   assertEquals(bg0.size(),
2);

  
// check equal of bags
  
DataBag bg1 = new InternalCachedBag(1, 0.5f);
   assertEquals(bg1.size(),
0);

   String[][] tupleContents =
new String[][] { { "a", "b"},{ "c", "d" }, { "e", "f"} };
  
for (int i = 0; i < tupleContents.length; i++) {
        bg1.add(Util.createTuple(tupleContents[i]));
    }

  
// check size, and isSorted(), isDistinct()
  
assertEquals(bg1.size(), 3);
   assertFalse(bg1.isSorted());
   assertFalse(bg1.isDistinct());

   tupleContents =
new String[][] { { "c", "d" }, { "a", "b"},{ "e", "f"} };
   DataBag bg2 =
new InternalCachedBag(1, 0.5f);
   
for (int i = 0; i < tupleContents.length; i++) {
         bg2.add(Util.createTuple(tupleContents[i]));
    }
    assertEquals(bg1, bg2);

   
// check bag with data written to disk
   
DataBag bg3 = new InternalCachedBag(1, 0.0f);
    tupleContents =
new String[][] { { "e", "f"}, { "c", "d" }, { "a", "b"}};
   
for (int i = 0; i < tupleContents.length; i++) {
        bg3.add(Util.createTuple(tupleContents[i]));
    }
    assertEquals(bg1, bg3);

   
// check iterator
   
Iterator<Tuple> iter = bg3.iterator();
    DataBag bg4 =
new InternalCachedBag(1, 0.0f);
   
while(iter.hasNext()) {
       bg4.add(iter.next());
    }
    assertEquals(bg3, bg4);

   
// call iterator methods with irregular order
   
iter = bg3.iterator();
    assertTrue(iter.hasNext());
    assertTrue(iter.hasNext());
    DataBag bg5 =
new InternalCachedBag(1, 0.0f);
    bg5.add(iter.next());
    bg5.add(iter.next());
    assertTrue(iter.hasNext());
    bg5.add(iter.next());
    assertFalse(iter.hasNext());
    assertFalse(iter.hasNext());
    assertEquals(bg3, bg5);


    bg4.clear();
    assertEquals(bg4.size(),
0);
}

这里前面忘记给出构造方法的,这里给出

public InternalCachedBag() {
   
this(1, -1f);
}


public InternalCachedBag(int bagCount) {      
   
this(bagCount, -1f);


public InternalCachedBag(int bagCount, float percent) {
   
super(bagCount, percent);
   init();
}

private void init() {
    factory = TupleFactory.getInstance();       
    mContents =
new ArrayList<Tuple>();                   
    addDone =
false;
}

这里调用了super,所以我们还要看看selfSpillBag的构造函数

public SelfSpillBag(int bagCount) {
    memLimit =
new MemoryLimits(bagCount, -1);
}


public SelfSpillBag(int bagCount, float percent) {
    memLimit =
new MemoryLimits(bagCount, percent);
}

意外地简单啊,后一个参数是内存大小限制,-1应该是无限制,其他的性质看源码也可以看出来,这里就不一一列举了,总而言之这个bag就是一个可以限制j大小的spillableBag

InternalDistinctBag

接下来分析内部独特包

 继承关系

public class InternalDistinctBag extends SortedSpillBag

父类UML

放出一个注释

/**
 *
没有倍数的无序元组集合。数据在进入时不会重复存储。当需要溢出时,数据会被排序并写入磁盘。
 
* 数据存储在 HashSet 中。当需要排序时,它会被放置在一个 ArrayList 中,然后进行排序。
 
* 尽管有这些诡计,但发现这比将其存储在 TreeSet 中要快。当内存中的元组数量达到限制时,这个包会主动溢出
 
*/

基本的性质注释中已经介绍的相对比较清楚了,接下来继续去TestDataBag中找相应的测试函数 

@Test
public void testInternalDistinctBag() throws Exception {
  
// check adding empty tuple
  
DataBag bg0 = new InternalDistinctBag();
   bg0.add(TupleFactory.getInstance().newTuple());
   bg0.add(TupleFactory.getInstance().newTuple());
   assertEquals(bg0.size(),
1);// 因为实例化参数是一样的所以被认为是同一个tuple

   // check equal of bags
  
DataBag bg1 = new InternalDistinctBag();
   assertEquals(bg1.size(),
0);

   String[][] tupleContents =
new String[][] { { "e", "f"}, { "a", "b"}, { "e", "d" }, { "a", "b"}, { "e", "f"}};
  
for (int i = 0; i < tupleContents.length; i++) {
        bg1.add(Util.createTuple(tupleContents[i]));
    }

  
// check size, and isSorted(), isDistinct()
  
assertEquals(bg1.size(), 3);
   assertFalse(bg1.isSorted());
   assertTrue(bg1.isDistinct());

   tupleContents =
new String[][] { { "a", "b" }, { "e", "d"}, { "e", "d"}, { "e", "f"} };
   DataBag bg2 =
new InternalDistinctBag();
   
for (int i = 0; i < tupleContents.length; i++) {
         bg2.add(Util.createTuple(tupleContents[i]));
    }
    assertEquals(bg1, bg2);
// 和集合的性质类似,顺序不影响相等

    Iterator<Tuple> iter = bg1.iterator();
    iter.next().equals(Util.createTuple(
new String[] { "a", "b"}));
    iter.next().equals(Util.createTuple(
new String[] { "c", "d"}));
    iter.next().equals(Util.createTuple(
new String[] { "e", "f"}));

   
// check bag with data written to disk
   
DataBag bg3 = new InternalDistinctBag(1, 0.0f);
    tupleContents =
new String[][] { { "e", "f"}, { "a", "b"}, { "e", "d" }, { "a", "b"}, { "e", "f"}};
   
for (int i = 0; i < tupleContents.length; i++) {
        bg3.add(Util.createTuple(tupleContents[i]));
    }
    assertEquals(bg2, bg3);
    assertEquals(bg3.size(),
3);


   
// call iterator methods with irregular order
   
iter = bg3.iterator();
    assertTrue(iter.hasNext());
    assertTrue(iter.hasNext());

    DataBag bg4 =
new InternalDistinctBag(1, 0.0f);// 喜闻乐见的限制内存大小
    bg4.add(iter.next());
    bg4.add(iter.next());
    assertTrue(iter.hasNext());
    bg4.add(iter.next());
    assertFalse(iter.hasNext());
    assertFalse(iter.hasNext());
    assertEquals(bg3, bg4);

   
// check clear
   
bg3.clear();
    assertEquals(bg3.size(),
0);

   
// 测试所有数据溢出
   
DataBag bg5 = new InternalDistinctBag();
   
for(int j=0; j<3; j++) {
      
for (int i = 0; i < tupleContents.length; i++) {
          bg5.add(Util.createTuple(tupleContents[i]));
       }
       bg5.spill();
    }

    assertEquals(bg5.size(),
3);


   
// 测试大多数数据溢出,内存中有一些数据并合并溢出文件
   
DataBag bg6 = new InternalDistinctBag();
   
for(int j=0; j<104; j++) {
      
for (int i = 0; i < tupleContents.length; i++) {
          bg6.add(Util.createTuple(tupleContents[i]));
       }
      
if (j != 103) {
          bg6.spill();
       }
    }

    assertEquals(bg6.size(),
3);

   
// 检查 sorted bag 的两个实现是否可以正确比较
   
DataBag bg7 = new DistinctDataBag();
   
for(int j=0; j<104; j++) {
      
for (int i = 0; i < tupleContents.length; i++) {
          bg7.add(Util.createTuple(tupleContents[i]));
       }
      
if (j != 103) {
          bg7.spill();
       }
    }
    assertEquals(bg6, bg7);
}

主要性质看源代码也可以看得出来了,总结一下InternalDistinctDataBag是独特的、排序的、使用hashSet存储的Tuple,同时可以指定限制的内存大小,超过会溢出

老规矩,贴出构造函数

public InternalDistinctBag() {
   
this(1, -1.0f);
}


public InternalDistinctBag(int bagCount) {
   
this(bagCount, -1.0f);
}


public InternalDistinctBag(int bagCount, float percent) {
   
super(bagCount, percent);
   
if (percent < 0) {
        percent =
0.2F;
       
if (PigMapReduce.sJobConfInternal.get() != null) {
            String usage = PigMapReduce.sJobConfInternal.get().get(PigConfiguration.PIG_CACHEDBAG_MEMUSAGE);
            
if (usage != null) {
                percent = Float.parseFloat(usage);
            }
        }
    }
          
    init(bagCount, percent);
}

 private void init(int bagCount, double percent) {
    mContents =
new HashSet<Tuple>();
}

又用到了父类,于是贴出父类的构造函数 

SortedSpillBag(int bagCount, float percent){
   
super(bagCount, percent);
}

好家伙,父类调用了它的父类,于是看看继承关系

public abstract class SortedSpillBag extends SelfSpillBag

这里就很明了了,SelfSpillBag不就是InternalCachedBag的父类吗,前面有源码,于是为什么这个函数也可以限制内存大小也就很明白了

本篇博客就到这里,下一篇将继续介绍其他包的变种

Guess you like

Origin blog.csdn.net/Aulic/article/details/121465548