一、数据库的使用笔记
数据库的使用流程:
关于Sqlite3的介绍说明:
SQLite是一种嵌入式数据库,它的数据库就是一个文件。由于SQLite本身是C写的,而且体积很小,所以,经常被集成到各种应用程序中,iOS和Android的App中都可以集成。 Python就内置了SQLite3,所以,在Python中使用SQLite,不需要安装任何东西,直接使用。 在使用SQLite前,我们先要搞清楚几个概念: 表是数据库中存放关系数据的集合,一个数据库里面通常都包含多个表,比如学生的表,班级的表,学校的表,等等。表和表之间通过外键关联。 要操作关系数据库,首先需要连接到数据库,一个数据库连接称为Connection; 连接到数据库后,需要打开游标,称之为Cursor,通过Cursor执行SQL语句,然后,获得执行结果。 Python定义了一套操作数据库的API接口,任何数据库要连接到Python,只需要提供符合Python标准的数据库驱动即可。
常用命令有:
sqlite3.connect() | 打开一个到 SQLite 数据库文件 database 的链接。 |
connection.cursor() | 创建一个 cursor |
cursor.execute() | sqlite3模块支持两种类型的占位符:问号和命名占位符(命名样式) |
connection.commit() | 提交当前的事务。 |
connection.close() | 关闭数据库连接。 |
然后是建立一个简单的数据库
代码如下
1 def SQLite_Test(): 2 # =========== 连接数据库 ============ 3 # 1. 连接本地数据库 4 connectA = sqlite3.connect("example.db") 5 # 2. 连接内存数据库,在内存中创建临时数据库 6 connectB = sqlite3.connect(":memory:") 7 8 # =========== 创建游标对象 ============ 9 cursorA = connectA.cursor() 10 cursorB = connectB.cursor() 11 12 # =========== 创建表 ============ 13 cursorA.execute("CREATE TABLE class(id real, name text, age real, sex text)") 14 cursorB.execute("CREATE TABLE family(relation text, job text, age real)") 15 16 # =========== 插入数据 ============ 17 cursorA.execute("INSERT INTO class VALUES(1,'Jock',8,'M')") 18 cursorA.execute("INSERT INTO class VALUES(2,'Mike',10,'M')") 19 # 使用 ? 占位符 20 cursorA.execute("INSERT INTO class VALUES(?,?,?,?)", (3,'Sarah',9,'F')) 21 22 families = [ 23 ['Dad', 'CEO', 35], 24 ['Mom', 'singer', 33], 25 ['Brother', 'student', 8] 26 ] 27 cursorB.executemany("INSERT INTO family VALUES(?,?,?)",families) 28 29 # =========== 查找数据 ============ 30 # 使用 命名变量 占位符 31 cursorA.execute("SELECT * FROM class WHERE sex=:SEX", {"SEX":'M'}) 32 print("TABLE class: >>>select Male\n", cursorA.fetchone()) 33 cursorA.close() 34 35 cursorB.execute("SELECT * FROM family ORDER BY relation") 36 print("TABLE family:\n", cursorB.fetchall()) 37 cursorB.close() 38 39 # =========== 断开连接 ============ 40 connectA.close() 41 connectB.close() 42 import sqlite3 43 SQLite_Test()
效果如图:
爬取信息并存为csv文件
1 # -*- coding: utf-8 -*- 2 from collections import defaultdict 3 from functools import partial 4 import itertools 5 import operator 6 import re 7 8 import numpy as np 9 10 from pandas._libs import internals as libinternals, lib 11 from pandas.compat import map, range, zip 12 from pandas.util._validators import validate_bool_kwarg 13 14 from pandas.core.dtypes.cast import ( 15 find_common_type, infer_dtype_from_scalar, maybe_convert_objects, 16 maybe_promote) 17 from pandas.core.dtypes.common import ( 18 _NS_DTYPE, is_datetimelike_v_numeric, is_extension_array_dtype, 19 is_extension_type, is_list_like, is_numeric_v_string_like, is_scalar) 20 import pandas.core.dtypes.concat as _concat 21 from pandas.core.dtypes.generic import ABCExtensionArray, ABCSeries 22 from pandas.core.dtypes.missing import isna 23 24 import pandas.core.algorithms as algos 25 from pandas.core.arrays.sparse import _maybe_to_sparse 26 from pandas.core.base import PandasObject 27 from pandas.core.index import Index, MultiIndex, ensure_index 28 from pandas.core.indexing import maybe_convert_indices 29 30 from pandas.io.formats.printing import pprint_thing 31 32 from .blocks import ( 33 Block, CategoricalBlock, DatetimeTZBlock, ExtensionBlock, 34 ObjectValuesExtensionBlock, _extend_blocks, _merge_blocks, _safe_reshape, 35 get_block_type, make_block) 36 from .concat import ( # all for concatenate_block_managers 37 combine_concat_plans, concatenate_join_units, get_mgr_concatenation_plan, 38 is_uniform_join_units) 39 40 # TODO: flexible with index=None and/or items=None 41 42 43 class BlockManager(PandasObject): 44 """ 45 Core internal data structure to implement DataFrame, Series, Panel, etc. 46 47 Manage a bunch of labeled 2D mixed-type ndarrays. Essentially it's a 48 lightweight blocked set of labeled data to be manipulated by the DataFrame 49 public API class 50 51 Attributes 52 ---------- 53 shape 54 ndim 55 axes 56 values 57 items 58 59 Methods 60 ------- 61 set_axis(axis, new_labels) 62 copy(deep=True) 63 64 get_dtype_counts 65 get_ftype_counts 66 get_dtypes 67 get_ftypes 68 69 apply(func, axes, block_filter_fn) 70 71 get_bool_data 72 get_numeric_data 73 74 get_slice(slice_like, axis) 75 get(label) 76 iget(loc) 77 78 take(indexer, axis) 79 reindex_axis(new_labels, axis) 80 reindex_indexer(new_labels, indexer, axis) 81 82 delete(label) 83 insert(loc, label, value) 84 set(label, value) 85 86 Parameters 87 ---------- 88 89 90 Notes 91 ----- 92 This is *not* a public API class 93 """ 94 __slots__ = ['axes', 'blocks', '_ndim', '_shape', '_known_consolidated', 95 '_is_consolidated', '_blknos', '_blklocs'] 96 97 def __init__(self, blocks, axes, do_integrity_check=True): 98 self.axes = [ensure_index(ax) for ax in axes] 99 self.blocks = tuple(blocks) 100 101 for block in blocks: 102 if block.is_sparse: 103 if len(block.mgr_locs) != 1: 104 raise AssertionError("Sparse block refers to multiple " 105 "items") 106 else: 107 if self.ndim != block.ndim: 108 raise AssertionError( 109 'Number of Block dimensions ({block}) must equal ' 110 'number of axes ({self})'.format(block=block.ndim, 111 self=self.ndim)) 112 113 if do_integrity_check: 114 self._verify_integrity() 115 116 self._consolidate_check() 117 118 self._rebuild_blknos_and_blklocs() 119 120 def make_empty(self, axes=None): 121 """ return an empty BlockManager with the items axis of len 0 """ 122 if axes is None: 123 axes = [ensure_index([])] + [ensure_index(a) 124 for a in self.axes[1:]] 125 126 # preserve dtype if possible 127 if self.ndim == 1: 128 blocks = np.array([], dtype=self.array_dtype) 129 else: 130 blocks = [] 131 return self.__class__(blocks, axes) 132 133 def __nonzero__(self): 134 return True 135 136 # Python3 compat 137 __bool__ = __nonzero__ 138 139 @property 140 def shape(self): 141 return tuple(len(ax) for ax in self.axes) 142 143 @property 144 def ndim(self): 145 return len(self.axes) 146 147 def set_axis(self, axis, new_labels): 148 new_labels = ensure_index(new_labels) 149 old_len = len(self.axes[axis]) 150 new_len = len(new_labels) 151 152 if new_len != old_len: 153 raise ValueError( 154 'Length mismatch: Expected axis has {old} elements, new ' 155 'values have {new} elements'.format(old=old_len, new=new_len)) 156 self.axes[axis] = new_labels 157 158 def rename_axis(self, mapper, axis, copy=True, level=None): 159 """ 160 Rename one of axes. 161 162 Parameters 163 ---------- 164 mapper : unary callable 165 axis : int 166 copy : boolean, default True 167 level : int, default None 168 """ 169 obj = self.copy(deep=copy) 170 obj.set_axis(axis, _transform_index(self.axes[axis], mapper, level)) 171 return obj 172 173 @property 174 def _is_single_block(self): 175 if self.ndim == 1: 176 return True 177 178 if len(self.blocks) != 1: 179 return False 180 181 blk = self.blocks[0] 182 return (blk.mgr_locs.is_slice_like and 183 blk.mgr_locs.as_slice == slice(0, len(self), 1)) 184 185 def _rebuild_blknos_and_blklocs(self): 186 """ 187 Update mgr._blknos / mgr._blklocs. 188 """ 189 new_blknos = np.empty(self.shape[0], dtype=np.int64) 190 new_blklocs = np.empty(self.shape[0], dtype=np.int64) 191 new_blknos.fill(-1) 192 new_blklocs.fill(-1) 193 194 for blkno, blk in enumerate(self.blocks): 195 rl = blk.mgr_locs 196 new_blknos[rl.indexer] = blkno 197 new_blklocs[rl.indexer] = np.arange(len(rl)) 198 199 if (new_blknos == -1).any(): 200 raise AssertionError("Gaps in blk ref_locs") 201 202 self._blknos = new_blknos 203 self._blklocs = new_blklocs 204 205 @property 206 def items(self): 207 return self.axes[0] 208 209 def _get_counts(self, f): 210 """ return a dict of the counts of the function in BlockManager """ 211 self._consolidate_inplace() 212 counts = dict() 213 for b in self.blocks: 214 v = f(b) 215 counts[v] = counts.get(v, 0) + b.shape[0] 216 return counts 217 218 def get_dtype_counts(self): 219 return self._get_counts(lambda b: b.dtype.name) 220 221 def get_ftype_counts(self): 222 return self._get_counts(lambda b: b.ftype) 223 224 def get_dtypes(self): 225 dtypes = np.array([blk.dtype for blk in self.blocks]) 226 return algos.take_1d(dtypes, self._blknos, allow_fill=False) 227 228 def get_ftypes(self): 229 ftypes = np.array([blk.ftype for blk in self.blocks]) 230 return algos.take_1d(ftypes, self._blknos, allow_fill=False) 231 232 def __getstate__(self): 233 block_values = [b.values for b in self.blocks] 234 block_items = [self.items[b.mgr_locs.indexer] for b in self.blocks] 235 axes_array = [ax for ax in self.axes] 236 237 extra_state = { 238 '0.14.1': { 239 'axes': axes_array, 240 'blocks': [dict(values=b.values, mgr_locs=b.mgr_locs.indexer) 241 for b in self.blocks] 242 } 243 } 244 245 # First three elements of the state are to maintain forward 246 # compatibility with 0.13.1. 247 return axes_array, block_values, block_items, extra_state 248 249 def __setstate__(self, state): 250 def unpickle_block(values, mgr_locs): 251 return make_block(values, placement=mgr_locs) 252 253 if (isinstance(state, tuple) and len(state) >= 4 and 254 '0.14.1' in state[3]): 255 state = state[3]['0.14.1'] 256 self.axes = [ensure_index(ax) for ax in state['axes']] 257 self.blocks = tuple(unpickle_block(b['values'], b['mgr_locs']) 258 for b in state['blocks']) 259 else: 260 # discard anything after 3rd, support beta pickling format for a 261 # little while longer 262 ax_arrays, bvalues, bitems = state[:3] 263 264 self.axes = [ensure_index(ax) for ax in ax_arrays] 265 266 if len(bitems) == 1 and self.axes[0].equals(bitems[0]): 267 # This is a workaround for pre-0.14.1 pickles that didn't 268 # support unpickling multi-block frames/panels with non-unique 269 # columns/items, because given a manager with items ["a", "b", 270 # "a"] there's no way of knowing which block's "a" is where. 271 # 272 # Single-block case can be supported under the assumption that 273 # block items corresponded to manager items 1-to-1. 274 all_mgr_locs = [slice(0, len(bitems[0]))] 275 else: 276 all_mgr_locs = [self.axes[0].get_indexer(blk_items) 277 for blk_items in bitems] 278 279 self.blocks = tuple( 280 unpickle_block(values, mgr_locs) 281 for values, mgr_locs in zip(bvalues, all_mgr_locs)) 282 283 self._post_setstate() 284 285 def _post_setstate(self): 286 self._is_consolidated = False 287 self._known_consolidated = False 288 self._rebuild_blknos_and_blklocs() 289 290 def __len__(self): 291 return len(self.items) 292 293 def __unicode__(self): 294 output = pprint_thing(self.__class__.__name__) 295 for i, ax in enumerate(self.axes): 296 if i == 0: 297 output += u'\nItems: {ax}'.format(ax=ax) 298 else: 299 output += u'\nAxis {i}: {ax}'.format(i=i, ax=ax) 300 301 for block in self.blocks: 302 output += u'\n{block}'.format(block=pprint_thing(block)) 303 return output 304 305 def _verify_integrity(self): 306 mgr_shape = self.shape 307 tot_items = sum(len(x.mgr_locs) for x in self.blocks) 308 for block in self.blocks: 309 if block._verify_integrity and block.shape[1:] != mgr_shape[1:]: 310 construction_error(tot_items, block.shape[1:], self.axes) 311 if len(self.items) != tot_items: 312 raise AssertionError('Number of manager items must equal union of ' 313 'block items\n# manager items: {0}, # ' 314 'tot_items: {1}'.format( 315 len(self.items), tot_items)) 316 317 def apply(self, f, axes=None, filter=None, do_integrity_check=False, 318 consolidate=True, **kwargs): 319 """ 320 iterate over the blocks, collect and create a new block manager 321 322 Parameters 323 ---------- 324 f : the callable or function name to operate on at the block level 325 axes : optional (if not supplied, use self.axes) 326 filter : list, if supplied, only call the block if the filter is in 327 the block 328 do_integrity_check : boolean, default False. Do the block manager 329 integrity check 330 consolidate: boolean, default True. Join together blocks having same 331 dtype 332 333 Returns 334 ------- 335 Block Manager (new object) 336 337 """ 338 339 result_blocks = [] 340 341 # filter kwarg is used in replace-* family of methods 342 if filter is not None: 343 filter_locs = set(self.items.get_indexer_for(filter)) 344 if len(filter_locs) == len(self.items): 345 # All items are included, as if there were no filtering 346 filter = None 347 else: 348 kwargs['filter'] = filter_locs 349 350 if consolidate: 351 self._consolidate_inplace() 352 353 if f == 'where': 354 align_copy = True 355 if kwargs.get('align', True): 356 align_keys = ['other', 'cond'] 357 else: 358 align_keys = ['cond'] 359 elif f == 'putmask': 360 align_copy = False 361 if kwargs.get('align', True): 362 align_keys = ['new', 'mask'] 363 else: 364 align_keys = ['mask'] 365 elif f == 'fillna': 366 # fillna internally does putmask, maybe it's better to do this 367 # at mgr, not block level? 368 align_copy = False 369 align_keys = ['value'] 370 else: 371 align_keys = [] 372 373 # TODO(EA): may interfere with ExtensionBlock.setitem for blocks 374 # with a .values attribute. 375 aligned_args = {k: kwargs[k] 376 for k in align_keys 377 if hasattr(kwargs[k], 'values') and 378 not isinstance(kwargs[k], ABCExtensionArray)} 379 380 for b in self.blocks: 381 if filter is not None: 382 if not b.mgr_locs.isin(filter_locs).any(): 383 result_blocks.append(b) 384 continue 385 386 if aligned_args: 387 b_items = self.items[b.mgr_locs.indexer] 388 389 for k, obj in aligned_args.items(): 390 axis = getattr(obj, '_info_axis_number', 0) 391 kwargs[k] = obj.reindex(b_items, axis=axis, 392 copy=align_copy) 393 394 applied = getattr(b, f)(**kwargs) 395 result_blocks = _extend_blocks(applied, result_blocks) 396 397 if len(result_blocks) == 0: 398 return self.make_empty(axes or self.axes) 399 bm = self.__class__(result_blocks, axes or self.axes, 400 do_integrity_check=do_integrity_check) 401 bm._consolidate_inplace() 402 return bm 403 404 def quantile(self, axis=0, consolidate=True, transposed=False, 405 interpolation='linear', qs=None, numeric_only=None): 406 """ 407 Iterate over blocks applying quantile reduction. 408 This routine is intended for reduction type operations and 409 will do inference on the generated blocks. 410 411 Parameters 412 ---------- 413 axis: reduction axis, default 0 414 consolidate: boolean, default True. Join together blocks having same 415 dtype 416 transposed: boolean, default False 417 we are holding transposed data 418 interpolation : type of interpolation, default 'linear' 419 qs : a scalar or list of the quantiles to be computed 420 numeric_only : ignored 421 422 Returns 423 ------- 424 Block Manager (new object) 425 """ 426 427 # Series dispatches to DataFrame for quantile, which allows us to 428 # simplify some of the code here and in the blocks 429 assert self.ndim >= 2 430 431 if consolidate: 432 self._consolidate_inplace() 433 434 def get_axe(block, qs, axes): 435 from pandas import Float64Index 436 if is_list_like(qs): 437 ax = Float64Index(qs) 438 elif block.ndim == 1: 439 ax = Float64Index([qs]) 440 else: 441 ax = axes[0] 442 return ax 443 444 axes, blocks = [], [] 445 for b in self.blocks: 446 block = b.quantile(axis=axis, qs=qs, interpolation=interpolation) 447 448 axe = get_axe(b, qs, axes=self.axes) 449 450 axes.append(axe) 451 blocks.append(block) 452 453 # note that some DatetimeTZ, Categorical are always ndim==1 454 ndim = {b.ndim for b in blocks} 455 assert 0 not in ndim, ndim 456 457 if 2 in ndim: 458 459 new_axes = list(self.axes) 460 461 # multiple blocks that are reduced 462 if len(blocks) > 1: 463 new_axes[1] = axes[0] 464 465 # reset the placement to the original 466 for b, sb in zip(blocks, self.blocks): 467 b.mgr_locs = sb.mgr_locs 468 469 else: 470 new_axes[axis] = Index(np.concatenate( 471 [ax.values for ax in axes])) 472 473 if transposed: 474 new_axes = new_axes[::-1] 475 blocks = [b.make_block(b.values.T, 476 placement=np.arange(b.shape[1]) 477 ) for b in blocks] 478 479 return self.__class__(blocks, new_axes) 480 481 # single block, i.e. ndim == {1} 482 values = _concat._concat_compat([b.values for b in blocks]) 483 484 # compute the orderings of our original data 485 if len(self.blocks) > 1: 486 487 indexer = np.empty(len(self.axes[0]), dtype=np.intp) 488 i = 0 489 for b in self.blocks: 490 for j in b.mgr_locs: 491 indexer[j] = i 492 i = i + 1 493 494 values = values.take(indexer) 495 496 return SingleBlockManager( 497 [make_block(values, 498 ndim=1, 499 placement=np.arange(len(values)))], 500 axes[0]) 501 502 def isna(self, func, **kwargs): 503 return self.apply('apply', func=func, **kwargs) 504 505 def where(self, **kwargs): 506 return self.apply('where', **kwargs) 507 508 def setitem(self, **kwargs): 509 return self.apply('setitem', **kwargs) 510 511 def putmask(self, **kwargs): 512 return self.apply('putmask', **kwargs) 513 514 def diff(self, **kwargs): 515 return self.apply('diff', **kwargs) 516 517 def interpolate(self, **kwargs): 518 return self.apply('interpolate', **kwargs) 519 520 def shift(self, **kwargs): 521 return self.apply('shift', **kwargs) 522 523 def fillna(self, **kwargs): 524 return self.apply('fillna', **kwargs) 525 526 def downcast(self, **kwargs): 527 return self.apply('downcast', **kwargs) 528 529 def astype(self, dtype, **kwargs): 530 return self.apply('astype', dtype=dtype, **kwargs) 531 532 def convert(self, **kwargs): 533 return self.apply('convert', **kwargs) 534 535 def replace(self, **kwargs): 536 return self.apply('replace', **kwargs) 537 538 def replace_list(self, src_list, dest_list, inplace=False, regex=False): 539 """ do a list replace """ 540 541 inplace = validate_bool_kwarg(inplace, 'inplace') 542 543 # figure out our mask a-priori to avoid repeated replacements 544 values = self.as_array() 545 546 def comp(s, regex=False): 547 """ 548 Generate a bool array by perform an equality check, or perform 549 an element-wise regular expression matching 550 """ 551 if isna(s): 552 return isna(values) 553 if hasattr(s, 'asm8'): 554 return _compare_or_regex_search(maybe_convert_objects(values), 555 getattr(s, 'asm8'), regex) 556 return _compare_or_regex_search(values, s, regex) 557 558 masks = [comp(s, regex) for i, s in enumerate(src_list)] 559 560 result_blocks = [] 561 src_len = len(src_list) - 1 562 for blk in self.blocks: 563 564 # its possible to get multiple result blocks here 565 # replace ALWAYS will return a list 566 rb = [blk if inplace else blk.copy()] 567 for i, (s, d) in enumerate(zip(src_list, dest_list)): 568 new_rb = [] 569 for b in rb: 570 m = masks[i][b.mgr_locs.indexer] 571 convert = i == src_len 572 result = b._replace_coerce(mask=m, to_replace=s, value=d, 573 inplace=inplace, 574 convert=convert, regex=regex) 575 if m.any(): 576 new_rb = _extend_blocks(result, new_rb) 577 else: 578 new_rb.append(b) 579 rb = new_rb 580 result_blocks.extend(rb) 581 582 bm = self.__class__(result_blocks, self.axes) 583 bm._consolidate_inplace() 584 return bm 585 586 def reshape_nd(self, axes, **kwargs): 587 """ a 2d-nd reshape operation on a BlockManager """ 588 return self.apply('reshape_nd', axes=axes, **kwargs) 589 590 def is_consolidated(self): 591 """ 592 Return True if more than one block with the same dtype 593 """ 594 if not self._known_consolidated: 595 self._consolidate_check() 596 return self._is_consolidated 597 598 def _consolidate_check(self): 599 ftypes = [blk.ftype for blk in self.blocks] 600 self._is_consolidated = len(ftypes) == len(set(ftypes)) 601 self._known_consolidated = True 602 603 @property 604 def is_mixed_type(self): 605 # Warning, consolidation needs to get checked upstairs 606 self._consolidate_inplace() 607 return len(self.blocks) > 1 608 609 @property 610 def is_numeric_mixed_type(self): 611 # Warning, consolidation needs to get checked upstairs 612 self._consolidate_inplace() 613 return all(block.is_numeric for block in self.blocks) 614 615 @property 616 def is_datelike_mixed_type(self): 617 # Warning, consolidation needs to get checked upstairs 618 self._consolidate_inplace() 619 return any(block.is_datelike for block in self.blocks) 620 621 @property 622 def any_extension_types(self): 623 """Whether any of the blocks in this manager are extension blocks""" 624 return any(block.is_extension for block in self.blocks) 625 626 @property 627 def is_view(self): 628 """ return a boolean if we are a single block and are a view """ 629 if len(self.blocks) == 1: 630 return self.blocks[0].is_view 631 632 # It is technically possible to figure out which blocks are views 633 # e.g. [ b.values.base is not None for b in self.blocks ] 634 # but then we have the case of possibly some blocks being a view 635 # and some blocks not. setting in theory is possible on the non-view 636 # blocks w/o causing a SettingWithCopy raise/warn. But this is a bit 637 # complicated 638 639 return False 640 641 def get_bool_data(self, copy=False): 642 """ 643 Parameters 644 ---------- 645 copy : boolean, default False 646 Whether to copy the blocks 647 """ 648 self._consolidate_inplace() 649 return self.combine([b for b in self.blocks if b.is_bool], copy) 650 651 def get_numeric_data(self, copy=False): 652 """ 653 Parameters 654 ---------- 655 copy : boolean, default False 656 Whether to copy the blocks 657 """ 658 self._consolidate_inplace() 659 return self.combine([b for b in self.blocks if b.is_numeric], copy) 660 661 def combine(self, blocks, copy=True): 662 """ return a new manager with the blocks """ 663 if len(blocks) == 0: 664 return self.make_empty() 665 666 # FIXME: optimization potential 667 indexer = np.sort(np.concatenate([b.mgr_locs.as_array 668 for b in blocks])) 669 inv_indexer = lib.get_reverse_indexer(indexer, self.shape[0]) 670 671 new_blocks = [] 672 for b in blocks: 673 b = b.copy(deep=copy) 674 b.mgr_locs = algos.take_1d(inv_indexer, b.mgr_locs.as_array, 675 axis=0, allow_fill=False) 676 new_blocks.append(b) 677 678 axes = list(self.axes) 679 axes[0] = self.items.take(indexer) 680 681 return self.__class__(new_blocks, axes, do_integrity_check=False) 682 683 def get_slice(self, slobj, axis=0): 684 if axis >= self.ndim: 685 raise IndexError("Requested axis not found in manager") 686 687 if axis == 0: 688 new_blocks = self._slice_take_blocks_ax0(slobj) 689 else: 690 slicer = [slice(None)] * (axis + 1) 691 slicer[axis] = slobj 692 slicer = tuple(slicer) 693 new_blocks = [blk.getitem_block(slicer) for blk in self.blocks] 694 695 new_axes = list(self.axes) 696 new_axes[axis] = new_axes[axis][slobj] 697 698 bm = self.__class__(new_blocks, new_axes, do_integrity_check=False) 699 bm._consolidate_inplace() 700 return bm 701 702 def __contains__(self, item): 703 return item in self.items 704 705 @property 706 def nblocks(self): 707 return len(self.blocks) 708 709 def copy(self, deep=True): 710 """ 711 Make deep or shallow copy of BlockManager 712 713 Parameters 714 ---------- 715 deep : boolean o rstring, default True 716 If False, return shallow copy (do not copy data) 717 If 'all', copy data and a deep copy of the index 718 719 Returns 720 ------- 721 copy : BlockManager 722 """ 723 # this preserves the notion of view copying of axes 724 if deep: 725 if deep == 'all': 726 copy = lambda ax: ax.copy(deep=True) 727 else: 728 copy = lambda ax: ax.view() 729 new_axes = [copy(ax) for ax in self.axes] 730 else: 731 new_axes = list(self.axes) 732 return self.apply('copy', axes=new_axes, deep=deep, 733 do_integrity_check=False) 734 735 def as_array(self, transpose=False, items=None): 736 """Convert the blockmanager data into an numpy array. 737 738 Parameters 739 ---------- 740 transpose : boolean, default False 741 If True, transpose the return array 742 items : list of strings or None 743 Names of block items that will be included in the returned 744 array. ``None`` means that all block items will be used 745 746 Returns 747 ------- 748 arr : ndarray 749 """ 750 if len(self.blocks) == 0: 751 arr = np.empty(self.shape, dtype=float) 752 return arr.transpose() if transpose else arr 753 754 if items is not None: 755 mgr = self.reindex_axis(items, axis=0) 756 else: 757 mgr = self 758 759 if self._is_single_block and mgr.blocks[0].is_datetimetz: 760 # TODO(Block.get_values): Make DatetimeTZBlock.get_values 761 # always be object dtype. Some callers seem to want the 762 # DatetimeArray (previously DTI) 763 arr = mgr.blocks[0].get_values(dtype=object) 764 elif self._is_single_block or not self.is_mixed_type: 765 arr = np.asarray(mgr.blocks[0].get_values()) 766 else: 767 arr = mgr._interleave() 768 769 return arr.transpose() if transpose else arr 770 771 def _interleave(self): 772 """ 773 Return ndarray from blocks with specified item order 774 Items must be contained in the blocks 775 """ 776 from pandas.core.dtypes.common import is_sparse 777 dtype = _interleaved_dtype(self.blocks) 778 779 # TODO: https://github.com/pandas-dev/pandas/issues/22791 780 # Give EAs some input on what happens here. Sparse needs this. 781 if is_sparse(dtype): 782 dtype = dtype.subtype 783 elif is_extension_array_dtype(dtype): 784 dtype = 'object' 785 786 result = np.empty(self.shape, dtype=dtype) 787 788 itemmask = np.zeros(self.shape[0]) 789 790 for blk in self.blocks: 791 rl = blk.mgr_locs 792 result[rl.indexer] = blk.get_values(dtype) 793 itemmask[rl.indexer] = 1 794 795 if not itemmask.all(): 796 raise AssertionError('Some items were not contained in blocks') 797 798 return result 799 800 def to_dict(self, copy=True): 801 """ 802 Return a dict of str(dtype) -> BlockManager 803 804 Parameters 805 ---------- 806 copy : boolean, default True 807 808 Returns 809 ------- 810 values : a dict of dtype -> BlockManager 811 812 Notes 813 ----- 814 This consolidates based on str(dtype) 815 """ 816 self._consolidate_inplace() 817 818 bd = {} 819 for b in self.blocks: 820 bd.setdefault(str(b.dtype), []).append(b) 821 822 return {dtype: self.combine(blocks, copy=copy) 823 for dtype, blocks in bd.items()} 824 825 def xs(self, key, axis=1, copy=True, takeable=False): 826 if axis < 1: 827 raise AssertionError( 828 'Can only take xs across axis >= 1, got {ax}'.format(ax=axis)) 829 830 # take by position 831 if takeable: 832 loc = key 833 else: 834 loc = self.axes[axis].get_loc(key) 835 836 slicer = [slice(None, None) for _ in range(self.ndim)] 837 slicer[axis] = loc 838 slicer = tuple(slicer) 839 840 new_axes = list(self.axes) 841 842 # could be an array indexer! 843 if isinstance(loc, (slice, np.ndarray)): 844 new_axes[axis] = new_axes[axis][loc] 845 else: 846 new_axes.pop(axis) 847 848 new_blocks = [] 849 if len(self.blocks) > 1: 850 # we must copy here as we are mixed type 851 for blk in self.blocks: 852 newb = make_block(values=blk.values[slicer], 853 klass=blk.__class__, 854 placement=blk.mgr_locs) 855 new_blocks.append(newb) 856 elif len(self.blocks) == 1: 857 block = self.blocks[0] 858 vals = block.values[slicer] 859 if copy: 860 vals = vals.copy() 861 new_blocks = [make_block(values=vals, 862 placement=block.mgr_locs, 863 klass=block.__class__)] 864 865 return self.__class__(new_blocks, new_axes) 866 867 def fast_xs(self, loc): 868 """ 869 get a cross sectional for a given location in the 870 items ; handle dups 871 872 return the result, is *could* be a view in the case of a 873 single block 874 """ 875 if len(self.blocks) == 1: 876 return self.blocks[0].iget((slice(None), loc)) 877 878 items = self.items 879 880 # non-unique (GH4726) 881 if not items.is_unique: 882 result = self._interleave() 883 if self.ndim == 2: 884 result = result.T 885 return result[loc] 886 887 # unique 888 dtype = _interleaved_dtype(self.blocks) 889 890 n = len(items) 891 if is_extension_array_dtype(dtype): 892 # we'll eventually construct an ExtensionArray. 893 result = np.empty(n, dtype=object) 894 else: 895 result = np.empty(n, dtype=dtype) 896 897 for blk in self.blocks: 898 # Such assignment may incorrectly coerce NaT to None 899 # result[blk.mgr_locs] = blk._slice((slice(None), loc)) 900 for i, rl in enumerate(blk.mgr_locs): 901 result[rl] = blk._try_coerce_result(blk.iget((i, loc))) 902 903 if is_extension_array_dtype(dtype): 904 result = dtype.construct_array_type()._from_sequence( 905 result, dtype=dtype 906 ) 907 908 return result 909 910 def consolidate(self): 911 """ 912 Join together blocks having same dtype 913 914 Returns 915 ------- 916 y : BlockManager 917 """ 918 if self.is_consolidated(): 919 return self 920 921 bm = self.__class__(self.blocks, self.axes) 922 bm._is_consolidated = False 923 bm._consolidate_inplace() 924 return bm 925 926 def _consolidate_inplace(self): 927 if not self.is_consolidated(): 928 self.blocks = tuple(_consolidate(self.blocks)) 929 self._is_consolidated = True 930 self._known_consolidated = True 931 self._rebuild_blknos_and_blklocs() 932 933 def get(self, item, fastpath=True): 934 """ 935 Return values for selected item (ndarray or BlockManager). 936 """ 937 if self.items.is_unique: 938 939 if not isna(item): 940 loc = self.items.get_loc(item) 941 else: 942 indexer = np.arange(len(self.items))[isna(self.items)] 943 944 # allow a single nan location indexer 945 if not is_scalar(indexer): 946 if len(indexer) == 1: 947 loc = indexer.item() 948 else: 949 raise ValueError("cannot label index with a null key") 950 951 return self.iget(loc, fastpath=fastpath) 952 else: 953 954 if isna(item): 955 raise TypeError("cannot label index with a null key") 956 957 indexer = self.items.get_indexer_for([item]) 958 return self.reindex_indexer(new_axis=self.items[indexer], 959 indexer=indexer, axis=0, 960 allow_dups=True) 961 962 def iget(self, i, fastpath=True): 963 """ 964 Return the data as a SingleBlockManager if fastpath=True and possible 965 966 Otherwise return as a ndarray 967 """ 968 block = self.blocks[self._blknos[i]] 969 values = block.iget(self._blklocs[i]) 970 if not fastpath or not block._box_to_block_values or values.ndim != 1: 971 return values 972 973 # fastpath shortcut for select a single-dim from a 2-dim BM 974 return SingleBlockManager( 975 [block.make_block_same_class(values, 976 placement=slice(0, len(values)), 977 ndim=1)], 978 self.axes[1]) 979 980 def delete(self, item): 981 """ 982 Delete selected item (items if non-unique) in-place. 983 """ 984 indexer = self.items.get_loc(item) 985 986 is_deleted = np.zeros(self.shape[0], dtype=np.bool_) 987 is_deleted[indexer] = True 988 ref_loc_offset = -is_deleted.cumsum() 989 990 is_blk_deleted = [False] * len(self.blocks) 991 992 if isinstance(indexer, int): 993 affected_start = indexer 994 else: 995 affected_start = is_deleted.nonzero()[0][0] 996 997 for blkno, _ in _fast_count_smallints(self._blknos[affected_start:]): 998 blk = self.blocks[blkno] 999 bml = blk.mgr_locs 1000 blk_del = is_deleted[bml.indexer].nonzero()[0] 1001 1002 if len(blk_del) == len(bml): 1003 is_blk_deleted[blkno] = True 1004 continue 1005 elif len(blk_del) != 0: 1006 blk.delete(blk_del) 1007 bml = blk.mgr_locs 1008 1009 blk.mgr_locs = bml.add(ref_loc_offset[bml.indexer]) 1010 1011 # FIXME: use Index.delete as soon as it uses fastpath=True 1012 self.axes[0] = self.items[~is_deleted] 1013 self.blocks = tuple(b for blkno, b in enumerate(self.blocks) 1014 if not is_blk_deleted[blkno]) 1015 self._shape = None 1016 self._rebuild_blknos_and_blklocs() 1017 1018 def set(self, item, value): 1019 """ 1020 Set new item in-place. Does not consolidate. Adds new Block if not 1021 contained in the current set of items 1022 """ 1023 # FIXME: refactor, clearly separate broadcasting & zip-like assignment 1024 # can prob also fix the various if tests for sparse/categorical 1025 1026 # TODO(EA): Remove an is_extension_ when all extension types satisfy 1027 # the interface 1028 value_is_extension_type = (is_extension_type(value) or 1029 is_extension_array_dtype(value)) 1030 1031 # categorical/spares/datetimetz 1032 if value_is_extension_type: 1033 1034 def value_getitem(placement): 1035 return value 1036 else: 1037 if value.ndim == self.ndim - 1: 1038 value = _safe_reshape(value, (1,) + value.shape) 1039 1040 def value_getitem(placement): 1041 return value 1042 else: 1043 1044 def value_getitem(placement): 1045 return value[placement.indexer] 1046 1047 if value.shape[1:] != self.shape[1:]: 1048 raise AssertionError('Shape of new values must be compatible ' 1049 'with manager shape') 1050 1051 try: 1052 loc = self.items.get_loc(item) 1053 except KeyError: 1054 # This item wasn't present, just insert at end 1055 self.insert(len(self.items), item, value) 1056 return 1057 1058 if isinstance(loc, int): 1059 loc = [loc] 1060 1061 blknos = self._blknos[loc] 1062 blklocs = self._blklocs[loc].copy() 1063 1064 unfit_mgr_locs = [] 1065 unfit_val_locs = [] 1066 removed_blknos = [] 1067 for blkno, val_locs in libinternals.get_blkno_placements(blknos, 1068 self.nblocks, 1069 group=True): 1070 blk = self.blocks[blkno] 1071 blk_locs = blklocs[val_locs.indexer] 1072 if blk.should_store(value): 1073 blk.set(blk_locs, value_getitem(val_locs)) 1074 else: 1075 unfit_mgr_locs.append(blk.mgr_locs.as_array[blk_locs]) 1076 unfit_val_locs.append(val_locs) 1077 1078 # If all block items are unfit, schedule the block for removal. 1079 if len(val_locs) == len(blk.mgr_locs): 1080 removed_blknos.append(blkno) 1081 else: 1082 self._blklocs[blk.mgr_locs.indexer] = -1 1083 blk.delete(blk_locs) 1084 self._blklocs[blk.mgr_locs.indexer] = np.arange(len(blk)) 1085 1086 if len(removed_blknos): 1087 # Remove blocks & update blknos accordingly 1088 is_deleted = np.zeros(self.nblocks, dtype=np.bool_) 1089 is_deleted[removed_blknos] = True 1090 1091 new_blknos = np.empty(self.nblocks, dtype=np.int64) 1092 new_blknos.fill(-1) 1093 new_blknos[~is_deleted] = np.arange(self.nblocks - 1094 len(removed_blknos)) 1095 self._blknos = algos.take_1d(new_blknos, self._blknos, axis=0, 1096 allow_fill=False) 1097 self.blocks = tuple(blk for i, blk in enumerate(self.blocks) 1098 if i not in set(removed_blknos)) 1099 1100 if unfit_val_locs: 1101 unfit_mgr_locs = np.concatenate(unfit_mgr_locs) 1102 unfit_count = len(unfit_mgr_locs) 1103 1104 new_blocks = [] 1105 if value_is_extension_type: 1106 # This code (ab-)uses the fact that sparse blocks contain only 1107 # one item. 1108 new_blocks.extend( 1109 make_block(values=value.copy(), ndim=self.ndim, 1110 placement=slice(mgr_loc, mgr_loc + 1)) 1111 for mgr_loc in unfit_mgr_locs) 1112 1113 self._blknos[unfit_mgr_locs] = (np.arange(unfit_count) + 1114 len(self.blocks)) 1115 self._blklocs[unfit_mgr_locs] = 0 1116 1117 else: 1118 # unfit_val_locs contains BlockPlacement objects 1119 unfit_val_items = unfit_val_locs[0].append(unfit_val_locs[1:]) 1120 1121 new_blocks.append( 1122 make_block(values=value_getitem(unfit_val_items), 1123 ndim=self.ndim, placement=unfit_mgr_locs)) 1124 1125 self._blknos[unfit_mgr_locs] = len(self.blocks) 1126 self._blklocs[unfit_mgr_locs] = np.arange(unfit_count) 1127 1128 self.blocks += tuple(new_blocks) 1129 1130 # Newly created block's dtype may already be present. 1131 self._known_consolidated = False 1132 1133 def insert(self, loc, item, value, allow_duplicates=False): 1134 """ 1135 Insert item at selected position. 1136 1137 Parameters 1138 ---------- 1139 loc : int 1140 item : hashable 1141 value : array_like 1142 allow_duplicates: bool 1143 If False, trying to insert non-unique item will raise 1144 1145 """ 1146 if not allow_duplicates and item in self.items: 1147 # Should this be a different kind of error?? 1148 raise ValueError('cannot insert {}, already exists'.format(item)) 1149 1150 if not isinstance(loc, int): 1151 raise TypeError("loc must be int") 1152 1153 # insert to the axis; this could possibly raise a TypeError 1154 new_axis = self.items.insert(loc, item) 1155 1156 block = make_block(values=value, ndim=self.ndim, 1157 placement=slice(loc, loc + 1)) 1158 1159 for blkno, count in _fast_count_smallints(self._blknos[loc:]): 1160 blk = self.blocks[blkno] 1161 if count == len(blk.mgr_locs): 1162 blk.mgr_locs = blk.mgr_locs.add(1) 1163 else: 1164 new_mgr_locs = blk.mgr_locs.as_array.copy() 1165 new_mgr_locs[new_mgr_locs >= loc] += 1 1166 blk.mgr_locs = new_mgr_locs 1167 1168 if loc == self._blklocs.shape[0]: 1169 # np.append is a lot faster, let's use it if we can. 1170 self._blklocs = np.append(self._blklocs, 0) 1171 self._blknos = np.append(self._blknos, len(self.blocks)) 1172 else: 1173 self._blklocs = np.insert(self._blklocs, loc, 0) 1174 self._blknos = np.insert(self._blknos, loc, len(self.blocks)) 1175 1176 self.axes[0] = new_axis 1177 self.blocks += (block,) 1178 self._shape = None 1179 1180 self._known_consolidated = False 1181 1182 if len(self.blocks) > 100: 1183 self._consolidate_inplace() 1184 1185 def reindex_axis(self, new_index, axis, method=None, limit=None, 1186 fill_value=None, copy=True): 1187 """ 1188 Conform block manager to new index. 1189 """ 1190 new_index = ensure_index(new_index) 1191 new_index, indexer = self.axes[axis].reindex(new_index, method=method, 1192 limit=limit) 1193 1194 return self.reindex_indexer(new_index, indexer, axis=axis, 1195 fill_value=fill_value, copy=copy) 1196 1197 def reindex_indexer(self, new_axis, indexer, axis, fill_value=None, 1198 allow_dups=False, copy=True): 1199 """ 1200 Parameters 1201 ---------- 1202 new_axis : Index 1203 indexer : ndarray of int64 or None 1204 axis : int 1205 fill_value : object 1206 allow_dups : bool 1207 1208 pandas-indexer with -1's only. 1209 """ 1210 if indexer is None: 1211 if new_axis is self.axes[axis] and not copy: 1212 return self 1213 1214 result = self.copy(deep=copy) 1215 result.axes = list(self.axes) 1216 result.axes[axis] = new_axis 1217 return result 1218 1219 self._consolidate_inplace() 1220 1221 # some axes don't allow reindexing with dups 1222 if not allow_dups: 1223 self.axes[axis]._can_reindex(indexer) 1224 1225 if axis >= self.ndim: 1226 raise IndexError("Requested axis not found in manager") 1227 1228 if axis == 0: 1229 new_blocks = self._slice_take_blocks_ax0(indexer, 1230 fill_tuple=(fill_value,)) 1231 else: 1232 new_blocks = [blk.take_nd(indexer, axis=axis, fill_tuple=( 1233 fill_value if fill_value is not None else blk.fill_value,)) 1234 for blk in self.blocks] 1235 1236 new_axes = list(self.axes) 1237 new_axes[axis] = new_axis 1238 return self.__class__(new_blocks, new_axes) 1239 1240 def _slice_take_blocks_ax0(self, slice_or_indexer, fill_tuple=None): 1241 """ 1242 Slice/take blocks along axis=0. 1243 1244 Overloaded for SingleBlock 1245 1246 Returns 1247 ------- 1248 new_blocks : list of Block 1249 1250 """ 1251 1252 allow_fill = fill_tuple is not None 1253 1254 sl_type, slobj, sllen = _preprocess_slice_or_indexer( 1255 slice_or_indexer, self.shape[0], allow_fill=allow_fill) 1256 1257 if self._is_single_block: 1258 blk = self.blocks[0] 1259 1260 if sl_type in ('slice', 'mask'): 1261 return [blk.getitem_block(slobj, new_mgr_locs=slice(0, sllen))] 1262 elif not allow_fill or self.ndim == 1: 1263 if allow_fill and fill_tuple[0] is None: 1264 _, fill_value = maybe_promote(blk.dtype) 1265 fill_tuple = (fill_value, ) 1266 1267 return [blk.take_nd(slobj, axis=0, 1268 new_mgr_locs=slice(0, sllen), 1269 fill_tuple=fill_tuple)] 1270 1271 if sl_type in ('slice', 'mask'): 1272 blknos = self._blknos[slobj] 1273 blklocs = self._blklocs[slobj] 1274 else: 1275 blknos = algos.take_1d(self._blknos, slobj, fill_value=-1, 1276 allow_fill=allow_fill) 1277 blklocs = algos.take_1d(self._blklocs, slobj, fill_value=-1, 1278 allow_fill=allow_fill) 1279 1280 # When filling blknos, make sure blknos is updated before appending to 1281 # blocks list, that way new blkno is exactly len(blocks). 1282 # 1283 # FIXME: mgr_groupby_blknos must return mgr_locs in ascending order, 1284 # pytables serialization will break otherwise. 1285 blocks = [] 1286 for blkno, mgr_locs in libinternals.get_blkno_placements(blknos, 1287 self.nblocks, 1288 group=True): 1289 if blkno == -1: 1290 # If we've got here, fill_tuple was not None. 1291 fill_value = fill_tuple[0] 1292 1293 blocks.append(self._make_na_block(placement=mgr_locs, 1294 fill_value=fill_value)) 1295 else: 1296 blk = self.blocks[blkno] 1297 1298 # Otherwise, slicing along items axis is necessary. 1299 if not blk._can_consolidate: 1300 # A non-consolidatable block, it's easy, because there's 1301 # only one item and each mgr loc is a copy of that single 1302 # item. 1303 for mgr_loc in mgr_locs: 1304 newblk = blk.copy(deep=True) 1305 newblk.mgr_locs = slice(mgr_loc, mgr_loc + 1) 1306 blocks.append(newblk) 1307 1308 else: 1309 blocks.append(blk.take_nd(blklocs[mgr_locs.indexer], 1310 axis=0, new_mgr_locs=mgr_locs, 1311 fill_tuple=None)) 1312 1313 return blocks 1314 1315 def _make_na_block(self, placement, fill_value=None): 1316 # TODO: infer dtypes other than float64 from fill_value 1317 1318 if fill_value is None: 1319 fill_value = np.nan 1320 block_shape = list(self.shape) 1321 block_shape[0] = len(placement) 1322 1323 dtype, fill_value = infer_dtype_from_scalar(fill_value) 1324 block_values = np.empty(block_shape, dtype=dtype) 1325 block_values.fill(fill_value) 1326 return make_block(block_values, placement=placement) 1327 1328 def take(self, indexer, axis=1, verify=True, convert=True): 1329 """ 1330 Take items along any axis. 1331 """ 1332 self._consolidate_inplace() 1333 indexer = (np.arange(indexer.start, indexer.stop, indexer.step, 1334 dtype='int64') 1335 if isinstance(indexer, slice) 1336 else np.asanyarray(indexer, dtype='int64')) 1337 1338 n = self.shape[axis] 1339 if convert: 1340 indexer = maybe_convert_indices(indexer, n) 1341 1342 if verify: 1343 if ((indexer == -1) | (indexer >= n)).any(): 1344 raise Exception('Indices must be nonzero and less than ' 1345 'the axis length') 1346 1347 new_labels = self.axes[axis].take(indexer) 1348 return self.reindex_indexer(new_axis=new_labels, indexer=indexer, 1349 axis=axis, allow_dups=True) 1350 1351 def merge(self, other, lsuffix='', rsuffix=''): 1352 # We assume at this point that the axes of self and other match. 1353 # This is only called from Panel.join, which reindexes prior 1354 # to calling to ensure this assumption holds. 1355 l, r = items_overlap_with_suffix(left=self.items, lsuffix=lsuffix, 1356 right=other.items, rsuffix=rsuffix) 1357 new_items = _concat_indexes([l, r]) 1358 1359 new_blocks = [blk.copy(deep=False) for blk in self.blocks] 1360 1361 offset = self.shape[0] 1362 for blk in other.blocks: 1363 blk = blk.copy(deep=False) 1364 blk.mgr_locs = blk.mgr_locs.add(offset) 1365 new_blocks.append(blk) 1366 1367 new_axes = list(self.axes) 1368 new_axes[0] = new_items 1369 1370 return self.__class__(_consolidate(new_blocks), new_axes) 1371 1372 def equals(self, other): 1373 self_axes, other_axes = self.axes, other.axes 1374 if len(self_axes) != len(other_axes): 1375 return False 1376 if not all(ax1.equals(ax2) for ax1, ax2 in zip(self_axes, other_axes)): 1377 return False 1378 self._consolidate_inplace() 1379 other._consolidate_inplace() 1380 if len(self.blocks) != len(other.blocks): 1381 return False 1382 1383 # canonicalize block order, using a tuple combining the type 1384 # name and then mgr_locs because there might be unconsolidated 1385 # blocks (say, Categorical) which can only be distinguished by 1386 # the iteration order 1387 def canonicalize(block): 1388 return (block.dtype.name, block.mgr_locs.as_array.tolist()) 1389 1390 self_blocks = sorted(self.blocks, key=canonicalize) 1391 other_blocks = sorted(other.blocks, key=canonicalize) 1392 return all(block.equals(oblock) 1393 for block, oblock in zip(self_blocks, other_blocks)) 1394 1395 def unstack(self, unstacker_func, fill_value): 1396 """Return a blockmanager with all blocks unstacked. 1397 1398 Parameters 1399 ---------- 1400 unstacker_func : callable 1401 A (partially-applied) ``pd.core.reshape._Unstacker`` class. 1402 fill_value : Any 1403 fill_value for newly introduced missing values. 1404 1405 Returns 1406 ------- 1407 unstacked : BlockManager 1408 """ 1409 n_rows = self.shape[-1] 1410 dummy = unstacker_func(np.empty((0, 0)), value_columns=self.items) 1411 new_columns = dummy.get_new_columns() 1412 new_index = dummy.get_new_index() 1413 new_blocks = [] 1414 columns_mask = [] 1415 1416 for blk in self.blocks: 1417 blocks, mask = blk._unstack( 1418 partial(unstacker_func, 1419 value_columns=self.items[blk.mgr_locs.indexer]), 1420 new_columns, 1421 n_rows, 1422 fill_value 1423 ) 1424 1425 new_blocks.extend(blocks) 1426 columns_mask.extend(mask) 1427 1428 new_columns = new_columns[columns_mask] 1429 1430 bm = BlockManager(new_blocks, [new_columns, new_index]) 1431 return bm 1432 1433 1434 class SingleBlockManager(BlockManager): 1435 """ manage a single block with """ 1436 1437 ndim = 1 1438 _is_consolidated = True 1439 _known_consolidated = True 1440 __slots__ = () 1441 1442 def __init__(self, block, axis, do_integrity_check=False, fastpath=False): 1443 1444 if isinstance(axis, list): 1445 if len(axis) != 1: 1446 raise ValueError("cannot create SingleBlockManager with more " 1447 "than 1 axis") 1448 axis = axis[0] 1449 1450 # passed from constructor, single block, single axis 1451 if fastpath: 1452 self.axes = [axis] 1453 if isinstance(block, list): 1454 1455 # empty block 1456 if len(block) == 0: 1457 block = [np.array([])] 1458 elif len(block) != 1: 1459 raise ValueError('Cannot create SingleBlockManager with ' 1460 'more than 1 block') 1461 block = block[0] 1462 else: 1463 self.axes = [ensure_index(axis)] 1464 1465 # create the block here 1466 if isinstance(block, list): 1467 1468 # provide consolidation to the interleaved_dtype 1469 if len(block) > 1: 1470 dtype = _interleaved_dtype(block) 1471 block = [b.astype(dtype) for b in block] 1472 block = _consolidate(block) 1473 1474 if len(block) != 1: 1475 raise ValueError('Cannot create SingleBlockManager with ' 1476 'more than 1 block') 1477 block = block[0] 1478 1479 if not isinstance(block, Block): 1480 block = make_block(block, placement=slice(0, len(axis)), ndim=1) 1481 1482 self.blocks = [block] 1483 1484 def _post_setstate(self): 1485 pass 1486 1487 @property 1488 def _block(self): 1489 return self.blocks[0] 1490 1491 @property 1492 def _values(self): 1493 return self._block.values 1494 1495 @property 1496 def _blknos(self): 1497 """ compat with BlockManager """ 1498 return None 1499 1500 @property 1501 def _blklocs(self): 1502 """ compat with BlockManager """ 1503 return None 1504 1505 def get_slice(self, slobj, axis=0): 1506 if axis >= self.ndim: 1507 raise IndexError("Requested axis not found in manager") 1508 1509 return self.__class__(self._block._slice(slobj), 1510 self.index[slobj], fastpath=True) 1511 1512 @property 1513 def index(self): 1514 return self.axes[0] 1515 1516 def convert(self, **kwargs): 1517 """ convert the whole block as one """ 1518 kwargs['by_item'] = False 1519 return self.apply('convert', **kwargs) 1520 1521 @property 1522 def dtype(self): 1523 return self._block.dtype 1524 1525 @property 1526 def array_dtype(self): 1527 return self._block.array_dtype 1528 1529 @property 1530 def ftype(self): 1531 return self._block.ftype 1532 1533 def get_dtype_counts(self): 1534 return {self.dtype.name: 1} 1535 1536 def get_ftype_counts(self): 1537 return {self.ftype: 1} 1538 1539 def get_dtypes(self): 1540 return np.array([self._block.dtype]) 1541 1542 def get_ftypes(self): 1543 return np.array([self._block.ftype]) 1544 1545 def external_values(self): 1546 return self._block.external_values() 1547 1548 def internal_values(self): 1549 return self._block.internal_values() 1550 1551 def formatting_values(self): 1552 """Return the internal values used by the DataFrame/SeriesFormatter""" 1553 return self._block.formatting_values() 1554 1555 def get_values(self): 1556 """ return a dense type view """ 1557 return np.array(self._block.to_dense(), copy=False) 1558 1559 @property 1560 def asobject(self): 1561 """ 1562 return a object dtype array. datetime/timedelta like values are boxed 1563 to Timestamp/Timedelta instances. 1564 """ 1565 return self._block.get_values(dtype=object) 1566 1567 @property 1568 def _can_hold_na(self): 1569 return self._block._can_hold_na 1570 1571 def is_consolidated(self): 1572 return True 1573 1574 def _consolidate_check(self): 1575 pass 1576 1577 def _consolidate_inplace(self): 1578 pass 1579 1580 def delete(self, item): 1581 """ 1582 Delete single item from SingleBlockManager. 1583 1584 Ensures that self.blocks doesn't become empty. 1585 """ 1586 loc = self.items.get_loc(item) 1587 self._block.delete(loc) 1588 self.axes[0] = self.axes[0].delete(loc) 1589 1590 def fast_xs(self, loc): 1591 """ 1592 fast path for getting a cross-section 1593 return a view of the data 1594 """ 1595 return self._block.values[loc] 1596 1597 def concat(self, to_concat, new_axis): 1598 """ 1599 Concatenate a list of SingleBlockManagers into a single 1600 SingleBlockManager. 1601 1602 Used for pd.concat of Series objects with axis=0. 1603 1604 Parameters 1605 ---------- 1606 to_concat : list of SingleBlockManagers 1607 new_axis : Index of the result 1608 1609 Returns 1610 ------- 1611 SingleBlockManager 1612 1613 """ 1614 non_empties = [x for x in to_concat if len(x) > 0] 1615 1616 # check if all series are of the same block type: 1617 if len(non_empties) > 0: 1618 blocks = [obj.blocks[0] for obj in non_empties] 1619 if len({b.dtype for b in blocks}) == 1: 1620 new_block = blocks[0].concat_same_type(blocks) 1621 else: 1622 values = [x.values for x in blocks] 1623 values = _concat._concat_compat(values) 1624 new_block = make_block( 1625 values, placement=slice(0, len(values), 1)) 1626 else: 1627 values = [x._block.values for x in to_concat] 1628 values = _concat._concat_compat(values) 1629 new_block = make_block( 1630 values, placement=slice(0, len(values), 1)) 1631 1632 mgr = SingleBlockManager(new_block, new_axis) 1633 return mgr 1634 1635 1636 # -------------------------------------------------------------------- 1637 # Constructor Helpers 1638 1639 def create_block_manager_from_blocks(blocks, axes): 1640 try: 1641 if len(blocks) == 1 and not isinstance(blocks[0], Block): 1642 # if blocks[0] is of length 0, return empty blocks 1643 if not len(blocks[0]): 1644 blocks = [] 1645 else: 1646 # It's OK if a single block is passed as values, its placement 1647 # is basically "all items", but if there're many, don't bother 1648 # converting, it's an error anyway. 1649 blocks = [make_block(values=blocks[0], 1650 placement=slice(0, len(axes[0])))] 1651 1652 mgr = BlockManager(blocks, axes) 1653 mgr._consolidate_inplace() 1654 return mgr 1655 1656 except (ValueError) as e: 1657 blocks = [getattr(b, 'values', b) for b in blocks] 1658 tot_items = sum(b.shape[0] for b in blocks) 1659 construction_error(tot_items, blocks[0].shape[1:], axes, e) 1660 1661 1662 def create_block_manager_from_arrays(arrays, names, axes): 1663 1664 try: 1665 blocks = form_blocks(arrays, names, axes) 1666 mgr = BlockManager(blocks, axes) 1667 mgr._consolidate_inplace() 1668 return mgr 1669 except ValueError as e: 1670 construction_error(len(arrays), arrays[0].shape, axes, e) 1671 1672 1673 def construction_error(tot_items, block_shape, axes, e=None): 1674 """ raise a helpful message about our construction """ 1675 passed = tuple(map(int, [tot_items] + list(block_shape))) 1676 # Correcting the user facing error message during dataframe construction 1677 if len(passed) <= 2: 1678 passed = passed[::-1] 1679 1680 implied = tuple(len(ax) for ax in axes) 1681 # Correcting the user facing error message during dataframe construction 1682 if len(implied) <= 2: 1683 implied = implied[::-1] 1684 1685 if passed == implied and e is not None: 1686 raise e 1687 if block_shape[0] == 0: 1688 raise ValueError("Empty data passed with indices specified.") 1689 raise ValueError("Shape of passed values is {0}, indices imply {1}".format( 1690 passed, implied)) 1691 1692 1693 # ----------------------------------------------------------------------- 1694 1695 def form_blocks(arrays, names, axes): 1696 # put "leftover" items in float bucket, where else? 1697 # generalize? 1698 items_dict = defaultdict(list) 1699 extra_locs = [] 1700 1701 names_idx = ensure_index(names) 1702 if names_idx.equals(axes[0]): 1703 names_indexer = np.arange(len(names_idx)) 1704 else: 1705 assert names_idx.intersection(axes[0]).is_unique 1706 names_indexer = names_idx.get_indexer_for(axes[0]) 1707 1708 for i, name_idx in enumerate(names_indexer): 1709 if name_idx == -1: 1710 extra_locs.append(i) 1711 continue 1712 1713 k = names[name_idx] 1714 v = arrays[name_idx] 1715 1716 block_type = get_block_type(v) 1717 items_dict[block_type.__name__].append((i, k, v)) 1718 1719 blocks = [] 1720 if len(items_dict['FloatBlock']): 1721 float_blocks = _multi_blockify(items_dict['FloatBlock']) 1722 blocks.extend(float_blocks) 1723 1724 if len(items_dict['ComplexBlock']): 1725 complex_blocks = _multi_blockify(items_dict['ComplexBlock']) 1726 blocks.extend(complex_blocks) 1727 1728 if len(items_dict['TimeDeltaBlock']): 1729 timedelta_blocks = _multi_blockify(items_dict['TimeDeltaBlock']) 1730 blocks.extend(timedelta_blocks) 1731 1732 if len(items_dict['IntBlock']): 1733 int_blocks = _multi_blockify(items_dict['IntBlock']) 1734 blocks.extend(int_blocks) 1735 1736 if len(items_dict['DatetimeBlock']): 1737 datetime_blocks = _simple_blockify(items_dict['DatetimeBlock'], 1738 _NS_DTYPE) 1739 blocks.extend(datetime_blocks) 1740 1741 if len(items_dict['DatetimeTZBlock']): 1742 dttz_blocks = [make_block(array, 1743 klass=DatetimeTZBlock, 1744 placement=[i]) 1745 for i, _, array in items_dict['DatetimeTZBlock']] 1746 blocks.extend(dttz_blocks) 1747 1748 if len(items_dict['BoolBlock']): 1749 bool_blocks = _simple_blockify(items_dict['BoolBlock'], np.bool_) 1750 blocks.extend(bool_blocks) 1751 1752 if len(items_dict['ObjectBlock']) > 0: 1753 object_blocks = _simple_blockify(items_dict['ObjectBlock'], np.object_) 1754 blocks.extend(object_blocks) 1755 1756 if len(items_dict['SparseBlock']) > 0: 1757 sparse_blocks = _sparse_blockify(items_dict['SparseBlock']) 1758 blocks.extend(sparse_blocks) 1759 1760 if len(items_dict['CategoricalBlock']) > 0: 1761 cat_blocks = [make_block(array, klass=CategoricalBlock, placement=[i]) 1762 for i, _, array in items_dict['CategoricalBlock']] 1763 blocks.extend(cat_blocks) 1764 1765 if len(items_dict['ExtensionBlock']): 1766 1767 external_blocks = [ 1768 make_block(array, klass=ExtensionBlock, placement=[i]) 1769 for i, _, array in items_dict['ExtensionBlock'] 1770 ] 1771 1772 blocks.extend(external_blocks) 1773 1774 if len(items_dict['ObjectValuesExtensionBlock']): 1775 external_blocks = [ 1776 make_block(array, klass=ObjectValuesExtensionBlock, placement=[i]) 1777 for i, _, array in items_dict['ObjectValuesExtensionBlock'] 1778 ] 1779 1780 blocks.extend(external_blocks) 1781 1782 if len(extra_locs): 1783 shape = (len(extra_locs),) + tuple(len(x) for x in axes[1:]) 1784 1785 # empty items -> dtype object 1786 block_values = np.empty(shape, dtype=object) 1787 block_values.fill(np.nan) 1788 1789 na_block = make_block(block_values, placement=extra_locs) 1790 blocks.append(na_block) 1791 1792 return blocks 1793 1794 1795 def _simple_blockify(tuples, dtype): 1796 """ return a single array of a block that has a single dtype; if dtype is 1797 not None, coerce to this dtype 1798 """ 1799 values, placement = _stack_arrays(tuples, dtype) 1800 1801 # CHECK DTYPE? 1802 if dtype is not None and values.dtype != dtype: # pragma: no cover 1803 values = values.astype(dtype) 1804 1805 block = make_block(values, placement=placement) 1806 return [block] 1807 1808 1809 def _multi_blockify(tuples, dtype=None): 1810 """ return an array of blocks that potentially have different dtypes """ 1811 1812 # group by dtype 1813 grouper = itertools.groupby(tuples, lambda x: x[2].dtype) 1814 1815 new_blocks = [] 1816 for dtype, tup_block in grouper: 1817 1818 values, placement = _stack_arrays(list(tup_block), dtype) 1819 1820 block = make_block(values, placement=placement) 1821 new_blocks.append(block) 1822 1823 return new_blocks 1824 1825 1826 def _sparse_blockify(tuples, dtype=None): 1827 """ return an array of blocks that potentially have different dtypes (and 1828 are sparse) 1829 """ 1830 1831 new_blocks = [] 1832 for i, names, array in tuples: 1833 array = _maybe_to_sparse(array) 1834 block = make_block(array, placement=[i]) 1835 new_blocks.append(block) 1836 1837 return new_blocks 1838 1839 1840 def _stack_arrays(tuples, dtype): 1841 1842 # fml 1843 def _asarray_compat(x): 1844 if isinstance(x, ABCSeries): 1845 return x._values 1846 else: 1847 return np.asarray(x) 1848 1849 def _shape_compat(x): 1850 if isinstance(x, ABCSeries): 1851 return len(x), 1852 else: 1853 return x.shape 1854 1855 placement, names, arrays = zip(*tuples) 1856 1857 first = arrays[0] 1858 shape = (len(arrays),) + _shape_compat(first) 1859 1860 stacked = np.empty(shape, dtype=dtype) 1861 for i, arr in enumerate(arrays): 1862 stacked[i] = _asarray_compat(arr) 1863 1864 return stacked, placement 1865 1866 1867 def _interleaved_dtype(blocks): 1868 # type: (List[Block]) -> Optional[Union[np.dtype, ExtensionDtype]] 1869 """Find the common dtype for `blocks`. 1870 1871 Parameters 1872 ---------- 1873 blocks : List[Block] 1874 1875 Returns 1876 ------- 1877 dtype : Optional[Union[np.dtype, ExtensionDtype]] 1878 None is returned when `blocks` is empty. 1879 """ 1880 if not len(blocks): 1881 return None 1882 1883 return find_common_type([b.dtype for b in blocks]) 1884 1885 1886 def _consolidate(blocks): 1887 """ 1888 Merge blocks having same dtype, exclude non-consolidating blocks 1889 """ 1890 1891 # sort by _can_consolidate, dtype 1892 gkey = lambda x: x._consolidate_key 1893 grouper = itertools.groupby(sorted(blocks, key=gkey), gkey) 1894 1895 new_blocks = [] 1896 for (_can_consolidate, dtype), group_blocks in grouper: 1897 merged_blocks = _merge_blocks(list(group_blocks), dtype=dtype, 1898 _can_consolidate=_can_consolidate) 1899 new_blocks = _extend_blocks(merged_blocks, new_blocks) 1900 return new_blocks 1901 1902 1903 def _compare_or_regex_search(a, b, regex=False): 1904 """ 1905 Compare two array_like inputs of the same shape or two scalar values 1906 1907 Calls operator.eq or re.search, depending on regex argument. If regex is 1908 True, perform an element-wise regex matching. 1909 1910 Parameters 1911 ---------- 1912 a : array_like or scalar 1913 b : array_like or scalar 1914 regex : bool, default False 1915 1916 Returns 1917 ------- 1918 mask : array_like of bool 1919 """ 1920 if not regex: 1921 op = lambda x: operator.eq(x, b) 1922 else: 1923 op = np.vectorize(lambda x: bool(re.search(b, x)) if isinstance(x, str) 1924 else False) 1925 1926 is_a_array = isinstance(a, np.ndarray) 1927 is_b_array = isinstance(b, np.ndarray) 1928 1929 # numpy deprecation warning to have i8 vs integer comparisons 1930 if is_datetimelike_v_numeric(a, b): 1931 result = False 1932 1933 # numpy deprecation warning if comparing numeric vs string-like 1934 elif is_numeric_v_string_like(a, b): 1935 result = False 1936 else: 1937 result = op(a) 1938 1939 if is_scalar(result) and (is_a_array or is_b_array): 1940 type_names = [type(a).__name__, type(b).__name__] 1941 1942 if is_a_array: 1943 type_names[0] = 'ndarray(dtype={dtype})'.format(dtype=a.dtype) 1944 1945 if is_b_array: 1946 type_names[1] = 'ndarray(dtype={dtype})'.format(dtype=b.dtype) 1947 1948 raise TypeError( 1949 "Cannot compare types {a!r} and {b!r}".format(a=type_names[0], 1950 b=type_names[1])) 1951 return result 1952 1953 1954 def _concat_indexes(indexes): 1955 return indexes[0].append(indexes[1:]) 1956 1957 1958 def items_overlap_with_suffix(left, lsuffix, right, rsuffix): 1959 """ 1960 If two indices overlap, add suffixes to overlapping entries. 1961 1962 If corresponding suffix is empty, the entry is simply converted to string. 1963 1964 """ 1965 to_rename = left.intersection(right) 1966 if len(to_rename) == 0: 1967 return left, right 1968 else: 1969 if not lsuffix and not rsuffix: 1970 raise ValueError('columns overlap but no suffix specified: ' 1971 '{rename}'.format(rename=to_rename)) 1972 1973 def lrenamer(x): 1974 if x in to_rename: 1975 return '{x}{lsuffix}'.format(x=x, lsuffix=lsuffix) 1976 return x 1977 1978 def rrenamer(x): 1979 if x in to_rename: 1980 return '{x}{rsuffix}'.format(x=x, rsuffix=rsuffix) 1981 return x 1982 1983 return (_transform_index(left, lrenamer), 1984 _transform_index(right, rrenamer)) 1985 1986 1987 def _transform_index(index, func, level=None): 1988 """ 1989 Apply function to all values found in index. 1990 1991 This includes transforming multiindex entries separately. 1992 Only apply function to one level of the MultiIndex if level is specified. 1993 1994 """ 1995 if isinstance(index, MultiIndex): 1996 if level is not None: 1997 items = [tuple(func(y) if i == level else y 1998 for i, y in enumerate(x)) for x in index] 1999 else: 2000 items = [tuple(func(y) for y in x) for x in index] 2001 return MultiIndex.from_tuples(items, names=index.names) 2002 else: 2003 items = [func(x) for x in index] 2004 return Index(items, name=index.name, tupleize_cols=False) 2005 2006 2007 def _fast_count_smallints(arr): 2008 """Faster version of set(arr) for sequences of small numbers.""" 2009 counts = np.bincount(arr.astype(np.int_)) 2010 nz = counts.nonzero()[0] 2011 return np.c_[nz, counts[nz]] 2012 2013 2014 def _preprocess_slice_or_indexer(slice_or_indexer, length, allow_fill): 2015 if isinstance(slice_or_indexer, slice): 2016 return ('slice', slice_or_indexer, 2017 libinternals.slice_len(slice_or_indexer, length)) 2018 elif (isinstance(slice_or_indexer, np.ndarray) and 2019 slice_or_indexer.dtype == np.bool_): 2020 return 'mask', slice_or_indexer, slice_or_indexer.sum() 2021 else: 2022 indexer = np.asanyarray(slice_or_indexer, dtype=np.int64) 2023 if not allow_fill: 2024 indexer = maybe_convert_indices(indexer, length) 2025 return 'fancy', indexer, len(indexer) 2026 2027 2028 def concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy): 2029 """ 2030 Concatenate block managers into one. 2031 2032 Parameters 2033 ---------- 2034 mgrs_indexers : list of (BlockManager, {axis: indexer,...}) tuples 2035 axes : list of Index 2036 concat_axis : int 2037 copy : bool 2038 2039 """ 2040 concat_plans = [get_mgr_concatenation_plan(mgr, indexers) 2041 for mgr, indexers in mgrs_indexers] 2042 concat_plan = combine_concat_plans(concat_plans, concat_axis) 2043 blocks = [] 2044 2045 for placement, join_units in concat_plan: 2046 2047 if len(join_units) == 1 and not join_units[0].indexers: 2048 b = join_units[0].block 2049 values = b.values 2050 if copy: 2051 values = values.copy() 2052 elif not copy: 2053 values = values.view() 2054 b = b.make_block_same_class(values, placement=placement) 2055 elif is_uniform_join_units(join_units): 2056 b = join_units[0].block.concat_same_type( 2057 [ju.block for ju in join_units], placement=placement) 2058 else: 2059 b = make_block( 2060 concatenate_join_units(join_units, concat_axis, copy=copy), 2061 placement=placement) 2062 blocks.append(b) 2063 2064 return BlockManager(blocks, axes)