[Python] Introduction to Pandas, introduction to data structure Series and DataFrame, CSV file processing, JSON file processing

serial number content
1 [Python] Introduction to Pandas, introduction to data structure Series and DataFrame, CSV file processing, JSON file processing
2 [Python] Pandas data cleaning operation, summary of commonly used functions

1. Introduction to Pandas

Pandas is an extension library for the Python language for data analysis that provides high-performance, easy-to-use data structures and data analysis tools. The name Pandas is derived from the terms "panel data" and "Python data analysis".

Pandas is a powerful tool set for analyzing structured data, based on Numpy (providing high-performance matrix operations). Pandas can import data from various file formats such as CSV, JSON, SQL, and Microsoft Excel. Pandas can perform operations on various data, such as merging, reshaping, selection, as well as data cleaning and data processing features. Pandas is widely used in various data analysis fields such as academic, finance, statistics and so on.

2. Pandas data structure

The main data structure of Pandas is

  • Series (one-dimensional data)
  • DataFrame (two-dimensional data)

1. Series (one-dimensional data)

A Series is a one-dimensional array-like object that consists of a set of data (various Numpy data types) and a set of data labels (i.e., indices) associated with it.

Pandas Series is similar to a column in a table, similar to a one-dimensional array, and can save any data type.


Series consists of index and columns. The function is as follows:

pandas.Series( data, index, dtype, name, copy)

Parameter Description:

  • data: A set of data (ndarray type).
  • index: Data index label, if not specified, starts from 0 by default.
  • dtype: Data type, it will be judged by itself by default.
  • name:Set the name.
  • copy: Copy data, default is False.

For an example of program code, refer to: Pandas Data Structure - Series .


2. DataFrame (two-dimensional data)

DataFrame is a tabular data structure that contains a set of ordered columns, each column can be a different value type (numeric, string, Boolean). DataFrame has both row and column indexes, and it can be viewed as a dictionary composed of Series (shared with one index).

Insert image description here

Insert image description here


The DataFrame construction method is as follows:

pandas.DataFrame( data, index, columns, dtype, copy)

Parameter Description:

  • data: A set of data (ndarray, series, map, lists, dict, etc. types).
  • index: Index value, or can be called row label.
  • columns: Column label, default is RangeIndex (0, 1, 2, …, n).
  • dtype:type of data.
  • copy: Copy data, default is False.

There is no corresponding part of the data NaN.

Pandas can use locthe attribute to return the data of the specified row. If no index is set, the index of the first row is 0 and the index of the second row is 1.

For an example of program code, refer to: Pandas data structure - DataFrame .


3. Process CSV files

CSV (Comma-Separated Values, sometimes also called character-separated values, because the separating character does not have to be a comma), its file stores tabular data (numbers and text) in plain text.

CSV is a versatile, relatively simple file format widely used by users, business, and science.


to_string()Used to return DataFrame type data. If this function is not used, the output result will be the first 5 rows and the last 5 rows of the data, and the middle part will be replaced by....

import pandas as pd
df = pd.read_csv('nba.csv')
print(df.to_string())

to_csv()Method stores the DataFrame as a csv file.

import pandas as pd
df = pd.read_csv('nba.csv')
df.to_csv('site.csv')

head( n )This method is used to read the previous n lines. If parameter n is not filled in, 5 lines will be returned by default.

import pandas as pd
df = pd.read_csv('nba.csv')
print(df.head(10))

tail( n )This method is used to read the last n rows. If parameter n is not filled in, 5 rows will be returned by default. The value of each field in an empty row will return NaN.

import pandas as pd
df = pd.read_csv('nba.csv')
print(df.tail(10))

info()The method returns some basic information about the table:

import pandas as pd
df = pd.read_csv('nba.csv')
print(df.info())

4. Process JSON files

JSON (JavaScript Object Notation, JavaScript Object Notation) is a syntax for storing and exchanging text information, similar to XML.

JSON is smaller, faster, and easier to parse than XML. For more JSON content, please refer to the JSON tutorial .


The following is the data content of the file named sites.json:

[
   {
    
    
   "id": "A001",
   "name": "菜鸟教程",
   "url": "www.runoob.com",
   "likes": 61
   },
   {
    
    
   "id": "A002",
   "name": "Google",
   "url": "www.google.com",
   "likes": 124
   },
   {
    
    
   "id": "A003",
   "name": "淘宝",
   "url": "www.taobao.com",
   "likes": 45
   }
]

to_string()Used to return data of DataFrame type, we can also directly process JSON strings.

import pandas as pd
df = pd.read_json('sites.json')
print(df.to_string())

JSON objects have the same format as Python dictionaries, so we can also directly convert Python dictionaries into DataFrame data.

Read JSON data from URL:

import pandas as pd
URL = 'https://static.runoob.com/download/sites.json'
df = pd.read_json(URL)
print(df)

json_normalize()The method can completely parse the embedded data.


Ref.

  1. Pandas Tutorial - Tutorial for Newbies
  2. Pandas - documentation

Guess you like

Origin blog.csdn.net/weixin_36815313/article/details/132138453