6 Best Practices for Python Reusable Functions

Writing clean code is a must-have skill for data scientists working in a team with various roles because:

  • Clean code enhances readability, making it easier for team members to understand and contribute to the codebase.

  • Clean code improves maintainability and simplifies tasks such as debugging, modifying, and extending existing code.

For maintainability, our Python functions should:

  • small

  • only do one task

  • no repeat

  • There is a level of abstraction

  • have a descriptive name

  • has fewer than four arguments

Technology Exchange

Technology must learn to share and communicate, and it is not recommended to work behind closed doors. One person can go fast, and a group of people can go farther.

This article is shared and recommended by fans, dry data, data sharing, data, and technical exchange improvements, all of which can be obtained by adding the exchange group. The group has more than 2,000 members. The best way to add notes is: source + interest direction, which is convenient Find like-minded friends.

Method ①, add WeChat account: pythoner666, remarks: from CSDN + add group
Method ②, WeChat search official account: Python learning and data mining, background reply: add group

Let's first look at the get_data function below.

import xml.etree.ElementTree as ET
import zipfile
from pathlib import Path
import gdown

def get_data(
    url: str,
    zip_path: str,
    raw_train_path: str,
    raw_test_path: str,
    processed_train_path: str,
    processed_test_path: str,
):
    # Download data from Google Drive
    zip_path = "Twitter.zip"
    gdown.download(url, zip_path, quiet=False)

    # Unzip data
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(".")

    # Extract texts from files in the train directory
    t_train = []
    for file_path in Path(raw_train_path).glob("*.xml"):
        list_train_doc_1 = [r.text for r in ET.parse(file_path).getroot()[0]]
        train_doc_1 = " ".join(t for t in list_train_doc_1)
        t_train.append(train_doc_1)
    t_train_docs = " ".join(t_train)

    # Extract texts from files in the test directory
    t_test = []
    for file_path in Path(raw_test_path).glob("*.xml"):
        list_test_doc_1 = [r.text for r in ET.parse(file_path).getroot()[0]]
        test_doc_1 = " ".join(t for t in list_test_doc_1)
        t_test.append(test_doc_1)
    t_test_docs = " ".join(t_test)

    # Write processed data to a train file
    with open(processed_train_path, "w") as f:
        f.write(t_train_docs)

    # Write processed data to a test file
    with open(processed_test_path, "w") as f:
        f.write(t_test_docs)


if __name__ == "__main__":
    get_data(
        url="https://drive.google.com/uc?id=1jI1cmxqnwsmC-vbl8dNY6b4aNBtBbKy3",
        zip_path="Twitter.zip",
        raw_train_path="Data/train/en",
        raw_test_path="Data/test/en",
        processed_train_path="Data/train/en.txt",
        processed_test_path="Data/test/en.txt",
    )

Although there are many comments in this function, it is difficult to understand what this function does because:

  • This function is very long.

  • This function attempts to accomplish multiple tasks.

  • The code inside a function is at a different level of abstraction.

  • This function has many parameters.

  • There are multiple code duplications.

  • The function lacks a descriptive name.

We'll refactor this code by using the six practices mentioned at the beginning of the article.

small

A function should be kept small to improve its readability. Ideally, a function should not exceed 20 lines of code. Also, a function should not be indented more than 1 or 2.

import zipfile
import gdown

def get_raw_data(url: str, zip_path: str) -> None:
    gdown.download(url, zip_path, quiet=False)
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(".")

only do one task

Functions should have a single focus and perform a single task. The function get_data attempts to accomplish several tasks, including retrieving data from Google Drive, performing text extraction, and saving the extracted text.

Therefore, this function should be divided into several small functions, as shown in the following diagram:

def main(
    url: str,
    zip_path: str,
    raw_train_path: str,
    raw_test_path: str,
    processed_train_path: str,
    processed_test_path: str,
) -> None:
    get_raw_data(url, zip_path)
    t_train, t_test = get_train_test_docs(raw_train_path, raw_test_path)
    save_train_test_docs(processed_train_path, processed_test_path, t_train, t_test)

Each of these functions should have a single purpose:

def get_raw_data(url: str, zip_path: str) -> None:
    gdown.download(url, zip_path, quiet=False)
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(".")

The function get_raw_dataperforms only one action, which is to get the raw data.

repeatability

We should avoid repetition because:

  • Duplicated code impairs code readability.

  • Duplicated code complicates code modification. If modifications are required, they need to be modified in multiple places, increasing the possibility of errors.

The code below contains duplication, the code for retrieving training and test data is almost identical.

from pathlib import Path  

 # 从train目录下的文件中提取文本
t_train = []
for file_path in Path(raw_train_path).glob("*.xml"):
    list_train_doc_1 = [r.text for r in ET.parse(file_path).getroot()[0]]
    train_doc_1 = " ".join(t for t in list_train_doc_1)
    t_train.append(train_doc_1)
t_train_docs = " ".join(t_train)

# 从测试目录的文件中提取文本
t_test = []
for file_path in Path(raw_test_path).glob("*.xml"):
    list_test_doc_1 = [r.text for r in ET.parse(file_path).getroot()[0]]
    test_doc_1 = " ".join(t for t in list_test_doc_1)
    t_test.append(test_doc_1)
t_test_docs = " ".join(t_test)

We can eliminate the duplication by merging the duplicated code into a single extract_texts_from_multiple_filesfunction called , which extracts text from multiple files at a specified location.

def extract_texts_from_multiple_files(folder_path) -> str:

    all_docs = []
    for file_path in Path(folder_path).glob("*.xml"):
        list_of_text_in_one_file = [r.text for r in ET.parse(file_path).getroot()[0]]
        text_in_one_file = " ".join(list_of_text_in_one_file)
        all_docs.append(text_in_one_file)

    return " ".join(all_docs)

Now you can use this feature to extract text from different places without recoding.

t_train = extract_texts_from_multiple_files(raw_train_path)
t_test  = extract_texts_from_multiple_files(raw_test_path)

a level of abstraction

The level of abstraction refers to the complexity of a system. High-level refers to a more general view of the system, while low-level refers to more specific aspects of the system.

It is a good practice to keep the same level of abstraction within a code segment to make the code easier to understand.

The following function demonstrates this:

def extract_texts_from_multiple_files(folder_path) -> str:

    all_docs = []
    for file_path in Path(folder_path).glob("*.xml"):
        list_of_text_in_one_file = [r.text for r in ET.parse(file_path).getroot()[0]]
        text_in_one_file = " ".join(list_of_text_in_one_file)
        all_docs.append(text_in_one_file)

    return " ".join(all_docs)

The function itself is at a higher level, but the code inside the for loop involves lower-level operations related to XML parsing, text extraction, and string manipulation.

To solve this mix of abstraction levels, we can encapsulate low-level operations in extract_texts_from_each_filefunctions:

def extract_texts_from_multiple_files(folder_path: str) -> str:
    all_docs = []
    for file_path in Path(folder_path).glob("*.xml"):
        text_in_one_file = extract_texts_from_each_file(file_path)
        all_docs.append(text_in_one_file)

    return " ".join(all_docs)
    

def extract_texts_from_each_file(file_path: str) -> str:
    list_of_text_in_one_file = [r.text for r in ET.parse(file_path).getroot()[0]]
    return " ".join(list_of_text_in_one_file)

This introduces a higher level of abstraction to the text extraction process, making the code more readable.

descriptive name

A function's name should be descriptive enough that users can understand its purpose without reading the code. Long, descriptive names are better than vague ones. For example, naming a function get_textsis not as extract_texts_from_multiple_filesclear as naming it.

However, if a function's name becomes too long, for example retrieve_data_extract_text_and_save_data, it's a sign that the function might be doing too much and should be split into smaller functions.

less than four parameters

As the number of function parameters increases, it becomes more complicated to keep track of the order, purpose, and relationship among the many parameters. This makes it difficult for developers to understand and use the function.

def main(
    url: str,
    zip_path: str,
    raw_train_path: str,
    raw_test_path: str,
    processed_train_path: str,
    processed_test_path: str,
) -> None:
    get_raw_data(url, zip_path)
    t_train, t_test = get_train_test_docs(raw_train_path, raw_test_path)
    save_train_test_docs(processed_train_path, processed_test_path, t_train, t_test)

To improve code readability, you can use data classes or Pydanticmodels to encapsulate multiple related parameters in a data structure.

from pydantic import BaseModel

class RawLocation(BaseModel):
    url: str
    zip_path: str
    path_train: str
    path_test: str


class ProcessedLocation(BaseModel):
    path_train: str
    path_test: str


def main(raw_location: RawLocation, processed_location: ProcessedLocation) -> None:
    get_raw_data(raw_location)
    t_train, t_test = get_train_test_docs(raw_location)
    save_train_test_docs(processed_location, t_train, t_test)

How do i write such a function?

You don't need to keep all these best practices in mind when writing Python functions. A good indicator of the quality of a Python function is its testability. If a function can be easily tested, it indicates that the function is modular, performs a single task, and has no duplicated code.

def save_data(processed_path: str, processed_data: str) -> None:
    with open(processed_path, "w") as f:
        f.write(processed_data)


def test_save_data(tmp_path):
    processed_path = tmp_path / "processed_data.txt"
    processed_data = "Sample processed data"

    save_data(processed_path, processed_data)

    assert processed_path.exists()
    assert processed_path.read_text() == processed_data

References
Martin, RC (2009). Clean code: A handbook of agile software craftsmanship . Upper Saddle River: Prentice Hall.

Guess you like

Origin blog.csdn.net/m0_59596937/article/details/132462076