Python can be said to be the easiest programming language to get started. With the help of basic packages such as numpy and scipy, Python can be said to be the best language for data processing and machine learning.
With the help of bigwigs and enthusiastic contributors, Python has a huge community supporting technology development, and two various Python packages have been developed to help data workers work.
Technology Exchange
Technology must learn to share and communicate, and it is not recommended to work behind closed doors. One person can go fast, and a group of people can go farther.
Good articles are inseparable from the sharing and recommendation of fans, dry data, data sharing, data, and technical exchange improvement, all of which can be obtained by adding the communication group. The group has more than 2,000 friends. The best way to add notes is: source + interest directions, making it easy to find like-minded friends.
Method ①, add WeChat account: pythoner666, remarks: from CSDN + add group
Method ②, WeChat search official account: Python learning and data mining, background reply: add group
1、Knockknock
Knockknock is a simple Python package that notifies you when a machine learning model finishes training or crashes. We can be notified through various channels such as email, Slack, Microsoft Teams, etc.
In order to install the package, we use the following code.
pip install knockknock
For example, we can use the following code to notify the machine learning modeling training status to the specified email address.
from knockknock import email_sender
from sklearn.linear_model import LinearRegression
import numpy as np
@email_sender(recipient_emails=["<[email protected]>", "<[email protected]>"], sender_email="<[email protected]>")
def train_linear_model(your_nicest_parameters):
x = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
y = np.dot(x, np.array([1, 2])) + 3
regression = LinearRegression().fit(x, y)
return regression.score(x, y)
This way you can be notified when there is a problem with the function or when it completes.
2、tqdm
What if you need to show a progress bar when iterating or looping? Then tqdm is what you need. This package will provide a simple progress meter in your notebook or command prompt.
Let's start by installing the package.
pip install tqdm
The following code can then be used to display a progress bar during the loop.
from tqdm import tqdm
q = 0
for i in tqdm(range(10000000)):
q = i +1
Like the gifg above, it can display a nice progress bar on the notebook. It can be very useful when you have a complex iteration and want to track the progress.
3、Pandas-log
Panda -log can provide feedback on basic operations of Panda, such as .query, .drop, .merge, etc. It is based on the Tidyverse of R and one can use it to understand all the steps of data analysis.
Installation package
pip install pandas-log
After installing the package, take a look at the example below.
import pandas as pd
import numpy as np
import pandas_log
df = pd.DataFrame({
"name": ['Alfred', 'Batman', 'Catwoman'],
"toy": [np.nan, 'Batmobile', 'Bullwhip'],
"born": [pd.NaT, pd.Timestamp("1940-04-25"), pd.NaT]})
Then let's try to do a simple pandas operation record with the following code.
with pandas_log.enable():
res = (df.drop("born", axis = 1)
.groupby('name')
)
Through pandas-log, we can get all the execution information.
4、Emoji
As the name suggests, Emoji is a Python package that supports emoji text parsing. Usually, we have a hard time dealing with emoji in Python, but the Emoji package can help us with the conversion.
Use the following code to install the Emoji package.
pip install emoji
Take a look at the following code:
import emoji
print(emoji.emojize('Python is :thumbs_up:'))
Python is
With this package, you can easily output emoji.
5、TheFuzz
TheFuzz uses Levenshtein distance to match text to calculate similarity.
pip install thefuzz
The following code describes how to use TheFuzz for similarity text matching.
from thefuzz import fuzz, process
#Testing the score between two sentences
fuzz.ratio("Test the word", "test the Word!")
81
TheFuzz can also extract similarity scores from multiple words at the same time.
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
process.extract("new york jets", choices, limit=2)
[('new york jets', 100),
('new york Giants', 79)]
TheFuzz is suitable for any text data similarity detection, this work is very important in nlp.
6、Numerizer
Numerizer converts written numeric literals to corresponding integer or floating point numbers.
pip install numerizer
Then let's try a few inputs to convert.
from numerizer import numerize
numerize('forty two')
'42'
It also works if another writing style is used.
numerize('forty-two')
'42'
numerize('nine and three quarters')
'9.75'
If the input is not a numeric expression, it will be preserved:
numerize('maybe around nine and three quarters')
'maybe around 9.75'
7, PyAuto GUI
PyAutoGUI can automatically control the mouse and keyboard.
pip install pyautogui
Then we can test with the following code.
import pyautogui
pyautogui.moveTo(10, 15)
pyautogui.click()
pyautogui.doubleClick()
pyautogui.press('enter')
The above code will move the mouse to a certain position and click the mouse. Useful when repetitive actions are required, such as downloading files or collecting data.
8、Weightedcalcs
Weightedcalcs for statistical calculations. Usage ranges from simple statistics such as weighted means, medians, and standardized changes, to weighted counts and distributions, and more.
pip install weightedcalcs
Compute weighted distributions using available data.
import seaborn as sns
df = sns.load_dataset('mpg')
import weightedcalcs as wc
calc = wc.Calculator("mpg")
Then we do the weighting by passing the dataset and calculating the expected variables.
calc.distribution(df, "origin")
origin
europe 0.208616
japan 0.257042
usa 0.534342
Name: mpg, dtype: float64
9、scikit-posthocs
scikit-posthocs is a python package for "post hoc" test analysis, often used for pairwise comparisons in statistical analysis. This package provides a simple scikit-learn-like API for analysis.
pip install scikit-posthocs
Then let's start with a simple dataset and run an ANOVA test.
import statsmodels.api as sa
import statsmodels.formula.api as sfa
import scikit_posthocs as sp
df = sa.datasets.get_rdataset('iris').data
df.columns = df.columns.str.replace('.', '')
lm = sfa.ols('SepalWidth ~ C(Species)', data=df).fit()
anova = sa.stats.anova_lm(lm)
print(anova)
df sum_sq mean_sq F PR(>F)
C (Species) 2.0 11.344933 5.672467 49.1600 4.492017e-17
Residual 147.0 16.962000 0.115388 NaN NaN
Obtained the ANOVA test results, but not sure which variable class has the greatest impact on the results, you can use the following code to check the reason.
sp.posthoc_ttest(df,
val_col='SepalWidth',
group_col='Species',
p_adjust='holm')
Using scikit-posthoc, we simplified the process of pairwise analysis of post hoc tests and obtained P values
10、Cerberus
Cerberus is a lightweight python package for data validation.
pip install cerberus
The basic usage of Cerberus is to verify the structure of classes.
from cerberus import Validator
schema = {
'name': {
'type': 'string'},
'gender':{
'type': 'string'},
'age':{
'type':'integer'}}
v = Validator(schema)
After the structure that needs to be verified is defined, the instance can be verified.
document = {
'name': 'john doe', 'gender':'male', 'age': 15}
v.validate(document)
True
If there is a match, the Validator class will output True. This way we can make sure the data structure is correct.
11、ppscore
ppscore is used to calculate the predictive power of variables related to the target variable. This package computes a score that can detect a linear or non-linear relationship between two variables. Scores range from 0 (no predictive power) to 1 (perfect predictive power).
pip install ppscore
Use the ppscore package to calculate scores against objectives.
import seaborn as sns
import ppscore as pps
df = sns.load_dataset('mpg')
pps.predictors(df, 'mpg')
The results are sorted. Lower-ranked variables are less predictive of the target.
12、Maya
Maya is designed to parse DateTime data as easily as possible.
pip install maya
Then we can easily get the current date using the following code.
import maya
now = maya.now()
print(now)
It can also be tomorrow's date.
tomorrow = maya.when('tomorrow')
tomorrow.datetime()
datatime.datatime.(2022, 8, 8, 6, 44, 10, 141499,
tzinfo=<UTC>)
13、Pendulum
Pendulum is another python package dealing with DateTime data. It is used to simplify any DateTime analysis process.
pip install pendulum
We can do anything with practice.
import pendulum
now = pendulum.now("Europe/Berlin")
now.in_timezone("Asia/Tokyo")
now.to_iso8601_string()
now.add(days=2)
14、category_encoders
category_encoders is a python package for encoding (converting to numerical data) categorical data. This package is a collection of various encoding methods that we can apply to various categorical data as needed.
pip install category_encoders
The transformation can be applied using the following example.
from category_encoders import BinaryEncoder
import pandas as pd
enc = BinaryEncoder(cols=['origin']).fit(df)
numeric_dataset = enc.transform(df)
numeric_dataset.head()
15、scikit-multilearn
scikit-multilearn can be used for machine learning models specific to multiclass classification models. This package provides APIs for training machine learning models to predict datasets with more than two class targets.
pip install scikit-multilearn
Use the sample data set to perform multi-label KNN to train the classifier and measure the performance indicators.
from skmultilearn.dataset import load_dataset
from skmultilearn.adapt import MLkNN
import sklearn.metrics as metrics
X_train, y_train, feature_names, label_names = load_dataset('emotions', 'train')
X_test, y_test, _, _ = load_dataset('emotions', 'test')
classifier = MLkNN(k=3)
prediction = classifier.fit(X_train, y_train).predict(X_test)
metrics.hamming_loss(y_test, prediction)
emotions:train - exists, not redownloading
emotions:test - exists, not redownloading
0.2953795379537954
16、Multiset
Multiset is similar to the built-in set function, but the package allows multiple occurrences of the same character.
pip install multiset
You can use the following code to use the Multiset function.
from multiset import Multiset
set1 = Multiset('aab')
set1
Multiset({
'a': 2, 'b':1})
17、Jazzit
Jazzit can play music while our code makes an error or waits for the code to run.
pip install jazzit
Use the following code to try the sample music in error case.
from jazzit import error_track
@error_track("curb_your_enthusiasm.mp3", wait=5)
def run():
for num in reversed(range(10)):
print(10/num)
Although this package is useless, its function is not very interesting.
18、handcalcs
handcalcs is used to simplify the process of mathematical formulas in notebooks. It converts any mathematical function into its equation form.
pip install handcalcs
Use the following code to test the handcalcs package. Use the %%render magic command to render Latex.
import handcalcs.render
from math import sqrt
%%rendera = 4
b = 6
c = sqrt(3*a + b/7)
19、NeatText
NeatText simplifies text cleaning and preprocessing. It is useful for any NLP project and text machine learning project data.
pip install neattext
Use the following code to generate test data
import neattext as nt
mytext = "This is the word sample but ,our WEBSITE is https://exaempleeele.com ✨."
docx = nt.TextFrame(text=mytext)
TextFrame is used to start the NeatText class and various functions can then be used to view and clean the data.
docx.describe()
Key Value
Length : 72
vowels : 21
consonants: 33
stopwords : 5
punctuations: 6
special_char: 6
tokens(whitespace): 11
tokens(words): 13
Using the describe function, per-text statistics can be displayed. To further clean the data, the following code can be used.
docx.normalize()
20、Combo
Combo is a python package for combining machine learning models and scores. This package provides a toolbox that allows various machine learning models to be trained into one model. That is, the model can be integrated.
pip install combo
Create a machine learning ensemble using the breast cancer dataset from scikit-learn and various classification models from scikit-learn.
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from combo.models.classifier_stacking import Stacking
from combo.utils.data import evaluate_print
Next, look at a single classifier used to predict an object.
# Define data file and read X and y
random_state = 42
X, y = load_breast_cancer(return_X_y=True)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,random_state=random_state)
# initialize a group of clfs
classifiers = [DecisionTreeClassifier(random_state=random_state),
LogisticRegression(random_state=random_state),
KNeighborsClassifier(),
RandomForestClassifier(random_state=random_state),
GradientBoostingClassifier(random_state=random_state)]
clf_names = ['DT', 'LR', 'KNN', 'RF', 'GBDT']
for i, clf in enumerate(classifiers):
clf.fit(X_train, y_train)
y_test_predict = clf.predict(X_test)
evaluate_print(clf_names[i] + ' | ', y_test, y_test_predict)
print()
DT | Accuracy: 0.9386, ROC:0.9383, F1:0.9521
LR | Accuracy: 0.9693, ROC:0.962, F1:0.9766
KNN | Accuracv: 0.9561, ROC:0.9519, F1:0.9662
RF | Accuracy: 0.9781, ROC:0.9716, F1:0.9833
GBDT | Accuracy: 0.9605, ROC:0.9524, F1:0.9699
Stacking model using the Combo package.
clf = Stacking(classifiers, n_folds=4, shuffle_data=False,
keep_original=True, use_proba=False,
random_state=random_state)
clf.fit(X_train, y_train)
y_test_predict = clf.predict(X_test)
evaluate_print('Stacking | ', y_test, y_test_predict)
21、PyAztro
Do you need horoscope data or just curious about today's luck? You can use PyAztro to get this information! This pack has lucky numbers, lucky symbols, moods and more. This is the basic data of our artificial intelligence fortune-telling, ha
pip install pyaztro
Use the codes below to access today's horoscope information.
import pyaztro
pyaztro.Aztro(sign='gemini').description
"A very sexy visitor will cross your path soon.
If not today, then within a few days.
Once they arrive, lots of things will change.
Your organized schedule, for one.
Not that you'll mind, of coursel"
22、Faker
Faker can be used to simplify generating synthetic data. Many developers use this package to create test data.
pip install Faker
To generate synthetic data using the Faker package
from faker import Faker
fake = Faker()
generate name
fake.name()
'Danielle Cobb'
Faker randomly generates data every time the .name property is fetched from the Faker class.
23、Fairlearn
Fairlearn is used to evaluate and mitigate unfairness in machine learning models. This package provides many APIs necessary to view deviations.
pip install fairlearn
You can then use Fairlearn's dataset to see how much bias is in the model.
from fairlearn.metrics import MetricFrame, selection_rate
from fairlearn.datasets import fetch_adult
data = fetch_adult(as_frame=True)
X = data.data
y_true = (data.target == '>50K') * 1
sex = X['sex']
selection_rates = MetricFrame(metrics=selection_rate,
y_true=y_true,
y_pred=y_true,
sensitive_features=sex)
fig = selection_rates.by_group.plot.bar(
legend=False, rot=0,
title='Fraction earning over $50,000')
The Fairlearn API has a selection_rate function that can be used to detect the difference in scores between group model predictions so that we can see the bias in the results.
24、tiobeindexpy
tiobeindexpy is used to get TIOBE index data. The TIOBE Index is a programming ranking data that is very important for developers because we don't want to miss out on the next big thing in the programming world.
pip install tiobeindexpy
The top 20 programming language rankings of the month can be obtained through the following code.
from tiobeindexpy import tiobeindexpy as tb
df = tb.top_20()
25、pytrends
pytrends can use the Google API to get keyword trend data. This package is useful if you want to understand current web trends or trends related to our keywords. This requires access to google, so you get the idea.
pip install pytrends
Let's say I want to know the current trends related to the keyword "Present Gift",
from pytrends.request import TrendReq
import pandas as pd
pytrend = TrendReq()
keywords = pytrend.suggestions(keyword='Present Gift')
df = pd.DataFrame(keywords)
df
This package will return the top 5 trends related to the keyword.
26、visions
visions is a python package for semantic data analysis. This package can detect data types and infer what a column's data should be.
pip install visions
The following code can be used to detect the column data type in the data. The Titanic dataset of seaborn is used here.
import seaborn as sns
from visions.functional import detect_type, infer_type
from visions.typesets import CompleteSet
df = sns.load_dataset('titanic')
typeset = CompleteSet()
converting everything to strings
print(detect_type(df, typeset))
27、Schedule
Schedule can create job scheduling functions for any code
pip install schedule
For example, we want to work every 10 seconds:
import schedule
import time
def job():
print("I'm working...")
schedule.every(10).seconds.do(job)
while True:
schedule.run_pending()
time.sleep(1)
I'm working...
I'm working...
I'm working...
I'm working...
I'm working...
I'm working...
I'm working...
I'm working...
I'm working...
I'm working...
28、autocorrect
autocorrect is a python package for text spelling correction, applicable to many languages. Usage is simple and very useful for data cleaning process.
pip install autocorrect
Autocorrection can be done using code similar to the following.
from autocorrect import Speller
spell = Speller()
spell("I'm not sleaspy and tehre is no place, I'm giong to.")
"I'm not sleaspy and tehre is no place,
I'm giong to."
29、funky
funcy contains nifty utility functions for everyday data analysis use. There are too many functions in the package, I can't show them all, please check his documentation if you are interested.
pip install funcy
Here is just an example function for selecting an even number from an iterable variable as shown in the following code.
from funcy import select, even
select(even, {
i for i in range (20)})
{
0, 2, 4, 6, 8, 10, 12, 14, 16, 18}
30、IceCream
IceCream can make the debugging process easier. This package provides more verbose output during printing/logging.
pip install icecream
You can use the following code
from icecream import ic
def some_function(i):
i = 4 + (1 * 2)/ 10
return i + 35
ic(some_function(121))
39.2
Can also be used as a function checker.
def foo():
ic()
if some_function(12):
ic()
else:
ic()
foo()
The level of detail printed is ideal for analysis
Summarize
In this article, we summarize 30 unique Python packages that are useful in data work. Most software packages are easy to use and straightforward, but some may have more functions and require further reading of their documents. If you are interested, please go to the pypi website to search and view the homepage and documents of the package. I hope this article will help you.