30 Most Used Python Packages for Data Science Jobs

Python can be said to be the easiest programming language to get started. With the help of basic packages such as numpy and scipy, Python can be said to be the best language for data processing and machine learning.

With the help of bigwigs and enthusiastic contributors, Python has a huge community supporting technology development, and two various Python packages have been developed to help data workers work.

Technology Exchange

Technology must learn to share and communicate, and it is not recommended to work behind closed doors. One person can go fast, and a group of people can go farther.

Good articles are inseparable from the sharing and recommendation of fans, dry data, data sharing, data, and technical exchange improvement, all of which can be obtained by adding the communication group. The group has more than 2,000 friends. The best way to add notes is: source + interest directions, making it easy to find like-minded friends.

Method ①, add WeChat account: pythoner666, remarks: from CSDN + add group
Method ②, WeChat search official account: Python learning and data mining, background reply: add group

1、Knockknock

Knockknock is a simple Python package that notifies you when a machine learning model finishes training or crashes. We can be notified through various channels such as email, Slack, Microsoft Teams, etc.

In order to install the package, we use the following code.

pip install knockknock  

For example, we can use the following code to notify the machine learning modeling training status to the specified email address.

from knockknock import email_sender  
from sklearn.linear_model import LinearRegression  
import numpy as np  
  
@email_sender(recipient_emails=["<[email protected]>", "<[email protected]>"], sender_email="<[email protected]>")  
def train_linear_model(your_nicest_parameters):  
  x = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])  
  y = np.dot(x, np.array([1, 2])) + 3  
  regression = LinearRegression().fit(x, y)  
return regression.score(x, y)  

This way you can be notified when there is a problem with the function or when it completes.

2、tqdm

What if you need to show a progress bar when iterating or looping? Then tqdm is what you need. This package will provide a simple progress meter in your notebook or command prompt.

Let's start by installing the package.

pip install tqdm  

The following code can then be used to display a progress bar during the loop.

from tqdm import tqdm  
q = 0  
for i in tqdm(range(10000000)):  
  q = i +1  

Like the gifg above, it can display a nice progress bar on the notebook. It can be very useful when you have a complex iteration and want to track the progress.

3、Pandas-log

Panda -log can provide feedback on basic operations of Panda, such as .query, .drop, .merge, etc. It is based on the Tidyverse of R and one can use it to understand all the steps of data analysis.

Installation package

pip install pandas-log  

After installing the package, take a look at the example below.

import pandas as pd  
import numpy as np  
import pandas_log  
df = pd.DataFrame({
    
    "name": ['Alfred', 'Batman', 'Catwoman'],  
                  "toy": [np.nan, 'Batmobile', 'Bullwhip'],  
                  "born": [pd.NaT, pd.Timestamp("1940-04-25"),   pd.NaT]})  

Then let's try to do a simple pandas operation record with the following code.

with pandas_log.enable():  
  res = (df.drop("born", axis = 1)  
            .groupby('name')  
        )  

Through pandas-log, we can get all the execution information.

4、Emoji

As the name suggests, Emoji is a Python package that supports emoji text parsing. Usually, we have a hard time dealing with emoji in Python, but the Emoji package can help us with the conversion.

Use the following code to install the Emoji package.

pip install emoji  

Take a look at the following code:

import emoji  
print(emoji.emojize('Python is :thumbs_up:'))  

Python is

With this package, you can easily output emoji.

5、TheFuzz

TheFuzz uses Levenshtein distance to match text to calculate similarity.

pip install thefuzz  

The following code describes how to use TheFuzz for similarity text matching.

from thefuzz import fuzz, process  
  
#Testing the score between two sentences  
fuzz.ratio("Test the word", "test the Word!")  

81

TheFuzz can also extract similarity scores from multiple words at the same time.

choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]  
process.extract("new york jets", choices, limit=2)  
[('new york jets', 100),   
 ('new york Giants', 79)]

TheFuzz is suitable for any text data similarity detection, this work is very important in nlp.

6、Numerizer

Numerizer converts written numeric literals to corresponding integer or floating point numbers.

pip install numerizer  

Then let's try a few inputs to convert.

from numerizer import numerize  
numerize('forty two')  

'42'

It also works if another writing style is used.

numerize('forty-two')  

'42'
numerize('nine and three quarters')  

'9.75'

If the input is not a numeric expression, it will be preserved:

numerize('maybe around nine and three quarters')  
'maybe around 9.75'

7, PyAuto GUI

PyAutoGUI can automatically control the mouse and keyboard.

pip install pyautogui  

Then we can test with the following code.

import pyautogui  
pyautogui.moveTo(10, 15)  
pyautogui.click()  
pyautogui.doubleClick()  
pyautogui.press('enter')  

The above code will move the mouse to a certain position and click the mouse. Useful when repetitive actions are required, such as downloading files or collecting data.

8、Weightedcalcs

Weightedcalcs for statistical calculations. Usage ranges from simple statistics such as weighted means, medians, and standardized changes, to weighted counts and distributions, and more.

pip install weightedcalcs  

Compute weighted distributions using available data.

import seaborn as sns  
df = sns.load_dataset('mpg')  
import weightedcalcs as wc  
calc = wc.Calculator("mpg")  

Then we do the weighting by passing the dataset and calculating the expected variables.

calc.distribution(df, "origin")  
origin  
europe  0.208616  
japan   0.257042  
usa     0.534342  
Name: mpg, dtype: float64

9、scikit-posthocs

scikit-posthocs is a python package for "post hoc" test analysis, often used for pairwise comparisons in statistical analysis. This package provides a simple scikit-learn-like API for analysis.

pip install scikit-posthocs  

Then let's start with a simple dataset and run an ANOVA test.

import statsmodels.api as sa  
import statsmodels.formula.api as sfa  
import scikit_posthocs as sp  
df = sa.datasets.get_rdataset('iris').data  
df.columns = df.columns.str.replace('.', '')  
  
lm = sfa.ols('SepalWidth ~ C(Species)', data=df).fit()  
anova = sa.stats.anova_lm(lm)  
print(anova)  
            df     sum_sq     mean_sq   F        PR(>F)      
C (Species) 2.0    11.344933  5.672467  49.1600  4.492017e-17  
Residual    147.0  16.962000  0.115388  NaN      NaN

Obtained the ANOVA test results, but not sure which variable class has the greatest impact on the results, you can use the following code to check the reason.

sp.posthoc_ttest(df,   
                 val_col='SepalWidth',   
                 group_col='Species',   
                 p_adjust='holm')  

Using scikit-posthoc, we simplified the process of pairwise analysis of post hoc tests and obtained P values

10、Cerberus

Cerberus is a lightweight python package for data validation.

pip install cerberus  

The basic usage of Cerberus is to verify the structure of classes.

from cerberus import Validator  
schema = {
    
    'name': {
    
    'type': 'string'},   
          'gender':{
    
    'type': 'string'},   
          'age':{
    
    'type':'integer'}}  
v = Validator(schema)  

After the structure that needs to be verified is defined, the instance can be verified.

document = {
    
    'name': 'john doe', 'gender':'male', 'age': 15}  
v.validate(document)  
True

If there is a match, the Validator class will output True. This way we can make sure the data structure is correct.

11、ppscore

ppscore is used to calculate the predictive power of variables related to the target variable. This package computes a score that can detect a linear or non-linear relationship between two variables. Scores range from 0 (no predictive power) to 1 (perfect predictive power).

pip install ppscore  

Use the ppscore package to calculate scores against objectives.

import seaborn as sns  
import ppscore as pps  
df = sns.load_dataset('mpg')  
pps.predictors(df, 'mpg')  

The results are sorted. Lower-ranked variables are less predictive of the target.

12、Maya

Maya is designed to parse DateTime data as easily as possible.

 pip install maya

Then we can easily get the current date using the following code.

import maya  
now = maya.now()  
print(now)  

It can also be tomorrow's date.

tomorrow = maya.when('tomorrow')  
tomorrow.datetime()  

datatime.datatime.(2022, 8, 8, 6, 44, 10, 141499,  
tzinfo=<UTC>)

13、Pendulum

Pendulum is another python package dealing with DateTime data. It is used to simplify any DateTime analysis process.

pip install pendulum  

We can do anything with practice.

import pendulum  
now = pendulum.now("Europe/Berlin")  
  
now.in_timezone("Asia/Tokyo")  
  
now.to_iso8601_string()  
  
now.add(days=2)  

14、category_encoders

category_encoders is a python package for encoding (converting to numerical data) categorical data. This package is a collection of various encoding methods that we can apply to various categorical data as needed.

pip install category_encoders  

The transformation can be applied using the following example.

from category_encoders import BinaryEncoder  
import pandas as pd  
  
enc = BinaryEncoder(cols=['origin']).fit(df)  
numeric_dataset = enc.transform(df)  
numeric_dataset.head()  

15、scikit-multilearn

scikit-multilearn can be used for machine learning models specific to multiclass classification models. This package provides APIs for training machine learning models to predict datasets with more than two class targets.

pip install scikit-multilearn  

Use the sample data set to perform multi-label KNN to train the classifier and measure the performance indicators.

from skmultilearn.dataset import load_dataset  
from skmultilearn.adapt import MLkNN  
import sklearn.metrics as metrics  
  
X_train, y_train, feature_names, label_names = load_dataset('emotions', 'train')  
X_test, y_test, _, _ = load_dataset('emotions', 'test')  
  
classifier = MLkNN(k=3)  
prediction = classifier.fit(X_train, y_train).predict(X_test)  
  
metrics.hamming_loss(y_test, prediction)  
emotions:train - exists, not redownloading    
emotions:test - exists, not redownloading     
  
0.2953795379537954

16、Multiset

Multiset is similar to the built-in set function, but the package allows multiple occurrences of the same character.

pip install multiset  

You can use the following code to use the Multiset function.

from multiset import Multiset  
set1 = Multiset('aab')  
set1  

Multiset({
    
    'a': 2, 'b':1})

17、Jazzit

Jazzit can play music while our code makes an error or waits for the code to run.

pip install jazzit  

Use the following code to try the sample music in error case.

from jazzit import error_track  
  
@error_track("curb_your_enthusiasm.mp3", wait=5)  
def run():  
    for num in reversed(range(10)):  
        print(10/num)  

Although this package is useless, its function is not very interesting.

18、handcalcs

handcalcs is used to simplify the process of mathematical formulas in notebooks. It converts any mathematical function into its equation form.

pip install handcalcs  

Use the following code to test the handcalcs package. Use the %%render magic command to render Latex.

import handcalcs.render  
from math import sqrt  
%%rendera = 4  
b = 6  
c = sqrt(3*a + b/7)  

19、NeatText

NeatText simplifies text cleaning and preprocessing. It is useful for any NLP project and text machine learning project data.

pip install neattext  

Use the following code to generate test data

import neattext as nt   
mytext = "This is the word sample but ,our WEBSITE is https://exaempleeele.com ✨."  
docx = nt.TextFrame(text=mytext)  

TextFrame is used to start the NeatText class and various functions can then be used to view and clean the data.

docx.describe()  

Key       Value  
Length    : 72  
vowels    : 21  
consonants: 33  
stopwords : 5  
punctuations: 6  
special_char: 6  
tokens(whitespace): 11  
tokens(words): 13

Using the describe function, per-text statistics can be displayed. To further clean the data, the following code can be used.

docx.normalize()  

20、Combo

Combo is a python package for combining machine learning models and scores. This package provides a toolbox that allows various machine learning models to be trained into one model. That is, the model can be integrated.

pip install combo  

Create a machine learning ensemble using the breast cancer dataset from scikit-learn and various classification models from scikit-learn.

from sklearn.tree import DecisionTreeClassifier  
from sklearn.linear_model import LogisticRegression  
from sklearn.ensemble import GradientBoostingClassifier  
from sklearn.ensemble import RandomForestClassifier  
from sklearn.neighbors import KNeighborsClassifier  
  
from sklearn.model_selection import train_test_split  
from sklearn.datasets import load_breast_cancer  
  
from combo.models.classifier_stacking import Stacking  
from combo.utils.data import evaluate_print  

Next, look at a single classifier used to predict an object.

# Define data file and read X and y  
random_state = 42  
X, y = load_breast_cancer(return_X_y=True)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,random_state=random_state)  
# initialize a group of clfs  
classifiers = [DecisionTreeClassifier(random_state=random_state),  
                   LogisticRegression(random_state=random_state),  
                   KNeighborsClassifier(),  
                   RandomForestClassifier(random_state=random_state),  
                   GradientBoostingClassifier(random_state=random_state)]  
clf_names = ['DT', 'LR', 'KNN', 'RF', 'GBDT']  
  
for i, clf in enumerate(classifiers):  
    clf.fit(X_train, y_train)  
    y_test_predict = clf.predict(X_test)  
    evaluate_print(clf_names[i] + '   |   ', y_test, y_test_predict)  
    print()  
DT   | Accuracy: 0.9386, ROC:0.9383, F1:0.9521  
LR   | Accuracy: 0.9693, ROC:0.962,  F1:0.9766  
KNN  | Accuracv: 0.9561, ROC:0.9519, F1:0.9662  
RF   | Accuracy: 0.9781, ROC:0.9716, F1:0.9833  
GBDT | Accuracy: 0.9605, ROC:0.9524, F1:0.9699

Stacking model using the Combo package.

clf = Stacking(classifiers, n_folds=4, shuffle_data=False,  
                   keep_original=True, use_proba=False,  
                   random_state=random_state)  
                     
clf.fit(X_train, y_train)  
y_test_predict = clf.predict(X_test)  
  
evaluate_print('Stacking | ', y_test, y_test_predict)  

21、PyAztro

Do you need horoscope data or just curious about today's luck? You can use PyAztro to get this information! This pack has lucky numbers, lucky symbols, moods and more. This is the basic data of our artificial intelligence fortune-telling, ha

pip install pyaztro  

Use the codes below to access today's horoscope information.

import pyaztro  
pyaztro.Aztro(sign='gemini').description  
"A very sexy visitor will cross your path soon.   
If not today, then within a few days.   
Once they arrive, lots of things will change.   
Your organized schedule, for one.   
Not that you'll mind, of coursel"

22、Faker

Faker can be used to simplify generating synthetic data. Many developers use this package to create test data.

pip install Faker  

To generate synthetic data using the Faker package

from faker import Faker  
fake = Faker()  

generate name

fake.name()  
'Danielle Cobb'

Faker randomly generates data every time the .name property is fetched from the Faker class.

23、Fairlearn

Fairlearn is used to evaluate and mitigate unfairness in machine learning models. This package provides many APIs necessary to view deviations.

pip install fairlearn  

You can then use Fairlearn's dataset to see how much bias is in the model.

from fairlearn.metrics import MetricFrame, selection_rate  
from fairlearn.datasets import fetch_adult  
  
data = fetch_adult(as_frame=True)  
X = data.data  
y_true = (data.target == '>50K') * 1  
sex = X['sex']  
  
selection_rates = MetricFrame(metrics=selection_rate,  
                              y_true=y_true,  
                              y_pred=y_true,  
                              sensitive_features=sex)  
                                
fig = selection_rates.by_group.plot.bar(  
    legend=False, rot=0,  
    title='Fraction earning over $50,000')  

The Fairlearn API has a selection_rate function that can be used to detect the difference in scores between group model predictions so that we can see the bias in the results.

24、tiobeindexpy

tiobeindexpy is used to get TIOBE index data. The TIOBE Index is a programming ranking data that is very important for developers because we don't want to miss out on the next big thing in the programming world.

pip install tiobeindexpy  

The top 20 programming language rankings of the month can be obtained through the following code.

from tiobeindexpy import tiobeindexpy as tb  
df = tb.top_20()  

25、pytrends

pytrends can use the Google API to get keyword trend data. This package is useful if you want to understand current web trends or trends related to our keywords. This requires access to google, so you get the idea.

pip install pytrends  

Let's say I want to know the current trends related to the keyword "Present Gift",

from pytrends.request import TrendReq  
import pandas as pd  
pytrend = TrendReq()  
  
keywords = pytrend.suggestions(keyword='Present Gift')  
df = pd.DataFrame(keywords)  
df  

This package will return the top 5 trends related to the keyword.

26、visions

visions is a python package for semantic data analysis. This package can detect data types and infer what a column's data should be.

pip install visions  

The following code can be used to detect the column data type in the data. The Titanic dataset of seaborn is used here.

import seaborn as sns  
from visions.functional import detect_type, infer_type  
from visions.typesets import CompleteSet  
df = sns.load_dataset('titanic')  
typeset = CompleteSet()  
  
converting everything to strings  
print(detect_type(df, typeset))  

27、Schedule

Schedule can create job scheduling functions for any code

pip install schedule  

For example, we want to work every 10 seconds:

import schedule  
import time  
  
def job():  
    print("I'm working...")  
  
schedule.every(10).seconds.do(job)  
  
while True:  
    schedule.run_pending()  
    time.sleep(1)  
I'm working...    
I'm working...     
I'm working...     
I'm working...     
I'm working...     
I'm working...     
I'm working...     
I'm working...     
I'm working...     
I'm working...  

28、autocorrect

autocorrect is a python package for text spelling correction, applicable to many languages. Usage is simple and very useful for data cleaning process.

pip install autocorrect  

Autocorrection can be done using code similar to the following.

from autocorrect import Speller  
spell = Speller()  
spell("I'm not sleaspy and tehre is no place, I'm giong to.")  
"I'm not sleaspy and tehre is no place,   
I'm giong to."

29、funky

funcy contains nifty utility functions for everyday data analysis use. There are too many functions in the package, I can't show them all, please check his documentation if you are interested.

pip install funcy  

Here is just an example function for selecting an even number from an iterable variable as shown in the following code.

from funcy import select, even  
select(even, {
    
    i for i in range (20)})  

{
    
    0, 2, 4, 6, 8, 10, 12, 14, 16, 18}

30、IceCream

IceCream can make the debugging process easier. This package provides more verbose output during printing/logging.

pip install icecream  

You can use the following code

from icecream import ic  
  
def some_function(i):  
    i = 4 + (1 * 2)/ 10   
    return i + 35  
  
ic(some_function(121))  
39.2

Can also be used as a function checker.

def foo():  
    ic()  
      
    if some_function(12):  
        ic()  
    else:  
        ic()  
  
foo()  

The level of detail printed is ideal for analysis

Summarize

In this article, we summarize 30 unique Python packages that are useful in data work. Most software packages are easy to use and straightforward, but some may have more functions and require further reading of their documents. If you are interested, please go to the pypi website to search and view the homepage and documents of the package. I hope this article will help you.

Guess you like

Origin blog.csdn.net/m0_59596937/article/details/130779123