【Copulas】Copula python(4)

Synthetic Data for Machine Learning

loading dataset

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

Generating synthetic data. This simulates a scenario where a company may unwilling to share the real dataset but is willing to release a synthetic copy which preserves many of the real dataset’s properties for researchers to use.

def create_synthetic(X, y):
    dataset = np.concatenate([X, np.expand_dims(y, 1)], axis=1)
    model = GaussianMultivariate()
    model.fit(dataset)

    synthetic = model.sample(len(dataset))
    X = synthetic.values[:, :-1]
    y = synthetic.values[:, -1]

    return X,y

X_syn, y_syn = create_synthetic(X_train, y_train)

# Training
from sklearn.linear_model import ElasticNet
model = ElasticNet() # 弹性网
model.fit(X_syn, y_syn)
score_syn = model.score(X_test, y_test)

model.fit(X_train, y_train)
score_real = model.score(X_test, y_test)

print('syn score is {}'.format(score_syn))
print('real score is {}'.format(score_real))

Vine Copula

A vine is a graphical tool for labeling constraints in high-dimensional probability distributions. A R-Vine is a special case for which all constraints are two-dimensional or conditional two-dimensional. Although the number of parametric copula families with flexible dependence is limited, there are many parametric families of bivariate copulas. R-Vine has proven useful in other problems such as (constrained) sampling of correlation matrices, building non-parametric continuous Bayesian networks.

Reference

Copula docs

猜你喜欢

转载自blog.csdn.net/qq_18822147/article/details/118529582