How technology nerds use ML technology to evolve into women's clothing giants

Have you bought a new women's clothing during the May Day holiday?

When it comes to buying clothes, you should have had the following experience: walking on the street and suddenly seeing someone wearing cool clothes (nv) clothes (zhuang), you can’t help but wonder where you bought such beautiful clothes, I really want to Buy a piece of clothing, but you don't know others, so you can only look at the "clothing" and sigh. But what if there were now a way to find online sellers based on photos of clothes?

A programmer in Germany, Aleksandr Movchan, gave this problem a special name- "Street-to-Shop" (Street-to-Shop) shopping, and decided to use distance metric learning (DML ) in machine learning. )solve this problem. (No longer afraid of not being able to buy your favorite women's clothing!)

metric learning

Before introducing this "shopping" tutorial, let's briefly talk about metric learning. Metric Learning is also known as similarity learning. If the similarity between two pictures needs to be calculated, how to measure the similarity between pictures so that the similarity of pictures of different categories is small and the similarity of pictures of the same category is large is the goal of metric learning.

In mathematics, a metric (or distance function) is a function that defines the distance between elements in a set. A collection with metrics is called a metric space. For example, if our goal is to recognize faces, then we need to build a distance function to strengthen appropriate features (such as hair color, face shape, etc.); and if our goal is to recognize poses, we need to build a capture pose similarity. distance function. To handle a wide variety of feature similarities, we can manually construct distance functions by selecting appropriate features for a specific task. However, this method requires a lot of manual input and may also be very robust to data changes. As an ideal alternative, metric learning can independently learn a metric distance function for a specific task according to different tasks.

Metric learning methods can be divided into metric learning through linear transformation and nonlinear models with metric learning. Some very classic unsupervised linear dimensionality reduction algorithms can also be regarded as unsupervised Mahalanobis metric learning, such as principal component analysis, multidimensional scaling transformation, etc.

Metric learning has been applied to image retrieval and classification in computer vision, face recognition, human activity recognition and pose estimation, text analysis and some other fields such as music analysis, automated project debugging, microarray data analysis, etc. Let's take a look at the specific tutorial below.

Build the dataset

First, as with any machine learning problem, we need a dataset. In fact, when I saw a large number of clothing photos on AliExpress one day, I had this idea: I can use these data to make a search function based on photos. To keep things simple and convenient, I decided to focus on women's clothing, and girls (and some boys) love to buy clothes.

Here's the women's clothing category I'm scraping images for:

  • skirt

  • women's shirt

  • Sweatshirts & Sweatshirts

  • sweater

  • Jackets & Coats

I used requests and BeautifulSoup to scrape images. The seller's clothing photos can be obtained from the main page of the clothing category, but the photos uploaded by buyers need to be obtained in the evaluation area. There is a "color" property on the clothing page, which can be used to determine whether the clothing is another color or even a completely different clothing. So we treat clothes of different colors as different commodities.

You can click here to see the code I use to get all the information on a garment.

We just need to search the clothing page by clothing category, get the URL of all the clothing, use the code above to get the information of each clothing.

In the end, we end up with two datasets of images for each garment: images from the seller (the URL field for each element in item['colors']) and images from the buyer (each in item['feedbacks'] element's URL field).

For each color, we only get one photo from the seller, but there may be more than one photo from the buyer, and sometimes there is not even one photo (this is the same as the buyer show on Tmall, some will be larger Show special show, and some are not show).

very good! We got the desired data. However, there is a lot of noisy data in the resulting dataset: there is a lot of noise in the image data from buyers, such as photos of express packages, photos showing only the texture of the clothes and only a part of them are exposed, and some are taken when the package is just torn Photo.

To mitigate this problem, we label the 5000 images into two categories: benign photos and noisy photos. At first, my plan was to train a classifier for both classes and then use it to clean the dataset. But then I decided to leave this work behind and just add clean data to the test and validation datasets.

The second problem is that sometimes several sellers sell the same clothing, and sometimes several clothing stores show the same photos of the clothing (or just a little editing). How to solve this problem?

The easiest way is to do nothing and use a robust algorithm in distance metric learning. However, this way will affect the validation performance of the dataset, because we will have the same clothing in the validation and training data. So it will lead to data leakage. Another way is to find similar (or even identical) garments and combine them into one garment. We can use perceptual hashing to find the same clothing photos, or we can train a model with noisy data to find the same clothing photos. I chose the second method because it allows identical photos to be merged into one, even a slightly edited photo.

distance metric learning

One of the most commonly used distance metric learning methods is Triplet loss:

where max(x,0) is the hinge function, d(x,y) is the distance function between x and y, F(x) is the deep neural network, M is the boundary, a is the anchor, p is the number of positive points, and n is negative points. F(a), F(p), F(n) are points in a high-dimensional space (vector) generated by a deep neural network. It is worth mentioning that in order to make the model more robust to changes in lighting and contrast of the photo, it is often necessary to regularize the vectors to obtain the same cell length, eg ||x|| = 1. Anchor and positive samples belong to the same analogy, and negative samples are examples in another class.

Then the main idea of ​​Triplet loss is to use a distance boundary M to distinguish the vector of the positive pair (anchor and positive) from the vector of the negative pair (anchor and negative).

But how to choose triplet (a, p, n)? We could randomly choose the sample as a triplet, but this leads to the following problem. First, there will be N³ possible triplets. This means that we need to spend a lot of time traversing all possible triplets. But in fact, we don't need to do this, because after a few training iterations, many meta-triplets already meet the triplet limit (such as 0 loss), which means that these triplets are not useful for training.

One of the most common ways to select triplets is hard negative mining:

Choosing the hardest hard samples can actually lead to bad local minima early in training. Specifically, it causes a shrinking model (eg F(x) = 0). To alleviate this problem, we can use semi-hard negative mining.

Semi-hard samples are further away from the anchor than positive samples, but they are still indistinguishable (does not meet the triplet constraint) because they are inside the boundary M.

There are two ways to generate semi-hard (and hard) samples: online and offline.

  • The online method means that we can randomly select some samples from the training data set to form a mini-batch data, and select triplet from the samples inside. But in the online way we also need to have large batches of data. In our case, this was not possible because I only had a GTX 1070 with 8G RAM.

  • In offline mode, we need to pause training at intervals, predict vectors for a certain number of samples, select triplets, and train the model with these triplets. This means that we need to do two forward calculations, which is a small price to pay for using the offline method.

very good! We can now train the model with triplet loss and offline semi-hard sample mining. but! (There's a "but" every time something good is near, and this time it's not wrong.) We also need some way to perfectly solve the "street to store" problem. Our task is to find the highest level of clothing photos for sellers and buyers alike.

However, often the quality of photos from sellers is much better than that of photos uploaded by buyers (think about it too, photos posted by online stores generally go through N-way PS processing), so we will have two domains: seller photos and buyer photos. To get an efficient model, we need to close the gap between these two domains. This problem is called domain adaptation.

I propose a very simple way to close the gap between these two domains: we select anchors from seller photos and positive and negative samples from buyer photos. that's it! Simple but effective.

accomplish

To implement my idea and experiment quickly, I used the Keras library and the TensorFlow backend.

I chose Inception V3 as the base convolutional neural network for my model. As normal, I initialized the convolutional neural network with ImageNet weights. Then two fully connected layers are added at the end of the network after regularization with L2. The vector size is 128.

def get_model():
    no_top_model = InceptionV3(include_top=False, weights='imagenet', pooling='avg')

    x = no_top_model.output
    x = Dense(512, activation='elu', name='fc1')(x)
    x = Dense(128, name='fc2')(x)
    x = Lambda(lambda x: K.l2_normalize(x, axis=1), name='l2_norm')(x)
    return Model(no_top_model.inputs, x)

We also need to implement the triplet loss function, which can pass the anchor, positive and negative samples into the function as a separate mini-batch data, and divide this mini-batch data into 3 tensors in the function. The distance function is the squared Euclidean distance.

def margin_triplet_loss(y_true, y_pred, margin, batch_size):
    out_a = tf.gather(y_pred, tf.range(0, batch_size, 3))
    out_p = tf.gather(y_pred, tf.range(1, batch_size, 3))
    out_n = tf.gather(y_pred, tf.range(2, batch_size, 3))

    loss = K.maximum(margin
                 + K.sum(K.square(out_a-out_p), axis=1)
                 - K.sum(K.square(out_a-out_n), axis=1),
                 0.0)
    return K.mean(loss)

and optimize the model:

#utility function to freeze some portion of a function's arguments
from functools import partial, update_wrapper
def wrapped_partial(func, *args, **kwargs):
    partial_func = partial(func, *args, **kwargs)
    update_wrapper(partial_func, func)
    return partial_func

opt = keras.optimizers.Adam(lr=0.0001)
model.compile(loss=wrapped_partial(margin_triplet_loss, margin=margin, batch_size=batch_size),

Experimental results

The performance measure of the model is called R@K. Let's see how to calculate R@K. For each buyer photo in the validation set as a query, we need to find the corresponding seller photo. Every time we query for a photo, we compute the embedding vector and search for that vector's nearest neighbor among all seller photos. We will not only use the seller photos in the validation set, but also the seller photos in the training set, as this increases the number of distractions and makes our task more challenging.

So we get a query photo, and a list of the most similar seller photos. We return 1 for this query if there is a corresponding seller photo among the K most similar photos, and 0 if not. Now we need to return such a result for each query in the validation set and then find the average score for each query. This is R@K.

As I said above, I cleaned a small subset of buyer photos from the noisy data. So I tested the quality of the model with two validation sets: a full validation set and a subset with clean data only.

The results of the model are not very ideal, we can also do this to optimize:

  • Clean buyer data from noisy data. In this regard, I have done the first step, cleaning out a small dataset.

  • Merge clothing photos more accurately (at least in the validation set).

  • Further narrow the gap between domains. I think it can be done with domain specific enhancement methods (such as enhancing the illumination of an image) and other specific methods (like the method in this paper ).

  • Using another distance metric learning method, I tried the method in this paper and it worked even worse.

  • And of course collecting more data.

Demo, code and trained model

I made a demo of the model, you can click here to view it.

You can take a photo of your favorite girl's clothes on the street (be careful), or find a random one from the verification set and upload it to the model to see how it works.

Click here to view the code base of this project.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325377622&siteId=291194637