For the first time to participate in Kaggle competition, how to do?


The full text 4621 words, when learning is expected to grow 14 Fenzhong

Come Source: ijiandao

Kaggle is probably the most famous machine learning contest website.

Kaggle contest includes a set of data available from the Web site, you need to use machine learning, deep learning or other data science and technology to solve the problem.

Once developed solutions, contestants predict the results can be uploaded back to the site, successfully predicted the outcome will determine whether or not the position of the participants in the race chart, participants may even receive a cash prize.

 

Kaggle is honed machine learning and data science skills to compare themselves with others, an excellent platform to learn new technologies. This article will provide the Raiders for the first time to participate in Kaggle contest. This article includes the following:

 

• Develop model to predict whether the tweets about the real disaster.

· Test data set provided by Kaggle prediction model.

· Submitted for the first time, a place in the rankings Kaggle.

 

Detection disaster Tweets

 

On the site of a recent contest provides a data set, which contains a label and tweets, participants can learn Tweets is really about the disaster through the label. The competition has nearly 3000 participants, the highest cash prize of $ 10000. Click here to view the data and competitions outline.

 

If you do not Kaggle account, click here to create one for free.

 

In the contest page select "Download all" will get a compressed file containing three CSV file.

 

The first data set includes a series of features and labels for respective target training. The data set has the following properties:

 

· Id: numeric identifier tweets. When the contestants predict the results uploaded to the list, it will come in handy.

* Keywords: In some cases, tweet keywords may be lost.

· Location: Send tweets position. It also may not be displayed.

· Text: the full text of tweets.

· Target: contestants try to predict the label. If this is indeed Tweets and disaster-related, compared with 1, and 0 otherwise.

 

Carefully read these documents in order to learn more about them. You will notice the following code already contains a set_option instruction. Pandas set_options display format allows you to control the results of the data frame. Instructions contained herein is intended to ensure that the complete contents of text columns, so that the results of the analysis and easier to see.

 

import pandasas pdpd.set_option( display.max_colwidth , -1)train_data = pd.read_csv( train.csv )train_data.head()

The second set contains only data characteristic data sets, to predict the target tag, which will determine whether a place in the charts.

 

test_data =pd.read_csv( test.csv )test_data.head()

 

A third set of data illustrates the documentation should be submitted what format. The document will include a target id column and predicted by the model test.csv file. After you create this file, contestants will submit it to the site to enter the rankings.

 

sample_submission= pd.read_csv( sample_submission.csv )sample_submission.head()

 

Learning to prepare the data for the machine

 

Whatever the task of machine learning, data cleaning and pre-treatment is necessary before you can train the model. When dealing with text data, which is particularly important.

 

In order to make the first model is easy to operate, and because of the lack of large amounts of data in these columns, location and keyword features will be deleted, only the actual text tweets training. id column will be deleted, because it is not useful training model.

 

train_data =train_data.drop([ keyword ,  location ,  id ], axis=1)train_data.head()

 

Now, the data set as follows.

 

Text (especially the tweets) will usually contain a lot of special characters, but these characters machine learning algorithm is not necessarily meaningful. So the first step I want to take is to delete these characters. Also all the words to lowercase.

 

import redef  clean_text(df, text_field):    df[text_field] =df[text_field].str.lower()    df[text_field] =df[text_field].apply(lambda elem: re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z	])|(w+://S+)|^rt|http.+?", "", elem))     return dfdata_clean =clean_text(train_data, "text")data_clean.head()

 

Another useful text cleanup process is to remove stop words. Stop words high frequency of use, but in general is not to convey meaning. In English, stop words include words such as "the", "it", "as". If you keep these words in the text, will generate a lot of noise, it will be more difficult to learn arithmetic.

 

Natural Language Toolkit (NLTK) brings together python libraries and tools for working with text data, click here to access the full document. In addition to processing tools, Natural Language Toolkit also has a large text corpora and lexical resources, including the languages ​​of all the stop words resources. The library will be used to delete a data set from the stop words.

 

Natural Language Toolkit library by pip install. After installation is complete, you need to download and import the corpus stop word file.

 

import nltk.corpusnltk.download( stopwords )

 

After this step is completed, you can read into the stop words, and the use of stop words in the corpus delete tweets.

 

from nltk.corpus importstopwordsstop = stopwords.words( english )data_clean[ text ] =data_clean[ text ].apply(lambda x:    .join([word for word in x.split() if wordnot in (stop)]))data_clean.head()

 

Data preprocessing

 

After the data is clean, but also the need for further pre-treatment can be used for machine learning algorithm.

 

All machine learning algorithm uses mathematical calculations map features mode (in the case of this article is the text or word) and target variable. Therefore, in order to perform calculations, you must convert the text before training machine learning models into digital form.

 

This type of pre-processing a variety of methods, but I will be illustrated using two methods scikit-learn library.

 

The first step in this process is to break the data into a frequency tag or a single word, is calculated for each word appears in the text, and the counts expressed as a sparse matrix.

 

CountVectoriser function for this purpose.

 

The next step is to generate weighted CountVectoriser words. With this purpose is weighted proportionally reduced the impact of words frequently appear in the text, so that the model training process, not frequent or provide more information about the word will get attention. TfidTransformer perform this function.

 

Machine Learning pipeline

 

All pre-processing and model fitting into the pipeline scikit-learn, to see how the performance of the model. The first attempt, I use a linear support vector machine classifier (SGDClassifier) ​​- one of the best text classification algorithm recognized.

 

from sklearn.model_selectionimport train_test_splitX_train, X_test, y_train, y_test =train_test_split(data_clean[ text ],data_clean[ target ],random_state = 0)fromsklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.pipeline import Pipelinefrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.linear_model import SGDClassifierpipeline_sgd = Pipeline([    ( vect , CountVectorizer()),    ( tfidf ,  TfidfTransformer()),    ( nb , SGDClassifier()),])model = pipeline_sgd.fit(X_train, y_train)

 

The trained predictive model test data retention, to see how the performance of the model.

 

from sklearn.metrics importclassification_reporty_predict = model.predict(X_test)print(classification_report(y_test, y_predict))

 

The first attempt, the performance of the model is quite good.

 

For the first time submitted

 

Now take a look at the rankings on the contest model test data set and the performance charts.

 

You first need to wash test text file, and then do the prediction model. The following code fetches a copy of the test data, and performs the same training data cleaning operation. Output is shown in the code below.

 

submission_test_clean =test_data.copy()submission_test_clean = clean_text(submission_test_clean, "text")submission_test_clean[ text ] = submission_test_clean[ text ].apply(lambda x:   .join([word for word in x.split() if word not in (stop)]))submission_test_clean = submission_test_clean[ text ]submission_test_clean.head()

 

Then use the model to predict.

 

submission_test_pred =model.predict(submission_test_clean)

 

Creating submit data frame needs to be built only contains the test set id and forecasting.

 

id_col = test_data[ id ]submission_df_1 = pd.DataFrame({                  "id": id_col,                  "target":submission_test_pred})submission_df_1.head()

 

Finally, save it as a CSV file. To include index = False, it is very important, otherwise the index will be saved as a file submitted will be rejected.

 

submission_df_1.to_csv( submission_1.csv ,index=False)

After obtaining the CSV file, you can return to the contest page and select the "Submit predict" button. This will open a form, contestants can upload a CSV file on it. Best to add some comments about the process, try to submit before recording.

 

This interface will appear after submission.

 

Now submitted a success!

 

This model allows the author to obtain 0.78 points in the rankings, ranked 2371. Obviously there are some room for improvement, but I now have a benchmark used to compare, which is conducive to future competition.

 

In this paper, for the first time to participate in the contest Kaggle submitted to predict what were outlined. Want to improve the score, you can also take further additional steps. Such as better text cleaning, different pretreatment methods to try another machine learning algorithms, super model parameter adjustment and so on.

 

Source: Pexels

Thanks for reading!


Recommended reading topics

Send a message thumbs circle of friends

Together we share the learning and development of dry AI

Compile Group: Zhou fruit, Qi Xin

Related Links:

https://towardsdatascience.com/how-to-enter-your-first-kaggle-competition-4717e7b232db

Such as reprint, please leave a message backstage, comply with norms reprint

Recommended reading articles

ACL2018 Proceedings 50 interpretation

EMNLP2017 Proceedings 28 interpretation papers

2018 AI Three top Chinese academic achievements will be full link

ACL2017 Proceedings: 34 Interpretation of dry goods all here

10 AAAI2017 classic paper review

Press and identify two-dimensional code that can be added attention

Jun core reading love you

Published 899 original articles · won praise 2858 · Views 520,000 +

Guess you like

Origin blog.csdn.net/duxinshuxiaobian/article/details/104958144