Kaggle Competition - Text Classification Task

Answered

Submission Format

This is an exercise of text classification, through the platform of an online data science competition.This is a text classification task. Every document (a line in the data file) is a movie review from IMDB. Your goal is to classify each document into one of the two categories, based on whether it needs simplification: 1 if the review is positive; 0 if the review is negative. The training data contains 10,000 reviews, already labeled with one of the above categories. The test data contains 5,000 reviews that are unlabeled. The submission should be a .csv file with a header line ”Id,Category” followed by exactly 5,000 lines.

In each line, there should be exactly two integers, separated by a comma. The first integer is the line ID of a test example (0 - 5,000), and the second integer is the category your classifier predicts one of {0,1}. You can make 8 submissions per day. Once you submit your results, you will get an accuracy score computed based on 50% of the test data. This score will position you somewhere on the leaderboard. Once the competition ends, you will see the final accuracy computed based on the other 50% of the test data. The evaluation metric is the accuracy of your classifier - so the higher the better.

The grading of this assignment include three parts: two rounds of competitions and a 1-page memo. For the two rounds of competitions, we will evaluate your classification result using the accuracy.

1. Round 1 (5 + 0.5 pts): The first round ends at 11/23. The evaluation is simple:

(a) Everyone who beats a correctly implemented Naive Bayes classifier (0.82440 accuracy on the public test set) gets full points. Otherwise, your score will be deducted based on the accuracy.

(b) If you beats a well-tuned SVM classifier (0.8768 on the public test set), you will receive 0.5 bonus point.

(c) Late Policy: You will receive 4 pts if you submit after Round 1 due but before Round 2 due. No bonus
points will be awarded if you submit late.

Round 2 (10 pts): The second rounds ends two weeks later. Your score will be determined by both the accuracy score of your best classifier (referred to as acc thereafter) and its ranking on the private leaderboard. Specifically, if and only if your acc beats a correctly implemented SVM classifier (0.8456 on the public test set), you will receive at least 8 points. If your acc beats the best performance of the previous class, which is 0.9, you will get at least 10 points. Otherwise, your grade will be given according to your position on the leaderboard based on the following formula to compute your grade:

grade = 7 + 3 ∗ 2/ log2(2 + rank).

Note that the winner can get as much as 10.8 points! However, if your submission did not beat the SVM baseline, your score will be less than 8 points regardless of the ranking.

3. One Page Memo (5 pts): Please submit a one page memo (in .pdf) describing the preprocessing, features, models, and parameter tuning you explored and the corresponding results. Please write down your name and the display name you used in the competition. In addition, please also submit the source code in your submissions. If you received help from anyone, you should list the name(s) in your memo.

Get instant help from 5000+ experts for