A language model can predict the probability of the next word in the sequence based on the words already observed in the sequence. Neural networks are a preferred method for developing such language models because they can use a large context of observed words and use a distributed representation (as opposed to symbolic representation) of words. In this exercise, you will learn how to develop a neural language model in Pytorch. This assignment will cover how to
(a) How to prepare texts for developing a word-based language model.
(b) How to design and fit/train a basic neural language model.
(c) How to use the learned language model to generate new texts.
The dataset for this exercise will be the wikitext-2 dataset.
(i) Download the dataset and the code. The dataset should have three files: train, test, and valid. The code should have basic prepossessing (see data.py) and data loader (see main.py) that you can use for your work. Try to run the code.
(ii) You should understand the preprocessing and data laoding functions.
(iii) Write a class class FNNModel(nn.Module) similar to class RNNModel(nn.Module). The FN- NModel class should implement a language model with a feed-forward network architecture. For your reference, RNNModel implements a recurrent network architecture, more specifically, Long Short-Term Memory (LSTM) that you will learn later in the course.
The FNN model should have an architecture as shown in Figure 1. This is indeed the first (historically) neural language model [1].1 The neural model learns the distributed represen- tation of each word (embedding matrix C) and the probability function of a word sequence as a function of their distributed representations. It has a hidden layer with tanh activation and the output layer is a Softmax layer. The output of the model for each input of (n ? 1) previous words are the probabilities over the |V | words in the vocabulary for the next word.
Figure 1: A Neural Probabilistic Language Model by [1]
(iv) Train the model with any of SGD variants (Adam, RMSProp) for n = 8 to train an 8-gram language model.
(v) Show the perplexity score on the test set. You should select your best model based on the perplexity score on the valid set.
(vi) Do steps (iv)-(v) again, but now with sharing the input (look-up matrix) and output layer embeddings (final layer weights).
(vii) Adapt generate.py so that you can generate texts using your language model (FNNModel).
(viii) In your opinion, which computation/operation is the most expensive one in inference or for- ward pass? Can you think of ways to improve this? If yes, please
mention.