CE314 Natural Language Engineering
Task
Regular expression (40%)
(You can store your code in output part1_regularexpression_studentID.py)
1: Write a regular expression that can find all amounts of money in a text. Your expression should be able to deal with different formats and currencies, for example £50,000 and £117.3m as well as 30p, 500m euro, 338bn euros, $15bn and $92.88. Make sure that you can at least detect amounts in Pounds, Dollars and Euros. (20pts)
For full marks: include the output of a Python program that applies your regular expression to the following BBC News Web site:
https://www.bbc.co.uk/news/business-41779341
2: Write a regular expression that can matching all phone numbers listed below: (You can write a python program to check the matching results)
555.123.4565
+1-(800)-545-2468
2-(800)-545-2468
3-800-545-2468
555-123-3456
555 222 3342
(234) 234 2442
(243)-234-2342
1234567890
123.456.7890
123.4567
123-4567
1234567900
12345678900
NLTK (10%)
1. Find the 50 highest frequency word in Wall Street Journal corpus in NLTK.books (text7), submit your code as the name: part2_NLTK_studentID.py (All punctuation removed and all words lowercased.)
Language modelling:
1. Build an n gram language model based on nltk’s Brown corpus, provide the code. (You can build a language model in a few lines of code using the NLTK package, you can use bigram, trigram or higher order grams) (20pts)
2. After step 1, make simple predictions with the language model you have built in question 1. We will start with two simple words – “I am”. Let your n gram model to tell me what will be the next word, show me both code and module generated results. (15 pts)
3. Based on the work of question 1 and question 2, generate a few sentences start with “I am”. (15 pts)