1. Download 5 news articles from your favourite news site. Each article should be at least long.
2. Pre-process the articles and remove any noise such as html or other web formatting characters. After this process the articles should only contain as ASCII characters with sentence per line.
3. Name these articles using the following naming convention.
a. One article per file.
b. The filename should be _X.txt, where X should be a number from 1 to 5 corresponding to your each of your 5 articles.
As an example if your ID number is : 0527825, then you should have the following 5 filenames:
· 0527825_1.txt
· 0527825_2.txt
· 0527825_3.txt
· 0527825_4.txt
· 0527825_5.txt
4. Now read through each of the above files and identify the following named entities:
a. PERSON
b. ORGANISATION
c. GPE (location)
5. Create additional 5 files using notepad/wordpad with the above 3 named entities using the following notation rules.
a. File name to use same convention as above but use the extension “.dat”
b. One entity per line and entity in bracket as the following.
For the example text “Jacinda Arden is the prime minister of New Zealand.” You should have :
Jacinda Arden (PERSON)
NEW ZEALAND (GPE)
6. At the end of this exercise you should have the following 5 files for the same ID as above.
· 0527825_1.dat
· 0527825_2.dat
· 0527825_3.dat
· 0527825_4.dat
· 0527825_5.dat
7. Use 7zip to zip the 10 files (5 txt files containing your cleaned text and the 5 dat files containing your named entities) using the filename
0527825.7zip
8. Submit this zip file to bb as your assessment for Part A.
9. Download the zip data file done collectively part of part A. You will be using this full data set for all of your work in part B.
10. You can use either the NLTK or another (such as the Stanford NER tool) to do the following exercise.
11. Use the NER tool to extract the 3 categories of named entities from the dataset. If the tool you are using identifies more categories of entities you can ignore the others not in the annotated in the dataset.
12. Use FPR metrices to evaluate the accuracy of the tool you are using.
1. You are required to write a report describing your programming activity in a maximum of 10 pages, excluding references and any appendices. Your report should contain the following:
a. An introduction describing what you set out to do. Here you should also briefly describe what NER is and why its an important step in text processing
b. An algorithmic description of the steps and the tools used to achieve the task.
c. Calculate the FPR values for each of the categories as well as the overall and appropriately present these results.
d. A discussion of the FPR values obtained and whether they are reasonable or not.
e. Conclusions and reflection on your learning.
f. The code should be included with appropriate comments and formatting in an easily readable format in the appendix.