If this date falls on a UK public holiday or a University of York closure day, the submission date will change. Please check the submission point in the ‘Assignments’ area of the module in Canvas for the exact submission deadline.
I.Module Learning Outcomes
1.Analyse different data mining and text processing tasks and the algorithms most appropriate for addressing them.
2.Critically evaluate and select the appropriate open-source or commercial data mining and text processing toolkits and implement the algorithms.
3.Critically evaluate the algorithms with respect to the accuracy of their results.
4.Develop and communicate a data mining and text processing solution to a real-world problem.
5.Identify and discuss the challenging research issues in the area of data mining and text processing.
II.Assessment Background/Scenario
You have been approached by a credit card company which is looking to explore credit card fraud patterns and profile potential targets. The company has identified four areas (problems) it would like to identify and test potential solutions for, using the data mining and/or text analysis techniques that you have learned about on this module.
Using only the given data set/s, the company would like you to explore and present possible solutions for only two of the following:
Top profiles: Being able to profile potential targets effectively may help improve fraud prevention in the future. Examine the data and identify three distinct profiles (differing sets of personal attributes; there may be some overlap) that are linked to high levels of fraudulent actions. You will need to define and clearly state what you have identified as ‘high level’ as part of your assumptions for this problem.
Location: Determine whether a ‘transaction’s location’ is a good predictor of the likelihood of fraud and clearly demonstrate this against 2-3 of the other attributes in the data set. Make sure you clearly state which other attributes you selected as comparators and justify why you have selected them for the role.
Recommender: Consider how you could use the data to recommend to credit card users safe places to perform their transactions on a daily/weekly basis. As part of this problem, also consider how this information could be best communicated to credit card users in a visual way and put forward or demonstrate one option. You may need to consider the data sparsity problem here and research possible solutions.
Transaction Search: The company is looking to provide an extra service/level of security for high-value targets. Ascertain if there is a strong relationship between the transaction amounts and the time of day of the transactions. Consider what other attributes (within the data set) could be included to help protect high-value targets.
The credit card company has provided you with a simulated sample set of data as two CSV files – one suitable for training and the other for testing. How you use these is up to you, and you may not need both depending on your approach. You should clearly state in your submission which of the data sets you have used at appropriate points.
These data samples are provided under a Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication license and are available from the Kaggle data repository.
III.Assessment Task/s
Given the scenario above, your task is to write a report detailing possible solutions to your two selected options. You should clearly draw on the current literature and can use examples from your work throughout the module, including your formative assessment as supporting evidence for your approach. There is NO requirement for you to cite yourself where you have reused work from your formative. However, please be aware that the context for this assessment is different, and you may find that success in one does not necessarily translate to success in the other.
Your report should provide an initial executive summary and consist of three clear sections – one for each task. Your response in one section will not contribute to grades in another, so you should consider this assignment in the same way you would an examination. Further formatting details are given below.
The maximum word count for the three tasks combined is 3,000 words, plus your executive summary, which should be 400 words.
Executive summary (All LOs, 10%)
(400 words, and no less than 390)
Overview/summary of the report which should at least contain the following points:
Which options you have chosen to present solutions for;
What was achieved/undertaken;
What processes were applied;
What the results demonstrated;
What should be reconsidered in future.
Task 1: Discussion of techniques used in your two solutions (LO1,3 and 4, 40%)
Given the scenario above, design and discuss the potential solution(s) to the problem(s) you have selected. You will need to write small programs and/or use tools to run simulations as supporting evidence on the given test data. In this task you should make it clear which problem you are presenting a solution for.
Your report should clearly cover the following:
Any assumptions you are making about the scenario or selected problem;
Any pre-processing you would undertake to make the data fit for purpose;
Which data mining/text analysis techniques you have employed in your solutions.
Justification for the selection of those techniques, given the nature of the data and the requirements of the problem you are attempting to solve.
An evaluation of the techniques you have applied in terms of the accuracy of their results. You will need to clearly define and state the measures/methods by which you are evaluating the techniques. It is perfectly acceptable for your techniques to have been unsuccessful. Whether successful or not, it should be clear how your evaluation has informed your conclusion.
All code examples and results (output) should be presented in the appendices as screen shots only, not as handwritten (typed) code. All supporting evidence in the appendices must be referred to and discussed in the body of the report. You will need to present evidence of your prototype programs and tests to pass this assessment. To attain a higher grade, your discussion should be supported by reference to relevant literature in this section.
Task 2: Evaluation of the tools/languages (LO2, 20%)
Given the languages/tools you have selected and used, provide a critical evaluation of their effectiveness in the context of the given scenario. You should clearly make comparisons to other options available and draw on the specific requirements of the scenario when presenting your argument. To attain a higher grade, your discussion should be supported by reference to relevant literature in this section.
Consider the question: If you undertook this assignment a second time, would you use the same languages/tools and why?
Task 3: Discussion of the current literature (LO5, 30%)
(Suggested word count for this section: 1,000 words)
Given the scenario above and the nature of the problems you have selected, research and identify the main areas of investigation the research community is currently tackling. Consider the following questions:
What are the current ‘problem’ areas?
What solutions have been put forward and how are they being evaluated?
Given your experience, would you consider these potentially successful solutions?
Justify why you consider them successful or not.
Present a discussion around these questions and consider how current research could potentially change or improve your solutions to the given scenario. To attain a pass, your discussion must be supported by reference to relevant literature in this section.
IV.Deliverables
You are to produce and submit a REPORT that presents your response to the three tasks, given two of the problems presented in the scenario. Your report should adhere to the following format guidelines.
You may choose to redistribute the given indicative word counts between the three tasks as you see fit, providing your total response to these does not exceed 3,000 words. However, the executive summary must be 400 words and you cannot redistribute that word count to support other sections.
Document Format
You should submit a single word-processed file as .doc, .docx or .pdf. Other formats are NOT acceptable and are not accessible by your marker.
Word counts that exceed the overall limits will not be reviewed. A line will be drawn at the limit as indicated above.
You must state on the front page of your document the number of words used and this will be checked.
The main text should be written using a consistent sans serif font and font size.
All images and diagrams must be clear and viewable on the page without scaling. They should be accompanied by appropriate captions and be referred to and discussed in the main body of the text. Those that are not will NOT be considered.
Your document should be fully justified OR left justified, but NOT centrally justified.
You should not exceed more than 3 levels of section headings. i.e. main heading, sub-heading 1, sub-heading 2. Your title is not classed as a heading.
All source material that is used, whether by direct quotation or not, must be acknowledged, following the IEEE referencing style. See the University of York Academic Integrity site.
Appendices may be used but should not exceed 5 additional pages and all content must be referred to and discussed in the main body of the text. Those that are not will NOT be considered.
Appendices should ONLY be used for supportive information, such as over-large figures or tables of data. They are NOT a device to incorporate material that would otherwise cause you to exceed the word limit. These are not included in the word count.
The word count does not include: any title pages, bibliography and/or reference, any tables of contents/figures/diagrams/etc.
Your reference list should come after any appendices and is not included in the word count. It should be formatted using the IEEE guidelines.