The first option is to program ID3 using the algorithm described in class. You need to develop software to solve a supervised learning problem (ie. to build a model against a training set), then run the software against a test dataset and report the accuracy of the model. Your program should do the following things:
You should build a decision tree using the ID3 algorithm given in the 3rd lecture (it is a pretty simple algorithm, feel free to learn it yourself if you choose to start this assignment before Week 3). This algorithm uses the information gain measure to calculate the splits. You should build the decision tree using the training data supplied, then calculate the error on the supplied test/validation data. Since the mushroom dataset is categorical, you will not need to consider the complexities added with real–valued attributes. There is missing data in the mushroom dataset (flagged by “?” values). Don’t treat the missing data specially. Just pretend that “?” is just another value for the attribute in question. Also, do not worry about pruning the tree.
The program must display a text representation of the decision tree. You are free to display the tree in any way you think makes sense, so long as it shows what attributes are tested at each node in the tree. It is acceptable to utilise diagnosis tools provided by machine learning packages for the display of the tree ** as long as the tree is built by your own program, i.e. it is NOT acceptable to form a 2nd tree using the package, and display the 2nd tree directly **.
Hint #1: The trick with building the decision tree is not really the ID3 algorithm which is fairly straightforward. The tricky bit is managing the dataset. Remember that you need to be able to easily split the dataset based on the value of a specific attribute. That means you need to devise a suitable data structure to easily do this split and to work out class frequencies.
Hint #2: Think carefully about the entropy function you need to use when calculating information gain. It’s not quite so simple as in our theoretical discussion. Specifically, what happens when all of the dataset you’re looking at has only one of the two class values? ie. all the mushrooms are edible or all are poisonous? How will you deal with this?
Hint #3: Follow carefully the online learning materials provided Week 3.
The second option allows you to choose another algorithm to program, so long as you seek approval from me. One potential method is a multilayer perceptron neural network. You may use a supporting mathematical library to help with the details so long as you code the machine learning algorithm part yourself. Note: It is not acceptable to simply write code to call the Java Weka algorithm or the Python scikit-learn code for the algorithm. I expect you to write the main algorithm yourself. The dataset to be used for the classification (or regression) problem will need to be determined in consultation with me, but as a default we would probably use the mushroom dataset from choice 1 if it makes sense.
The third choice is to use an existing package to solve a data mining problem. If you want to do this it will not be enough to just use one classification algorithm and copy the output. You need to explore the data, systematically try several algorithms and parameter settings to find the best (by evaluating the quality of the classifiers) and then provide a recommendation.
With the emergence of big data, it becomes important for different industries to churn the available data about the business processes and customers in order to improve the performance. In this way, the organizations can gain competitive advantage in their respective markets.
Now a day the retail sector became one of the most competitive industries. In order to survive the organization in the market mostly utilizes undirected mass marketing comprehensively.
Every potential customer receives similar catalogues, advertising mails, pamphlets and announcements. In this way most of the customers gets annoyed due to huge number offers while the response rate for those campaigns drops for every organization.
The following report contributes to the exploration of a dataset for retail organization, modelling of the selected dataset and implementation of different classifier algorithms. In addition to that, different sections of this report contributes to the interpretations of different insights from the exploration of the selected dataset.
For the BigMart selected dataset, the data is collected from the below link;
The test dataset contains the 8523 rows for twelve columns that represents different attributes for the different products. The columns are given by; 'Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type', 'Item_MRP', 'Outlet_Identifier’, ‘Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type', 'Item_Outlet_Sales'.
Following is the statistical information for the selected dataset;
|
Item_Weight |
Item_Visibility |
Item_MRP |
Outlet_Establishment_Year |
Item_Outlet_Sales |
count |
7060.000000 |
8523.000000 |
8523.000000 |
8523.000000 |
8523.000000 |
mean |
12.857645 |
0.066132 |
140.992782 |
1997.831867 |
2181.288914 |
std |
4.643456 |
0.051598 |
62.275067 |
8.371760 |
1706.499616 |
min |
4.555000 |
0.000000 |
31.290000 |
1985.000000 |
33.290000 |
25% |
8.773750 |
0.026989 |
93.826500 |
1987.000000 |
834.247400 |
50% |
12.600000 |
0.053931 |
143.012800 |
1999.000000 |
1794.331000 |
75% |
16.850000 |
0.094585 |
185.643700 |
2004.000000 |
3101.296400 |
max |
21.350000 |
0.328391 |
266.888400 |
2009.000000 |
13086.964800 |
The above table provides statistical description about the selected dataset about all the numerical dataset. For the Item weight there are total 7060 rows, and for Item visibility, Item MRP and item outlet sales there are 8523 rows of data. From the above table it is evident that the minimum, mean and maximum value for the Item_MRP is given by, 31.290,140.9927 and 266.8884.
For the item weight, the minimum, mean and the maximum value is given by 4.5550, 12.857645 and 21.3500.
For the attribute Item_Visibility it is found that the minimum value of it is zero. This result does not make any sense as whenever a product is sold from a store, therefore the value of visibility cannot be 0.
On the other hand, the Outlet_Establishment_Years ranges from 1985 to 2009. Therefore, with the age of the store it will be possible to find out the impact on the sales of the products from the specific stores.
The lower number of Item_Weight compared to the Item_Outlet_Sales indicates that there are missing values in the selected dataset for the analysis.
For this project the linear regression classifier model is used. The Simple linear regression is a method that helps in summarization and define the relationships among two continuous variables that are included in the test. One is denoted by x, which is considered as independent or explanatory variable. Another variable is usually denoted y, which is considered as the dependent variable.
There are mainly two types of regression models. One is Simple Linear Regression model and the other is multiple Linear Regression. The Simple Linear Regression is considered by one independent variable. On the other hand, the Multiple Linear Regression model considers more than one independent variables for the prediction process. While finding best fit line, you can fit a polynomial or curvilinear regression. And these are known as polynomial or curvilinear regression.
In implementing the linear regression model the main objective is to fit a straight line among the distribution of the selected dataset or feature. The best fit line among the distributions will be nearest of all the points. This helps in reducing the error in the process of prediction from the available data points from the fitted line.
Following are the some of the properties of the linear regression line;
In order to implement the linear regression model which is the part of the Machine learning in artificial intelligence. In this way the developed algorithm that enables the computer sysems to adapt with the behaviour and predict for a specific data point in the view of available empirical data used to train the dataset (Chen and Zhang 2014). A critical focal point of machine learning is on finding data by recognizing examples and settling on shrewd choices in light of data.
Machine learning can assist retailers with becoming more exact and granular in making predictions (Davenport, 2013). Models of machine learning incorporate normal dialect preparing, affiliation run learning and group learning (Chen and Zhang 2014).
Thematic examination of the dataset was chosen as the information investigation strategy for this investigation, as it was fitting for deciding examples and subjects identifying with the utilization of huge information investigation. Examples and subjects were distinguished by following the way toward performing topical examination down into six stages as quickly talked about beneath. 1. Getting comfortable with the information that had been assembled from the semi organized meetings. The key way to deal with this was in the interpretation and examination of this to the first meetings for exactness. 2. Perusing the majority of the translations and producing codes which depicted intriguing highlights of the reactions from respondents. 3. down subjects included perusing the majority of the codes and doling out them a typical topic. Precedents of beginning topics were: use, challenge, definition, obstructions to selection, future, design, esteem, challenge, innovations, development, discernment and inspiration. 4. Arranging the codes under their separate subjects. Rehashing the coded removes under each subject with a specific end goal to distinguish subthemes. The way toward recognizing subthemes gathered regular codes together and to expel a portion of the codes which did not shape some portion of a coherent gathering.
For the selected dataset, the unique values for each of the columns are investigated which is listed below;
Attribute |
count |
Item_Identifier |
1559 |
Item_Weight |
416 |
Item_Fat_Content |
5 |
Item_Visibility |
7880 |
Item_Type |
16 |
Item_MRP |
5938 |
Outlet_Identifier |
10 |
Outlet_Establishment_Year |
9 |
Outlet_Size |
4 |
Outlet_Location_Type |
3 |
Outlet_Type |
4 |
Item_Outlet_Sales |
3493 |
dtype: int64 |
|
From the above table it can be stated that, the dataset contains 10 unique outlets with unique identifiers and mainly 1559 different items.
In the next stage of this project, at first it is investigated that most important factors that impacts on the sales of the products at the store. In this analysis we found the relation between the different factors through the correlation among the factors. Following is the depiction of relation through the heat map
From the above heat map, it is evident that the item_MRP has the most significant impact on the outlet sales of the products from the stores.
In this stage the, the distribution of the different type of items according to the contents and types are segregated. This are given by;
Frequency of Item_Fat_Content in the dataset;
Low Fat |
8485 |
Regular |
4824 |
low fat |
522 |
LF |
195 |
reg |
178 |
For different outlet sizes the frequency of the stores are given by;
Medium |
2793 |
Small |
2388 |
High |
932 |
In further investigation, the comparison between the different item types and item weight were investigated. In order to do that, box plot is used to find the relationship which lead to the following plots;
From the above range it is evident that, the house hold products and the seafood is available in wide range of weight which shows the range of the weights with different values.
Now in the following plot that includes different variables are depicted that helps in determine the relationships between them.
From the above plot it can be said that that most of the stores that were established before the year 1990 have the larger number of sales compared to the newer stores of big mart. Moreover, the items with weights between the values 10-15 are mostly sold by the stores.
In this data mining project, it is found that there are numerous rows in the dataset that contains missing values which may lead to the wrong modelling of the classifier of the dataset. Thus, in order to remove or imputing the missing values in the dataset in order to make the dataset and the generated model a reliable one.
The examination of enormous information to pick up bits of knowledge is another idea. Huge information examination has been characterized in various routes and there seems, by all accounts, to be an absence of agreement on the definition. Enormous information investigation has been characterized as far as the advancements and systems which are utilized to dissect substantial scale complex information to help enhance the execution of a firm (Kwon et al., 2014) characterizes huge information investigation as the use of cutting edge logical methods on enormous informational indexes. Fisher et al. (2012) characterize enormous information examination as a work process which distils terabytes of low esteem information down into more granular information of high esteem. For the reasons for this paper, enormous information examination was characterized as the utilization of logical strategies and advances to break down huge information so as to acquire data which is of incentive for deciding.
For the majority of Big Data mainly available in unstructured state that does not provide any value for the business organizations. From the unstructured dataset through the use of the right set of tools and analytics it is possible to find out the relevant insights for the organizations.
With the specifically crafted model it is possible to make prediction for the desired feature selected from the database. Use of the predictive model from the selected Dataset can be helpful in finding trends and insights about the sales and business processes that help in driving operational efficiencies for the business organizations. in this way the organization can create and launch new products as well as gaining competitive advantages against other competitors. In this way the exploitation of the value from the Big Data is helpful in removing tremendous effort required for sampling.
Furthermore, analysis of Big Data can bring other benefits for the organizations. This benefits includes launch of new products and services with the customer centric product recommendations while better meeting customer requirements of the customers. In this way the data analytics can facilitate growth in the market.
Previously getting the insights from huge amount of dataset were too costly to process.
The underlying concept to carry out this project was to knowledge discovery through the use of the different packages in python languages known as data mining. The core of this process is machine learning by defining the features for classification process.
Spellbinding examination is the arrangement of systems which are utilized to depict and give an account of the past (Davenport and Dyché 2013). Retailers can utilize unmistakable investigation to portray and outline deals by area and stock levels. Models of techniques incorporate information representation, expressive measurements and a few information mining strategies (Camm et al., 2014).
Prescient investigation comprises of an arrangement of procedures which utilize measurable models and experimental techniques on past information keeping in mind the end goal to make exact expectations about the future or decide the effect of one variable on another.
retail industry, prescient examination can extricate designs from information to make forecasts about future deals, rehash visits by clients and probability of making an online buy (Camm et al. 2014). Precedents of prescient investigative systems which can be connected to enormous information incorporate information mining strategies and direct relapse.
In order to avoid decreased rate of response as well as success it is important to provide personalized recommendation for the customers that are specific to their needs. In order to do that, it is important to determine the factors that have the significant impact in the plotting of the regression line. For this, the feature engineering is the most important stage where the variables are selected for the classifier modelling process.
For this project we have investigated with multiple models with different perspectives with the selected dataset using different classification models. The selected data set is a small in size that may not be fruitful for the organizations large scale sales model. Throughout the project it is found that the dataset has missing values for the different attributes in the numerous rows which were managed in the data cleansing process to have better classification model.
Even though, altered combinations of feature sets for the modelling were used due to the noisy dataset the results deviated from each other. Where as in case of the results from the models which yielded higher rate of accuracies can help in concluding that the dataset contains a demonstrative amount of information.
In order to remove the noise from the dataset that may have caused due to the e random split method of the complete dataset. The results from the project depicted that linear models are not actually suitable to use for this kind of project as this classification model was introduced in order to predict other attributes than the categorical data.
Davenport, T.H. and Dyché, J., 2013. Big data in big companies. International Institute for Analytics, 3.
To export a reference to this article please select a referencing stye below:
My Assignment Help. (2021). Advanced Data Analytics. Retrieved from https://myassignmenthelp.com/free-samples/31005-advanced-data-analytics/exploration-of-the-dataset.html.
"Advanced Data Analytics." My Assignment Help, 2021, https://myassignmenthelp.com/free-samples/31005-advanced-data-analytics/exploration-of-the-dataset.html.
My Assignment Help (2021) Advanced Data Analytics [Online]. Available from: https://myassignmenthelp.com/free-samples/31005-advanced-data-analytics/exploration-of-the-dataset.html
[Accessed 19 April 2021].
My Assignment Help. 'Advanced Data Analytics' (My Assignment Help, 2021) <https://myassignmenthelp.com/free-samples/31005-advanced-data-analytics/exploration-of-the-dataset.html> accessed 19 April 2021.
My Assignment Help. Advanced Data Analytics [Internet]. My Assignment Help. 2021 [cited 19 April 2021]. Available from: https://myassignmenthelp.com/free-samples/31005-advanced-data-analytics/exploration-of-the-dataset.html.
The world's leading assignment help service, MyAssignmenthelp.com, offers iconic assignment assistance at some of the most amazing prices ever. With an army of brilliant minds, we are capable of providing science, Mathematics, literature, statistics, finance, etc. assignment help of exceptional quality for all academic levels. For more than ten years, we have been delivering flawless assignment writing assistance to students from all around the world. Our native writers are some of the best in the industry and the best people to help you reach the heights of academic success!
Answer: Introduction The main aim of this project is to develop a naive command line text based user interface to access the memory database. The c source code is used have used for the command line interface. All the data must be stored in the directory of memory database. The user can enter all the commands on a single line command to interact with the memory database. Description To create the naïve interface, the text based command...
Read MoreAnswer: To develop the Bright College Management system, Java programming language is used. Java is an object-Oriented language and is the most appropriate language to use to develop the proposed system. By using different Object Oriented design patterns, the proposed application will be able to take advantage of most important object oriented design patterns like encapsulation and method overriding. To demonstrate the design of the propose...
Read MoreAnswer: The implemented program is a product in a shop hhaving a particular quantity in stock, minimum stock level, and when this is reached it indicates that the product needs to be reordered and a reorder amount.The class product is designed to model a product where by it has the following attributes which are defined as varibles in the product class. String name- which is the name of the product int quantity- which...
Read MoreAnswer: Bigelow et al. (2015) opined that memory management is one of the big issues in fundamental programming. Though, it's an important aspect to manage memory in the programming environment using C++ [1]. Lakhotia, Harman and Gross (2013) stated that smart pointers are the class objects which look as well as feel like pointer, but they are smarter [2]. This report is designed to explain the use of C++ language in memory manage...
Read MoreAnswer Introduction The internet has moved on by many a miles over the past years. People now use the internet from various devices that range from desktop computers, laptops, tablets and smartphones. Responsive web design or RWD is one of the most commonly used web designing approaches in the modern technological era (Mohammad & Tomberg, 2013). Through this approach only one website interface is designed and it is meant to suffice the ne...
Read MoreJust share requirement and get customized Solution.
Orders
Overall Rating
Experts
Our writers make sure that all orders are submitted, prior to the deadline.
Using reliable plagiarism detection software, Turnitin.com.We only provide customized 100 percent original papers.
Feel free to contact our assignment writing services any time via phone, email or live chat. If you are unable to calculate word count online, ask our customer executives.
Our writers can provide you professional writing assistance on any subject at any level.
Our best price guarantee ensures that the features we offer cannot be matched by any of the competitors.
Get all your documents checked for plagiarism or duplicacy with us.
Get different kinds of essays typed in minutes with clicks.
Calculate your semester grades and cumulative GPa with our GPA Calculator.
Balance any chemical equation in minutes just by entering the formula.
Calculate the number of words and number of pages of all your academic documents.
Our Mission Client Satisfaction
THANK YOU SO MUCH FOR YOUR ASSISTANCE AND HELP DURING THESE TIMES! YOU ARE ALL AMAZING
Australia
happy with the work happy with the work happy with the work happy with happy with the work h the work happy with the work happy with the work happy with the work happy with the work happy with the work happy with the work happy with the work happy wi...
Australia
happy with the work ,happy with the work happy with the work happy with the work happy with the work happy with the work happy with the work happy with the work happy with the work happy with the work happy with the work happy with the work happy wit...
Australia
Written really well and answered the questions well. Straight to the point and citations are placed well. Well informated.
Australia