Using K-means for Customer Segmentation in E-commerce essay.

Research Objective

Requirements (Tasks)
The whole task of this assignment consists of the following procedural steps.

Step 1 :
Set up (by your imagination of a real-like business situation or by applying an actual analysis problem case) a scenario in which you are given a set of domain-specific dataset and asked to analyze the given dataset. The purpose of the analysis might be to understand (overview or learn about) the given data or to solve a specific analytical problem – depending on the scenario you made up.

Step 2 :
Find and get your own domain-specific dataset to fit for the scenario you made up. The dataset could be unique or publicly available. Some public datasets are available
from the UCI machine learning repository (http://archive.ics.uci.edu/ml/). Also refer to Resources folder of our LearnJCU subject site for more sources.

Step 3 :
Choose appropriate data mining techniques (algorithms) – see more details for each option in Step 4 below.
** Note: The procedural order of the above three steps can be alternated. For example, you may find an interesting dataset first and then set up a specific datamining scenario which fits for the analysis on the dataset chosen. **

Step 4 :
You can select either of two options for this assignment.

Option (1) – Programming-intensive Assignment
- Once you have your own domain-specific dataset and chosen data mining algorithm, then you need to design and implement the chosen algorithm in your preferred programming language.
- A series of preprocessing will be required at this step. The preprocessing This assignment can be worked either as a group (two students at maximum) or as an individual. If you work as a group, then group members must equally contribute to the group work. Also, all group members must participate in the presentation.
procedure should be designed carefully (considering what kind of processing will be required? How? Why?) to make your data ready to be fed to your
program. Some parts of this preprocessing procedure can be included in your program as a part of “pre-data-mining module”.
- Your final program must become a stand-alone data-mining tool designed for your own purpose of data analysis. It is expected that your program should include the following modules (and may include more sub-modules if needed);
1) pre-data-mining module – designed for necessary preprocessing and for getting the data ready to be fed to the next module (data-mining module). You don’t need to include all required pre-processing in this module. It is assumed that some initial preprocessing (e.g. cleaning noise data) can be done externally using other software tools (e.g. Excel or Weka).
2) data-mining module – the chosen data mining algorithm is implemented. You can directly borrow the algorithm from one popular existing data mining method, or you can design your own algorithm (by amending the existing one)
3) post-mining module – this module is for presenting/reporting the output result produced through previous modules. The result can be made in a simple text report or additionally in a non-text visualization way (e.g. graph, chart or diagram).
- This programming-intensive assignment still requires an analysis. Try to find all the patterns you can detect with your implemented algorithm. Try to compare and contrast the result using your chosen preprocessing scheme and algorithm with using other existing algorithm or with using other preprocessing methods.

Methodology

Note: in particular for the comparison the result using your program with using other existing algorithm, you can use other existing data mining tools (e.g. Weka) to get the result using other algorithm.

Option (2) – Analysis-intensive Assignment
- Once you have your own domain-specific dataset chosen, you need to design your own data-mining analysis scheme. This analysis scheme can consist of multiple steps of procedures:
1) Set up a strategy for preprocessing on your data. A series of preprocessing will be required and need to be designed carefully (considering what kind of processing will be required? How? Why?). You may include multiple different preprocessing schemes for the comparison analysis.
2) Set up a strategy for data-mining.

you need to select one data mining area (clustering, classification, association rules mining) of your choice and select AT LEAST ONE
existing data mining algorithm in your chosen data mining area. For example, if you chose Clustering as your data mining area, you can apply two algorithms; DBSCAN and k-mean and compare the two results. Alternatively you can design a combined algorithm which applies multiple algorithms from same/different data mining areas in a series. Your strategy also can be designed to apply different parameters for one algorithm. Another strategy you can set up is to apply multiple preprocessing (attribute selection) schemes for one algorithm.
- You can choose one data mining tool (e.g. Weka) to analyze your chosen dataset. Apply the data-mining strategy (you had set up) on your chosen data (preprocessed) using the data mining tool and try to find all the patterns you can detect.
- Do various comparison experiments either by applying different data mining algorithms (or strategy) to the same chosen dataset or by applying the same algorithm to the differently pre-processed datasets.
- Critically analyze experimental results and discuss/demonstrate why a chosen algorithm (strategy) is superior/inferior to other algorithms (strategies).

Step 5 :
- You need to present a short presentation (5~10 minutes presentation) based on your chosen algorithm (strategy) and experimental test, and also you need to write a scientific paper as an experimental report.
• The presentation must generally include a good overview on your project, aims/objectives, reasons of your choice, brief overview of strategy/algorithm you chosen, findings, comparison including experimental results and conclusion.
• The presentation pitch will occur during classes in Week 13. All members of the group should equally participate in the presentation pitch.
• You need to write a research report paper of minimum 10~15 pages in length on your project to summarise your algorithm and experimental results. The report should contain all topics listed above for presentation but with more details. For CP5634 students, you need to add in your report one additional section for a brief (mini) literature review about the data mining methods (strategy, algorithm and/or preprocessing methods) you chose for your project. Please refer to the following link if you need to get further idea of “literature review”:
- The research paper must follow the generally accepted format of research article consisting of introduction, related work (brief review of methodologies
(algorithm/strategy used), a summarized description of your experimental settings and procedures (description of data, justification of chosen data mining area, justification of chosen algorithm, preprocessing details, etc.), comparison, discussion, issues, conclusion, possible future work and a list of references. (you may add more sections if needed)
- In addition to the general components listed above, the report from “Programming-intensive option” should include a summary of your program (including the program structure, implementation details, a summarized algorithm for the main modules etc. including code if necessary).
- For “Analysis-intensive option”, it is required to include a more in-depth analysis on the investigation and experimental comparison made through the
project.

Research Objective

Research Papaer

Use of K-means algorithm in customer Segmentation

Introduction

In data mining the clustering technique is considered as a critical step. Statistically it is a multivariate procedure which is suitable for different type of segmentation process for a given data set and researches. This research paper contributes to the use of the k-means clustering technique using the Python programming language for a selected e-commerce data set.

For a business, market or customer segmentation can be defined as a process that helps in dividing its customer base into multiple homogeneous groups that consist of users who shares similar buyer characteristics. This attributes includes their buying habits, interests in products, product preferences and so on (Al-Wakeel & Wu , 2016). The customer segmentation is considered as the most fundamental step in strategic planning for acquiring the potential customers as well as retaining the customers. The segmentation of the buyers is mainly done under different categories depending upon purchasing capabilities as well as interest.

With the present rate of growth in the e-commerce the data clustering process guarantee to convey effective answers for a significant number of the issues emerging from the interaction of customers with the expanding volume of data in generated due to the interactions and buying products from the websites (Kwac, Flora and Rajagopal ,2014). By the term clustering we mean the unsupervised procedure through which huge amount of data are segmented into homogenous as well as disjoint groups of data. This segmentation is done depending upon the similarity between the attributes. The K-means algorithm is used for other application too such as decision making and pattern classification.

The following paper contributes to the discussion about the processing of data set, analysis of data set using the k-means clustering algorithm. In addition to that, the discussion on the findings through the analysis of dataset and

Related work customer segmentation using K-means algorithm

In their paper, the authors Kwac, Flora and Rajagopal (2014) stated that, clustering has the ability and is effective in finding subtle yet strategic connections covered inside a unlabelled datasets. This type off analysis is done under unsupervised learning. There are many clustering algorithms which incorporates Self-Organizing map (SOM), k-Means clustering, k-Nearest Neighbour clustering and so on (Al-Wakeel & Wu , 2016). The above mentioned clustering techniques are very useful as having no information of the dataset beforehand these algorithms are able to recognizing clusters in the given dataset by rehashed correlations of the input designs until the steady clusters in the data set are achieved.

Each cluster contains data points that have similarities in any case, vary significantly from information purposes of different clusters. Clusters has colossal applications in image analysis, pattern recognition and so on. In their paper they tried to find out segments using the k-Means algorithm through the MATLAB code.

The authors discussed the implementation of the K-means algorithm in the following way, K-means clustering, which is one innovation basing on center point of mass, takes as the info parameter, then separate information point question sets into gatherings. The reason for clusters is to make the between aggregate comparability most astounding, be that as it may, the intra-gather comparability most reduced. Likeness of clusters can be estimated by mean estimations of articles in gatherings, which can be regarded as the center point of gathering

Methodology

On the other hand, the researchers like Dhanachandra, Manglem & Chanu, (2015), found that, the k-means clustering techniques is a non-hierarchical and partitional data clustering strategy reasonable for classification process of huge amount of data into multiple patterns or clusters. It is the easiest and most generally utilized for data analysis that utilizes the squared error criteria for determining the clusters.

For a given data set consisting of numeric items and a another integer , it ascertains a segment of examples in k number of clusters. This procedure happens in an iterative way beginning from an arbitrary partition and continuing until finding out a segment of n that limits the inside group of aggregate of squared errors in the process (Dhanachandra, Manglem & Chanu, 2015). The k-means is calculated in in four stages:

Determining the k cluster centres to agree with k arbitrarily picked examples or k randomly characterized center points inside the data set that contains the example set.
finding and assigning each of the patters to the nearest cluster (group mean).
Re-evaluation of the of the cluster centres utilizing the present cluster elements.
Calculation of the convergence of the different clusters.

Analysis of dataset

Description of data

We have selected aa data set that refers to customers of oline distributor. It includes the annual spending in monetary units for different types of products. Mainly consist of 8 columns and 350 rows of data that will be used for analysis

Justification for selecting K-means algorithm over the DBSCAN

In order to segment the customer from the selected dataset we have selected the K-means algorithm. Primary reason behind this selection of the algorithm is clustering helps in creating groups depending on typically used continuous variables (Al-Wakeel & Wu , 2016). As in this project we are trying to create different customer groups depending upon the different attributes, therefore in this scenario clustering can be very helpful in order to find the boundaries between multiple groups from the selected dataset.

As in this project we are going to work with multiple dependent variable of interest from the data set. The variables are generally considered as the input variable in the analysis. The clusters after the analysis can be inferred in light of the selected variables (Maldonado, Carrizosa & Weber, 2015).

The K-means clustering is also very useful in exploratory analysis too. This clustering technique also helps in finding out the picture of typical customer characteristics from the selected dataset.

In addition to that, Homogeneity is also an important factor while considering the cluster analysis. In case of K-means clustering variances among the resulting group from the analysis are fond to be very small. On the contrary, in case of rule-based segmentation process the resulting groups consist of customers who are actually very different depending on the attributes from each other.

Also it provides dynamic clustering results as the clusters definitions are changed with every iteration or time the algorithm runs. In case real time data clustering it ensures that the resultant groups from the analysis always reflects the current state of the data which is analysed through the clustering process.

On the other hand, DBSCAN or the Density-based clustering algorithms discover clusters or areas with high densities which are isolated by low density areas that may lead to the confusion of the explicit determination of the clusters of the buyers (Dhanachandra, Manglem & Chanu, 2015). The density based spatial clustering of utilizations with noise algorithms classifies clusters every accessible point as center focuses, outskirt focuses, and also noise points.

Data Preprocessing

Center points are those that have in any event Minpt number of focuses in the e distance. Fringe focuses can be characterized as focuses that are not center focuses, be that as it may, are the neighbours of center focuses. Commotion focuses are those that are neither center focuses nor outskirt focuses.

Pre-processing on the dataset

For the selected data set, we have selected the data available at archive.ics.uci.edu having title as “Wholesale Customer data”.

The data set includes 440 rows for this assignment we reduced the number of rows to 348 rows. Before analysing the dataset, we pre-processed the data for better results. For the data set following is the statistical result,

	Channel	Region	Fresh	Milk	Grocery	Frozen	DetergentsPaper	Delicatessen
count	348.000000	348.000000	348.000000	348.000000	348.000000	348.000000	348.000000	348.000000
mean	1.324713	2.422414	12027.491379	5583.442529	7762.818966	3091.054598	2814.166667	1576.672414
std	0.468942	0.829751	13143.047477	7131.588022	9206.970951	5100.776773	4654.992536	3109.744474
min	1.000000	1.000000	3.000000	55.000000	3.000000	33.000000	3.000000	3.000000
25%	1.000000	2.000000	2916.000000	1471.500000	2141.250000	779.000000	261.500000	408.000000
50%	1.000000	3.000000	8305.000000	3539.500000	4725.000000	1456.500000	771.000000	900.500000
75%	2.000000	3.000000	16850.500000	7190.250000	10550.000000	3505.250000	3971.500000	1795.750000
max	2.000000	3.000000	112151.000000	73498.000000	92780.000000	60869.000000	40827.000000	47943.000000

From the above table it can be stated that, there are total 348 rows and the rows includes the monetary value spend on the different products such as milk, grocery, frozen, detergents_paper, delicatessen.

For the products, we found that, we have the following statistical data,

Product (mean, std, min,max)

Fresh (12027, 13143.047477, 3.000000, 112151)

Milk (5583.442529, 7131.588022, 55, 73498)

Grocery (7762.818966, 9206.970951, 3, 92780)

Frozen (3091.054598, 5100.776773, 33 ,60869)

DetergentsPaper(2814.166667, 4654.992536, 3 , 40827)

Delicatessen (1576.672414, 3109.744474, 3.000000, 47943)

At this stage we tried to explore some details for some of the arbitrarily selected customers by subtracting the mean and median values from the purchases of the customers which results in something like the following,

Fresh Milk Grocery Frozen Detergents_Paper

4198.0 -3758.0 -5998.0 -2238.0 -2644.0

-11455.0 4180.0 14419.0 -870.0 2068.0

10294.0 -2367.0 -6316.0 -883.0 -2636.0

Delicatessen

0 -510.0

1 986.0

2 1025.0

Here from the above table we found that for the arbitrarily selected customer 1, it buys more than average in fresh, products on the other hand the customer 2 purchases better Milk and frozen products and Delicatessen. At the end the customer 3 buys Fresh and more Delicatessen compared to the other two customers.

In the next stage in order to find out the relation between the features of the data set we tried to find out the distribution for a given feature in the data set which is depicted below,

From the above scatter plot it is evident that, Grocery element and the Detergent _paper elements have the highest correlation between them in the selected dataset. As the distribution is mostly right skewed thus it can be said that by observing the data set, high spending on the frozen items cannot be paired with the higher rate of fresh food purchases.

In order to avoid the skewness of the dataset, we tried to use non liner scaling of the features using the natural logarithm.

Analysis of the Result

One of the measurements that is normally used to look at comes about crosswise over various estimations of K is the mean separation between information focuses and their group centroid. Since expanding the quantity of bunches will dependably diminish the separation to information centres, expanding K will dependably diminish this metric, to the extraordinary of achieving zero when K is the same as the quantity of information center (Dhanachandra, Manglem & Chanu, 2015). In this manner, this metric cannot be utilized as the sole target.

From the processed data we got the following clusters marked with the black outline.

From the above cluster analysis, it can be stated that, each cluster depicted in the figure has a central point. These focuses (or means) are not particularly information focuses from the information, but instead the midpoints of the considerable number of information focuses anticipated in the individual groups. For the issue of making customer segments, a clusters central point relates to the average customer of the cluster (Maldonado, Carrizosa & Weber, 2015). Since the information is right now decreased in measurement and scaled by a logarithm, we can recuperate the delegate customers spending from these information focuses by applying the backwards changes.

It can additionally be joined with other choice strategies which can be constructed viably over it, permitting the customer to rethink his criteria and inclinations in light of the groups processed. Besides the customer require not uncover his buying procedure but rather utilize the arrangement created to shape it (Maldonado, Carrizosa & Weber, 2015).

The customer can consolidate information things from different sources, channel them utilizing the range seek and characterize them, subsequently having the capacity to coordinate item lists from various providers.
The online store requires not keep data with respect to the customer separated from his inclination rectangle furthermore, last buy choices, upgrading along these lines moral calculates, for example, obscurity buying, catching just changes in client inclinations per session.
For the situation of versatile processing, the calculation portrayed in this paper can be overhauled by the online shop server bunch, and the client can recover the sifted data in his cell phone through incremental steps, or apply basic leadership programming to shape his customized buying criteria and inclination.

Possible future work

For this project we have used a small dataset in order to implement the cluster analysis using k-means algorithm. In future this work may be extended to implement the same algorithm on larger dataset to improve the proficiency of the developed algorithm.

Conclusion

The target of customer segmentation is precisely anticipating the requirement of the customers so that the organizations can retain the customers. Consequently, the organizations can enhance the productivity of the business and profit from it by obtaining or fabricating items in right amount at time for the loyal customers at an optimum cost.

In order to meet these stringent prerequisites k-means clustering strategy can be very helpful for appropriate forecasting of the business furthermore deterring the business strategies for the future. Through this process it is conceivable to order classify the brands, items, durability, utility, convenience and so on with clustering process. For instance, through this process it can be determined that which brands are grouped together as far as customer buying patterns includes some specific brands at once.

Cite This Work

To export a reference to this article please select a referencing stye below:

My Assignment Help. (2020). Using K-means For Customer Segmentation In E-commerce Essay.. Retrieved from https://myassignmenthelp.com/free-samples/cp5634-data-mining/algorithm-is-implemented.html.

"Using K-means For Customer Segmentation In E-commerce Essay.." My Assignment Help, 2020, https://myassignmenthelp.com/free-samples/cp5634-data-mining/algorithm-is-implemented.html.

My Assignment Help (2020) Using K-means For Customer Segmentation In E-commerce Essay. [Online]. Available from: https://myassignmenthelp.com/free-samples/cp5634-data-mining/algorithm-is-implemented.html
[Accessed 20 April 2024].

My Assignment Help. 'Using K-means For Customer Segmentation In E-commerce Essay.' (My Assignment Help, 2020) <https://myassignmenthelp.com/free-samples/cp5634-data-mining/algorithm-is-implemented.html> accessed 20 April 2024.

My Assignment Help. Using K-means For Customer Segmentation In E-commerce Essay. [Internet]. My Assignment Help. 2020 [cited 20 April 2024]. Available from: https://myassignmenthelp.com/free-samples/cp5634-data-mining/algorithm-is-implemented.html.