Wine is and has always been a peculiar beverage. Since the dawn of modern civilization, wine has played a huge role in both commercial economics and recreational social life. However, there are myriad differences between varieties of wine. For instance wine from Spain may not be similar to wine from Italy and even interesting, wine from the same region may or may not be similar to each other. In their article on the composition of wine, Lingli, Na and Xueling (2016) identify wine as a key source of “Natural antioxidants” due to one of its components the phenolic compounds which are predominantly found in grapes a major raw input for wine processing.
Grapes to a large extent define the type and quality of the processed wine. Therefore, in determining what influences the difference in the quality of wine it is prudent to examine the composition of wine processed from different regions, what the grapes and corresponding wines fetch in the market as well as establish the common preference among wine lovers. Dharmadhikari (2013) notes that the chemical composition of grapes include:
- Pectic substances
- Aroma compounds
- Organic acids
Besides the above, grapes are made of up to 70% water as well as other in-organic compounds.
To determine whether there is a difference in the chemical composition of wine from different cultivars and if there is, how does this difference affect the quality of wine?
Purpose of the paper
The purpose of this paper is to employ data mining techniques on a specific dataset in order to solve a given business problem. The data mining techniques include: Clustering, Classification, Association rules mining, Outlier detection, Pattern tracking, Regression, and Prediction. However, clustering will be the main data mining technique used throughout this paper.
Data mining is the systematic process of data analysis and its transformation to useful forms hence forming a basis for business decisions. Sunil et al. (2017) argue that, data mining and machine learning are closely related whereby, “…there is a significant overlap in the underlying principles and methodologies…” between the two fields. An article by the Economic times (2018) define data mining as the extraction of valuable information/data from raw data. According to the article, clustering is founded on exploration and visualization of previously unknown patterns and facts present in a given dataset.
Process of data mining
Data mining is often a phased endeavor, and therefore follows an outlined process to ensure successful and useful drawing of insights from data, Brown (2017) identify “Cross-industry standard process for Data Mining (CRISP-DM)” as being one of the widely adopted data-mining framework. In his book on Data mining, Brown states that a data mining process includes:
- Business understanding- i.e. identification of business goals and their assessment
- Data comprehension through its description, exploration and quality verification
- Data preparation, i.e. cleaning
- Modeling of the data to determine underlying patterns
- Evaluation- examination of the exposed patterns and determine their use in the given business scenario
- Deployment- last stage of data mining which involve reporting and practicle use of the new insights
A cluster includes objects grouped together following their similarity in pre-defined characteristics i.e. a class. Therefore, clustering is the procedure of placing a group of objects into “classes of similar objects” (Tutorials point-Data mining, 2018). Among other uses, clustering is majorly used in pattern recognition and data analysis hence it is important due to its (Tutorials point-Data mining, 2018):
- Good ability to handle noisy data
- High dimensionality
There are a number of clustering techniques which range from: partitioning method, hierarchical method, constraint-based method etcetera.
Scope of study
This paper is aimed at exploring the functionality of different data mining techniques so as to demonstrate the usefulness of the specific processes used in the data mining field. It additionally seeks to investigate the evolution of data mining technologies through practical use on a real dataset that is supposedly obtained from a real business scenario. However, the paper does not explore the specific details of how different components of wine affect the quality of wine. It however seeks to explore the patterns underlying wine production from different regions.
Data used in this study is obtained from the UCI machine learning repository. The row data contains 13 variables and one class identifier variable i.e. the three different cultivars from which the wine was processed, in addition, there are a total of 158 data entries. The data variables are:
- Malic acid
- Total phenols
- Nonflavanoid phenols
- Color intensity
- OD280/OD315 of diluted wines
Due to the nature of our problem, i.e. exploring composition of wine from different regions, we split the data into subsets according to the identifier class to obtain three sub-sets.
In the raw data, the data records are identified by class attributes (1-3), we remove the attribute through filtering (Winter school, 2012).
Clustering falls under the category of unsupervised machine learning. It endeavors to expose how data is grouped and distributed. There exists a range of clustering techniques which include K-means and Expected Maximization.
Simple K-Means Clustering
K-means clustering develops on several groups of clusters dependant on the data hence enabling the easy flexibility of cluster info when building clusters.
The general algorithm for K-means according to Suman and Pooja (2014):
- Choose k Object G as initial cluster center
- Calculate distance from data point to cluster
- if data point is nearest to own cluster, assume it else move it to the nearest cluster
- Conductb and c until optimum relevant cluster is located for each data
Expected maximization allocates a given probability to every instance in the algorithm hence showing whether it belongs to the given cluster. The numbers of clusters are determined through the use of cross-validation in the model.
According to Sharova (2016), the steps of EM are:
- Conduct a probabilistic assignment for each data-point to a class based on the current hypothesis h give distributional class parameters
- Update the immediate original hypothesis h for distributional class parameters given new assignments
In the first step, expected values of cluster assignments is obtained whereas in the second a maximum likelihood for the original hypothesis is obtained
Results and analysis
Figure 3-Region A
Figure 4: Region B
Figure 5: Region C
Figure 6: Region A
Figure 7: Region B
Figure 8: Region C
In employing both K-means clustering and Expected Maximization clustering, the results are negligible. For instance, the mean of Alcohol contained in wine from region c (see figure 8) is 12.9644 on the first cluster when applying Expected Maximization clustering whereas the mean of alcohol from the same region when applying K-means clustering is 13.1537 which is a difference of 0.8039. In comparison, the mean of total phenols from wine obtained from region B when applying Expected Maximization clustering is 2.3524 on the 2nd cluster compared to the mean of 3.0345 which is a difference of 0.6821.
In a paper on the difference between the two methods of clustering, Yordan (2015) states that, “…K-means define hard clusters” where the samples are explicitly linked to the groups in question. Expected Maximization enables determination of the groups without necessarily considering association. Therefore K-means often are a generalized variant of Expected Maximization which tends to”… include the covariance structure of the data as well as centers of the latent Gaussians” (Yordan, 2015).
In most occasions, the Expected Maximization model is more reliable than the K-means model when the goal is to acquire data and plot it in a somewhat normalized form whereas K-means is suitable for, according to Zheng (2012), “…convergence points to which we get a general sectioning.” Hence K-means in a data mining scenario is used when exploring a generalized dataset through conducting iterations and attempting to gravitate inwards to the possibility of dividing the values of held in the overall set, which is similar to machine learning apart from the fact that there is no data training or testing (Business Science, 2018). The major difference between the two clustering method is therefore embedded in the fact that Expected Maximization is useful when there is no need of a vector whereas K-means explicitly usually does need a vector basis. Therefore where the effectiveness of K-means clustering method is doubted, the Expected Maximization can be used alternatively since K-means clustering is convenient when conducting easy clustering (Zheng, 2012).
From the results in the preceding section, it is evident that the components of wine from different regions are different. I.e. the mean of Alcohol composition of wine from region A is different from that from region B and C subsequently; this is equally true for other components whose mean differs as well from region to region.
Clustering is a useful data mining technique when it comes to establishing the underlying data structures as well as exploring the network distributions that were previously hidden in the data. Hence providing up-to- date information which is important in business decision-making, which often includes:
It is therefore important to choose the right tool for data mining in any given business set-up since it is through the data that any business grows eventually.
Recommendation for Further research
Following the research on the different data mining techniques and how specifically clustering can be used to obtain insights from a given dataset, in this case study, “Difference in wine composition from different cultivars”. We recommend further research on how the different composition of wine influences the quality and eventually the value of the given wine as well the economic impact of such difference on the socio-economic growth of the different wine cultivars.
Business science solutions. (2018). What is Clustering in data mining? What is its significance? Retrieved from:www.quora.com/whta-is-clustering-in-data-mining-What-is-its-significance Brown et al (2017).Machine Learning and Emotional Content Prediction. Centiment. 33(2), 21- 24.
Dharmadhikari, M.R., Sebastian, D. & Jennifer, H. (2013). Iowa State Research Farm Progress Reports. Retrieved from: https://lib.dr.iastate.edu/farms_reports/332/ Economic Times. (2018). Data Mining. Retrieved from: www.m.economictimes.com/definition/datamining
Sharova, E.(2016). The Expectation Maximization Algorithm. Retrieved from: www.kdnuggets.com/2016/08/tutorial-expectation-maximization-algorithm.html Suman, K. & Pooja, M. (2014). Comparison and Analysis of Various Clustering Methods in
Data Mining On Education data set using the WEKA tool. International journal of emerging trends & technology in Computer Science. 3(2), 240-244. ISSN 2278-6856
Lingli, Z., Na, L. & Xueling, G.(2016). Phenolic Compounds and anti-oxidant Activity of Wines Fermented using Ten Blueberry Varieties. American Journal of Food Technology, 11(6), 291-297. DOI: 10.3923/ajft.2016.291.297
Sunil, C. (2014). Evaluating and Analyzing Clusters in Data Mining. Journal of Computer Science and Mobile Computing, 3(12), 345-349
Tutorials point. (2018). Data mining- Cluster analysis. Retrieved from: www.tutorialspoint/data_mining/dm_cluster_analysis.htm
Yordan, P. R., Alexis, B. & Max, S.L. (2015). What to Do When K-Means Clustering
Fails:A simple yet Principled Alternative Algorithm. Retrieved from: www.journals.plos.org/plosone/article?id=10.1371/journal.pone.0162259
Winter School. (2012). Data processing Techniques for data mining. Data mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets. 2(13), 139-144.
Zheng, Y. (2012). Clustering methods in Data Mining with its Applications in High Education. International Conference on education Technology and Computer. 43(12), 36-54.