Wine Quality Prediction

Sumantha.NTS
7 min readMay 23, 2021

--

Wine quality refers to the factors that go into producing a wine, as well as the indicators or characteristics that tell you if the wine is of high quality.

When you know what influences and signifies wine quality, you’ll be in a better position to make good purchases. You’ll also begin to recognize your preferences and how your favorite wines can change with each harvest. Your appreciation for wines will deepen once you’re familiar with wine quality levels and how wines vary in taste from region to region.

Factors that affects the Wine Quality

  1. Soil

The mineral content of the soil and the groundwater determines the composition of acids and other trace minerals that influence the aroma of the wine.

2. Grape variety

Each grape variety has a distinct aroma and other features that play an important role in determining the kind of wine to be produced.

3. Climate

The climate can be a blessing or a curse for the grapes. Extremes of sunshine, hail storm, wind, frost, rain, etc. can damage the grapes. The average yearly temperature of the place should not be below 10 degrees celsius. The ideal average temperature is 14 degrees celsius.

4. Latitude

Most wine-producing countries lie between 30 degrees to 50 degrees latitude. The countries near the 30-degree latitude have a higher temperature which accelerates the fermentation process, producing poor quality wine.

5. Viticulture

This is most important and in every stage of viticulture plowing, pruning, weeding, spraying, and harvesting, etc. happens at a particular month of the year considering the weather.

6. Vinification

It refers to the method of making wine. The wine producers have a lot of options before them at each stage of making wine. The wine produced in the new world uses the latest technology while the traditional winemaking countries follow the old wine-making methods. Each has its own characteristics.

7. Aging

Aging determines the character of the wine. The longer the wine matured, the mellower and smoother will be the wine, taking the flavor of vanillin from the wood.

8. Storing

Wines should be stored at an appropriate temperature and in rooms free from direct sunlight and vibration. Wines should not be subjected to extreme fluctuation of temperature. Poor storage would mar the quality of the wine.

Effects of Bad Wine Quality

  1. Liver disease
  2. Sulfite reactions
  3. Migraine Headache
  4. Weight gain
  5. Breast Cancer

Benefits of drinking Wine

Research on the benefits of wine consumption will continue, especially since it is believed to have several curative properties. In the meantime, people should remember to drink responsibly in order to avoid any of the adverse effects associated with alcohol consumption.

Benefits of drinking good quality wine are as follows

  1. Increases good cholesterol
  2. Relieve anxiety and tension
  3. Ulcer Preventative
  4. Prevents cardiac fibrosis
  5. Increase lung function
  6. Protect the blood vessels from damage

Moderate and regular consumption of good quality of wine has many health benefits.

This project will explain the parameters which govern the quality of wine and predicting the same using machine learning algorithms.

Dataset Details:

Dataset is taken from kaggle. The link for the dataset is shown below. https://www.kaggle.com/rajyellow46/wine-quality

The dataset consists of 6497 records with 12 input channels and ‘quality’ as output variable. Sample data is shown below.

Sample data of wine dataset

There are few missing values in some input variable as shown below.

Missing values

Checking the statistical parameters to impute the missing values.

Descriptive statistics for wine dataset

From the above descriptive statistics results, the mean and median for all the variables are almost equal.

Hence the missing values are imputed with median values.

Sample code for the same is given below

Filling missing values using median values

Exploratory Data Analysis (EDA):

After filling the missing values, exploratory data analysis is carried out to get insights from the data.

Distribution of a variable is checked with histogram and outliers with boxplot for all the variables.

Graphs of EDA is presented below.

EDA of fixed acidity variable
EDA of Volatile acidity variable
EDA of Citric acid variable
EDA of Residual sugar variable
EDA of Chlorides variable

For the complete EDA of variables, please refer the git-hub repository given at end of this blog.

The output variable i.e., quality variable does not have equal number of records for each class as shown below

Output variable

Since the output variable is unbalanced, over sampling techniques used to bring it to balanced condition.

Random Over sampler is used as over sampling technique. Sample code for the same is given below.

Random over sampler code

Feature Selection:

Since the most of the input variables are continuous, correlation is plotted to check multi-colinearity in input variables. Correlation is plotted with heatmap as shown below.

Correlation graph

From the above graph it is evident that free sulfur dioxide and total sulfur dioxide are 72% co-related. Hence one variable should be neglected for model building. Hence i have neglected free sulfur dioxide for model building.

Random forest and decision tree feature importance is used to select the important features represented below.

Random forest feature importance
Decision tree feature importance

From the above graphs, it is evident that the type variable is least important feature to predict the output. Hence it is neglected.

With the help of co-relation and feature importance graphs, ‘type’ and ‘free sulfur dioxide’ variable is neglected for model building and entire data is splitted into 80% train and 20% test as shown below.

Splitting the data

Model Building:

Different model have been built for the above dataset. The models are optimized with hyper-parameters using Grid Search CV method.

Accuracy is considered to be the evaluation metric. The different model accuracies are tabulated below.

Accuracy Result
Result

From the above results, XG Boost classifier is considered for building the web API.

Sample code for XG boost classifier is shown below.

#importing libraries
from xgboost import XGBClassifier
#initializing the model
xgb = XGBClassifier(eval_metric='merror')
#fitting the model and predicting for test data
pred_xgb = xgb.fit(x_train,y_train).predict(x_test)
#plotting the confusion matrix
print('CONFUSION MATRIX : ')
fig, ax = plt.subplots(figsize=(15, 8))
plot_confusion_matrix(estimator=xgb, X=x_test, y_true=y_test,cmap='YlGn_r',ax=ax)
plt.show()
#printing the classification report
print('REPORT: ',classification_report(y_pred=pred_xgb,y_true=y_test))
#calculating accuracy
acc_xgb = accuracy_score(y_test,pred_xgb)

Result of the above code is shown below.

Confusion Matrix
Report

Sample Code to check the hyper parameters is shown below.

# checking hyper parameters
xgb.get_params().keys()

Result of the above code is shown below.

XG Boost Hyper parameters

Grid search CV is used for Hyper-parameter tuning. Sample code for the same is shown below.

#Hyper parameter tuning using GridSearchCV
from sklearn.model_selection import GridSearchCV
#hyper parameters are
params = {'n_estimators' : [650],
'max_depth':[10,15]
}
#initializing the grid
grid = GridSearchCV(estimator=xgb,param_grid=params,cv=3,verbose=3,n_jobs=-1)
#fitting the model
grid.fit(x_train,y_train)
print('Best Score : ',grid.best_score_)
print('Best parameters : ',grid.best_params_,'\n')
#predicting for test data
pred_xgb = grid.predict(x_test)
#plotting the confusion matrix
print('CONFUSION MATRIX : ')
fig, ax = plt.subplots(figsize=(15, 8))
plot_confusion_matrix(estimator=grid, X=x_test, y_true=y_test,cmap='YlGn_r',ax=ax)
plt.show()
#printing the classification report
print('REPORT: ',classification_report(y_pred=pred_xgb,y_true=y_test))
#calculating accuracy
acc_xgb = accuracy_score(y_test,pred_xgb)

Result of the above code is shown below.

Best score and parameters
Confusion Matrix
Report

Model is exported as a pickle file in order use web API. Sample code for pickling file is shown below.

#pickling the files
import pickle
pickle_out = open('classifier.pkl','wb')
pickle.dump(xgb,pickle_out)
pickle_out.close()

Model Deployment:

The web API is developed with help of flask framework. The pickled file is used to predict for new data.

The developed API is deployed in Heroku platform.

The screen-shot of the web API is shown below.

Screen-shot of web API

Link for Web API :

All the files are uploaded in the git-hub repository as given in the link below.

Conclusion:

As moderate and regular consumption of wine has many health benefits, it is important to use the good quality of wine.

References:

--

--

Sumantha.NTS

Data Science enthusiast. Eager to solve real world problems related AI.