Wine Quality Prediction

7 min readMay 23, 2021

Wine quality refers to the factors that go into producing a wine, as well as the indicators or characteristics that tell you if the wine is of high quality.

When you know what influences and signifies wine quality, you’ll be in a better position to make good purchases. You’ll also begin to recognize your preferences and how your favorite wines can change with each harvest. Your appreciation for wines will deepen once you’re familiar with wine quality levels and how wines vary in taste from region to region.

Factors that affects the Wine Quality

Soil

The mineral content of the soil and the groundwater determines the composition of acids and other trace minerals that influence the aroma of the wine.

2. Grape variety

Each grape variety has a distinct aroma and other features that play an important role in determining the kind of wine to be produced.

3. Climate

The climate can be a blessing or a curse for the grapes. Extremes of sunshine, hail storm, wind, frost, rain, etc. can damage the grapes. The average yearly temperature of the place should not be below 10 degrees celsius. The ideal average temperature is 14 degrees celsius.

4. Latitude

Most wine-producing countries lie between 30 degrees to 50 degrees latitude. The countries near the 30-degree latitude have a higher temperature which accelerates the fermentation process, producing poor quality wine.

5. Viticulture

This is most important and in every stage of viticulture plowing, pruning, weeding, spraying, and harvesting, etc. happens at a particular month of the year considering the weather.

6. Vinification

It refers to the method of making wine. The wine producers have a lot of options before them at each stage of making wine. The wine produced in the new world uses the latest technology while the traditional winemaking countries follow the old wine-making methods. Each has its own characteristics.

7. Aging

Aging determines the character of the wine. The longer the wine matured, the mellower and smoother will be the wine, taking the flavor of vanillin from the wood.

8. Storing

Wines should be stored at an appropriate temperature and in rooms free from direct sunlight and vibration. Wines should not be subjected to extreme fluctuation of temperature. Poor storage would mar the quality of the wine.

Effects of Bad Wine Quality

Liver disease
Sulfite reactions
Migraine Headache
Weight gain
Breast Cancer

Benefits of drinking Wine

Research on the benefits of wine consumption will continue, especially since it is believed to have several curative properties. In the meantime, people should remember to drink responsibly in order to avoid any of the adverse effects associated with alcohol consumption.

Benefits of drinking good quality wine are as follows

Increases good cholesterol
Relieve anxiety and tension
Ulcer Preventative
Prevents cardiac fibrosis
Increase lung function
Protect the blood vessels from damage

Moderate and regular consumption of good quality of wine has many health benefits.

This project will explain the parameters which govern the quality of wine and predicting the same using machine learning algorithms.

Dataset Details:

Dataset is taken from kaggle. The link for the dataset is shown below. https://www.kaggle.com/rajyellow46/wine-quality

The dataset consists of 6497 records with 12 input channels and ‘quality’ as output variable. Sample data is shown below.

There are few missing values in some input variable as shown below.

Checking the statistical parameters to impute the missing values.

From the above descriptive statistics results, the mean and median for all the variables are almost equal.

Hence the missing values are imputed with median values.

Sample code for the same is given below

Filling missing values using median values

Exploratory Data Analysis (EDA):

After filling the missing values, exploratory data analysis is carried out to get insights from the data.

Distribution of a variable is checked with histogram and outliers with boxplot for all the variables.

Graphs of EDA is presented below.

For the complete EDA of variables, please refer the git-hub repository given at end of this blog.

The output variable i.e., quality variable does not have equal number of records for each class as shown below

Since the output variable is unbalanced, over sampling techniques used to bring it to balanced condition.

Random Over sampler is used as over sampling technique. Sample code for the same is given below.

Feature Selection:

Since the most of the input variables are continuous, correlation is plotted to check multi-colinearity in input variables. Correlation is plotted with heatmap as shown below.

From the above graph it is evident that free sulfur dioxide and total sulfur dioxide are 72% co-related. Hence one variable should be neglected for model building. Hence i have neglected free sulfur dioxide for model building.

Random forest and decision tree feature importance is used to select the important features represented below.

From the above graphs, it is evident that the type variable is least important feature to predict the output. Hence it is neglected.

With the help of co-relation and feature importance graphs, ‘type’ and ‘free sulfur dioxide’ variable is neglected for model building and entire data is splitted into 80% train and 20% test as shown below.

Model Building:

Different model have been built for the above dataset. The models are optimized with hyper-parameters using Grid Search CV method.

Accuracy is considered to be the evaluation metric. The different model accuracies are tabulated below.

From the above results, XG Boost classifier is considered for building the web API.

Sample code for XG boost classifier is shown below.

#importing libraries
from xgboost import XGBClassifier#initializing the model
xgb = XGBClassifier(eval_metric='merror')#fitting the model and predicting for test data
pred_xgb = xgb.fit(x_train,y_train).predict(x_test)#plotting the confusion matrix
print('CONFUSION MATRIX : ')
fig, ax = plt.subplots(figsize=(15, 8))
plot_confusion_matrix(estimator=xgb, X=x_test, y_true=y_test,cmap='YlGn_r',ax=ax)
plt.show()#printing the classification report
print('REPORT: ',classification_report(y_pred=pred_xgb,y_true=y_test))#calculating accuracy
acc_xgb = accuracy_score(y_test,pred_xgb)

Result of the above code is shown below.

Sample Code to check the hyper parameters is shown below.

# checking hyper parameters
xgb.get_params().keys()

Result of the above code is shown below.

XG Boost Hyper parameters

Grid search CV is used for Hyper-parameter tuning. Sample code for the same is shown below.

#Hyper parameter tuning using GridSearchCV
from sklearn.model_selection import GridSearchCV#hyper parameters are
params = {'n_estimators' : [650],
          'max_depth':[10,15]
         }#initializing the grid
grid = GridSearchCV(estimator=xgb,param_grid=params,cv=3,verbose=3,n_jobs=-1)#fitting the model
grid.fit(x_train,y_train)print('Best Score : ',grid.best_score_)
print('Best parameters : ',grid.best_params_,'\n')#predicting for test data
pred_xgb = grid.predict(x_test)#plotting the confusion matrix
print('CONFUSION MATRIX : ')
fig, ax = plt.subplots(figsize=(15, 8))
plot_confusion_matrix(estimator=grid, X=x_test, y_true=y_test,cmap='YlGn_r',ax=ax)
plt.show()#printing the classification report
print('REPORT: ',classification_report(y_pred=pred_xgb,y_true=y_test))#calculating accuracy
acc_xgb = accuracy_score(y_test,pred_xgb)

Result of the above code is shown below.

Best score and parameters

Model is exported as a pickle file in order use web API. Sample code for pickling file is shown below.

#pickling the files
import pickle
pickle_out = open('classifier.pkl','wb')
pickle.dump(xgb,pickle_out)
pickle_out.close()

Model Deployment:

The web API is developed with help of flask framework. The pickled file is used to predict for new data.

The developed API is deployed in Heroku platform.

The screen-shot of the web API is shown below.

Link for Web API :

Wine Quality

Edit description

wine--quality.herokuapp.com

All the files are uploaded in the git-hub repository as given in the link below.

sumantha-NTS/Wine-Quality-Prediction

Predicting wine quality with variables like alcohol, pH, Sulphates, Citric acid etc. Predicting wine quality with…

github.com

Conclusion:

As moderate and regular consumption of wine has many health benefits, it is important to use the good quality of wine.

References:

Welcome to Flask - Flask Documentation (2.0.x)

Welcome to Flask's documentation. Get started with Installation and then get an overview with the Quickstart . There is…

flask.palletsprojects.com

XGBoost Documentation - xgboost 1.5.0-SNAPSHOT documentation

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable…

xgboost.readthedocs.io

Documentation

Technical documentation that describes the Heroku platform.

devcenter.heroku.com

Wine Quality Prediction

Factors that affects the Wine Quality

Effects of Bad Wine Quality

Benefits of drinking Wine

Dataset Details:

Exploratory Data Analysis (EDA):

Feature Selection:

Model Building:

Model Deployment:

Wine Quality

Edit description

sumantha-NTS/Wine-Quality-Prediction

Predicting wine quality with variables like alcohol, pH, Sulphates, Citric acid etc. Predicting wine quality with…

Conclusion:

References:

Welcome to Flask - Flask Documentation (2.0.x)

Welcome to Flask's documentation. Get started with Installation and then get an overview with the Quickstart . There is…

XGBoost Documentation - xgboost 1.5.0-SNAPSHOT documentation

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable…

Documentation

Technical documentation that describes the Heroku platform.

10 Factors Influencing The Character Of Wine

A good quality wine should have the following characteristics- It should have the aroma of the grapes used in the…

Written by Sumantha.NTS