In this report, I will be exploring a dataset that contains review data for 1,599 red wine varieties. The dataset includes several chemical and physical properties, as well as the final overall quality assessment for each wine.
There are 13 variables with about 1,600 observations in our dataset. Each observation is from a different wine variety.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median : 2.200 Median :0.07900 Median :14.00 Median : 38.00
## Mean : 2.539 Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :15.500 Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
First, I’d like to get an overall idea of the quality of the wines in the dataset:
The minimum quality is 3, and the maximum is 8. The units of measurement are discrete integers, which makes categorization a bit easier. The average qualities are 5s and 6s, which is unsurprising. I wonder if there are any specific qualities that are common in middling wines that reults in such “average” ratings.
As I have no wine-tasting experience, I would also like to tease out which specific qualities are common for a score of 8. I believe I will explore this in the bivariate section.
There are four separate variables specifically for acidity, so I imagine this is a very important factor in wine quality assessment.
##
## Fixed Acidity :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## Standard Deviation: 1.7410963181277
##
## Volatile Acidity :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## Standard Deviation: 0.179059704153535
##
## Citric Acid :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## Standard Deviation: 0.194801137405319
##
## pH :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## Standard Deviation: 0.154386464903543
It would appear that all wines tasted are acidic. There is at least one observation in both the volatile acidity and citric acid variables that are far above the rest, while one observation in the pH scale is unusually low. I wonder if these are all from the same highly acidic wine.
From the description of volatile acidity, this wine would have a high amount of acetic acid. I wonder if it had spoiled and turned to vinegar.
Most wines seem to have a lower amount of citric acid, and the majority of have a volatile acidity greater than 0.3 and less than 0.8.
The mean pH of the wines is 3.3.
There are variables for residual sugars and chlorides, which respectively influence the sweetness and saltiness of the wine.
##
## Residual Sugar :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
## Standard Deviation: 1.40992805950728
##
## Chlorides :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## Standard Deviation: 0.0470653020100901
Most wines seem to have lower quantities of residual sugars and chlorides, while the long tails of each variable indicates some very sweet and very salty wines are sprinkled throughout the sample.
##
## Free Sulfur Dioxide :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
## Standard Deviation: 10.4601569698097
##
## Total Sulfur Dioxide :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
## Standard Deviation: 32.8953244782991
The free sulfur dioxide and total sulfur dioxide histograms are closely mirrored. The description of these variables indicates that they will minimally affect taste, but may affect the shelf stability and aging processes of the wines. Most wines seem to have relatively similar levels of sulfur dioxide.
This chart shows regular gaps in the sulfur dioxide ratings. I wonder if this is indicative of the use of rounding values, or if there is something physical going on here.
##
## Density :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
## Standard Deviation: 0.00188733395384256
While there is very little variance on density, it can be noted that the majority of wines are less dense than water, with very few having a higher density than water. As sugar is more dense than water, and alcohol is less dense than water, I wonder if the density of a wine is a good predictor if the levels of its sugar and alcohol.
##
## Sulphates :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
## Standard Deviation: 0.16950697959011
I’m not sure what to do with this data, but I do wonder how it affects the quality rating of wines, since it is listed as being used as an antimicrobial and antioxidant. Perhaps wines with lower levels of sulphates will have lower scores, as they may have slightly spoiled.
##
## Alcohol :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## Standard Deviation: 1.06566758184739
Most of the wines seem to fall between 9% and 12% abv, although there are some very strong wines in this list.
This dataset contains 1,599 wine varieties with 12 physical and chemical properties, as well as one property for quality. All physical and chemical properties are listed as numerical floating-point values, with the quality being ranked as an integer between 1 and 10, although no wines tested scored less than 3 or greater than 8.
Other Observations:
The main feature of interest in this dataset is the quality ranking of the wines. I would like to find out which variables can be used to reliably predict the quality of a wine. In particular, I would like to see which qualities can maximimize the perceived quality of a wine.
I believe residual sugars, acidity factors (particularly pH), chlorides, alcohol, and density will all be helpful to analyze. These factors are the biggest contributors to flavor (sweetness, sourness, saltiness, “booziness,” and mouthfeel. )
I did not create any new variables in this dataset. These values are generally very straightforward, and I feel that I will tease out the most interesting relationships in this data in the bivariate analysis.
There were some very highly acidic outliers in the data. This is unusual, and I assume this is indicative of a wine that has turned to vinegar or otherwise gone bad. The residual sugar and chloride distributions were also very positively skewed with very long tails.
The only operations I performed were adjusting the scale and bins of a few of the histograms. I did this to better tease out a useful representation of the distributions.
## Alcohol - Quality correlation: 0.476166324001136
Alcohol and quality are strongly positive correlated, with a correlation of ~0.48! We can see that stronger wines, on average, receive a higher rating. We can also see that, at the extremes, the highest rated wines are stronger.
## Residual Sugar - Quality correlation: 0.0137316373400663
I have chosen this plot to identify if there is a relationship between the residual sugar in a wine and its quality. I suspected that it may, as the sweetness of a wine will be expected to affect an individual’s preference.
Residual sugar only has a correlation with quality of ~0.014. This is a very weak relationship
## Sulphates - Quality correlation: 0.251397079069261
I have chosen this plot to identify if there is a relationship between the sulphate levels in a wine and its quality. I expected it would, as it affects the shelf-life and “freshness” of a wine.
Sulphates have a moderate correlation with quality, at ~0.25. This makes sense, as the description of the sulphates feature indicates that it should improve the “freshness” of the wine. This is worth exploring further.
## pH - Quality correlation: -0.0577313912053821
I have chosen this plot to identify if there is a relationship between the acidity levels (acidity) in a wine and its quality. As the acidity of a wine is a strong aspect of its flavor, I expected that it would have a strong correlation.
Acidity does not seem to be a strong predictor of quality, with a very weak negative correlation. This is surprising.
## Density - Quality correlation: -0.174919227783349
I have chosen this plot to identify if there is a relationship between the density of a wine and its quality. I suspected that it would, as the density of a wine affects its mouthfeel.
There is a slight, but not insignificant, negative correlation between density and quality at ~-0.17.
## Chlorides - Quality correlation: -0.128906559930053
I have chosen this plot to identify if there is a relationship between the chloride levels in a wine and its quality. A wine with a higher level of chlorides should be saltier, so I suspect that it should have some effect.
Chlorides have a slight negative correlation with quality a ~-0.13. It would seem that saltier wines are less favorable.
## Chlorides - Quality correlation: 0.0556095352035321
Up to this point, I have only compared the relationships between chemical and physical properties of a wine with its quality. I have chosen this plot to identify if there is a relationship between residual sugars and chlorides.
Residual Sugars and chlorides have a very weak correlation at ~0.06. As such, the presence of residual chlorides does not seem to have a significant impact on the presence of chlorides.
## Residual Sugar - Alcohol correlation: 0.0420754372097311
I have chosen this plot to identify if there is a relationship between alcohol and residual sugars. I suspected there would be, as alcohol is a by-product of yeast consuming sugar.
Residual sugars and alcohol have a very weak correlation at ~0.04. This, surprisingly, appears to be another dead-end.
## Density - Residual Sugar correlation: 0.355283370983376
I have chosen this plot to identify if there is a relationship between residual sugars and the density of a wine. I assumed there would be, as sugar is denser than water.
Residual sguars and density have a moderate positive correlation at ~0.36! This is the only significant correlation I have found between the interrelationships of the chemical and physical properties of a wine. I assume this has to do with the high relative density of sugar. If a wine contains more sugar leftover from the fermentation process, it will naturally be more dense. Even if this is only an observation of a physical property, it is still an interesting relationship to see.
There was a clear positive relationship between the quality of the wine and several properties. Namely, average quality ratings increased with alcohol content and sulphate levels. Stronger wines performed better on average, and all of the top performing wines except for two had an ABV of at least 11%, whereas all of the worst performing wines were 11% or under.
The relationship between sulphates and quality was particularly interesting. While the average quality did increase with sulphate levels, it seems to cap off around 1.2. Levels higher than this result in scores averaging around 5.
There was a negative relationship between the quality of wine and acidity. Less acidic wines seem to receive higher quality ratings.
Density and chlorides also displayed a negative relationship with quality, with both saltier and denser wines performing worse on average.
I did not find any significant relationships between other features. I compared residual sugars and chlorides, and found nothing of note.
I did find it interesting that I did not find a significant relationship between residual sugars and alcohol content. I had expected lower residual sugar yielding a higher ABV wine, as the yeast would have consumed more of it. Since I did not find a relationship, I suspect the residual sugars may have more to do with the tolerance levels of the yeast strain used in brewing than in the total amount of sugar consumed by the yeast.
The strongest relationship I found was by far the relationship between alcohol and quality. The correlation between alcohol and quality was ~0.48. This was almost twice as strong of a correlation as the second strongest, which was sulphates and quality at a correlation of ~0.25.
The three strongest factors in the quality of a wine were its alcohol, density, and sulphates. As such, I would like to test the relationships between these three factors and their effect on quality.
Here, we can clearly see the relationship between quality and alcohol, but the relationship between density and quality is not immediately apparent.
This may be indicative that the relationship between quality and density is overshadowed by the strength of the relationship between quality and alcohol. There does not appear to be a synergistic relationship between the two features, or, if there is, then it is very weak.
Here, we have isolated “low quality” wines (those scoring less than 4). They appear to be fairly randomly spread out.
Here, we have isolated “high quality” wines (those scoring greater than 6). These points appear similarly random compared to the low quality wines. This appears to be a dead-end. I will stick to my original hypothesis that density and alcohol do not have a synergistic relationship.
This is a much stronger relationship. We can see a cluster of low quality wines where alcohol is under 10% and sulphate levels are less than 0.8.
“Medium” quality wines begin to cluster around the middle of the group. showing that wines with greater than 10% abv with sulphate levels greater than 0.6 or so are given average ratings.
High quality wines begin to cluster where alcohol levels surpass 12% and sulphates are at least 0.7.
Wines with less 10% abv seem to do poorly across the board, regardless of sulphate levels.
Again, we do not see a strong synergy between density and sulphates. I suspect separating high and low quality wines will yield a similar result to alcohol and density.
I could not discern any influence from density, but there appeared to be a synergistic relationship between alcohol and sulphates. High-alcohol, high-sulphate wines performed much better than low-alcohol, low-sulphate wines. There seemed to be minimum threshold of alcohol for a wine to do well (almost all wines under 10% abv performed poorly).
The very best performing wines were high-alcohol, high-sulphates, while the very worst performing wines were low-alcohol, low-sulphates.
It was surprising to see how small a role density played in the quality of a wine after alcohol content and sulphates are taken into account. By itself, density had a correlation with quality of ~-0.175. While this is not a significant number, it is completely overpowered by alcohol and sulphates.
Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.
Here, we can see both the distribution of quality ratings along the x-axis, as well as the quality rating of wines based on their alcohol percent. We can see that most wines received ratings of 5 or 6, and that the highest rated wines are only found at higher alcohol percentages.
We can also see the inverse relationship, in that the worst rated wines are overwhelmingly found in weaker wines.
Wines between 9% and 10% represent the majority of wines that received a quality rating of 5, whereas wines that received a quality rating of 6 have a much larger variance in their alcohol content.
This chart demonstrates the relationship between alcohol and phosphates, and its effect on quality.
A large cluster of low-quality wines can be found in the lower-left corner of this chart, indicating that wines with low alcohol content and low phosphate levels are perceived as inferior.
As we move up and to the right, we see a clear increase in quality, with average-quality wines clustering in the middle of the distribution.
The highest quality wines are only found where higher alcohol contents meet higher phosphate levels. It can be seen, however, that slightly lower levels of phosphates can be made up for with much higher alcohol levels, as many wines with phosphate levels less than \(1.0 g/dm^3\) still receive high ratings when their alcohol content is greater than 12% abv.
I selected this chart because it showed the strongest relationship between two non-main features of interest. There is a clear positive relationship between residual sugars and density. There is a moderate correlation at ~0.36.
As there are many wines with the same density, only the mean residual sugar is shown for each point of density.
As sugar is more dense than water, this graph demonstrates how the density increases with every bit of sugar that has not been metabolized by the yeast. As such, residual sugars do not only affect the sweetness of a wine, but indirectly affect the density, and, as such, the mouthfeel of the wine.
In this report, I explored the relationships between the chemical and physical properties of almost 1,600 red wine varieties and their impact on quality assessment. I found that the strongest predictors for quality were alcohol content and phosphate levels. Other chemical and physical properties only had a small correlation with quality.
I was surprised that residual sugars did not have a stronger effect on the quality of wines, with a correlation of only ~0.014. I had imagined that higher residual sugars would result in a sweeter wine, which would affect the quality in some capacity. After further thought, I believe that wines may have been rated on their expected sweetness based on the type of wine they represent. If a wine is expected to be dry, a low residual sugar level should not negatively impact its quality. This would mean that the residual sugars alone should not heavily impact wine quality.
My greatest difficulty was trying to find relationships between the physical and chemical features. While most features influenced quality to some degree, there was very little relationship amongst the features themselves. I was only able to find a moderate correlation between residual sugars and density. While this does explain an expected relationship, it does not provide much predictive insight. I wish I had been able to find a stronger relationship, but this may be the strongest that is readily discoverable.
I believe the impact of the relationship between phosphates and alcohol on quality provided very strong footing for insight. due to the strength of this relationship, I believe I was able to provide what, in my opinion at least, were insightful visualizations.
I have mostly only performed descriptive statistics and visualization in this report. If I had more time, I would like to have performed more inferential statistics. Ideally, I would like to have implemented rigorous mathematical models and performed predictive analysis to see if I could predict the quality rating of a wine based on inputs of its physical and chemical properties.