DATA ANALYST
CHOCOLATE RATINGS
Tools
Microsoft Excel
Python:
Pandas.
Jupyter.
Seaborn.
Anaconda.
Maplotlib
Numpy
Scikit Learn
Tableau
Skills
Use of Source open Data
Data cleaning, wrangling and subsetting.
Creating Geographical visualizations.
Supervised machine learning with linear regression.
Unsupervised machine learning with k-means clustering.
Sourcing and analyzing time-series data.
creating a dashboard
Data
Chocolate Bar 2020 dataset from Flavors of Cacao, a dataset outlining over 2,700 types of plain dark chocolate bars, their ratings, ingredients, and tastes. The dataset used here have been acquired from Rachael Tatman's Chocolate Bar Ratings dataset on Kaggle
The goal is to learn about chocolates. Identify the countries that produce the best coca beans, knowing more about the relationship between Coca solids percentage and ratings and if the amount and selection of the ingredients matter to the quality of the taste.
Chocolate is one of the most popular candies in the world. Each year, the United States consumers eat more than 2.8 billions pounds. However, not all chocolate bars are created equal!
Cocoa Percent and Ratings:
Clusters tend to concentrate to the middle right side, this aligns with the conclusion of my hypothesis, rating does not affect cocoa_percent
Bean Origin and Ratings:
Clusters tend to concentrate to the middle right side, I see more concentration of bean_origin than actually ratings,
This aligns with the conclusion the rating does not affect bean_origin.
Company Location and Ratings:
Clusters tend to concentrate to the middle right side. I can see a slight relation with the company location and ratings.
Company and Ratings:
This last cluster is very interesting because there is more concentration between ratings and company to have a little more relation comparing with the other clusters.
Which make me conclude that ratings could create some type of Bias in this case.
Unsupervised machine learning: gap statistic method (K-means Clustering)
Hypothesis:
-If the number of ingredients affect the ratings?
Results:
The linear regression relationship isn’t purely linear This means that there is no significant relationship between the number of ingredients and rating, we can reject the null hypothesis.
Linear regression: Is more ingredients better?
Hypothesis:
-If the cocoa percentage is high, then the better the rates they received?
Results:
The linear regression represent a weak negative correlation between chocolate bar rating and cocoa percent. This means that there is no significant relationship between the cocoa percent and rating, we can reject the null hypothesis.
Cluster Analysis:
Chocolate bars are usually categorized by the amount of cocoa percent in a bar. Looking at the average rating of bars by cacao percent, we can see that semi-sweet and bittersweet chocolate bars are the most represented and have the least varied avg ratings. The highest avg rating of cocoa percent is 63% with 3.50, followed by 69% with 3.40. The highest rated 'Outstanding' chocolate bars contain between 63% and 88% cocoa percent.
Linear regression: Is more cocoa percent better?
The ratings indicate that United States is the manufacturer country favorite between the ratings category as highly recommended and outstanding.
For the Bean origin country under the rating category, the highly recommended and outstanding goes to Peru followed by Venezuela.
Exploratory Analysis: who makes the best bars
The two countries with the highest average bar rating are Tobago and Sao Tome & Principe with an avg rating of 3.5 with 1 and 2 reviews, respectively . Solomon Islands is the next highest rated location with a rating of 3.45 and 10 reviews. Venezuela and Peru are the countries with highest rated ‘Outstanding bars with 13. We can see all the countries that produce chocolate bars with 70%+ cocoa percent.
The cacao bean can be considered the most important part in determining the quality and flavor of a chocolate bar. By looking at the different locations that cultivate cacao beans, we can see that most of them come from South America, which include Venezuela, Peru and Ecuador.
Exploratory Analysis: where the cocoa is from?
The most common and highly rated combination of ingredients are bean, sugar and cocoa butter, vanilla, lecithin and salt; followed by cacao bean, sugar and cocoa butter.
The ratio between rating categories is similar for the top 3 combinations with just some minor changes.
The main ingredient in a chocolate bar is the cacao bean, which is categorized based on the percent of cocoa present in the bar. in the histogram, we can see that the most common percentage of cocoa is between 65 and 70%.
Ingredients:
B - Beans,
S - Sugar,
S* - Sweetener other than sugar or beet sugar,
C - Cocoa Butter,
V - Vanilla,
L - Lecithin,
Sa - Salt
Exploratory Analysis: What is inside of a chocolate bar?
Conclusion
The most common chocolate bars are with cocoa percent between 65 and 70%.
The most common and important ingredients in the highest performing chocolate bars are the cacao bean, cacao butter, vanilla, lecithin and salt.
The most common and highest rated cacao beans origin come from Peru and Venezuela.
The percentage of cocoa in a chocolate bar is not directly related to its rating. The best rated chocolate bars have between 63% and 88% cocoa percent.
Semi-sweet (63%-88%) chocolate bars contain two of the highest average rated percentages (53% and 69%) with few variability between the average ratings.
The U.S.A has the most chocolate manufacturer companies, as well as the most 'Outstanding' ratings.
Soma (Canada) has the most number of reviews, as well as the most 'Outstanding' ratings score. Arete (U.S.A.) has the 3rd most number of reviews and the 3rd most 'Outstanding' ratings.
Recommendations
It is recommended to open the ratings survey to the regular consumer instead of a limited population group in this case expert opinions., with that on my mind we can determine if the cocoa percent has anything to do with the flavor, quality coming from regular consumer perspective, so, we can project revenues and variety.
Limitations
The analysis of this project is limited by the nature of the dataset, as the data is gathered by assigning a rating based by expert opinions which can lead to Biases. Because of this there is an absence of quantitive and continuous variables that could be used to perform a more thorough analysis. Another limitation is the size of the dataset, which limits the amount of insight that can be gathered from measuring averages and performing time analysis decomposition.
Data Source:
Chocolate bar ratings 2022 (kaggle.com)
Data originates from the Flavours of Cacao website
RECOMMENDATIONS & CONCLUSIONS
Ivonne Aspilcueta
Data Analyst
Hermosa Beach, CA, United States