The Price of Beauty: A Data Analysis of Chemicals in Cosmetics

Madhu Shree Aravindan
Code Like A Girl
Published in
5 min readJan 3, 2024

--

The cosmetics industry has been on a continuous upward trajectory in recent years, driven by changing consumer preferences, technological advancements, and a worldwide increase in beauty awareness. This expanding sector covers a broad spectrum of products such as skincare, haircare, makeup, and fragrance, meeting the diverse needs of consumers.

Chemicals play a crucial role in the formulation and effectiveness of cosmetics, contributing to product stability, texture, colour, and preservation. While not all chemicals are harmful, and many undergo rigorous testing to ensure they are safe for use, some may have the potential to be harmful.

Let’s dive in and analyze the chemicals used in cosmetics with the help of Python’s libraries like Pandas, Numpy, and Matplotlib.

Dataset

Step 1: Download the dataset from Kaggle.

Step 2: Import the required libraries and the dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import missingno as msno

df = pd.read_csv('dataset.csv', encoding=('ISO-8859-1'))

Step 3: Let’s understand the dataset

df.head()
df.info()
df.describe().transpose()
df.shape

The dataset has 114635 rows and 22 columns

df.isnull().sum()
df.select_dtypes(include='int64').nunique()
df.select_dtypes(include='object').nunique()
#drop the duplicated values
df.drop_duplicates()

Data Cleaning

Step 1: To check if the data has any missing values

msno.matrix(df)

Data has missing values in the “CSFId,” “CSF,” “CASNumber,” “DiscontinuedDate,” and “ChemicalDateRemoval” columns. We can remove these columns as the values are missing. But I don’t have a clear understanding of what these terms mean. So, I will leave it as it is so I don’t lose any vital data.

Step 2: Check the number of unique data

print("Unique number of CDPHId: ",len(df['CDPHId'].unique()))
print("Unique number of product names: ",len(df['ProductName'].unique()))
print("Unique number of comapny names: ",len(df['CompanyName'].unique()))
print("Unique number of Brand names: ",len(df['BrandName'].unique()))
print("Unique number of primary category: ",len(df['PrimaryCategory'].unique()))
print("Unique number of sub category: ",len(df['SubCategory'].unique()))
print("Unique number of chemical names: ",len(df['ChemicalName'].unique()))

Data Analysis

Let’s divide the products based on the number of chemicals present in them.

below_five = df[df["ChemicalCount"]<5]
above_five = df[df["ChemicalCount"]>=5]
sorted_df = above_five.sort_values(by='ChemicalCount', ascending=False)

Now, let’s plot them

plt.barh(sorted_df['ProductName'],sorted_df['ChemicalCount'])
plt.xlabel("Number of Chemicals used")
plt.title("Cosmetic Products with 5 or more toxic chemicals")
plt.show()

Hair and skin care products seem to use a lot of toxic chemicals compared to other products.

Let’s calculate the average number of chemicals used per brand.

average_chemicals_per_brand = df.groupby('BrandName')['ChemicalCount'].mean().reset_index()
average_chemicals_per_brand = average_chemicals_per_brand[average_chemicals_per_brand["ChemicalCount"]>=3]
average_chemicals_per_brand = average_chemicals_per_brand.sort_values(by='ChemicalCount', ascending=False)
print(average_chemicals_per_brand)
plt.barh(average_chemicals_per_brand['BrandName'],average_chemicals_per_brand['ChemicalCount'])
plt.xlabel("Average number of Chemicals used")
plt.title("Cosmetic Brands with 3 or more average number of toxic chemicals used")
plt.show()

Mastercuts, Redis Design Line, and Careline brands seem to use more than 4 toxic chemicals on average in most of their products. It’s best to avoid buying any products from these brands.

Similarly, calculate the average number of chemicals used per company.

average_chemicals_per_company = df.groupby('CompanyName')['ChemicalCount'].mean().reset_index()
average_chemicals_per_company = average_chemicals_per_company[average_chemicals_per_company["ChemicalCount"]>=3]
average_chemicals_per_company = average_chemicals_per_company.sort_values(by='ChemicalCount', ascending=False)
print(average_chemicals_per_company)
plt.barh(average_chemicals_per_company['CompanyName'],average_chemicals_per_company['ChemicalCount'])
plt.xlabel("Average number of Chemicals used")
plt.title("Cosmetic Companies with 3 or more average number of toxic chemicals used")
plt.show()

Regis Corporation and Cosmopharm Ltd. use the highest number of chemicals on average in their products. It’s best to avoid them.

Let’s now calculate the number of products per company.

df['ProductCount'] = df.groupby('CompanyName')['ProductName'].transform('count')
no_of_products_per_company = df[df["ProductCount"]>=4000]

no_of_products_per_company.groupby('CompanyName').size().plot(kind='barh')
plt.title("Cosmetic Companies with highest number of products")
plt.show()

L’Oreal USA, S+, and Coty seem to produce the most products.

Let’s now count the number of products per brand.

df['ProductCount_Brand'] = df.groupby('BrandName')['ProductName'].transform('count')
no_of_products_per_brand = df[df["ProductCount_Brand"]>=3000]

no_of_products_per_brand.groupby('BrandName').size().plot(kind='barh')
plt.title("Brand names with highest number of products")
plt.show()

Sephora and NYX have the highest number of products.

Let’s now find the most commonly used chemicals.

df['f'] = df.groupby('ChemicalName')['ChemicalName'].transform('count')
df['logf'] = [np.log10(i) for i in (df['f'])]
sorted_df = df.sort_values(by='logf', ascending=False)

plt.figure(figsize = (10,30))
plt.barh(sorted_df['ChemicalName'],sorted_df['logf'])
plt.xlabel("Frequency of Chemical used(log10)")
plt.title("Most commonly used Chemical")
plt.show()

Titanium Oxide is contained in most of the products. While the exact harms aren’t well understood. Studies have shown that it is considered possibly carcinogenic to humans when inhaled by the International Agency for Research on Cancer.

Now let’s find under which category these products come under

df['Primary_Count'] = df.groupby('PrimaryCategory')['PrimaryCategory'].transform('count')
sorted_df = df.sort_values(by='Primary_Count', ascending=False)
df
plt.barh(sorted_df['PrimaryCategory'],sorted_df['Primary_Count'])
plt.xlabel("Number of Products")
plt.title("Primary Categories with 5 or more toxic chemicals")
plt.show()

Makeup products use the most chemicals, and baby products use the least.

Conclusion

  • Most of the famous cosmetic companies use toxic chemicals in their products. It’s best to check the ingredients before purchasing these items.
  • Makeup products have the highest amount of chemicals compared to other products. Among them, hair and skin care products have the highest concentrations.
  • Titanium dioxide is used in the majority of the products.

--

--