HomeChallengesMeasuring Statistical bias with Amazon Sagemaker Data Wrangler
learnbook poster
Measuring Statistical bias with Amazon Sagemaker Data WranglerLast updated at Wed Oct 27 2021Skills

Learning Objectives

  • Statistical Bias Metrics
  • Measuring Statistical Bias with Sagemaker Data Wrangler


Well.. you had quite a day! You've reached home, ordered a burger, and want to Netflix. Now Netflix started throwing some recommendations at you and that is because of Netflix recommendation engine takes several factors and recommends a movie on the screen that you might like. This here what we observe is called Selection Bias. This selection bias is a cause that generates statistical bias on any data generated on viewing/reviews on watching the recommended movies.

Metrics for Statistical Bias

There are several such causes (Activity Bias, Societal Bias...) for statistical bias. Now to Measure such Statistical bias we use certain metrics. Now I do not want to start explaining all of these here. Please read them here

Evaluating Statistical Bias

Our objective for this learnbook is to use Amazon Sagemaker Data wrangler to measure the statistical bias by evaluating the metrics we saw earlier. We can also make use of Amazon Sagemaker clarify to evaluate this. But you might be wondering what's the difference between them. Please look at the below illustration for an understanding.

statistical Bias.PNG


We are going to make use of this data set for the analysis Credits: Thanks to Akash Patel for this data

Sagemaker Data Wrangler

  • Create an Amazon S3 bucket and upload the data to it
  • Create a Sagemaker Studio Environment. [For reference] . Creating Sagemaker studio might take some time.
  • Once Sagemaker Studio is active open it and create a new data flow.


  • Import the data that is uploaded to s3
  • Create a new Analysis in the data wrangler [For reference]
  • Create a Statistical bias report from the new analysis.
  • Once you execute this will generate report at specified s3 location

Now you finished generating the report but what does the report exactly mean?


  • CI score for YOLO is 1. which means predictions for samples with YOLO class are highly biased in the marital status attribute. Although the sample size is also very less. So inferences had to be made on a significant sample size
  • The Lp Norm difference for the "Single" class is 0.13. The closer it is to zero lesser the bias

As you finish your experiments make sure you delete all compute instances as well as Sagemaker studio to save cloud costs

Quiz Time!

  • Using the same procedure, measure the bias on Education facets using Sagemaker Datawrangler
  • Reflect a bit on your reading and take this quiz here to validate your learning
  • Wish to learn from a bigger community and work on an open project? Join us on Slack!

Created with 💙 by
author avatar
Cloud Engineer,Mentorskool
mentorskool logo
Mentee Today, Mentor Tomorrow
No 206, A Block, Sonesta Silver Oak,Varthur, Bangalore 560066,Karnataka
Copyright - 2022 © Mentorskool - All rights reserved.