Web Scraping Basics

Data Ingestion - Learning to scrape the web

We'll be doing the following activities in this class :

  • Understanding an HTML page
  • Learning to web scrape : Follow along project
  • Work on a lab assignment to assess what your learnt!
  • Discussion with your guide on learnings and shortcomings

Learning Outcomes

Below are the takeaways from this notebook :

  • Learning how to read an HTML page (DOM or Document Object Model)
  • How can Python be used to read an HTML page and extract data from the same
  • Discovering the packages which Python uses to implement Web Scraping

01. Learning to web scrape : Follow along project

  • Follow the article here by DataQuest end-to-end and try to replicate the results as shown

    🚩 Please ensure you type out each piece of code instead of doing copy-paste. You'll discover how quickly you learn if you try remembering and then typing!

  • Use this notebook itself to write the code

  • For each piece of your code, you need to provide sufficient comments to explain what the step would do

  • Once you are done, submit your assignment in the classroom

Note : Here is another article to develop good understanding of the process of web scraping

  • You are a Data Scientist hired by the BCCI to do some analysis of the IPL 2019 event concluded last year

  • Refer this page which lists down the details of the matches held

  • Use BeautifulSoup Python package to scrape the match details

  • Your output should be a CSV file which looks like below : web-scraping learnbook image.png

  • Your output should have all the rows as there in the source URL

