How are you doing today?
In this learnbook, I'll share my learning experience of building an end-to-end data pipeline using plain Python.
As a Data Analyst, I once got an opportunity to work on an interesting problem statement. Here it goes :
What was the problem?
I was working in an Agritech startup. We had good data on crop growth, germination etc. However, one of our customers wished to build a predictive model which could predict the harvest outcome based on several inputs like :
- Temperature variation
- Soil Moisture
- Relative Humidity
- Rainfall etc.
The problem was. We didn't have any weather data of our own. Also, the challenge was how to fetch data from a given source on a daily basis, save it in some database and make it ready for ad-hoc analysis as quickly as possible and completely open source.
How did we solve it?
Thinking through the problem, we broke down the entire pipeline into following components :
- Data Ingestion : We had to ingest data from an API after registering to the website and fetching the API KEY.
- We first set up an account on Openweather.org
- Next step, fetch the API key
- Finally, look into the API docs to fetch the API end point.
- Data Wrangling : Convert JSON data to dataframes. Being a Data Analyst, I would love to have data in dataframes. That opens a lot of avenues for data analysis or machine learning
- Data Storage : Persist the dataframes onto hard disks to be able to refer later for use in analytics or predictive modeling.
Given that we wanted a quick solution using open source, Python seemed to be a superb choice. We already had a SQL server database on Azure cloud with a write access.
Following resources helped us immensely to fill our knowledge gaps :
- requests package to read from an API end point : Refer this excellent article by DataQuest to understand the usage of requests package to fetch data from an API in Python
- pyodbc/pymssql packages to perform the below mentioned tasks :
Connecting to a relational database
Creating a new table in the relational database
Writing the dataframe into the newly created table
Reading from the SQL table and print output row by row
Note : In any of the above approaches if you face issues of Driver not available or not identified, you may need to physically download and install the ODBC drivers in your system. Refer this link to download the required drivers
Solution in progress
Notice how several libraries like requests, json, csv come together to solve the problem in question
Hope you have spent some quality time going through the articles by now. How about applying your skills right away?
Well, this mini lab has got you covered.
Head over to this Google Colab notebook and try implementing the above scenario by yourself!
Remember : Always try typing out your code by hand. Resisting the temptation to Copy/Paste will go long way developing the brain muscle much needed by good programmers.