Scope of the Project
Airbnb wants to analyze the historical data of all the listings on its platform since its initial stages and improve its recommendations to its customers. To do this, they need to gather the average rating, number of ratings, and prices of the Airbnb listings over the years. As a data engineer of the company, I took up the task of building an ETL pipeline that extracts the relevant data like listings, properties, hosts details, and load it to a data warehouse that makes querying for the decision-makers and analysts easier.
End-Use Cases
-
Query-based analytical tables that can be used by decision-makers
-
Analytical Table that can be used by analysts to explore more and develop recommendations for users.
Data Description and Sources
The data has been scoped from an awesome site inside Airbnb which contains the Airbnb actual data. This dataset contains information about various aspects of Reviews, Calendars, Listings of many cities. As I was interested in Austin, TX, and Los Angeles (LA), CA, I took the respective data and tried to extract meaningful information. The data comes in three files namely REVIEWS, LISTINGS, CALENDAR
.
-
Reviews
The file contains all the reviews of the listing on the Airbnb website- This file contains more than 1,000,000 rows/records.
-
Listing
The file contains all the house listings on Airbnb- The number of listings in Austin and LA comes around 40000 rows/records.
-
Calendar
file contains the availability of the listing across a vast range of dates- This file contains more than 13,000,000 rows/records.
Tools and Technologies
The tools used in this project are notebooks, Apache Spark, Amazon S3, Amazon Redshift, Apache Airflow.
-
To explore the dataset, I started with Google Colab free computing resources and Apache Spark.
- Spark is better at handling huge data records. Spark provides superior performance as it stores data in-memory shared across the cluster.
-
The data lake is stored on Amazon S3, an object storage service that offers scalability, data availability, security, and performance.
-
S3 is perfect for storing data partitioned and grouped in files for low cost and with a lot of flexibility.
-
Also, it has the flexibility in adding and removing additional data
-
Ease of schema design and availability to a wide range of users
-
-
For the ETL process, I used an EMR Cluster on AWS Redshift.
-
To orchestrate the overall data pipeline, I used Apache Airflow as it provides an intuitive UI helping us to track the progress of our pipelines.