## Key facts:

* Anomaly detection is a particularly important and active business metric for various fields. A technique that is used to identify the unusual patterns that are not in sync with the expectations.

* It has many applications in business-like health (detecting health discrepancies), cybersecurity(intrusions), electricity (huge and sudden surges), finances (fraud detection), manufacturing (fault detection), etc. This shows that there is more to anomaly detection in everyday life and important concepts to be looked at.

* The data science application in anomaly Detection combines multiple concepts like classification, regression, and clustering.

## Business Case

* Electricity consumption of society has been increasing steadily day by day and it has become important for generators and suppliers to meet the demand and be vary of the sudden increase or surges in their distribution.

* Typically, there is more energy consumption than necessary due to various supply losses, malfunctioning equipment, incorrectly configured systems, and inappropriate operating procedures. Therefore, it is important to have a quick response and reliable fault detection to ensure they met the demand, have a breakless service, and save energy.

* Electricity supply can be having point, contextual or collective anomalies. The electricity anomalies could fall into either of the three categories. Sudden surges at a locality or on holidays or an activity taking at a place etc.

* These anomalies could follow a pattern like high electricity usage in summer or sudden spike due to wedding activity (wedding season) or point anomaly (for example, when someone starts mining a bitcoin – which takes a lot of power)

* Detecting this helps the stakeholders to plan for additional supply, curb the untoward electricity usage, and leakages, and save a lot of money and energy for all stakeholders.

## Steps Involved:

### Initial Thinking

* Identifying anomalies is a tricky task as the data doesn’t always tell what it is. It could just an outlier or it was an anomaly. The set of anomalies/outliers will be very few and hence the data is imbalanced for any algorithm to learn. Especially in electricity anomaly detection, there are too many factors like weekends, office areas, holiday areas, etc.

* How to capture an anomaly then? My initial ideas were to use the time series data and predict the values based on various engineered features and calculate the residue. When the residue crosses the excess level mark that as an anomaly. This level can be decided based on several factors like a combination of time of duration, temperature, day of activity, etc. Then compare all the residues and mark the residues with exceeding level as anomalies. But before this, I must remove all the outliers captured in the data.

* Another idea that popped at me was finding the nearest neighbors and predict the values based on various features. Once the value predicted for a day based on nearest neighbors is different than the usual mark them unusual and consider the other factors to whether mark them as anomalies or not. Using KNN will help here. But defining the number of nearest neighbors and the relationship between some neighbors might become difficult in the starting as there could too many correlations.

* A key point to remember: Outliers are observations that are distant from the mean or location of a distribution. However, they don’t necessarily represent abnormal behavior or behavior generated by a different process. On the other hand, anomalies are data patterns that are generated by different processes.

### Technologies Used

* Python

* Pandas

* Numpy

* Plotly

* Scikit-Learn

* Boosting

* XGBoost

* Isolation Forest

### SKILLS

* EDA

* Time Series Analysis

* Supervised ML

* Statistics

* Data Visualization

### Data Acquisition

* Data downloaded from Scheider Electric.

* The data consisted of 5 CSV files.

  * Training data with **timestamp** and **values of electric units** for respective meter ids.

  * Weather data with a **timestamp, temperature, site ids, and distance**.

  * Holidays CSV files consisted of **Date, type of holiday, and site_ids**.

  * Metadata CSV file consisted of **site Id, meter_id, Meter Description, units, surface, and activity**.

  * The submission file had the true values of whether it is an anomaly or not.

### Preprocessing – Data formatting and Munging

* The given data is for three sites ***(Site 38, 234_203, 334_16)*** and more than 20 meter_ids. The data covers more than 4+ years of data. The given data also covers various kinds of activities like restaurants, offices, laboratories, general. The meter ids have even more vast distribution across 3 sites covering various electricity usage like elevators, generators, compressed air, etc.

* From exploring the above data, it was clear that a lot of data is missing between various data tables. Not all **site ids** have been mentioned for the respective meter ids.

  * The data is skewed towards one site.

  * The presence of wrong data in electric Units are present

  * Metadata is filled with many unknowns

  * The weather data table doesn’t contain data for site 234_203

```cv
gallery
{"layout":"hscroll","images":[{"src":"https://firebasestorage.googleapis.com/v0/b/maitake-project.appspot.com/o/pages%2F2UW5ERtHjGUlgbPJNhJu2lOmiuq1%2FSpV3GaJaeWzwfaYOw3rw%2F740f3f45-6f0d-45ee-a32c-99b179dca7cf.png?alt=media&token=adbe0e27-6ae8-4de5-b0b2-7780c57a33ab","id":"740f3f45-6f0d-45ee-a32c-99b179dca7cf","width":1440,"height":768,"filename":"fig_acitivity_anomaly.png","type":"image/png","caption":"","border":false},{"src":"https://firebasestorage.googleapis.com/v0/b/maitake-project.appspot.com/o/pages%2F2UW5ERtHjGUlgbPJNhJu2lOmiuq1%2FSpV3GaJaeWzwfaYOw3rw%2F428765d0-e439-4ff0-bfa9-eef3e7b9548a.png?alt=media&token=834edf30-cf44-4114-b9c8-f6e46ce336fd","id":"428765d0-e439-4ff0-bfa9-eef3e7b9548a","width":1440,"height":816,"filename":"fig_meter_anomaly.png","type":"image/png","caption":"","border":false}]}
```

### Feature Selection & Engineering

* Considering the subject (electric unit values prediction) and the data given, I started wondering what factors lead to electricity consumption for a place or activity. For a house, it could outside temperature, people staying at home on a holiday, and for an office outside temperature, full staff, hectic day, or long day, etc.

* I started to extract all the features like year, month, day, hour, sin, and cos values of month & day. Whether it is a working day or not, whether it is a weekend or not, day of the year, night, reset hour, the temperature outside, mean temperature across the day, mean temperature across the month, is it a holiday, etc.

* These are some of the features that occurred to my mind, and some based on the exploration of the internet.

* Now the important question is how many of these factors are important and what weight they carry on.

* Before moving on to modeling, I wanted to check how the electric unit values are spread across 4 years.

  * Below is the representation of weekday(blue) and weekend value(green) for sites 334 and 234.

  * We can see there are few spikes here and there. We don’t know whether they are anomalies are not. But they are not outliers.

  * Also, we can see that the green values are not high enough. This is due, site 334_61 & 234_203 is office activity and on weekends they are assumed to be non-functioning.

**Features Code**

### Model Selection & Training

* As the electricity data is time-series data, I decided to use a regression-based supervised learning model (XGBoost) to predict the future values based on past values and extracted feature engineering. The features I used are as follows

  * Rolling means, moving average, temperature, and other features like a year, month, weekend, weekday, cos and sin feature of day and month, etc.

  * Choosing XGBoost was an active choice as it has been the most efficient gradient boosting algorithm. It produces prediction by combining an ensemble of weak predictors.

  * To avoid overfitting, I used cross-validation to split the train and test data sets

  * Since I am using a regression for prediction, I used the RMSE metric to calculate the error.

* We have seen above that the weekend and weekday consumption values are a lot different, and the same model cannot be used. Hence, I decided to use the Isolation Forest model for weekend prediction. While using isolation forest, I tried using the normalized 24 hours values rather than direct values because weekend consumption appeared to be level distributed and didn’t want a prediction solely on some factors.

* For weekend anomaly detection, I borrowed some features from a GitHub solution.

  * The features he has selected as follows. (These are extensive and best features for an isolation forest algorithm and hence I couldn’t perform any better than this)

    * The power values of each hour are extracted as features.

    * Temperature

    * Power value after 9:00 am is it's a weekend and the consumption values after a bit of morning project the real consumption.

    * *Kullback-Leibler Divergence*: Measure of how one probability distribution diverges from the empirical distribution. We calculate the mean value of power usage of each hour as the empirical distribution, then calculate the KL divergence of each weekend 24-hr power usage with the empirical distribution.

    * The hour of daily power values’ peak because if the peak occurs later in the day suddenly, it could be an anomaly.

  * Isolation forest is an effective unsupervised algorithm with a small requirement.

  * The way these algorithms works is that first, it builds isolation trees by randomly selecting features and then starts splitting the data based on features’ breakpoints. Then anomalies are detected as instances that have short average path lengths on the isolation trees.

* But the case is not over yet. We predicted the consumption values and now must detect the anomalies

  * I defined an anomaly if the predicted error exceeds a defined rule-based threshold.

  * Here I plotted the error values after converting a distribution (Gaussian) by resampling with respect today with mean values and identifying those error values which fall in 2 sigma level.

  * The extremely large errors are labeled as anomalies.

* The predicted anomalies are visualized below (light shaded straight lines belong to weekdays & dark shaded straight lines belong to the weekend)

  * For site 334

  * For site 234

### Feature Importance from the model:

SITE 234:

```cv
gallery
{"layout":"hscroll","images":[{"src":"https://firebasestorage.googleapis.com/v0/b/maitake-project.appspot.com/o/pages%2F2UW5ERtHjGUlgbPJNhJu2lOmiuq1%2FSpV3GaJaeWzwfaYOw3rw%2Fe7d3ba8f-cf7a-40c8-914f-324d9f9e1a62.png?alt=media&token=8fd82294-c424-4d8f-8280-20a528efb625","id":"e7d3ba8f-cf7a-40c8-914f-324d9f9e1a62","width":643,"height":496,"filename":"feat_imp_234.png","type":"image/png","caption":"","border":false}]}
```

SITE 334:

```cv
gallery
{"layout":"hscroll","images":[{"src":"https://firebasestorage.googleapis.com/v0/b/maitake-project.appspot.com/o/pages%2F2UW5ERtHjGUlgbPJNhJu2lOmiuq1%2FSpV3GaJaeWzwfaYOw3rw%2Fe62d9c89-2d1d-491a-a3db-5067d2fc1113.png?alt=media&token=3b997ee7-83ce-4e0b-8091-27338a66f2e5","id":"e62d9c89-2d1d-491a-a3db-5067d2fc1113","width":491,"height":604,"filename":"feature_imp_334.png","type":"image/png","caption":"","border":false}]}
```

From the above feature importance images, the temperature, minutes, day, and hours play a key role in predicting the values of electric consumption.

### Future Work and Scope

* This work can be further improved using a more rule-based approach to decide the right anomalies for each site and meter id. Using PCA and more feature engineering can give us more accurate predictions and then using rule-based handholding can help us achieving in more accuracy.

* There is a lot of research going on in detecting anomalies in various sectors. The neural networks, GANs (novel unsupervised learning methods) are becoming more efficient in identifying anomalies. These methods can very well be used to tabular data as well.

## MY LEARNINGS:

* My biggest takeaway from this project is anomaly pattern detection needs subject matter expertise more than any other data science field.

* This is one of the important and practical fields with extremely high stakes in the real world.

* Combining time-series and anomaly detection increases the stakes, even more, higher in a business.

* It is especially important to define the rules to fix a threshold to identify the anomaly.

* Having more anomaly alerts is better than having fewer.

* Isolation forest an unsupervised algorithm is helpful in predicting with short average path lengths and compute with low linear time complexity with small memory requirement.

* Using RAPIDS AI to implement XGBoost using GPU. Wasn’t completely successful. Need to check more

#### What I tried and couldn’t achieve

* The data had more than a 1million rows for some sites. I couldn’t figure out a way to use GPU to train on site-specific data. I was able to perform only on meter id specific.

* Need to figure out GPU-based training and incorporate more data for training thereby getting more accurate anomalies.

Detecting Anomalies in Electricity Consumption

### **Scope of the Project**

Airbnb wants to analyze the historical data of all the listings on its platform since its initial stages and improve its recommendations to its customers. To do this, they need to gather the average rating, number of ratings, and prices of the Airbnb listings over the years. As a data engineer of the company, I took up the task of building an ETL pipeline that extracts the relevant data like listings, properties, hosts details, and load it to a data warehouse that makes querying for the decision-makers and analysts easier.

**End-Use Cases**

* Query-based analytical tables that can be used by decision-makers

* Analytical Table that can be used by analysts to explore more and develop recommendations for users.

## **Data Description and Sources**

The data has been scoped from an awesome site inside Airbnb which contains the Airbnb actual data. This dataset contains information about various aspects of Reviews, Calendars, Listings of many cities. As I was interested in Austin, TX, and Los Angeles (LA), CA, I took the respective data and tried to extract meaningful information. The data comes in three files namely `REVIEWS, LISTINGS, CALENDAR`.

* `Reviews` The file contains all the reviews of the listing on the Airbnb website

  * This file contains more than ***1,000,000 rows/records***.

* `Listing` The file contains all the house listings on Airbnb

  * The number of listings in Austin and LA comes around ***40000 rows/records***.

* `Calendar` file contains the availability of the listing across a vast range of dates

  * This file contains more than ***13,000,000 rows/records***.

## **Tools and Technologies**

The tools used in this project are notebooks, Apache Spark, Amazon S3, Amazon Redshift, Apache Airflow.

* To explore the dataset, I started with Google Colab free computing resources and Apache Spark.

  * Spark is better at handling huge data records. Spark provides superior performance as it stores data in-memory shared across the cluster.

* The data lake is stored on Amazon S3, an object storage service that offers scalability, data availability, security, and performance.

  * S3 is perfect for storing data partitioned and grouped in files for low cost and with a lot of flexibility.

  * Also, it has the flexibility in adding and removing additional data

  * Ease of schema design and availability to a wide range of users

* For the ETL process, I used an EMR Cluster on AWS Redshift.

* To orchestrate the overall data pipeline, I used Apache Airflow as it provides an intuitive UI helping us to track the progress of our pipelines.

```cv
gallery
{"layout":"hscroll","images":[{"src":"https://firebasestorage.googleapis.com/v0/b/maitake-project.appspot.com/o/pages%2F2UW5ERtHjGUlgbPJNhJu2lOmiuq1%2FhfE2tVzxHKKVSDe0d4fl%2F49198e6f-07fc-4b0e-9bae-ca5223da957a.png?alt=media&token=3eae49cd-9f05-40d9-8110-5c176458c3be","id":"49198e6f-07fc-4b0e-9bae-ca5223da957a","width":1803,"height":212,"filename":"dag_final.png","type":"image/png","caption":"","border":false}]}
```

Airbnb Data Orchestration

## Important Facts:

* Works best on 30 house amenities classes (as this model is trained on pre-defined amenities)

* Achieved mAP@0.5:0.95 - 0.495 – close enough to be a viable product

* Built on Pytorch, trained using GCP (1 GPU - P1000), developed API using Flask

* Returns the inference (results) under 200 milliseconds using CPU

---

I started learning about computer vision from the basics as I wanted to apply those skills more in industrial automation. At the same time, many blogs and articles started screaming to create an end-to-end machine learning model to understand the engineering behind the usefulness of the machine learning model.

Then came across a random tweet suggesting build a hobby product that I would like to use. After toiling a lot, I came across this [applied-ml GitHub](https://github.com/eugeneyan/applied-ml#computer-vision) by Eugene Yan. This an awesome collection of application-oriented Machine learning models being used in the industry. In the computer vision section, this [Airbnb paper](https://medium.com/airbnb-engineering/amenity-detection-and-beyond-new-frontiers-of-computer-vision-at-airbnb-144a4441b72e) caught my eye and I thought this is a cool product I would like to see for myself, and at the same time, I can learn important and new engineering skills needed for building an end-to-end machine learning model by following this article.

This article was incredibly detailed, almost step-by-step method one can follow to create a business need machine learning model. As I was learning new aspects of machine learning engineering, I wanted to follow the article step by step and replicate it and achieve equivalent results.

## BUSINESS CASE:

### Purpose of Project and Business case of this machine learning model:

* The purpose of this project is to build an object detection model and be able to create an interface to use the model. Overall, I wanted to learn the skills required to develop an end-to-end machine learning app.

* This model can help in finding the essential amenities in a house from the photos taken by the owner and adds them to the listing automatically, thereby decreasing the errors of a chance of missing out of humans. This model could help both the lister, consumer, and Airbnb to have the right list of amenities listed on the app and become a significant factor in determining the amount of charge a listing be listed.

## Steps Involved:

1. Data Acquisition

2. Preprocessing – Data formatting and Munging

3. Model Training

4. Evaluation and Inference

5. API Development

6. Deployment on to WEB

7. Future Work and Scope

### Data Acquisition

The most important part of this model is building a dataset of images of house amenities. As mentioned in the article, Airbnb took the majority of images from the internal database and the remaining images from the Open images dataset. But before building the dataset, they had to decide the taxonomy, amenity labels. The taxonomy here should help the specific business needs. After reviewing, they came with 40 amenity classes like kitchen, bathroom, and bedroom, such as a gas stove, oven, refrigerator, bathtub, shower, towel, bed, pillow, etc. Being a business, they could build a huge image dataset from all sources. But being a hobby project and as an individual, we don’t have that luxury. So, the next best thing is to use the available image datasets. As I was going through the article, I observed that the Airbnb engineering team used and started modeling with the images from the Open Image Dataset. So, I thought why not me. I went through the classes in the Open images Dataset and came with 30 house amenities classes that could be useful for my model.

```
names: ['Bathtub', 'Bed', 'Billiard table', 'Ceiling fan', 'Coffeemaker', 'Couch', 'Countertop', 'Dishwasher', 'Fireplace', 'Fountain', 'Gas stove', 'Jacuzzi', 'Kitchen & dining room table', 'Microwave oven', 'Mirror', 'Oven', 'Pillow', 'Porch', 'Refrigerator', 'Shower', 'Sink', 'Sofa bed', 'Stairs',  'Swimming pool', 'Television', 'Toilet', 'Towel', 'Tree house', 'Washing machine', 'Wine rack'] 
```

From the Open Images dataset, I downloaded around 36k images for training, 1000 images for validation, and 2500 images for testing purposes.

### Preprocessing – Data formatting and Munging

Now for object detection, it’s not just images, but we also need labels to train. Luckily, the open image dataset comes with annotation labels. I downloaded the annotations file and class file to build the label database.

Before creating an annotation dataset (a dataset with images and its labels), I must decide on the model I am going to use. I started reading about all the object detection models that have been developed. R-CNN, Fast R-CNN, YOLO, EfficientDet, Detectron2, etc. I wanted to use Detectron2 as it was not only the most advanced detection model but also it provides the most flexibility in selecting the model from its model_zoo. Detectron2 takes the input in the coco-JSON format. But I came across another article, where [Daniel Bourke](https://www.mrdbourke.com/airbnb-amenity-detection/) replicated the same Airbnb article using Detectron2. I started to think, what more I can do and how to differentiate myself by taking inputs from that article. At the same time, YOLOv5 was released, and everyone was talking about for several reasons but one of them was that the results were astonishing. It was using a RESNET model. But hyper-parameters were not as flexible as DETECTRON2 to change. So, I started to decide to use the YOLOv5 framework to build my model.

For Yolov5, I need the annotation database as a .txt file containing the box dimensions and classes for each image. Each image is represented by a .txt file and if that image has multiple classes, then the same .txt file will contain multiple rows with each class number and its corresponding dimensions.

This dataset is built using `prepare.py` which obtains modules from preprocessing and munge data. `preprocessing` the module helps in getting image ids, annotations, dimensions of the respective images we have downloaded. `munge_data` helps in creating the annotations box data as the YOLOv5 model requires.

It took me around 30min to one hour to create the dataset.

### MODEL TRAINING

With 35k images, I ended up with 50k annotated labels. Before starting training, download the yolov5, and create a new YAML file containing the training and validation image paths, the number of classes (30), and the class names.

At this point, we have all the ingredients to train an amenities model. As mentioned on the GitHub pages of yolov5, to train run the command

```
python train.py --img 640 
                --batch 8 
                --epochs 60 
                --data data/airbnb.yaml 
                --cfg models/yolov5x.yaml 
                --weights '' 
                --name yolov5x_airbnb_results 
                --cache
```

I took the largest model of Yolov5 which has 89M params. I took 8 batch sizes with 60 epochs, mention the YAML file we created to data, yolo5x.yaml to cfg. It took me around 30hours to train from start and achieve an mAP of 0.409. Once the training starts, we can check whether the training has been set up or not by checking the automatically generated file `train_batch0_gt.jpg` which included the ground truth for images. The metrics of this training are

```cv
gallery
{"layout":"hscroll","images":[{"src":"https://firebasestorage.googleapis.com/v0/b/maitake-project.appspot.com/o/pages%2F2UW5ERtHjGUlgbPJNhJu2lOmiuq1%2Fq9vxNNYBSD24OWOoVDVG%2F79688127-a9a7-4d9f-8881-0be4e9932438.png?alt=media&token=05c3614d-2b9a-4f03-904e-e582cbaf8198","id":"79688127-a9a7-4d9f-8881-0be4e9932438","width":1772,"height":671,"filename":"raw_wts_metrics.png","type":"image/png","caption":"","border":false}]}
```

Airbnb engineers estimated that an mAP of at least 50% needed to build a minimal viable product and achieved an mAP of 68% using AutoML on 75k images. So, I thought of ways to increase mAP. I started with the default hyper-parameters mentioned in the YOLOv5. Experimenting with hyper-parameters was not an option for me as it takes more computational time and money to experiment and that would be costly to me even though I am using GCP free credits. So, I looked at transfer learning.

As YOLOv5 has already been trained on the COCO dataset which has a similar image dataset, I downloaded the weights of the trained model, and just like the above, I started training using the downloaded weights but for only 30 epochs. This took me around 15hrs of training. The results are astonishing. As you see below, for 30 epochs, I achieved mAP -\[0.5:0.95\] – 49.5%. One can consider this as an achievement as per the initial hypothesis for being a minimum viable product.

```cv
gallery
{"layout":"hscroll","images":[{"src":"https://firebasestorage.googleapis.com/v0/b/maitake-project.appspot.com/o/pages%2F2UW5ERtHjGUlgbPJNhJu2lOmiuq1%2Fq9vxNNYBSD24OWOoVDVG%2Ff79fbeaf-aca3-43a3-a712-6ac1716cb1ed.png?alt=media&token=b751930a-d7ba-403d-91cc-f5cb178ddd6a","id":"f79fbeaf-aca3-43a3-a712-6ac1716cb1ed","width":2400,"height":1200,"filename":"results.png","type":"image/png","caption":"","border":false}]}
```

This is amazing, considering the Airbnb engineers worked on AutoML for three days to achieve an accuracy of 67%. YOLOv5 is amazing.

Evaluation/Predictions: YOLOv5 provides a lot of ways to check the results on new data. We can check with a folder of images all at once, or on a single image or with a URL link to the image, or through the webcam. At first, I evaluated the model using the download test dataset from the Open Images dataset. It contained around 2500 images, and with no weight initialization (the first) model, the model was able to predict only on 1000 images but with the transfer learning model, it was able to predict more than 1250 images. I forgot to check the accuracy of these predictions, but I am confident that the predictions on these images would have been more than 50% with a 0.3 confidence threshold and 0.4 IOU.

YOLOv5 also gives us the option to evaluate or predict with ensemble weights.

```
python detect.py --weights runs/exp5_yolov5x_airbnb_results/weights/best.pt runs/exp6_yolov5x_airbnb_results/weights/best.pt 
                 --source dataset/images/test/ 
                 --output output/inference_right_weights 
                 --img 640 
                 --save-txt
```

I also tested the model using images from the URL and some images taken from my house. The predictions were spot on in most of the cases. For example, check at the end of this write-up.

### API Development

Now the most difficult part for me. I never had to learn about APIs as I am from mechanical background. So, learning flask and write an API using it for my machine learning was difficult for me and took a long time to come up.

`python detect_flask.py`

### Deploying on to WEB

Unfortunately, I couldn’t deploy my model on to web because all free tier hosting website has an upper limit of 512mb and I couldn’t get model installation below 512MB. Pytorch latest version installation itself went more than 512MB. So there, I am attaching the images of locally hosted web app images below.

```cv
gallery
{"layout":"hscroll","images":[{"src":"https://firebasestorage.googleapis.com/v0/b/maitake-project.appspot.com/o/pages%2F2UW5ERtHjGUlgbPJNhJu2lOmiuq1%2Fq9vxNNYBSD24OWOoVDVG%2F8942f82b-fe6f-417f-a2d6-0aa1196e05fd.png?alt=media&token=bf91ffc8-ec17-4e0c-bb3e-72bec88bc4fb","id":"8942f82b-fe6f-417f-a2d6-0aa1196e05fd","width":1286,"height":651,"filename":"webapp-1.png","type":"image/png","caption":"","border":false},{"src":"https://firebasestorage.googleapis.com/v0/b/maitake-project.appspot.com/o/pages%2F2UW5ERtHjGUlgbPJNhJu2lOmiuq1%2Fq9vxNNYBSD24OWOoVDVG%2Fd57a73a5-5359-41f5-b174-6fe27cce7dc7.png?alt=media&token=5a823f31-1d46-4a7a-ace4-df1596e18e3d","id":"d57a73a5-5359-41f5-b174-6fe27cce7dc7","width":1280,"height":719,"filename":"webapp-2.png","type":"image/png","caption":"","border":false},{"src":"https://firebasestorage.googleapis.com/v0/b/maitake-project.appspot.com/o/pages%2F2UW5ERtHjGUlgbPJNhJu2lOmiuq1%2Fq9vxNNYBSD24OWOoVDVG%2F386cb1d0-7643-4e94-8078-12b75d7a74ed.png?alt=media&token=e3c53406-ee38-46b4-9c1b-b71838a2cd24","id":"386cb1d0-7643-4e94-8078-12b75d7a74ed","width":1219,"height":590,"filename":"webapp-3.png","type":"image/png","caption":"","border":false},{"src":"https://firebasestorage.googleapis.com/v0/b/maitake-project.appspot.com/o/pages%2F2UW5ERtHjGUlgbPJNhJu2lOmiuq1%2Fq9vxNNYBSD24OWOoVDVG%2Ff4dd536e-aa40-41e7-a107-dd71107ea541.png?alt=media&token=e165e8a2-fa93-4117-85f2-ad312042848f","id":"f4dd536e-aa40-41e7-a107-dd71107ea541","width":1056,"height":724,"filename":"webapp-4.png","type":"image/png","caption":"","border":false}]}
```

As you can see, with the trained machine learning model web app hosted locally, the predictions are astonishing. Some of the images from my house and I was really surprised that the model was detecting those. We can also see it couldn’t detect the grill in the last image.

If given more computation, this model can be hosted on the web and made open to everyone to use.

But if you would like to run the flask API on your local machine, just start downloading the folder `data`, `models`, `utils`, `weights`, weights from releases (I have uploaded both the trained weights), `detect_imgae.py` and just run the command `python detect_imgae.py`

### Future Work

* More hyperparameter tuning

* Hosting on web app and mobile app, accessible to everyone

* Collect more images and label the dataset

### Final Thoughts

With this project, I did learn a lot especially the 80% of machine learning models that are the things that happen in machine learning model development apart from building a model. I had to spend more than 150$ (of course, GCP free credits) on this model training and countless hours (more than a month along with my other semester work).

Coming to the most important question, can this model be used in a business case now? I think yes. If we go by Airbnb engineers’ hypothesis of having a minimum of 50% mAP to have a minimum viable product, we did achieve a 49.5% mAP with 15 hours of training. I think if we can add more images to the dataset and train for more epochs with a bit of hyperparameter tuning, we can achieve better results and cross the 50% mAP within no time.

### And more importantly……

I want to **THANK from bottom of my heart** to ***Daniel Bourke, Abhisekh Thakur, Eugene Yan,*** and most important ***YOLOv5 Ultra Analytics***. I learned a lot of engineering from them through this project