Key facts:
-
Anomaly detection is a particularly important and active business metric for various fields. A technique that is used to identify the unusual patterns that are not in sync with the expectations.
-
It has many applications in business-like health (detecting health discrepancies), cybersecurity(intrusions), electricity (huge and sudden surges), finances (fraud detection), manufacturing (fault detection), etc. This shows that there is more to anomaly detection in everyday life and important concepts to be looked at.
-
The data science application in anomaly Detection combines multiple concepts like classification, regression, and clustering.
Business Case
-
Electricity consumption of society has been increasing steadily day by day and it has become important for generators and suppliers to meet the demand and be vary of the sudden increase or surges in their distribution.
-
Typically, there is more energy consumption than necessary due to various supply losses, malfunctioning equipment, incorrectly configured systems, and inappropriate operating procedures. Therefore, it is important to have a quick response and reliable fault detection to ensure they met the demand, have a breakless service, and save energy.
-
Electricity supply can be having point, contextual or collective anomalies. The electricity anomalies could fall into either of the three categories. Sudden surges at a locality or on holidays or an activity taking at a place etc.
-
These anomalies could follow a pattern like high electricity usage in summer or sudden spike due to wedding activity (wedding season) or point anomaly (for example, when someone starts mining a bitcoin – which takes a lot of power)
-
Detecting this helps the stakeholders to plan for additional supply, curb the untoward electricity usage, and leakages, and save a lot of money and energy for all stakeholders.
Steps Involved:
Initial Thinking
-
Identifying anomalies is a tricky task as the data doesn’t always tell what it is. It could just an outlier or it was an anomaly. The set of anomalies/outliers will be very few and hence the data is imbalanced for any algorithm to learn. Especially in electricity anomaly detection, there are too many factors like weekends, office areas, holiday areas, etc.
-
How to capture an anomaly then? My initial ideas were to use the time series data and predict the values based on various engineered features and calculate the residue. When the residue crosses the excess level mark that as an anomaly. This level can be decided based on several factors like a combination of time of duration, temperature, day of activity, etc. Then compare all the residues and mark the residues with exceeding level as anomalies. But before this, I must remove all the outliers captured in the data.
-
Another idea that popped at me was finding the nearest neighbors and predict the values based on various features. Once the value predicted for a day based on nearest neighbors is different than the usual mark them unusual and consider the other factors to whether mark them as anomalies or not. Using KNN will help here. But defining the number of nearest neighbors and the relationship between some neighbors might become difficult in the starting as there could too many correlations.
-
A key point to remember: Outliers are observations that are distant from the mean or location of a distribution. However, they don’t necessarily represent abnormal behavior or behavior generated by a different process. On the other hand, anomalies are data patterns that are generated by different processes.
Technologies Used
-
Python
-
Pandas
-
Numpy
-
Plotly
-
Scikit-Learn
-
Boosting
-
XGBoost
-
Isolation Forest
SKILLS
-
EDA
-
Time Series Analysis
-
Supervised ML
-
Statistics
-
Data Visualization
Data Acquisition
-
Data downloaded from Scheider Electric.
-
The data consisted of 5 CSV files.
-
Training data with timestamp and values of electric units for respective meter ids.
-
Weather data with a timestamp, temperature, site ids, and distance.
-
Holidays CSV files consisted of Date, type of holiday, and site_ids.
-
Metadata CSV file consisted of site Id, meter_id, Meter Description, units, surface, and activity.
-
The submission file had the true values of whether it is an anomaly or not.
-
Preprocessing – Data formatting and Munging
-
The given data is for three sites (Site 38, 234_203, 334_16) and more than 20 meter_ids. The data covers more than 4+ years of data. The given data also covers various kinds of activities like restaurants, offices, laboratories, general. The meter ids have even more vast distribution across 3 sites covering various electricity usage like elevators, generators, compressed air, etc.
-
From exploring the above data, it was clear that a lot of data is missing between various data tables. Not all site ids have been mentioned for the respective meter ids.
-
The data is skewed towards one site.
-
The presence of wrong data in electric Units are present
-
Metadata is filled with many unknowns
-
The weather data table doesn’t contain data for site 234_203
-
Feature Selection & Engineering
-
Considering the subject (electric unit values prediction) and the data given, I started wondering what factors lead to electricity consumption for a place or activity. For a house, it could outside temperature, people staying at home on a holiday, and for an office outside temperature, full staff, hectic day, or long day, etc.
-
I started to extract all the features like year, month, day, hour, sin, and cos values of month & day. Whether it is a working day or not, whether it is a weekend or not, day of the year, night, reset hour, the temperature outside, mean temperature across the day, mean temperature across the month, is it a holiday, etc.
-
These are some of the features that occurred to my mind, and some based on the exploration of the internet.
-
Now the important question is how many of these factors are important and what weight they carry on.
-
Before moving on to modeling, I wanted to check how the electric unit values are spread across 4 years.
-
Below is the representation of weekday(blue) and weekend value(green) for sites 334 and 234.
-
We can see there are few spikes here and there. We don’t know whether they are anomalies are not. But they are not outliers.
-
Also, we can see that the green values are not high enough. This is due, site 334_61 & 234_203 is office activity and on weekends they are assumed to be non-functioning.
-
Features Code
Model Selection & Training
-
As the electricity data is time-series data, I decided to use a regression-based supervised learning model (XGBoost) to predict the future values based on past values and extracted feature engineering. The features I used are as follows
-
Rolling means, moving average, temperature, and other features like a year, month, weekend, weekday, cos and sin feature of day and month, etc.
-
Choosing XGBoost was an active choice as it has been the most efficient gradient boosting algorithm. It produces prediction by combining an ensemble of weak predictors.
-
To avoid overfitting, I used cross-validation to split the train and test data sets
-
Since I am using a regression for prediction, I used the RMSE metric to calculate the error.
-
-
We have seen above that the weekend and weekday consumption values are a lot different, and the same model cannot be used. Hence, I decided to use the Isolation Forest model for weekend prediction. While using isolation forest, I tried using the normalized 24 hours values rather than direct values because weekend consumption appeared to be level distributed and didn’t want a prediction solely on some factors.
-
For weekend anomaly detection, I borrowed some features from a GitHub solution.
-
The features he has selected as follows. (These are extensive and best features for an isolation forest algorithm and hence I couldn’t perform any better than this)
-
The power values of each hour are extracted as features.
-
Temperature
-
Power value after 9:00 am is it's a weekend and the consumption values after a bit of morning project the real consumption.
-
Kullback-Leibler Divergence: Measure of how one probability distribution diverges from the empirical distribution. We calculate the mean value of power usage of each hour as the empirical distribution, then calculate the KL divergence of each weekend 24-hr power usage with the empirical distribution.
-
The hour of daily power values’ peak because if the peak occurs later in the day suddenly, it could be an anomaly.
-
-
Isolation forest is an effective unsupervised algorithm with a small requirement.
-
The way these algorithms works is that first, it builds isolation trees by randomly selecting features and then starts splitting the data based on features’ breakpoints. Then anomalies are detected as instances that have short average path lengths on the isolation trees.
-
-
But the case is not over yet. We predicted the consumption values and now must detect the anomalies
-
I defined an anomaly if the predicted error exceeds a defined rule-based threshold.
-
Here I plotted the error values after converting a distribution (Gaussian) by resampling with respect today with mean values and identifying those error values which fall in 2 sigma level.
-
The extremely large errors are labeled as anomalies.
-
-
The predicted anomalies are visualized below (light shaded straight lines belong to weekdays & dark shaded straight lines belong to the weekend)
-
For site 334
-
For site 234
-
Feature Importance from the model:
SITE 234:
SITE 334:
From the above feature importance images, the temperature, minutes, day, and hours play a key role in predicting the values of electric consumption.
Future Work and Scope
-
This work can be further improved using a more rule-based approach to decide the right anomalies for each site and meter id. Using PCA and more feature engineering can give us more accurate predictions and then using rule-based handholding can help us achieving in more accuracy.
-
There is a lot of research going on in detecting anomalies in various sectors. The neural networks, GANs (novel unsupervised learning methods) are becoming more efficient in identifying anomalies. These methods can very well be used to tabular data as well.
MY LEARNINGS:
-
My biggest takeaway from this project is anomaly pattern detection needs subject matter expertise more than any other data science field.
-
This is one of the important and practical fields with extremely high stakes in the real world.
-
Combining time-series and anomaly detection increases the stakes, even more, higher in a business.
-
It is especially important to define the rules to fix a threshold to identify the anomaly.
-
Having more anomaly alerts is better than having fewer.
-
Isolation forest an unsupervised algorithm is helpful in predicting with short average path lengths and compute with low linear time complexity with small memory requirement.
-
Using RAPIDS AI to implement XGBoost using GPU. Wasn’t completely successful. Need to check more
What I tried and couldn’t achieve
-
The data had more than a 1million rows for some sites. I couldn’t figure out a way to use GPU to train on site-specific data. I was able to perform only on meter id specific.
-
Need to figure out GPU-based training and incorporate more data for training thereby getting more accurate anomalies.