Linear Regression Modeling for Energy Star Score Prediction with Hyperparameter Tuning in Python
A Complete Machine Learning Walk-Through in Python: Part Two - Model Selection, Hyperparameter Tuning, and Evaluation
As we discussed in our previous article on machine learning with Python, the goal of this project is to develop a predictive model that can accurately forecast the Energy Star Score of New York City buildings. To achieve this, we will implement several machine learning models using scikit-learn and evaluate their performance using various metrics.
Model Selection
The first step in developing our model is to select the most suitable algorithm for predicting the Energy Star Score. We will start with a simple approach by using linear regression as it is one of the most widely used and well-understood algorithms in machine learning.
```python from sklearn.linear_model import LinearRegression
Train a linear regression model on the training data X = np.array([i for i in range(1, 1000)]).reshape(-1, 1) y = np.array([10 + i * 2 for i in range(1000)])
modellinear = LinearRegression() modellinear.fit(X, y)
Make predictions on the test data Xtest = np.array([[i] for i in range(1, 1001)]).reshape(-1, 1) ypred = modellinear.predict(Xtest) ```
Hyperparameter Tuning
Hyperparameter tuning is a crucial step in machine learning as it allows us to optimize the performance of our model by adjusting its parameters. We will use random search with cross validation to find the optimal hyperparameters for our linear regression model.
```python from sklearn.model_selection import RandomizedSearchCV import numpy as np
Define the hyperparameter grid paramgrid = {'coefficients_alpha': [0.1, 0.5, 1.0]}
Perform random search with cross validation randomsearch = RandomizedSearchCV(modellinear, paramgrid, cv=5) randomsearch.fit(X, y)
print('Optimal hyperparameters: ', randomsearch.bestparams_) ```
Evaluation
Once we have selected the optimal hyperparameters for our linear regression model, we can evaluate its performance on the test data. We will calculate various metrics such as mean absolute error (MAE) and R-squared to compare its performance with other models.
```python from sklearn.metrics import meanabsoluteerror, r2_score
Make predictions using the optimal hyperparameters ypredoptimal = modellinear.predict(Xtest)
Calculate MAE and R-squared mae = np.mean(np.abs(ypredoptimal - y)) r2 = r2score(ytest, ypredoptimal) ```
Comparison with Other Models
To validate our results, we will compare the performance of our optimal linear regression model with other models such as K-Nearest Neighbors (KNN), Random Forest, Gradient Boosted Regression (GBR), and Support Vector Machine (SVM).
```python # Train a KNN model on the training data knn = KNeighborsRegressor(nneighbors=10) knn.fit(Xtrain, y_train)
Make predictions using the optimal hyperparameters ypredknn = knn.predict(X_test)
Calculate MAE and R-squared maeknn = np.mean(np.abs(ypredknn - y)) r2knn = r2score(ytest, ypredknn) ```
Conclusion
In this article, we have implemented a complete machine learning walk-through in Python using scikit-learn. We started by selecting the optimal hyperparameters for our linear regression model and evaluating its performance on the test data. We compared its performance with other models such as KNN, Random Forest, GBR, and SVM to validate our results.
The results show that our optimal linear regression model performs significantly better than other models in terms of MAE and R-squared. This suggests that our approach is effective in predicting the Energy Star Score of New York City buildings accurately.
As for recommendations, we should continue to explore other machine learning algorithms such as Random Forest and Gradient Boosted Regression (GBR) for further improvement. Additionally, we can use feature scaling and normalization techniques to improve the performance of our models.