This project focuses on analyzing real estate data and building predictive insights on house prices. The dataset contains features such as house size, number of bedrooms, and other key attributes.
The goal is to explore the dataset, prepare it for analysis, visualize key trends, and build a predictive model to estimate house prices.
- Loaded the real estate data from a CSV file (
real_state_dataset.csv
).import pandas as pd df = pd.read_csv('real_state_dataset.csv')
- Displayed basic dataset information including shape, column names, and data types.
print(df.shape) print(df.info())
- Inspected the first few rows using
head()
.print(df.head())
- Checked for missing values using
df.isnull().sum()
.print(df.isnull().sum())
- Dropped unnecessary columns:
brokered_by
,zip_code
, andprev_sold_date
.df.drop(columns=['brokered_by', 'zip_code', 'prev_sold_date'], inplace=True)
- Removed rows with missing values using
dropna()
.df.dropna(inplace=True)
- Checked for duplicate entries and removed them using
drop_duplicates()
.df.drop_duplicates(inplace=True)
-
Calculated descriptive statistics (count, mean, min, max) for numerical columns using
describe()
.print(df.describe())
-
Analyzed the distribution of key features.
-
Visualized the top 10 states with the most houses using a bar plot.
import matplotlib.pyplot as plt df['state'].value_counts().sort_values(ascending=False).head(10).plot(kind='bar') plt.title('Top 10 States with Most Houses') plt.show()
-
Calculated average house prices by state and city.
avg_price_by_state = df.groupby('state')['price'].mean() print(avg_price_by_state)
-
Displayed the correlation between numerical features and the target variable (
price
).print(df.corr()['price'])
- Selected relevant features (
bed
,bath
,house_size
) for model building.X = df[['bed', 'bath', 'house_size']] y = df['price']
- No additional feature engineering was performed.
- Split the dataset into training and testing sets using
train_test_split
.from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
- Standardized the numerical features using
StandardScaler
to improve model performance.from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression import joblib scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) joblib.dump(scaler, 'scaler.pkl')
- Trained a Linear Regression model using the training data.
lr = LinearRegression() lr.fit(X_train, y_train)
- Made predictions on the test data and evaluated the model using Mean Absolute Error (MAE).
from sklearn.metrics import mean_absolute_error lr_pred = lr.predict(X_test) mae = mean_absolute_error(y_test, lr_pred) print(f'Mean Absolute Error: {mae}')
- Saved the trained model and scaler using
joblib.dump()
.joblib.dump(lr, 'model.pkl')
- A Streamlit app was developed to allow users to input house features and get a predicted price.
import streamlit as st import joblib import numpy as np scaler = joblib.load('scaler.pkl') model = joblib.load('model.pkl') st.title('House Price Prediction') st.divider() bed = st.number_input('Bedrooms', value=2 , step=1) bath = st.number_input('Bathrooms', value=1, step=1) house_size = st.number_input('House Size', value=1000, step=50) X = [bed, bath, house_size] st.divider() predict_btn = st.button('Predict') st.divider() if predict_btn: st.balloons() X1 = np.array(X) X_array = scaler.transform([X1]) prediction = model.predict(X_array)[0] st.write(f'Predicted Price: {prediction:.2f}') else: st.write('Click the button to predict the price')
- The model was evaluated using Mean Absolute Error (MAE), which measures how close predictions are to actual values.
- The Streamlit app provides an interactive interface for predicting house prices.
Result Image
- Handle outliers (if present) and perform feature scaling.
- Compare the Linear Regression model with other machine learning models (e.g., Decision Trees, Random Forests).
- Tune hyperparameters to improve model performance.
- Deploy the final model as a web application.
Contributions are welcome! Please fork the repository and create a pull request for any enhancements or bug fixes.
This project is licensed under the MIT License.