{ "cells": [ { "cell_type": "markdown", "id": "9a9ada08-c9f0-466f-91a0-265fa539f9f5", "metadata": {}, "source": [ "# [Predict the strength of concrete](https://app.datacamp.com/workspace/w/6062fa1f-85aa-48e2-a156-66a5fba7ff2a)\n", "\n", "## 📖 Background\n", "\n", "Concrete is the most widely used building material in the world. It is a mix of cement and water with gravel and sand. It can also include other materials like fly ash, blast furnace slag, and additives. \n", "\n", "The compressive strength of concrete is a function of components and age, the team is testing different combinations of ingredients at different time intervals. \n", "\n", "Find a simple way to estimate strength to predict how a particular sample is expected to perform.\n", "\n", "The objective is to answer:\n", "1. The average strength of the concrete samples at 1, 7, 14, and 28 days of age.\n", "2. The coefficients of regression model using the formula that provided us:\n", "![Strength Equation](str_eq.png)" ] }, { "cell_type": "markdown", "id": "10dcc269-3659-4851-99cd-f1ffb7f818aa", "metadata": {}, "source": [ "## 💾 The data\n", "The team has already tested more than a thousand samples ([source](https://archive.ics.uci.edu/ml/datasets/concrete+compressive+strength)):\n", "\n", "### Compressive strength data:\n", "- \"cement\" - Portland cement in kg/m3\n", "- \"slag\" - Blast furnace slag in kg/m3\n", "- \"fly_ash\" - Fly ash in kg/m3\n", "- \"water\" - Water in liters/m3\n", "- \"superplasticizer\" - Superplasticizer additive in kg/m3\n", "- \"coarse_aggregate\" - Coarse aggregate (gravel) in kg/m3\n", "- \"fine_aggregate\" - Fine aggregate (sand) in kg/m3\n", "- \"age\" - Age of the sample in days\n", "- \"strength\" - Concrete compressive strength in megapascals (MPa)\n", "\n", "***Acknowledgments**: I-Cheng Yeh, \"Modeling of strength of high-performance concrete using artificial neural networks,\" Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998)*." ] }, { "cell_type": "markdown", "id": "0864ae32-db01-4bad-b6f3-3b506daf0ba4", "metadata": {}, "source": [ "## Import Libraries" ] }, { "cell_type": "code", "execution_count": 1, "id": "7f0964c3-2705-41df-af08-d8a8efe2378c", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from sklearn.preprocessing import StandardScaler, OrdinalEncoder\n", "from sklearn.model_selection import train_test_split, cross_val_score, KFold\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.linear_model import LinearRegression, LogisticRegression\n", "from sklearn.neighbors import KNeighborsRegressor\n", "from sklearn.tree import DecisionTreeRegressor\n", "from sklearn.ensemble import AdaBoostRegressor, RandomForestRegressor, BaggingRegressor, GradientBoostingRegressor\n", "from sklearn.svm import SVR\n", "from sklearn.model_selection import GridSearchCV\n", "import tensorflow as tf\n", "from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "id": "30988536-8315-4f6c-b131-7bbe2183e440", "metadata": {}, "source": [ "## Load Data" ] }, { "cell_type": "code", "execution_count": 2, "id": "40251159-3eea-47f8-bfb4-09e38d5ced65", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cementslagfly_ashwatersuperplasticizercoarse_aggregatefine_aggregateagestrength
0540.00.00.0162.02.51040.0676.02879.986111
1540.00.00.0162.02.51055.0676.02861.887366
2332.5142.50.0228.00.0932.0594.027040.269535
3332.5142.50.0228.00.0932.0594.036541.052780
4198.6132.40.0192.00.0978.4825.536044.296075
\n", "
" ], "text/plain": [ " cement slag fly_ash water superplasticizer coarse_aggregate \\\n", "0 540.0 0.0 0.0 162.0 2.5 1040.0 \n", "1 540.0 0.0 0.0 162.0 2.5 1055.0 \n", "2 332.5 142.5 0.0 228.0 0.0 932.0 \n", "3 332.5 142.5 0.0 228.0 0.0 932.0 \n", "4 198.6 132.4 0.0 192.0 0.0 978.4 \n", "\n", " fine_aggregate age strength \n", "0 676.0 28 79.986111 \n", "1 676.0 28 61.887366 \n", "2 594.0 270 40.269535 \n", "3 594.0 365 41.052780 \n", "4 825.5 360 44.296075 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('data/concrete_data.csv')\n", "df.head()" ] }, { "cell_type": "markdown", "id": "90a8a67f-1b24-4d00-af31-bf179f76b885", "metadata": {}, "source": [ "## Data Preprocessing\n", "\n", "### Data Size and Structure\n", "\n", "- Dataset comprises of 1030 observations.\n", "- 8 features and 1 target variable `strength`.\n", "- All columns are numerical\n", "- No null values" ] }, { "cell_type": "code", "execution_count": 3, "id": "643e657c-22b6-4505-b453-6e4e8f29677f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shape: (1030, 9) \n", "\n", "\n", "RangeIndex: 1030 entries, 0 to 1029\n", "Data columns (total 9 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 cement 1030 non-null float64\n", " 1 slag 1030 non-null float64\n", " 2 fly_ash 1030 non-null float64\n", " 3 water 1030 non-null float64\n", " 4 superplasticizer 1030 non-null float64\n", " 5 coarse_aggregate 1030 non-null float64\n", " 6 fine_aggregate 1030 non-null float64\n", " 7 age 1030 non-null int64 \n", " 8 strength 1030 non-null float64\n", "dtypes: float64(8), int64(1)\n", "memory usage: 72.5 KB\n" ] } ], "source": [ "print(f'Shape: {df.shape} \\n')\n", "df.info()" ] }, { "cell_type": "markdown", "id": "0d90e6c3-eb12-4ae7-8a5d-6e20afc4c163", "metadata": {}, "source": [ "### Duplicated Data\n", "A small 2.43% of the dataset are duplicated. I decided to drop these, so now the dataset has 1005 observations." ] }, { "cell_type": "code", "execution_count": 4, "id": "bbb3bde7-0e78-4bd1-87b7-f26932d0fd52", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Duplicated percentage: 2.43%\n", "df shape after removing duplicates: (1005, 9)\n" ] } ], "source": [ "dup_perc = df.duplicated().sum() / len(df) * 100\n", "print(f'Duplicated percentage: {np.round(dup_perc, 2)}%')\n", "df.drop_duplicates(inplace=True)\n", "print(f'df shape after removing duplicates: {df.shape}')" ] }, { "cell_type": "markdown", "id": "aaec1f84-fbe9-44e4-9393-9435bf2a4016", "metadata": {}, "source": [ "### Outliers\n", "\n", "Boxplots show that there are outliers in our data, but not many.\n", "We will be replacing these outliers with their median." ] }, { "cell_type": "code", "execution_count": 5, "id": "d7f08425-08c3-48a8-87be-013d3be2786d", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(12,6))\n", "sns.boxplot(data=df, orient=\"h\", palette=\"Set3\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 6, "id": "464b6a04-691a-4a12-9e6c-c53a26f890fb", "metadata": {}, "outputs": [], "source": [ "for col_name in df.columns[:-1]:\n", " q1 = df[col_name].quantile(0.25)\n", " q3 = df[col_name].quantile(0.75)\n", " iqr = q3 - q1\n", " low = q1 - 1.5 * iqr\n", " high = q3 + 1.5 * iqr\n", " df.loc[(df[col_name] < low) | (df[col_name] > high), col_name] = df[col_name].median()" ] }, { "cell_type": "markdown", "id": "d2ceaa14-7e2f-4597-86ed-10f3f34458e4", "metadata": {}, "source": [ "### Summary Statistics\n", "\n", "Variance, skew, and kurtosis is also included.
\n", "We can see that all variables have skew are in asymmetric form." ] }, { "cell_type": "code", "execution_count": 7, "id": "61d09a65-7a7f-43de-951b-decb042f5bd0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%maxvarskewkurtosis
cement1005.0278.629055104.345003102.000000190.680000265.000000349.00000540.00000010887.8796010.564997-0.432463
slag1005.071.36771185.2397400.0000000.00000020.000000141.30000342.1000007265.8133430.830465-0.524116
fly_ash1005.055.53507564.2074480.0000000.0000000.000000118.27000200.1000004122.5964360.497324-1.366457
water1005.0182.52129420.114500127.300000168.000000185.700000192.00000228.000000404.5930970.126521-0.076357
superplasticizer1005.05.7918465.3968510.0000000.0000006.1000009.9000023.40000029.1260010.514240-0.312220
coarse_aggregate1005.0974.37646877.579534801.000000932.000000968.0000001031.000001145.0000006018.584052-0.065242-0.583034
fine_aggregate1005.0771.62890578.821267594.000000724.300000780.000000822.00000945.0000006212.792066-0.335220-0.198316
age1005.032.11741327.6653331.0000007.00000028.00000028.00000120.000000765.3706631.3120080.876534
strength1005.035.25027316.2848082.33180823.52354233.79811444.8683482.599225265.1949600.395653-0.305402
\n", "
" ], "text/plain": [ " count mean std min 25% \\\n", "cement 1005.0 278.629055 104.345003 102.000000 190.680000 \n", "slag 1005.0 71.367711 85.239740 0.000000 0.000000 \n", "fly_ash 1005.0 55.535075 64.207448 0.000000 0.000000 \n", "water 1005.0 182.521294 20.114500 127.300000 168.000000 \n", "superplasticizer 1005.0 5.791846 5.396851 0.000000 0.000000 \n", "coarse_aggregate 1005.0 974.376468 77.579534 801.000000 932.000000 \n", "fine_aggregate 1005.0 771.628905 78.821267 594.000000 724.300000 \n", "age 1005.0 32.117413 27.665333 1.000000 7.000000 \n", "strength 1005.0 35.250273 16.284808 2.331808 23.523542 \n", "\n", " 50% 75% max var skew \\\n", "cement 265.000000 349.00000 540.000000 10887.879601 0.564997 \n", "slag 20.000000 141.30000 342.100000 7265.813343 0.830465 \n", "fly_ash 0.000000 118.27000 200.100000 4122.596436 0.497324 \n", "water 185.700000 192.00000 228.000000 404.593097 0.126521 \n", "superplasticizer 6.100000 9.90000 23.400000 29.126001 0.514240 \n", "coarse_aggregate 968.000000 1031.00000 1145.000000 6018.584052 -0.065242 \n", "fine_aggregate 780.000000 822.00000 945.000000 6212.792066 -0.335220 \n", "age 28.000000 28.00000 120.000000 765.370663 1.312008 \n", "strength 33.798114 44.86834 82.599225 265.194960 0.395653 \n", "\n", " kurtosis \n", "cement -0.432463 \n", "slag -0.524116 \n", "fly_ash -1.366457 \n", "water -0.076357 \n", "superplasticizer -0.312220 \n", "coarse_aggregate -0.583034 \n", "fine_aggregate -0.198316 \n", "age 0.876534 \n", "strength -0.305402 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "desc = df.describe().T\n", "desc['var'] = df.var(axis=0)\n", "desc['skew'] = df.skew(axis=0)\n", "desc['kurtosis'] = df.kurtosis(axis=0)\n", "desc" ] }, { "cell_type": "markdown", "id": "dcb2e04a-85c8-49aa-99c4-529100618704", "metadata": { "tags": [] }, "source": [ "## Exploratory Data Analysis" ] }, { "cell_type": "markdown", "id": "a326a037-17a1-4f82-b7c5-c490d58140aa", "metadata": {}, "source": [ "### Correlation matrix\n", "\n", "Let's look at the relationship between all the variables and their correlation.
\n", "From the heatmap, we observe target variable `strength` has a high positive correlation `cement` 0.49, `superplasticizer` 0.32, and `age` 0.5.
\n", "There is a strong negative correlation (-0.61) between `superplasticizer` and `water`." ] }, { "cell_type": "code", "execution_count": 8, "id": "cf29bd8a-2619-4d3c-a601-249bbf5011aa", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(12,6))\n", "sns.heatmap(df.corr(), annot=True)\n", "plt.title('Heatmap of correlation with all features')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "3e9f2d12-a822-42a5-87e3-94d391412429", "metadata": {}, "source": [ "### Age\n", "\n", "Age is a very important characteristic that affects concrete strength." ] }, { "cell_type": "code", "execution_count": 9, "id": "08fdc396-cfbe-4914-985b-b56c5fc74339", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(12,6))\n", "sns.countplot(x='age', data=df)\n", "plt.title(label=\"Count of ages of concrete\")\n", "plt.xlabel(\"Age (Days)\")\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "0cb1203b-5e5c-4012-afcc-9e8ab70b3ebd", "metadata": {}, "source": [ "This bar plot shows that age is a categorical variable with a high number of observations at 28 days.
\n", "The 28 days time frame is significant because this is the period for concrete to reach 99% of it's strength. While the concrete continuous to gain strength after that period, the rate of gain is much less compared to that in 28 days." ] }, { "cell_type": "code", "execution_count": 10, "id": "f9337430-04ba-4e21-9711-7c7487817a18", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average concrete strength: 35.25027287623584 \n", "\n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Average concrete strength at different ages (Days):
 agemeancount
019.4527162
1318.378023129
2725.181843122
31428.75103862
42837.383788478
55650.71515286
69040.48080954
79168.67464917
810047.66878052
912039.6471683
\n" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "avg_str_by_day = df.groupby('age').agg(['mean', 'count'])['strength']\n", "avg_str_by_day.reset_index(inplace=True)\n", "avg_str_by_day.age = avg_str_by_day.age.astype(int)\n", "avg_str_by_day = avg_str_by_day.style.set_caption(\"Average concrete strength at different ages (Days):\")\n", "print(f'Average concrete strength: {df.strength.mean()} \\n')\n", "avg_str_by_day" ] }, { "cell_type": "markdown", "id": "707706ff-5622-481a-a043-1dd841f59fb1", "metadata": {}, "source": [ "From the table we can see that average concrete strength at 28 days is similar to the average strength of the entire dataset." ] }, { "cell_type": "code", "execution_count": 11, "id": "1cf18be3-7283-43af-93c7-13a3866238de", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(12,6))\n", "sns.boxplot(x='age', y='strength', data=df);\n", "plt.title(label=\"Boxplot of concrete strength at different ages\") \n", "plt.xlabel(\"Age (Days)\")\n", "plt.ylabel(\"Strength\")\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "8d15b593-4f66-463b-ac70-425c266650b4", "metadata": {}, "source": [ "In this boxplot, concrete strength is the highest at 91 days. They have the highest median strenght with a narrow variation. We can also see that strength has the largest variation at 28 days." ] }, { "cell_type": "markdown", "id": "81da4f90-bd9f-4f6b-b876-652c8c8d1970", "metadata": { "tags": [] }, "source": [ "## 1. Average strength of the concrete samples at 1, 7, 14, and 28 days of age.\n", "\n", "Average strength of concrete is the lowest after 1 day and increases significantly after 7 days. Strength continues to increase in MPa after 14 and 28 days." ] }, { "cell_type": "code", "execution_count": 12, "id": "cfc23420-047f-49a4-8456-c26186edab7a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average strength of the concrete sample is 9.45 MPa at 1 day.\n", "Average strength of the concrete sample is 25.18 MPa at 7 days.\n", "Average strength of the concrete sample is 28.75 MPa at 14 days.\n", "Average strength of the concrete sample is 37.38 MPa at 28 days.\n" ] } ], "source": [ "ages = [1, 7, 14, 28]\n", "\n", "for age in ages:\n", " cond = df.age == age\n", " avg_str = df[cond].strength.mean().round(2)\n", " if age == 1:\n", " print(f'Average strength of the concrete sample is {avg_str} MPa at {age} day.')\n", " else:\n", " print(f'Average strength of the concrete sample is {avg_str} MPa at {age} days.')" ] }, { "cell_type": "markdown", "id": "dc886cef-86b2-4288-b124-c35d2a47f512", "metadata": {}, "source": [ "## 2. Creating predictive model\n", "\n", "Now let's help our colleages in the engineering department find out the coefficients $\\beta_{0}$, $\\beta_{1}$ ... $\\beta_{8}$, to use in the following formula:\n", "\n", "![Strength Equation](str_eq.png)" ] }, { "cell_type": "markdown", "id": "3dede16c-0c6f-4de4-b653-79adf29002cb", "metadata": {}, "source": [ "## Train Test Split\n", "\n", "Split the data to train (80%) and test (20%) to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model." ] }, { "cell_type": "code", "execution_count": 13, "id": "20ac4305-ba34-4a3a-92f2-d6d430b68a09", "metadata": {}, "outputs": [], "source": [ "features = ['cement', 'slag', 'fly_ash', 'water', 'superplasticizer', 'coarse_aggregate', 'fine_aggregate', 'age']\n", "X = df[features] # Features\n", "y = df['strength'] # Target" ] }, { "cell_type": "code", "execution_count": 14, "id": "8df4a9f2-2a5e-4a0c-89bd-a957b6988345", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "X train data (804, 8)\n", "y train data (804,)\n", "X test data (201, 8)\n", "y test data (201,)\n" ] } ], "source": [ "# Train Test Split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2022)\n", "\n", "print('X train data {}'.format(X_train.shape))\n", "print('y train data {}'.format(y_train.shape))\n", "print('X test data {}'.format(X_test.shape))\n", "print('y test data {}'.format(y_test.shape))" ] }, { "cell_type": "markdown", "id": "779d6366-9986-43f0-9c2c-65bbb934ea8f", "metadata": { "tags": [] }, "source": [ "### Encoding age\n", "\n", "We observed that `age` is a variabel comprised of a finite set of discrete values with a ranked ordering between values." ] }, { "cell_type": "code", "execution_count": 15, "id": "11ada35a-1bbd-4b29-8fec-671437863513", "metadata": {}, "outputs": [], "source": [ "def ordinal_encoding_feature(data, feature): \n", " '''Transform selected column with ordinal encoder\n", " \n", " INPUTS:\n", " data: dataframe\n", " feature: column name\n", " \n", " OUTPUT:\n", " d: dataframe with feature ordinally encoded\n", " ''' \n", " d = data.copy()\n", " encoder = OrdinalEncoder()\n", " # Reshape because only one column is transformed\n", " encoder.fit(data[feature].values.reshape(-1,1))\n", " d[feature]= encoder.transform(data.age.values.reshape(-1,1))\n", " return d\n", "X_train = X_train.pipe(ordinal_encoding_feature, 'age')\n", "X_test = X_test.pipe(ordinal_encoding_feature, 'age')" ] }, { "cell_type": "markdown", "id": "f4dbd66c-9dfa-48b2-8cb5-49016047e32a", "metadata": {}, "source": [ "## Model Selection\n", "\n", "We will be using a supervised regression model since the target variable `strength` is labeled and continuous.
\n", "Using k-folds cross validation to estimate and compare the performance of models on out-of-sample data using r2 score. This enables us to identify which model is worth improving upon.
\n", "GradientBoostRegressor gives the best results out of models, with the highest r2score of 0.89." ] }, { "cell_type": "code", "execution_count": 16, "id": "d5761d7b-94ca-4492-b3ce-335903f65209", "metadata": {}, "outputs": [], "source": [ "pipelines = []\n", "pipelines.append(('Linear Regression', Pipeline([('scaler', StandardScaler()), ('LR', LinearRegression())])))\n", "pipelines.append(('KNN Regressor', Pipeline([('scaler', StandardScaler()), ('KNNR', KNeighborsRegressor())])))\n", "pipelines.append(('SupportVectorRegressor', Pipeline([('scaler', StandardScaler()), ('SVR', SVR())])))\n", "pipelines.append(('DecisionTreeRegressor', Pipeline([('scaler', StandardScaler()), ('DTR', DecisionTreeRegressor())])))\n", "pipelines.append(('AdaboostRegressor', Pipeline([('scaler', StandardScaler()), ('ABR', AdaBoostRegressor())])))\n", "pipelines.append(('RandomForestRegressor', Pipeline([('scaler', StandardScaler()), ('RBR', RandomForestRegressor())])))\n", "pipelines.append(('BaggingRegressor', Pipeline([('scaler', StandardScaler()), ('BGR', BaggingRegressor())])))\n", "pipelines.append(('GradientBoostRegressor', Pipeline([('scaler', StandardScaler()), ('GBR', GradientBoostingRegressor())])))" ] }, { "cell_type": "code", "execution_count": 17, "id": "9ccd3c02-eb69-41b2-9c85-56e990b643c0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RegressorR2Std
0Linear Regression0.7665780.026529
1KNN Regressor0.7888510.022900
2SupportVectorRegressor0.7025930.018576
3DecisionTreeRegressor0.8044050.012659
4AdaboostRegressor0.7713210.011322
5RandomForestRegressor0.8821860.013177
6BaggingRegressor0.8659530.009271
7GradientBoostRegressor0.8854450.011429
\n", "
" ], "text/plain": [ " Regressor R2 Std\n", "0 Linear Regression 0.766578 0.026529\n", "1 KNN Regressor 0.788851 0.022900\n", "2 SupportVectorRegressor 0.702593 0.018576\n", "3 DecisionTreeRegressor 0.804405 0.012659\n", "4 AdaboostRegressor 0.771321 0.011322\n", "5 RandomForestRegressor 0.882186 0.013177\n", "6 BaggingRegressor 0.865953 0.009271\n", "7 GradientBoostRegressor 0.885445 0.011429" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create empty dataframe to store the results\n", "cv_scores = pd.DataFrame({'Regressor':[], 'R2':[], 'Std':[]})\n", "\n", "# Cross-validation score for each pipeline for training data\n", "for ind, val in enumerate(pipelines):\n", " name, pipeline = val\n", " kfold = KFold(n_splits=5) \n", " score = cross_val_score(pipeline, X_train, y_train, cv=kfold, scoring=\"r2\")\n", " cv_scores.loc[ind] = [name, score.mean(), score.std()]\n", "cv_scores" ] }, { "cell_type": "markdown", "id": "08b10e60-ac8e-45a7-8f24-1f86358c14e4", "metadata": { "tags": [] }, "source": [ "## Model Tuning\n", "\n", "Using Grid Search to tune hyperparameters of Gradient Boosting Regressor." ] }, { "cell_type": "code", "execution_count": 18, "id": "ab9a6bb0-b7b2-471d-a3fe-cddcca4c9c44", "metadata": {}, "outputs": [], "source": [ "steps = [('scaler', StandardScaler()), ('GBR', GradientBoostingRegressor())]\n", "pipeline = Pipeline(steps)\n", "\n", "param_grid=[{'GBR__n_estimators':[100,500,1000], \n", " 'GBR__learning_rate': [0.1,0.05,0.02,0.01], \n", " 'GBR__max_depth':[4,6], \n", " 'GBR__min_samples_leaf':[3,5,9,17], \n", " 'GBR__max_features':[1.0,0.3,0.1] }]\n", "\n", "#search = GridSearchCV(pipeline, param_grid, cv = 5, scoring = 'r2', n_jobs=-1, verbose=1)\n", "#search.fit(X_train, y_train)\n", "#print(search.best_estimator_) \n", "#print(\"R Squared:\", search.best_score_)" ] }, { "cell_type": "markdown", "id": "42a7a747-75ea-4678-9f20-dbf29b41ecd8", "metadata": {}, "source": [ "Now that we have identified the best model, let's use it to predict on our unseen data `X_test`." ] }, { "cell_type": "code", "execution_count": 19, "id": "8898c103-918d-41d4-96a5-fc7588505237", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('GBR',\n",
       "                 GradientBoostingRegressor(learning_rate=0.05, max_depth=6,\n",
       "                                           max_features=0.3,\n",
       "                                           min_samples_leaf=17,\n",
       "                                           n_estimators=1000))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('GBR',\n", " GradientBoostingRegressor(learning_rate=0.05, max_depth=6,\n", " max_features=0.3,\n", " min_samples_leaf=17,\n", " n_estimators=1000))])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#best_model = search.best_estimator_\n", "\n", "# Let's hardcode the parameters to save time\n", "best_model = Pipeline(steps=[('GBR',\n", " GradientBoostingRegressor(learning_rate=0.05, max_depth=6,\n", " max_features=0.3,\n", " min_samples_leaf=17,\n", " n_estimators=1000))])\n", "\n", "es = tf.keras.callbacks.EarlyStopping(monitor = 'val_loss', mode='min',patience=5, verbose=1)\n", "best_model.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 20, "id": "04799719-2b10-475b-be66-900fc872e01d", "metadata": {}, "outputs": [], "source": [ "y_pred = best_model.predict(X_test)" ] }, { "cell_type": "markdown", "id": "9adb02e1-2434-4c64-a9b7-0e860202e791", "metadata": { "tags": [] }, "source": [ "## Model Evaluation\n", "\n", "Let's evaluate the model's performance on the testing data to assess the likely future performance of a model.\n" ] }, { "cell_type": "code", "execution_count": 21, "id": "c31aba09-98ec-463b-979c-bfcb668b5b56", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean Absolute Error (MAE): 6.366971463051054\n", "Mean Squared Error (MSE): 68.3596957408342\n", "RMSE: 8.267992243636552\n", "R2 Score: 0.7333563920523734\n" ] } ], "source": [ "print(f'Mean Absolute Error (MAE): {mean_absolute_error(y_test, y_pred)}')\n", "print(f'Mean Squared Error (MSE): {mean_squared_error(y_test, y_pred)}')\n", "print(f'RMSE: {mean_squared_error(y_test, y_pred)**0.5}')\n", "print(f'R2 Score: {r2_score(y_test, y_pred)}')" ] }, { "cell_type": "markdown", "id": "2694da9c-dbc2-40e1-867c-1da7f956b178", "metadata": {}, "source": [ "## Feature Importance\n", "\n", "Using the feature importance of the model, we get the coefficients $\\beta_{0}$, $\\beta_{1}$ ... $\\beta_{8}$, to use in the following formula:\n", "![Strength Equation](str_eq.png)" ] }, { "cell_type": "code", "execution_count": 22, "id": "a4d579af-d464-4f43-b4d2-e93bfae07be3", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "betas = best_model[0].feature_importances_\n", "plt.figure(figsize=(12,6))\n", "plt.barh(features, betas)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 23, "id": "543a8dd6-ad13-4662-816f-d241ab4a5033", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Concrete Strength = -169.86276182969183 +\n", "0.22443683977322534 * cement +\n", "0.053931814571658084 * slag +\n", "0.04824340477622348 * fly_ash +\n", "0.12338099419069883 * water +\n", "0.07416135005702751 * superplasticizer +\n", "0.04618145961914725 * coarse_aggregate +\n", "0.07343117951771443 * fine_aggregate +\n", "0.356232957494305 * age\n" ] } ], "source": [ "betas_dict = dict(zip(features, betas))\n", "beta_0 = np.mean(y - np.sum(X*betas, axis=1))\n", "print(f\"Concrete Strength = {beta_0} +\")\n", "for key in betas_dict.keys():\n", " if key == \"age\":\n", " print(f\"{betas_dict[key]} * {key}\")\n", " else:\n", " print(f\"{betas_dict[key]} * {key} +\")" ] }, { "cell_type": "code", "execution_count": null, "id": "f93094a5-fde0-43a6-9036-b9e73edf99bb", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 5 }