{
"cells": [
{
"cell_type": "markdown",
"id": "9a9ada08-c9f0-466f-91a0-265fa539f9f5",
"metadata": {},
"source": [
"# [Predict the strength of concrete](https://app.datacamp.com/workspace/w/6062fa1f-85aa-48e2-a156-66a5fba7ff2a)\n",
"\n",
"## 📖 Background\n",
"\n",
"Concrete is the most widely used building material in the world. It is a mix of cement and water with gravel and sand. It can also include other materials like fly ash, blast furnace slag, and additives. \n",
"\n",
"The compressive strength of concrete is a function of components and age, the team is testing different combinations of ingredients at different time intervals. \n",
"\n",
"Find a simple way to estimate strength to predict how a particular sample is expected to perform.\n",
"\n",
"The objective is to answer:\n",
"1. The average strength of the concrete samples at 1, 7, 14, and 28 days of age.\n",
"2. The coefficients of regression model using the formula that provided us:\n",
""
]
},
{
"cell_type": "markdown",
"id": "10dcc269-3659-4851-99cd-f1ffb7f818aa",
"metadata": {},
"source": [
"## 💾 The data\n",
"The team has already tested more than a thousand samples ([source](https://archive.ics.uci.edu/ml/datasets/concrete+compressive+strength)):\n",
"\n",
"### Compressive strength data:\n",
"- \"cement\" - Portland cement in kg/m3\n",
"- \"slag\" - Blast furnace slag in kg/m3\n",
"- \"fly_ash\" - Fly ash in kg/m3\n",
"- \"water\" - Water in liters/m3\n",
"- \"superplasticizer\" - Superplasticizer additive in kg/m3\n",
"- \"coarse_aggregate\" - Coarse aggregate (gravel) in kg/m3\n",
"- \"fine_aggregate\" - Fine aggregate (sand) in kg/m3\n",
"- \"age\" - Age of the sample in days\n",
"- \"strength\" - Concrete compressive strength in megapascals (MPa)\n",
"\n",
"***Acknowledgments**: I-Cheng Yeh, \"Modeling of strength of high-performance concrete using artificial neural networks,\" Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998)*."
]
},
{
"cell_type": "markdown",
"id": "0864ae32-db01-4bad-b6f3-3b506daf0ba4",
"metadata": {},
"source": [
"## Import Libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "7f0964c3-2705-41df-af08-d8a8efe2378c",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from sklearn.preprocessing import StandardScaler, OrdinalEncoder\n",
"from sklearn.model_selection import train_test_split, cross_val_score, KFold\n",
"from sklearn.pipeline import Pipeline\n",
"from sklearn.linear_model import LinearRegression, LogisticRegression\n",
"from sklearn.neighbors import KNeighborsRegressor\n",
"from sklearn.tree import DecisionTreeRegressor\n",
"from sklearn.ensemble import AdaBoostRegressor, RandomForestRegressor, BaggingRegressor, GradientBoostingRegressor\n",
"from sklearn.svm import SVR\n",
"from sklearn.model_selection import GridSearchCV\n",
"import tensorflow as tf\n",
"from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\")"
]
},
{
"cell_type": "markdown",
"id": "30988536-8315-4f6c-b131-7bbe2183e440",
"metadata": {},
"source": [
"## Load Data"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "40251159-3eea-47f8-bfb4-09e38d5ced65",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(12,6))\n",
"sns.heatmap(df.corr(), annot=True)\n",
"plt.title('Heatmap of correlation with all features')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "3e9f2d12-a822-42a5-87e3-94d391412429",
"metadata": {},
"source": [
"### Age\n",
"\n",
"Age is a very important characteristic that affects concrete strength."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "08fdc396-cfbe-4914-985b-b56c5fc74339",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(12,6))\n",
"sns.countplot(x='age', data=df)\n",
"plt.title(label=\"Count of ages of concrete\")\n",
"plt.xlabel(\"Age (Days)\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "0cb1203b-5e5c-4012-afcc-9e8ab70b3ebd",
"metadata": {},
"source": [
"This bar plot shows that age is a categorical variable with a high number of observations at 28 days. \n",
"The 28 days time frame is significant because this is the period for concrete to reach 99% of it's strength. While the concrete continuous to gain strength after that period, the rate of gain is much less compared to that in 28 days."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "f9337430-04ba-4e21-9711-7c7487817a18",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Average concrete strength: 35.25027287623584 \n",
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"
\n",
"
Average concrete strength at different ages (Days):
\n",
" \n",
"
\n",
"
\n",
"
age
\n",
"
mean
\n",
"
count
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1
\n",
"
9.452716
\n",
"
2
\n",
"
\n",
"
\n",
"
1
\n",
"
3
\n",
"
18.378023
\n",
"
129
\n",
"
\n",
"
\n",
"
2
\n",
"
7
\n",
"
25.181843
\n",
"
122
\n",
"
\n",
"
\n",
"
3
\n",
"
14
\n",
"
28.751038
\n",
"
62
\n",
"
\n",
"
\n",
"
4
\n",
"
28
\n",
"
37.383788
\n",
"
478
\n",
"
\n",
"
\n",
"
5
\n",
"
56
\n",
"
50.715152
\n",
"
86
\n",
"
\n",
"
\n",
"
6
\n",
"
90
\n",
"
40.480809
\n",
"
54
\n",
"
\n",
"
\n",
"
7
\n",
"
91
\n",
"
68.674649
\n",
"
17
\n",
"
\n",
"
\n",
"
8
\n",
"
100
\n",
"
47.668780
\n",
"
52
\n",
"
\n",
"
\n",
"
9
\n",
"
120
\n",
"
39.647168
\n",
"
3
\n",
"
\n",
" \n",
"
\n"
],
"text/plain": [
""
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"avg_str_by_day = df.groupby('age').agg(['mean', 'count'])['strength']\n",
"avg_str_by_day.reset_index(inplace=True)\n",
"avg_str_by_day.age = avg_str_by_day.age.astype(int)\n",
"avg_str_by_day = avg_str_by_day.style.set_caption(\"Average concrete strength at different ages (Days):\")\n",
"print(f'Average concrete strength: {df.strength.mean()} \\n')\n",
"avg_str_by_day"
]
},
{
"cell_type": "markdown",
"id": "707706ff-5622-481a-a043-1dd841f59fb1",
"metadata": {},
"source": [
"From the table we can see that average concrete strength at 28 days is similar to the average strength of the entire dataset."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "1cf18be3-7283-43af-93c7-13a3866238de",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(12,6))\n",
"sns.boxplot(x='age', y='strength', data=df);\n",
"plt.title(label=\"Boxplot of concrete strength at different ages\") \n",
"plt.xlabel(\"Age (Days)\")\n",
"plt.ylabel(\"Strength\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "8d15b593-4f66-463b-ac70-425c266650b4",
"metadata": {},
"source": [
"In this boxplot, concrete strength is the highest at 91 days. They have the highest median strenght with a narrow variation. We can also see that strength has the largest variation at 28 days."
]
},
{
"cell_type": "markdown",
"id": "81da4f90-bd9f-4f6b-b876-652c8c8d1970",
"metadata": {
"tags": []
},
"source": [
"## 1. Average strength of the concrete samples at 1, 7, 14, and 28 days of age.\n",
"\n",
"Average strength of concrete is the lowest after 1 day and increases significantly after 7 days. Strength continues to increase in MPa after 14 and 28 days."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "cfc23420-047f-49a4-8456-c26186edab7a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Average strength of the concrete sample is 9.45 MPa at 1 day.\n",
"Average strength of the concrete sample is 25.18 MPa at 7 days.\n",
"Average strength of the concrete sample is 28.75 MPa at 14 days.\n",
"Average strength of the concrete sample is 37.38 MPa at 28 days.\n"
]
}
],
"source": [
"ages = [1, 7, 14, 28]\n",
"\n",
"for age in ages:\n",
" cond = df.age == age\n",
" avg_str = df[cond].strength.mean().round(2)\n",
" if age == 1:\n",
" print(f'Average strength of the concrete sample is {avg_str} MPa at {age} day.')\n",
" else:\n",
" print(f'Average strength of the concrete sample is {avg_str} MPa at {age} days.')"
]
},
{
"cell_type": "markdown",
"id": "dc886cef-86b2-4288-b124-c35d2a47f512",
"metadata": {},
"source": [
"## 2. Creating predictive model\n",
"\n",
"Now let's help our colleages in the engineering department find out the coefficients $\\beta_{0}$, $\\beta_{1}$ ... $\\beta_{8}$, to use in the following formula:\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"id": "3dede16c-0c6f-4de4-b653-79adf29002cb",
"metadata": {},
"source": [
"## Train Test Split\n",
"\n",
"Split the data to train (80%) and test (20%) to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "20ac4305-ba34-4a3a-92f2-d6d430b68a09",
"metadata": {},
"outputs": [],
"source": [
"features = ['cement', 'slag', 'fly_ash', 'water', 'superplasticizer', 'coarse_aggregate', 'fine_aggregate', 'age']\n",
"X = df[features] # Features\n",
"y = df['strength'] # Target"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "8df4a9f2-2a5e-4a0c-89bd-a957b6988345",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"X train data (804, 8)\n",
"y train data (804,)\n",
"X test data (201, 8)\n",
"y test data (201,)\n"
]
}
],
"source": [
"# Train Test Split\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2022)\n",
"\n",
"print('X train data {}'.format(X_train.shape))\n",
"print('y train data {}'.format(y_train.shape))\n",
"print('X test data {}'.format(X_test.shape))\n",
"print('y test data {}'.format(y_test.shape))"
]
},
{
"cell_type": "markdown",
"id": "779d6366-9986-43f0-9c2c-65bbb934ea8f",
"metadata": {
"tags": []
},
"source": [
"### Encoding age\n",
"\n",
"We observed that `age` is a variabel comprised of a finite set of discrete values with a ranked ordering between values."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "11ada35a-1bbd-4b29-8fec-671437863513",
"metadata": {},
"outputs": [],
"source": [
"def ordinal_encoding_feature(data, feature): \n",
" '''Transform selected column with ordinal encoder\n",
" \n",
" INPUTS:\n",
" data: dataframe\n",
" feature: column name\n",
" \n",
" OUTPUT:\n",
" d: dataframe with feature ordinally encoded\n",
" ''' \n",
" d = data.copy()\n",
" encoder = OrdinalEncoder()\n",
" # Reshape because only one column is transformed\n",
" encoder.fit(data[feature].values.reshape(-1,1))\n",
" d[feature]= encoder.transform(data.age.values.reshape(-1,1))\n",
" return d\n",
"X_train = X_train.pipe(ordinal_encoding_feature, 'age')\n",
"X_test = X_test.pipe(ordinal_encoding_feature, 'age')"
]
},
{
"cell_type": "markdown",
"id": "f4dbd66c-9dfa-48b2-8cb5-49016047e32a",
"metadata": {},
"source": [
"## Model Selection\n",
"\n",
"We will be using a supervised regression model since the target variable `strength` is labeled and continuous. \n",
"Using k-folds cross validation to estimate and compare the performance of models on out-of-sample data using r2 score. This enables us to identify which model is worth improving upon. \n",
"GradientBoostRegressor gives the best results out of models, with the highest r2score of 0.89."
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "d5761d7b-94ca-4492-b3ce-335903f65209",
"metadata": {},
"outputs": [],
"source": [
"pipelines = []\n",
"pipelines.append(('Linear Regression', Pipeline([('scaler', StandardScaler()), ('LR', LinearRegression())])))\n",
"pipelines.append(('KNN Regressor', Pipeline([('scaler', StandardScaler()), ('KNNR', KNeighborsRegressor())])))\n",
"pipelines.append(('SupportVectorRegressor', Pipeline([('scaler', StandardScaler()), ('SVR', SVR())])))\n",
"pipelines.append(('DecisionTreeRegressor', Pipeline([('scaler', StandardScaler()), ('DTR', DecisionTreeRegressor())])))\n",
"pipelines.append(('AdaboostRegressor', Pipeline([('scaler', StandardScaler()), ('ABR', AdaBoostRegressor())])))\n",
"pipelines.append(('RandomForestRegressor', Pipeline([('scaler', StandardScaler()), ('RBR', RandomForestRegressor())])))\n",
"pipelines.append(('BaggingRegressor', Pipeline([('scaler', StandardScaler()), ('BGR', BaggingRegressor())])))\n",
"pipelines.append(('GradientBoostRegressor', Pipeline([('scaler', StandardScaler()), ('GBR', GradientBoostingRegressor())])))"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "9ccd3c02-eb69-41b2-9c85-56e990b643c0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.