ANIL NEBİ ŞENTÜRK
5 min readJan 16, 2022

--

Machine learning for diabetes prediction

Hi everyone,today ı will mention about diabets datasets and ı will try to prediction for diabets.In this case,related to Pima Indian women aged 21 and above living in Phoenix, the 5th largest city in the US state of Arizona. It was recorded whether or not the women had diabetes along with different variables. The data set consists of 9 variables and 768 observations.

Data Set Story: what is the purpose of this project?
When a new person is entered for the obtained data set, a machine learning algorithm (random forest classifier) is created that predicts whether that person is sick or not.

The target variable is specified as “Outcome”; 1 indicates positive diabetes test result, 0 indicates negative.
Pregnancies: Number of pregnancies
Glucose: 2-hour plasma glucose concentration in the oral glucose tolerance test
BloodPressure: (Small blood pressure) (mm Hg)
SkinThickness
Insulin: 2-hour serum insulin (mu U/ml)
Diabetes Pedigree Function: Function (2 hour plasma glucose concentration in oral glucose tolerance test)
BMI: Body mass index
Age: Age (years)

let’s import our necessary libraries and selections!!!

With the grap_col_name function, we capture categorical variables, numeric variables, categorical but cardinal variables, and numeric but categorical variables in the data set.

def grab_col_names(dataframe, cat_th=10, car_th=20):
"""
grab_col_names for given dataframe

:param dataframe:
:param cat_th:
:param car_th:
:return:
"""
cat_cols = [col for col in dataframe.columns if dataframe[col].dtypes == "O"]
num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique() < cat_th and
dataframe[col].dtypes != "O"]

cat_but_car = [col for col in dataframe.columns if dataframe[col].nunique() > car_th and
dataframe[col].dtypes == "O"]
cat_cols = cat_cols + num_but_cat
cat_cols = [col for col in cat_cols if col not in cat_but_car]
num_cols = [col for col in dataframe.columns if dataframe[col].dtypes != "O"]
num_cols = [col for col in num_cols if col not in num_but_cat]
print(f"Observations: {dataframe.shape[0]}")
print(f"Variables: {dataframe.shape[1]}")
print(f'cat_cols: {len(cat_cols)}')
print(f'num_cols: {len(num_cols)}')
print(f'cat_but_car: {len(cat_but_car)}')
print(f'num_but_cat: {len(num_but_cat)}')

return cat_cols, cat_but_car, num_cols, num_but_cat


cat_cols, cat_but_car, num_cols, num_but_cat = grab_col_names(df)

To make sure there are outliers, let’s create lower and upper quartile values and see if there are any values outside of these values.
q1=0.05,q3=0.95

def grab_outliers(dataframe, col_name, index=False):
low, up = outlier_thresholds(dataframe, col_name)

if dataframe[((dataframe[col_name] < low) | (dataframe[col_name] > up))].shape[0] > 10:
print("#####################################################")
print(str(col_name) + " variable have too much outliers")
print("#####################################################")
print(dataframe[((dataframe[col_name] < low) | (dataframe[col_name] > up))].head(15))
print("#####################################################")
print("Lower threshold: " + str(low) + " Lowest outlier: " + str(dataframe[col_name].min()) +
" Upper threshold: " + str(up) + " Highest outlier: " + str(dataframe[col_name].max()))
print("#####################################################")
elif (dataframe[((dataframe[col_name] < low) | (dataframe[col_name] > up))].shape[0] < 10) & \
(dataframe[((dataframe[col_name] < low) | (dataframe[col_name] > up))].shape[0] > 0):
print("#####################################################")
print(str(col_name) + " variable have less than 10 outlier values")
print("#####################################################")
print(dataframe[((dataframe[col_name] < low) | (dataframe[col_name] > up))])
print("#####################################################")
print("Lower threshold: " + str(low) + " Lowest outlier: " + str(dataframe[col_name].min()) +
" Upper threshold: " + str(up) + " Highest outlier: " + str(dataframe[col_name].max()))
print("#####################################################")
else:
print("#####################################################")
print(str(col_name) + " variable does not have outlier values")
print("#####################################################")

if index:
print(str(col_name) + " variable's outlier indexes")
print("#####################################################")
outlier_index = dataframe[((dataframe[col_name] < low) | (dataframe[col_name] > up))].index
return outlier_index

grab_outliers(df,num_cols)

After looking at which columns they are, we suppress the outliers to the lower and upper quartiles we created.

We know if there are missing observations in the data and some variables should not contain zero (age, skin thickness, body mass index, blood pressure) let’s check them.

We see that there are no missing observations, but we have seen that the values are zero for the variables that should not be 0. In order to be able to operate on these variables, let’s first make the 0 values nan and then fill them with the average.

Let’s bring the heatmap function to observe the relationship of the variables with each other on the basis of correlation.

Let’s create new variables over the existing data to use in the prediction model. First, let’s create a new variable by multiplying the age and body mass index variable. Let’s assign a dataframe to the newly formed variable.

After segmenting the age, birth number and glucose and BMI amounts for other variables, let’s divide the data set again into categorical, numerical and cardinal after the newly created variables and find the variables with only two classes in their cells.

binary_cols = [col for col in df.columns if df[col].dtype not in [int, float]
and df[col].nunique() == 2]

Let’s convert the values of the classes to 1 and 0 using the label encoder in a way that the algorithm can understand.

def label_encoder(dataframe, binary_col):
labelencoder = LabelEncoder()
dataframe[binary_col] = labelencoder.fit_transform(dataframe[binary_col])
return dataframe

for col in binary_cols:
label_encoder(df, col)

We also convert variables containing more than two string classes using onehot encoder and deleting the first dummy variable (to avoid falling into the dummy variable trap). After the new variables are created, we divide the data set into categorical, numerical and cardinal again with the drop_first=true argument.

ohe_cols = [col for col in df.columns if 10 >= df[col].nunique() > 2]

def one_hot_encoder(dataframe, categorical_cols, drop_first=True):
dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first)
return dataframe

df = one_hot_encoder(df, ohe_cols)
df.head()
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

We made predictions in the model using the random forest classifier.

--

--