Home » ML
      Bookmark ·  Report ·  More actions.. Lock comments ·  Pin thread

Machine learning project in python to predict loan approval (Part 6 of 6)

We have the dataset with the loan applicants data and whether the application was approved or not. In this tutorial we will build a machine learning model to predict the loan approval probabilty. This would be last project in this course.

Steps involved in this machine learning project:

Following are the steps involved in creating a well-defined ML project:

  • Understand and define the problem
  • Analyse and prepare the data
  • Apply the algorithms
  • Reduce the errors
  • Predict the result

Our Third Project : Predict if the loan application will get approved

We have the loan application information like the applicant's name, personal details, financial information and requested loan amount and related details and the outcome (whether the application was approved or rejected). Based on this we are going to train a model and predict if a loan will get approved or not.

Here's our data set:


Loading our Loan-applications-information dataset:

  • Launch Anaconda navigator and open the terminal
  • Type the below command to start the python environment
python
  • Lets make sure the python environment is up and running. Copy paste the below command in the terminal to check if its working properly
print("Hello World")
  • Lets load the required libraries for our analysis
#Load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC from sklearn.tree import DecisionTreeClassifier, export_graphviz
  • Lets load the loan applications training data set and assign it to a variable called "iris". We are using pandas to load the data. We will also use pandas next to explore the data both with descriptive statistics and data visualization
#Load dataset
url = "https://raw.githubusercontent.com/callxpert/datasets/master/Loan-applicant-details.csv"
names = ['Loan_ID','Gender','Married','Dependents','Education','Self_Employed','ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term','Credit_History','Property_Area','Loan_Status']
dataset = pd.read_csv(url, names=names)

  • Lets take a peek at the data
print(dataset.head(20))
  • sklearn requires all inputs to be numeric, we should convert all our categorical variables into numeric by encoding the categories. This can be done using the following code:
from sklearn.preprocessing import LabelEncoder
var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status']
le = LabelEncoder()
for i in var_mod:
dataset[i] = le.fit_transform(dataset[i])

Splitting the Data set

As we have seen already, In Machine learning we have two kinds of datasets

  • Training dataset - used to train our model
  • Testing dataset - used to test if our model is making accurate predictions

Our dataset has 480 records. We are going to use 80% of it for training the model and 20% of the records to evaluate our model. copy paste the below commands to prepare our data sets

Though our dataset has lot of columns, we are only going to use the Income fields, loan amount, loan duration and credit history fields to train our model.

array = dataset.values
X = array[:,6:11]
Y = array[:,12]
x_train, x_test, y_train, y_test = model_selection.train_test_split(X, Y, test_size=0.2, random_state=7)

Evaluating the model and training the Model

We are going to apply the below four algorithms to this problem and evaluate its effectiveness. And finally choose the best algorithm and train it.

  • Logistic Regression : Logistic Regression is a classification algorithm. It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. To represent binary / categorical outcome, we use dummy variables
model = LogisticRegression()
model.fit(x_train,y_train)
predictions = model.predict(x_test)
print(accuracy_score(y_test, predictions))

Result: 0.7708333333333334

  • Decision tree : Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It works for both categorical and continuous input and output variables
model = DecisionTreeClassifier()
model.fit(x_train,y_train)
predictions = model.predict(x_test)
print(accuracy_score(y_test, predictions))

Result: 0.6458333333333334

  • Random forest : Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees
model = RandomForestClassifier(n_estimators=100)
model.fit(x_train,y_train)
predictions = model.predict(x_test)
print(accuracy_score(y_test, predictions))

Result: 0.7604166666666666

So, Regression algorithm works fine for our use case. We can change / play around with the variables that we used for training the model till we get better accuracy.

Summary

We built an end-to-end project and tested different algorithms in this tutorial. This concludes this mini course on machine learning. Hope the course gave you a good primer to the machine learning concepts and boosted your overall confidence with machine learning.

Next steps:

  • Continue working on more projects and build your portfolio
  • Learn more about various machine learning algorithms
  • Understand how the algorithms work behind the scenes and how we can fine tune it
  • Sign up and share your knowledge with others in this copycoding community

nVector

posted on 06 Sep 18

Enjoy great content like this and a lot more !

Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds




victor
victor12-Sep-18

Thanks for writing this, I enjoyed it

In the following code ------------------------------------------------------------------------------------------------------------ CODE ------------------------------------------------------------------------------------------------------------ model = LogisticRegression() model.fit(x_train,y_train) predictions = model.predict(x_test) print(accuracy_score(y_test, predictions)) ------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------ ERROR ------------------------------------------------------------------------------------------------------------ ValueError Traceback (most recent call last) in () 1 model = LogisticRegression() ----> 2 model.fit(x_train,y_train) 3 predictions = model.predict(x_test) 4 print(accuracy_score(y_test, predictions)) ~AppDataLocalContinuumanaconda3libsite-packagessklearnlinear_modellogistic.py in fit(self, X, y, sample_weight) 1215 X, y = check_X_y(X, y, accept_sparse='csr', dtype=_dtype, 1216 order="C") -> 1217 check_classification_targets(y) 1218 self.classes_ = np.unique(y) 1219 n_samples, n_features = X.shape ~AppDataLocalContinuumanaconda3libsite-packagessklearnutilsmulticlass.py in check_classification_targets(y) 170 if y_type not in ['binary', 'multiclass', 'multiclass-multioutput', 171 'multilabel-indicator', 'multilabel-sequences']: --> 172 raise ValueError("Unknown label type: %r" % y_type) 173 174 ValueError: Unknown label type: 'unknown' ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

anish21-Apr-19

For the above comment, the type of the values is Long, and I guess that's why it's not accepting it. Just convert it to int by: X = X.astype('int') Y = Y.astype('int') and then seperate into test and train data Worked for me, hope it helps

Jyotir21-Dec-19

I just added 

y=y.astype('int')

below 

X = loan_dataset.values[:, 6:11]

y = loan_dataset.values[:,12]

it worked fine.

Jyotir21-Dec-19

Thanks so much for sharing these three projects! I enjoyed working on them and learnt the basics.

I was looking for the hyperlinks for the texts in Next steps. Are those tasks for us?

Thanks.

nVector21-Dec-19

Glad it was useful. Yes, The next steps are action items

Jyotir22-Dec-19

Thanks again!

asish-cse19-Jul-20

Without approval status, how can I label data? Any suggestions or tutorial?

nVector20-Jul-20

I do not understand the training data needs labels so that the model can understand and train on it. We cannot have training data without labels

Sai-Aishwarya17-Mar-21

I'm getting an error while training the data

ValueError Traceback (most recent call last)

<ipython-input-16-5fd6998833ff> in <module>

1 model = LogisticRegression()

----> 2 model.fit(x_train,y_train)

3 predictions = model.predict(x_test)

4 print(accuracy_score(y_test, predictions))

~anaconda3libsite-packagessklearnlinear_model\_logistic.py in fit(self, X, y, sample_weight)

1343 order="C",

1344 accept_large_sparse=solver != 'liblinear')

-> 1345 check_classification_targets(y)

1346 self.classes_ = np.unique(y)

1347

~anaconda3libsite-packagessklearnutilsmulticlass.py in check_classification_targets(y)

170 if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',

171 'multilabel-indicator', 'multilabel-sequences']:

--> 172 raise ValueError("Unknown label type: %r" % y_type)

173

174

ValueError: Unknown label type: 'unknown'