Logistic Regression Model using Python

Logistic Regression Model using Python

Logistic regression is a popular and widely used statistical method for binary classification. It is a type of regression analysis used for predicting the outcome of a categorical dependent variable based on one or more predictor variables. In this blog, we will explore the basics of logistic regression, how it works, and how to implement it using Python.

Introduction to Logistic Regression

Logistic regression is a type of regression analysis where the dependent variable is binary (0 or 1, true or false, success or failure). It estimates the probability that a given input belongs to a certain category. Unlike linear regression, logistic regression models the probability that a given input point belongs to a specific class.

The logistic function (also called the sigmoid function) is used to map predicted values to probabilities. The logistic function is defined as:

σ(x)=11+e−x\sigma(x) = \frac{1}{1 + e^{-x}}σ(x)=1+e−x1

where σ(x)\sigma(x)σ(x) is the output of the logistic function for input xxx.

How Logistic Regression Works

Logistic regression works by fitting a linear equation to the data, then applying the logistic function to the output of this linear equation to produce a probability. The linear equation has the form:

z=β0+β1x1+β2x2+…+βnxnz = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_nz=β0+β1x1+β2x2+…+βnxn

where:

  • β0\beta_0β0 is the intercept,
  • β1,β2,…,βn\beta_1, \beta_2, \ldots, \beta_nβ1,β2,…,βn are the coefficients,
  • x1,x2,…,xnx_1, x_2, \ldots, x_nx1,x2,…,xn are the input features.

The logistic function then transforms zzz to a value between 0 and 1, representing the probability that the input point belongs to the positive class.

Steps to Implement Logistic Regression in Python

To implement logistic regression in Python, we will use the popular machine learning library scikit-learn. Here are the steps involved:

  1. Import Libraries: Import the necessary libraries.
  2. Load Dataset: Load the dataset for training and testing.
  3. Preprocess Data: Prepare the data for modeling.
  4. Train the Model: Fit the logistic regression model to the training data.
  5. Make Predictions: Use the model to make predictions on new data.
  6. Evaluate the Model: Assess the performance of the model.

Step 1: Import Libraries

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns        


Step 2: Load Dataset

For this example, we will use the famous Pima Indians Diabetes dataset, which is available in many data repositories.

# Load dataset
url = "https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv"
df = pd.read_csv(url)

# Display the first few rows of the dataset
df.head()
        

Step 3: Preprocess Data

We need to preprocess the data by handling missing values (if any), normalizing the data, and splitting it into training and testing sets.

# Check for missing values
df.isnull().sum()

# Split the data into features (X) and target (y)
X = df.drop('Outcome', axis=1)
y = df['Outcome']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        


Step 4: Train the Model

Next, we will create an instance of the logistic regression model and fit it to the training data.

# Create an instance of Logistic Regression
model = LogisticRegression(max_iter=1000)

# Fit the model to the training data
model.fit(X_train, y_train)
        

Step 5: Make Predictions

We can now use the trained model to make predictions on the test data.

# Make predictions on the test data
y_pred = model.predict(X_test)
        

Step 6: Evaluate the Model

Finally, we will evaluate the model's performance using various metrics.

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Generate confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Generate classification report
class_report = classification_report(y_test, y_pred)
print("Classification Report:")
print(class_report)

# Visualize the confusion matrix
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
        

4. Example: Predicting the Likelihood of Diabetes

Let's put it all together in a complete example where we predict the likelihood of diabetes based on various health metrics.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
url = "https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv"
df = pd.read_csv(url)

# Preprocess data
X = df.drop('Outcome', axis=1)
y = df['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

class_report = classification_report(y_test, y_pred)
print("Classification Report:")
print(class_report)

sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
        


Logistic regression is a powerful tool for binary classification problems. It is simple to understand and easy to implement, making it a great choice for many practical applications. In this blog, we covered the basics of logistic regression, how it works, and how to implement it in Python using scikit-learn. We also provided a complete example using the Pima Indians Diabetes dataset.

By following these steps, you can apply logistic regression to your own binary classification problems and gain valuable insights from your data.


Author

Nadir Riyani is an accomplished and visionary Engineering Manager specialising in AI/ML technologies. With a wealth of experience leading high-performing engineering teams, Nadir is passionate about leveraging artificial intelligence and machine learning to drive innovation and solve complex challenges. His expertise spans across software development principles, encompassing Agile, Automation and DevOps methodologies. Nadir's commitment to engineering excellence and ability to align technical strategies with business objectives make him a valuable asset to any organization. For further inquiries, please feel free to reach out to him at riyaninadir@gmail.com.


Adrian Olszewski

Clinical Trials Biostatistician at 2KMM (100% R-based CRO) ⦿ Frequentist/NHST (non-Bayesian) paradigm only ⦿ NOT a Data Scientist (no ML/AI/Big data) ⦿ Against anti-car/-meat/-cash/-house and C40 restrictions

1y

Let me only add that outside Machine Learning, in classic statistics, it's a key example of a regression algorithm - just by its definition of conditional expectation. No surprise - it was invented (McFadden, Berkson, Cox, Nelder) and embedded into the Generalized Linear Model family (Nelder, Wedderburn) exactly for the purpose of doing regression work years before it was used for classification (mainly to replace the probit regression). Nowadays it is used for regression and testing hypotheses by thousands of statisticians, e.g. in experimental and exploratory research, like clinical trials. For instance, it's my daily regression tool (and I have never used it for classification so far). I mention that, because some of ML specialists may want to join, one day, a field, where classic statistics is used and the logistic regression has many more applications (and classification is not the major one). https://www.linkedin.com/pulse/logistic-regression-has-been-since-its-birth-adrian-olszewski-haygf/ or - if you prefer Medium - https://medium.com/@r.clin.res/is-logistic-regression-a-regression-46dcce4945dd?source=friends_link&sk=e9cb5449363197f85c0d6d0f4211a562

To view or add a comment, sign in

More articles by Nadir R.

Others also viewed

Explore topics