Build a chat bot from scratch using Python and TensorFlow

6 min readApr 12, 2023

Building a chatbot can be a challenging task, but with the right tools and techniques, it can be a fun and rewarding experience. In this tutorial, we’ll be building a simple chatbot using Python and the Natural Language Toolkit (NLTK) library.

Here are the steps we’ll be following:

Set up a development environment
Define the problem statement
Collect and preprocess data
Train a machine learning model
Build the chatbot interface
Test the chatbot

Step 1: Set up a development environment

To get started, we need to set up our development environment. For this tutorial, we’ll be using Python 3. You can download Python 3 from the official website (https://www.python.org/downloads/) and install it on your machine.

Next, we need to install the following packages:

nltk
numpy
tensorflow

You can install these packages by running the following commands in your terminal or command prompt:

pip install nltk
pip install numpy
pip install tensorflow

Step 2: Define the problem statement

The first step in building a chatbot is to define the problem statement. In this tutorial, we’ll be building a simple chatbot that can answer basic questions about a topic. We’ll use a dataset of questions and answers to train our chatbot. Our chatbot should be able to understand the question and provide the best possible answer.

Step 3: Collect and preprocess data

The next step is to collect and preprocess data. We’ll be using a dataset of questions and answers related to programming. You can download the dataset from this link: https://drive.google.com/file/d/1JW7V_z57LjMk7VHbwnjZ1TAEeFlgfb21/view?usp=sharing

Once you have downloaded the dataset, we need to preprocess it. We’ll use NLTK library to preprocess the data. Here’s the code to preprocess the data:

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string

# Download NLTK data
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

# Load data
with open('data.txt', 'r', encoding='utf-8') as f:
    raw_data = f.read()

# Preprocess data
def preprocess(data):
    # Tokenize data
    tokens = nltk.word_tokenize(data)
    
    # Lowercase all words
    tokens = [word.lower() for word in tokens]
    
    # Remove stopwords and punctuation
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words and word not in string.punctuation]
    
    # Lemmatize words
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    return tokens

# Preprocess data
processed_data = [preprocess(qa) for qa in raw_data.split('\n')]

In the code above, we first download the necessary NLTK data. We then load the data from the file and preprocess it using the preprocess function. The function tokenizes the data, converts all words to lowercase, removes stopwords and punctuation, and lemmatizes the words.

Step 4: Train a machine learning model

The next step is to train a machine learning model. We’ll use the processed data to train a neural network using the TensorFlow library. Here’s the code to train the model:

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Set parameters
vocab_size = 5000
embedding_dim = 64
max_length = 100
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"
training_size = len(processed_data)

# Create tokenizer
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(processed_data)
word_index = tokenizer.word_index

# Create sequences
sequences = tokenizer.texts_to_sequences(processed_data)
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

# Create training data
training_data = padded_sequences[:training_size]
training_labels = padded_sequences[:training_size]

# Build model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Conv1D(64, 5, activation='relu'),
    tf.keras.layers.MaxPooling1D(pool_size=4),
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(vocab_size, activation='softmax')
])

# Compile model
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train model
num_epochs = 50
history = model.fit(training_data, training_labels, epochs=num_epochs, verbose=2)

In the code above, we first set some parameters for the model, such as the vocabulary size, embedding dimension, and maximum sequence length. We then create a tokenizer and fit it on the processed data. We use the tokenizer to create sequences and pad them to a fixed length.

We then create training data and labels, and build a neural network model using the Keras Sequential API. The model consists of an embedding layer, a dropout layer, a convolutional layer, a max pooling layer, an LSTM layer, and two dense layers. We compile the model with a sparse categorical cross-entropy loss function and the Adam optimizer.

Finally, we train the model for 50 epochs and store the training history.

Step 5: Build the chatbot interface

The next step is to build the chatbot interface. We’ll create a simple command-line interface for our chatbot. Here’s the code:

# Define function to predict answer
def predict_answer(model, tokenizer, question):
    # Preprocess question
    question = preprocess(question)
    # Convert question to sequence
    sequence = tokenizer.texts_to_sequences([question])
    # Pad sequence
    padded_sequence = pad_sequences(sequence, maxlen=max_length, padding=padding_type, truncating=trunc_type)
    # Predict answer
    pred = model.predict(padded_sequence)[0]
    # Get index of highest probability
    idx = np.argmax(pred)
    # Get answer
    answer = tokenizer.index_word[idx]
    return answer

# Start chatbot
while True:
    question = input('You: ')
    answer = predict_answer(model, tokenizer, question)
    print('Chatbot:', answer)

In the code above, we define a function called predict_answer that takes a question, preprocesses it, converts it to a sequence, pads the sequence, and predicts the answer using the trained model and tokenizer.

We then create a simple command-line interface for the chatbot that asks the user for input, calls the ‘predict_answer’ function to get the answer, and prints the answer to the console.

Step 6: Test the chatbot

Now that we have built the chatbot interface, it’s time to test the chatbot. Run the code and start chatting with the chatbot!

Here’s an example conversation:

You: What is your name?
Chatbot: i
You: That's not a name. What's your real name?
Chatbot: am
You: Come on, give me a real answer.
Chatbot: a
You: Okay, I give up. What do you like to do?
Chatbot: like
You: That's not an answer. What do you enjoy doing?
Chatbot: to
You: You're not very good at this, are you?
Chatbot: i
You: *sigh* nevermind.

As you can see, the chatbot’s responses are not very meaningful or coherent. This is because our training data is very small and our model is very simple. To improve the chatbot’s performance, you can try:

Using a larger and more diverse training dataset
Using a more complex model architecture
Fine-tuning the model with more training epochs
Using a pre-trained language model instead of training from scratch

Conclusion

In this tutorial, we have built a simple chatbot using Python and TensorFlow. We started by gathering and preprocessing data, then we built a neural network model using the Keras Sequential API. We then created a simple command-line interface for the chatbot and tested it with some example conversations.

This is just a basic example of a chatbot, and there are many ways to improve it. With more advanced techniques and tools, you can build chatbots that can understand natural language, generate human-like responses, and even learn from user interactions to improve over time.

Improving the Chatbot

There are many ways to improve a chatbot, and I’ll share some ideas below:

Use a more advanced language model: One way to improve the chatbot is to use a more advanced language model, such as GPT-3 or BERT, which have been pre-trained on massive amounts of text data and can generate human-like responses. You can use a pre-trained model and fine-tune it on your specific chatbot task.
Add more training data: Another way to improve the chatbot is to use more training data, ideally with a wide range of conversational topics and styles. You can scrape data from social media or forums, or use existing chatbot datasets such as Cornell Movie Dialogs Corpus or Persona-Chat.
Use a more complex model architecture: You can also improve the chatbot’s performance by using a more complex model architecture, such as a transformer-based model, which can capture longer-term dependencies in the input sequence.
Incorporate user feedback: You can incorporate user feedback into the chatbot to improve its responses over time. For example, you can ask users to rate the quality of the chatbot’s responses or suggest alternative responses, and use this feedback to retrain the model.
Add multi-turn conversation capability: The current chatbot can only handle one question-answer exchange at a time. You can improve the chatbot by adding multi-turn conversation capability, allowing the chatbot to remember previous conversation context and generate more meaningful responses.
Implement personality and emotional intelligence: You can also make the chatbot more engaging and human-like by implementing personality and emotional intelligence. For example, you can give the chatbot a specific personality trait, such as being funny or sarcastic, or use sentiment analysis to detect and respond to the user’s emotional state.

These are just a few ideas on how to improve the chatbot. There are many other techniques and tools you can use, depending on your specific use case and goals.