12 Nov 2024 Β· Software Engineering

    How to Build a RAG Chatbot to Chat with Documents Using Sparse Data

    24 min read
    Contents

    You’re getting started with a new tool. You need to find your way around the documentation quickly. You’re not sure where to start reading. But imagine if someone could tell you which part of the documentation you should start reading first. So let’s build an AI that does that. Even if the documentation is partially written.

    We will be building a bot that answers questions about the Semaphore documentation. You will learn how to build a RAG chatbot. And you will also build a tool that will help you get started with Semaphore.

    But this will not be an ordinary RAG chatbot. It will also be useful while the documentation is still incomplete. So you will also learn a technique for working with sparse data.

    What we are actually building

    We are building a Python chatbot that will answer questions based on the new Semaphore documentation on GitHub. The bot will run in the terminal. It will answer questions about the new Sempahore documentation. It will provide a short answer and then direct you to the source document where you can read more.

    These types of chat with documents systems are called Retrieval-Augmented Generation (RAG). When the user asks a question, the retriever searches for the relevant information. The retrieved information is inserted into the LLM’s prompt. And the LLM is instructed to generate the answer to the user’s question. The LLM’s answer must be based solely on the information provided in the prompt.

    The LLM has a context window of a fixed size. The size of the input context window limits how much text we can include in the prompt. The size of the output context limits how much text the LLM can output. Therefore, we only augment the prompt with the most relevant chunks of documents. Because in most cases, we can’t fit all of our source documents into a single prompt.

    In production, the source documents are stored in the system’s search database. In this tutorial, the documents will be stored in memory. Because we are building a first iteration quickly.

    The code in this tutorial runs on Python 3.10.9, and the uses the following packages from pip.

    langchain==0.3.0
    langchain-openai==0.2.0
    langchain-community==0.3.0
    faiss-cpu==1.8.0.post1
    

    The LLM that we are using is gpt-4o-mini. But you could swap it for any LLM that you prefer to use.

    Is the data usable? This is the greatest challenge

    The greatest challenge is that the documentation is still being written. At the time of writing, the new documentation is very sparse. Some documents consist of three letters, β€œWIP”. Some documents are mostly example code. Some documents are mostly a diagram or two. You can imagine a presenter at a data conference bleating β€œGarbage-In Garbage-Out!” But it’s my job to deliver a solution despite the sparse data. So here’s the solution that I came up with.

    We can often guess the purpose of a document before it has been completed. We look for clues. For example, a file called β€œSECURITY.md” will probably be about security. Even if the contents of that file are just three letters, β€œWIP”.

    Let’s imagine another example. A file that contains a single YAML template, and no text to describe the template. This file is probably related to config.

    We will be using the LLM to augment our documents with information that is implied by their path, file name, and content so far. Some documents have a lot of text already. Others have a few letters. To make every β€œdocument” is roughly the same size, we will use the LLM to generate summaries of those documents. Our RAG chatbot will only use the documentΒ summariesΒ and theΒ augmented context. The bot will explicitly direct the user to theΒ sourceΒ documents in each answer.

    At the time of writing, the document preprocessing step takes approximately five minutes. And it costs less than $0.10 in OpenAI fees.

    Anthropic have recently published an article that also describes a technique of augmenting data for RAG. In a typical RAG use case, the documents are split into chunks. When the user asks a question, the retriever retrieves only the most relevant chunks. Usually from several different source documents. In Anthropic’s Contextual Retrieval, they generate additional context for each chunk of text. This can help the retriever understand what the source document was talking about. In our case, we are not chunking our β€œdocuments”. For each source document, we are retrieving the whole summary plus augmented context.

    For example, here is a “summary plus augmented context” for docs/getting-started/guided-tour.md.

    The path 'docs/getting-started/guided-tour.md' suggests that this document serves as an introductory guide for new users of SemaphoreCI, helping them navigate the initial steps of setting up and using the tool effectively.
    This document provides a guided tour for new users of SemaphoreCI. It outlines essential tasks such as signing up, creating an organization, setting up a first project, and understanding Continuous Integration and Continuous Delivery processes. The document aims to help users get started with SemaphoreCI quickly and effectively, ensuring they can leverage the tool for their CI/CD workflows.
    The 'Guided Tour' document is a 20-minute tutorial designed to help new users of SemaphoreCI get acquainted with the platform. It emphasizes that no credit card is required to try out Semaphore, making it accessible for users to explore its features without any commitment. The tutorial covers four key learning objectives: first, it guides users through the sign-up process and organization creation; second, it introduces a simple 'Hello, World!' project to demonstrate the basic functionalities of CI; third, it provides a real-world example of Continuous Integration, including build and test steps; and finally, it explains the Continuous Delivery process, which includes releasing and deploying apps. Overall, this document serves as a foundational resource within the broader documentation, aimed at onboarding users efficiently and encouraging them to leverage SemaphoreCI for their development workflows.
    

    Another example is SECURITY.md. At the time of creating this tutorial, the file is a work in progress literally containing just three letters: “WIP”. On their own, these three letters would be useless to the retriever. Here is a “summary plus augmented context” for SECURITY.md:

    The path 'SECURITY.md' suggests that this document is focused on security practices, guidelines, or considerations relevant to the Semaphore CI/CD tool. It likely outlines how users can secure their CI/CD processes, manage sensitive data, and comply with security standards.
    This document is intended to provide insights into the security protocols and best practices associated with using Semaphore CI/CD. Security is a critical aspect of CI/CD tools as they handle code, configurations, and potentially sensitive data. This document will help users understand how to implement security measures within their Semaphore workflows, contributing to overall operational security.
    The document labeled 'Security' is currently marked as 'WIP' (Work In Progress), indicating that it is not yet complete. While no specific content is provided, the intention behind the document is to outline security practices and considerations for using SemaphoreCI. Once fully developed, it is expected to cover topics such as securing deployment pipelines, managing access controls, ensuring the integrity of build artifacts, and maintaining compliance with security regulations. This document will be part of the broader SemaphoreCI documentation suite, which addresses various functionalities such as Continuous Integration, Pipelines, and Automation, among others. It will be particularly useful for developers and operations teams looking to implement secure CI/CD practices.
    

    Our prompt for summarizing and augmenting the documents

    Here’s the prompt template that we will be using to summarize and augment the source documents. Notice that I wrapped it in a Python multiline string. A prompt template is different to a finished prompt. A prompt template has space for us to insert our data. The strings {path} and {doc} will be replaced by data. They will hold the path to a source document and the text of the source document.

    This Python string will live in a file called prompts.pyprompts.py contains three python strings. You will see the other two strings later in this tutorial.

    #This is the prompt for summarizing and augmenting the docs
    summarize_and_augment_prompt = """
    The document in the "document" XML tags is part of the documentation for a CI/CD tool called SemaphoreCI.
    The path in the "path_to_document" XML tags is path to that particular document. Pay attention to the file and folder names because they suggest the purpose of the document.
    Semaphore is a CI/CD solution to streamline developer workflows.
    Semaphore features Continuous Integration and Pipelines, Deployments and Automation, Metrics and Observability, Security and Compliance, a Developer Toolkit, Test Reports, Monorepos, and Self-hosted build agents.
    Semaphore works in the cloud, on-premises, and in the hybrid cloud.
    
    <path_to_document>
    {path}
    </path_to_document>
    
    <document>
    {doc}
    </document>
    
    The document in the "document" XML tags is part of the documentation for a CI/CD tool called Semaphore.
    The path in the "path_to_document" XML tags is path to that particular document. Pay attention to the file and folder names because they suggest the purpose of the document.
    
    Summarize the document. Add context to the document as well. You will need to explain what task it is useful for. And where this document fits into the broader documentation.
    Use the following JSON template.
    {{
        "what_path_suggests": Look at the path to the document. What does the path suggest about this document's purpose?
        "what_is_the_document_about": What is this document about? How does it help us use SempahoreCI?
        "summarize_document": Summarize the document. If the document has a lot of text, then write many paragraphs. If the document has code or a template, describe what you think tha template is for. What language is the code or template in?
    }}
    """

    Building the retriever

    Preprocessing the documents

    The new Semaphore documents are stored in a git repo. We will first clone the repo to our local disk. Then we will use Python os.path.walk() to iterate through each document. For each document, we will call the LLM to summarize and augment it. We will also generate an embedding for the combination of the summary and the augmented context.

    The code below highlights the important details. The complete code file will be presented later in this article.

    How to iterate through each document in the repo

    import os
    
    #1. Setup
    # Define the directory containing the markdown files
    docs_directory = "../semaphore/docs/"
    
    #3.0 Loop through and summarize the markdown files.
    for root, dirs, files in os.walk(docs_directory):
        #3.1 We loop through the markdown files only
        for filename in [ fi for fi in files if fi.endswith(".md") ]:        
            file_path = os.path.join(root, filename)
            #3.2 Read the contents of the markdown file
            with open(file_path, "r") as file:
                content = file.read()
            #...Then we call the LLM for each document in a loop here.

    How to use the LLM to summarize and augment each document

    We are using the Lanchain library. Note that we are not using any of Langchain’s output parsers. We are using OpenAI’s JSON mode to get a structured output. Our chain returns a dictionary object. To enable JSON mode, we use model = ChatOpenAI(model=model_to_use).with_structured_output(None, method="json_mode").

    We are also creating metadata for each document. We are creating a field called source. The source of a document is the url of its location on github. The chatbot will display the full URL when answering the user’s questions. The user will be able to click on the URL and view the document on github. The code below shows how it’s done.

    import prompts #This is prompts.py, which I mentioned above.
    from langchain_openai import ChatOpenAI
    from langchain_core.prompts import ChatPromptTemplate
    
    #1. Setup
    # Define the directory containing the markdown files
    clickable_url_prefix = "https://github.com/semaphoreci/semaphore/tree/main/docs/"
    docs_directory = "../semaphore/docs/"
    model_to_use = "gpt-4o-mini"
    
    # Check if output_dir is missing. mkdir if missing.
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    
    summaries = []
    summary_metadata = []
    
    #2.0 Setting the Langchain objects
    ##2.1 Setting up a Langchain "model" object
    model = ChatOpenAI(model=model_to_use).with_structured_output(None, method="json_mode")
    ##2.2 Setting up our prompt template
    our_prompt_template = ChatPromptTemplate.from_messages(
        [ ("user", prompts.summarize_and_augment_prompt)]
    )
    
    #3.0 Loop through and summarize the markdown files.
    for root, dirs, files in os.walk(docs_directory):
        #3.1 We loop through the markdown files only
        for filename in [ fi for fi in files if fi.endswith(".md") ]:        
            file_path = os.path.join(root, filename)
            #3.2 Read the contents of the markdown file
            with open(file_path, "r") as file:
                content = file.read()
                #3.3 Use the LLM to summarize the markdown file
                our_path = file_path.replace(docs_directory,"")
                clickable_url = clickable_url_prefix + our_path
                chain = our_prompt_template | model 
                json_formatted = chain.invoke({"path": our_path, "doc": content})
                #3.4 Append to the summary and path to the lists
                the_summary = json_formatted["what_path_suggests"]
                the_summary += "\n"
                the_summary += json_formatted["what_is_the_document_about"]
                the_summary += "\n"
                the_summary += json_formatted["summarize_document"]
                the_summary += "\n"
                summaries.append(the_summary)
                summary_metadata.append({"source": clickable_url})

    How to compute the embeddings

    Our retriever searches based on embeddings rather than based on text. Next we need to compute embeddings based on the text that we have just generated. We will compute the “document” embeddings once and then save them to disk.

    When the user asks our chatbot a question, we compute the embedding for the question. We then compare our embedded question with the embeddings of our “documents”. We retrieve the documents with the most similar embeddings. The exact implementation is called FAIS and we’ll be using a library for it.

    At the end of the code snippet below, we set up a dictionary object that holds the text to be retrieved, the embeddings, and the metadata. In this case, the metadata is the path to the source document. We use the pickle library to save this dictionary object to disk.

    import pickle
    from langchain_openai import OpenAIEmbeddings
    
    #1. Setup
    # Define the directory containing the markdown files
    clickable_url_prefix = "https://github.com/semaphoreci/semaphore/tree/main/docs/"
    docs_directory = "../semaphore/docs/"
    output_dir = "summarized_docs/"
    model_to_use = "gpt-4o-mini"
    output_pickle_file_name = "texts_and_embeddings.pkl"
    output_pickle_file_path = os.path.join(output_dir, output_pickle_file_name)
    
    #...Document looping/generation code has been omitted for clarity.
    
    #4.0 Generate the embeddings
    print("Embedding summaries...")
    summary_embeddings = embeddings.embed_documents(summaries)
    #5.0 zip everything together and save to disk
    text_embedding_metadata_triplets = {"embeddings":zip(summaries, summary_embeddings), "meta": summary_metadata}
    with open(output_pickle_file_path, "wb") as p_file:
        pickle.dump(text_embedding_metadata_triplets, p_file)

    The complete preprocessing file

    We store the code to preprocess the documents in a file called summarize_docs.py.

    #This file summarizes the documents, creates embeddings, and saves the data to a pickle file.
    
    import os
    import pickle
    import prompts
    from langchain_openai import ChatOpenAI
    from langchain_openai import OpenAIEmbeddings
    from langchain_core.prompts import ChatPromptTemplate
    
    #1. Setup
    # Define the directory containing the markdown files
    clickable_url_prefix = "https://github.com/semaphoreci/semaphore/tree/main/docs/"
    docs_directory = "../semaphore/docs/"
    output_dir = "summarized_docs/"
    model_to_use = "gpt-4o-mini"
    output_pickle_file_name = "texts_and_embeddings.pkl"
    output_pickle_file_path = os.path.join(output_dir, output_pickle_file_name)
    
    # Check if output_dir is missing. mkdir if missing.
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    
    summaries = []
    summary_metadata = []
    
    # Optional: Prompt the user for the API key
    #import getpass
    #os.environ["OPENAI_API_KEY"] = getpass.getpass()
    
    #2.0 Setting the Langchain objects
    ##2.1 Setting up a Langchain "model" object
    model = ChatOpenAI(model=model_to_use).with_structured_output(None, method="json_mode")
    ##2.2 Setting up our prompt template
    our_prompt_template = ChatPromptTemplate.from_messages(
        [ ("user", prompts.summarize_and_augment_prompt)]
    )
    #2.3 Set up the embeddings model
    embeddings = OpenAIEmbeddings()
    
    #3.0 Loop through and summarize the markdown files.
    for root, dirs, files in os.walk(docs_directory):
        #3.1 We loop through the markdown files only
        for filename in [ fi for fi in files if fi.endswith(".md") ]:        
            file_path = os.path.join(root, filename)
            #3.2 Read the contents of the markdown file
            with open(file_path, "r") as file:
                content = file.read()
                #3.3 Use the LLM to summarize the markdown file
                our_path = file_path.replace(docs_directory,"")
                clickable_url = clickable_url_prefix + our_path
                chain = our_prompt_template | model 
                json_formatted = chain.invoke({"path": our_path, "doc": content})
                #3.4 Append to the summary and path to the lists
                the_summary = json_formatted["what_path_suggests"]
                the_summary += "\n"
                the_summary += json_formatted["what_is_the_document_about"]
                the_summary += "\n"
                the_summary += json_formatted["summarize_document"]
                the_summary += "\n"
                summaries.append(the_summary)
                summary_metadata.append({"source": clickable_url})
                #3.5 Optional: Write the sumarised file to disk
                s_file_name = os.path.join(output_dir, our_path.replace("/","_"))
                with open(s_file_name, "w") as s_file:
                    print(s_file_name)
                    path_spec_text = "Path to file: {0} \n\n".format(our_path)
                    s_file.write(the_summary)
    
    #4.0 Generate the embeddings
    print("Embedding summaries...")
    summary_embeddings = embeddings.embed_documents(summaries)
    #5.0 zip everything together and save to disk
    text_embedding_metadata_triplets = {"embeddings":zip(summaries, summary_embeddings), "meta": summary_metadata}
    with open(output_pickle_file_path, "wb") as p_file:
        pickle.dump(text_embedding_metadata_triplets, p_file)

    How to retrieve the relevant documents at query time

    We’re using FAIS as our retriever. I chose FAIS for this tutorial because it runs in memory. We don’t need to worry installing any databases. In production, you might decide to outsource the retriever to a cloud vendor’s search product.

    The code below is an extract from a larger file that is called rag_bot.py. The complete file will be presented later in this article.

    We read the pickled data from disk. Then we instantiate a retriever object. To test our retriever, we could search our documents with retriever.invoke(). For example, retriever.invoke("What do I need to keep in mind for security?")

    import os
    import pickle
    from langchain_openai import OpenAIEmbeddings
    from langchain_community.vectorstores import FAISS
    
    #1. Setup
    docs_dir = "summarized_docs/"
    docs_pickle_file_name = "texts_and_embeddings.pkl"
    docs_pickle_file_path = os.path.join(docs_dir, docs_pickle_file_name)
    with open(docs_pickle_file_path, "rb") as p_file:  
        text_embedding_metadata_triplets = pickle.load(p_file)
    
    #2.3 Set up the embeddings model
    embeddings = OpenAIEmbeddings()
    
    #3.0 Setting up the retriever
    faiss_vectorstore = FAISS.from_embeddings(text_embedding_metadata_triplets["embeddings"], embeddings,text_embedding_metadata_triplets["meta"])
    retriever = faiss_vectorstore.as_retriever()

    Building the question-answering component

    The RAG question-answering prompt

    The question-answering component has a system prompt and a user prompt. The system prompt and the user prompt template are shown below. They are stored in prompts.py.

    #This is the prompt for the RAG 
    rag_system_prompt = """
    You are helping me find my way through the Semaphore CI documentation. 
    You answer my questions based on the context. And then you direct to the source of the document, where I can read more.
    You must answer based on the context only. Do not use your general knowledge.
    If you don't know the answer then reply with "I don't know".
    You are always very optimistic and encouraging.
    
    """
    
    
    rag_user_prompt = """
    The question is in the "question" XML tags.
    The context is in the "context" XML tags.
    The context is made up of documents. Each "document" is delimited by its own XML tag.
    Each document has a "source" property. You need to direct me to the source of the document. Because the source of the document will have more details.
    Answer the question based on the context. If you don't know the answer then reply with "I don't know".
    
    <context>
    {context}
    </context>
    
    
    <question>
    {question}
    </question>
    
    The question is in the "question" XML tags.
    The context is in the "context" XML tags.
    The context is made up of documents. Each "document" is delimited by its own XML tag.
    Each document has a "source" property. You need to direct me to the source of the document. Because the source of the document will have more details.
    Answer the question based on the context. If you don't know the answer then reply with "I don't know".
    
    Use the following JSON template for your reply. Write valid JSON.
    {{
        "answer": Answer the question. Make sure that you explain where I can find more information - in the source of the relevant documents.
        "sources": [Output the sources of the documents that you used to answer the question into this array.]
    }}
    """

    How to use the LLM to answer the question

    The retriever retrieves the documents that are relevant to our question. Then we need to insert those documents into the LLM’s prompt. The inserted documents are called the context. The prompt instructs the LLM to answer the question, but only based on the context in the prompt.

    We need to concatenate all of the retrieved documents into one string. And the final string must be in a style that matches our prompt. We will write a function called format_docs(). You can see it in the code below. Note how we are using the metadata as well.

    import prompts
    from textwrap import dedent
    from langchain_openai import ChatOpenAI
    from langchain_core.prompts import ChatPromptTemplate
    from langchain_core.runnables import RunnablePassthrough
    
    
    #0. Function used to arrange the search results. Could be used in other code.
    def format_docs(docs):
        """
        This function formats the documents for insertion into the RAG prompt
        """
        doc_template_string = dedent("""
        <document source="{the_source}">
        {content}
        </document>""")
        return "\n\n".join(doc_template_string.format(content=doc.page_content,the_source=doc.metadata["source"]) for doc in docs)
    
    #1. Setup
    model_to_use = "gpt-4o-mini"
    
    
    #...The retriever parts of the file have been omitted for clarity
    
    #4.0 Setting up the RAG chain
    rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | our_prompt_template
        | model
    
    )
    
    #5.0 Calling the RAG chain in a loop to answer question in the terminal
    while True:
        user_question = input("Your question: ")
        rag_answer = rag_chain.invoke(user_question)
        print("\nBot's answer: {0}\n".format(rag_answer["answer"]))
        print("For more details see: {0}\n\n".format(rag_answer["sources"]))

    The complete RAG chatbot file

    The retriever code and the question-answering code are part of a file called rag_bot.py. Here’s the complete file.

    #This file has the chatbot
    
    import os
    import pickle
    import prompts
    from textwrap import dedent
    from langchain_openai import ChatOpenAI
    from langchain_openai import OpenAIEmbeddings
    from langchain_core.prompts import ChatPromptTemplate
    from langchain_community.vectorstores import FAISS
    from langchain_core.runnables import RunnablePassthrough
    
    
    #0. Function used to arrange the search results. Could be used in other code.
    def format_docs(docs):
        """
        This function formats the documents for insertion into the RAG prompt
        """
        doc_template_string = dedent("""
        <document source="{the_source}">
        {content}
        </document>""")
        return "\n\n".join(doc_template_string.format(content=doc.page_content,the_source=doc.metadata["source"]) for doc in docs)
    
    
    if __name__ == "__main__":
        # Optional: Prompt the user for the API key
        #import getpass
        #os.environ["OPENAI_API_KEY"] = getpass.getpass()
    
        #1. Setup
        model_to_use = "gpt-4o-mini"
        docs_dir = "summarized_docs/"
        docs_pickle_file_name = "texts_and_embeddings.pkl"
        docs_pickle_file_path = os.path.join(docs_dir, docs_pickle_file_name)
    
        with open(docs_pickle_file_path, "rb") as p_file:  
            text_embedding_metadata_triplets = pickle.load(p_file)
    
    
        #2.0 Setting the Langchain objects
        ##2.1 Setting up a Langchain "model" object
        model = ChatOpenAI(model=model_to_use).with_structured_output(None, method="json_mode")
        ##2.2 Setting up our prompt template
        our_prompt_template = ChatPromptTemplate.from_messages(
            [ ("system", prompts.rag_system_prompt), ("user", prompts.rag_user_prompt)]
        )
        #2.3 Set up the embeddings model
        embeddings = OpenAIEmbeddings()
    
        #3.0 Setting up the retriever
        faiss_vectorstore = FAISS.from_embeddings(text_embedding_metadata_triplets["embeddings"], embeddings,text_embedding_metadata_triplets["meta"])
        retriever = faiss_vectorstore.as_retriever()
    
        #4.0 Setting up the RAG chain
        rag_chain = (
            {"context": retriever | format_docs, "question": RunnablePassthrough()}
            | our_prompt_template
            | model
    
        )
    
        #5.0 Calling the RAG chain in a loop to answer question in the terminal
        while True:
            user_question = input("Your question: ")
            rag_answer = rag_chain.invoke(user_question)
            print("\nBot's answer: {0}\n".format(rag_answer["answer"]))
            print("For more details see: {0}\n\n".format(rag_answer["sources"]))

    Putting it all together

    Directory structure

    We have three .py files. Our three prompts go into prompts.pysummarize_docs.py preprocesses the documents. rag_bot.py is the RAG chatbot that runs in terminal.

    I have drawn the directory structure below. The semaphore folder is a clone of the new Semaphore git repo. I haven’t shown all of the files. We are working with the markdown files in the docs directory.

    All of the paths that the chatbot displays are relative to semaphore/docs/. There is a second docs directory within semaphore/docs/. So semaphore/docs/docs/getting-started/guided-tour.md will be displayed as docs/getting-started/guided-tour.md.

    πŸ“‚Top_level_dir
     ┣ πŸ“‚docs_chatbot
     ┃ ┣ πŸ“‚summarized_docs
     ┃ ┃ β”— πŸ“œtexts_and_ebeddings.pkl <== This file is generated by summarize_docs.py
     ┃ ┣ πŸ“œ__init__.py <== Optional.
     ┃ ┣ πŸ“œprompts.py
     ┃ ┣ πŸ“œrag_bot.py
     ┃ ┣ πŸ“œrequirements.txt
     ┃ β”— πŸ“œsummarize_docs.py
     β”— πŸ“‚semaphore <== This is a clone of the repo at https://github.com/semaphoreci/semaphore.git
     ┃ ┣ πŸ“.git
     ┃ ┣ πŸ“.github
     ┃ ┣ πŸ“.semaphore
     ┃ ┣ πŸ“docs <== This is the docs directory that we are working with
     ┃ ┣ πŸ“œLICENSE
     ┃ β”— πŸ“œREADME.md
    
    
    

    Setting the OpenAI API KEY

    I have set the OpenAI API KEY as an evironment variable on my machine. I use the following code in the terminal. Note how I clear history.

    export OPENAI_API_KEY="my_api_key_here_123"
    history -c

    Then I press ctrl+shift+L to clear the terminal window. Your keyboard shortcuts might be different.

    Another option is for the python code to prompt the user for the API key. You can find some commented out code in summarize_docs.py and rag_bot.py.

    # Optional: Prompt the user for the API key
    import getpass
    os.environ["OPENAI_API_KEY"] = getpass.getpass()

    Optional code that writes the summmaries and augmented context to disk

    summarize_docs.py has some optional code that writes the summmaries and augmented context to disk. You can remove it if you don’t want it. These files are a copy of the text that we are storing in our retriever.

    #3.5 Optional: Write the sumarised file to disk
    s_file_name = os.path.join(output_dir, our_path.replace("/","_"))
    with open(s_file_name, "w") as s_file:
        print(s_file_name)
        path_spec_text = "Path to file: {0} \n\n".format(our_path)
        s_file.write(the_summary)

    Running the chatbot

    Here are the shell commands to preprocess the documents and run the RAG chatbot. I have put the code for this chatbot into a public git repo. So you can set up all of the files by just cloning two git repos.

    git clone https://github.com/semaphoreci/semaphore.git
    git clone https://github.com/slavarazbash/docs_chatbot.git
    

    Then you will need to install the required packages. I recommend using a virtual environment.

    CD docs_chatbot
    python -m venv myenv
    source myenv/bin/activate
    pip install -r requirements.txt
    

    These are the commands to preprocess the documents and start the chatbot.

    python summarize_docs.py
    python rag_bot.py

    Try it yourself. Please reach out if you have any questions or difficulties.

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    Slava Razbash
    Writen by:
    Slava is an experienced AI Engineer. He has worked for some of Australia's largest companies. He publishes AI articles, gives AI talks, and teaches AI courses. Feel free to connect with Slava on LinkedIn.
    Avatar for Slava Razbash
    Reviewed by:
    I picked up most of my soft/hardware troubleshooting skills in the US Army. A decade of Java development drove me to operations, scaling infrastructure to cope with the thundering herd. Engineering coach and CTO of Teleclinic.