Photo by Debby Hudson / Unsplash

Tagging and Summarizing Articles with OpenAI and LangChain

openai Jan 17, 2025

In this post, we build a simple AI-enabled application for tagging and summarizing articles by leveraging OpenAI and LangChain.

📌
TLDR; we will try to mimic some functionality of Inshorts.

Under the hood, our application will use OpenAI function calling to interact with a Large Language Model and get a structured output with a summary of the article, its language, and associated tags.

Firstly, we import our OpenAPI Key in the environment variable and add the necessary imports.

import os
import getpass

def _set_env(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass.getpass(f"{var}: ")

_set_env("OPENAI_API_KEY")
from pydantic import BaseModel, Field
from langchain_core.utils.function_calling import convert_to_openai_function
from langchain.document_loaders import WebBaseLoader
from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.output_parsers.openai_functions import (
  JsonOutputFunctionsParser, JsonKeyOutputFunctionsParser
)

Then, we initialize the model (with temperature=0) to limit the LLM from producing random and creative responses.

model = ChatOpenAI(temperature=0)

Next, load the document we would like to summarize.

loader = WebBaseLoader("https://blogs.nvidia.com/blog/ces-2025-jensen-huang/")
documents = loader.load()
doc = documents[0]

Next, we create a Pydantic data model to structure the output. The Pydantic library offers a concise way to define data structures and provide validation support.

We must add a docstring to the model and descriptions to all the fields to help the LLM understand the desired responses. We use Pydantic data models to create a JSON schema, allowing easy integration with OpenAI models.

class Overview(BaseModel):
    """Overview of an article."""
    summary: str = Field(description="Provide a excerpt of the content in 60 words.")
    language: str = Field(description="Provide the language that the content is written in.")
    keywords: str = Field(description="Provide keywords related to the content. All the keywords should be in lowercase.")

Next, we use the convert_to_openai_function function to create the JSON schema and bind it to the model.

overview_tagging_function = [
    convert_to_openai_function(Overview)
]

We then create the model, a simple prompt, and an output parser.

tagging_model = model.bind(
    functions=overview_tagging_function,
    function_call={"name":"Overview"}
)
prompt = ChatPromptTemplate.from_messages([
    ("system", "Think carefully, and then tag the text as instructed"),
    ("user", "{input}")
])
json_output_function_parser = JsonOutputFunctionsParser()

Later, we create a LangChain runnable named tagging_chain by piping prompt, tagging_model and the json_output_function_parser output parser.

tagging_chain = prompt | tagging_model | json_output_function_parser

Finally, we invoke the runnable by inserting the contents of the page to get a well-structured JSON output like -

tagging_chain.invoke({"input": doc.page_content})

{
  'summary': 'NVIDIA CEO Jensen Huang discussed the advancements in AI at CES 2025, unveiling new products like the NVIDIA Cosmos platform and Blackwell RTX 50 Series GPUs. The focus was on physical AI, AI tools for PCs, and innovations in autonomous vehicles and robotics.',
  'language': 'english',
  'keywords': 'nvidia, ces 2025, ai advancements, jensen huang, nvidia cosmos, blackwell rtx 50 series gpus, physical ai, robotics, autonomous vehicles'
}

https://platform.openai.com/docs/guides/function-calling

https://openai.com/index/function-calling-and-other-api-updates/

https://python.langchain.com/docs/concepts/runnables/

Tags