Jointly Querying Text and Structured Data? The Power of Large Language Models and the Quest for Reliable Queries
In today’s data-driven world, structured information stored within databases is fundamental for countless organisations, facilitating efficient data manipulation and querying through the well-known SQL language. However, a vast wealth of valuable knowledge remains locked within unstructured text. Despite progress in automatic extraction, obtaining structured data from text is still time-consuming and error-prone. Even when structured data has been extracted, it is still a challenge to integrate it with existing databases.
While human comprehension is the traditional gateway to this textual understanding and manipulation, Large Language Models (LLMs), such as ChatGPT, offer a revolutionary advantage by effectively querying and interpreting unstructured text. These LLMs can read vast textual datasets, from corporate reports to internet repositories, and answer complex questions expressed in natural language. However, when the requested query to a Large Language Model becomes more complicated and critical to errors, can we find a way to trust the answer these tools provide?
EURECOM’s professor Paolo Papotti, an expert in databases and AI, explains how they added a layer of declarative querying on top of LLMs, effectively handling the original free text as if it were tabular data. This process allows to query a LLM with SQL, as depicted in Figure 1. A SQL query script is more precise than a question expressed in natural language, and its output is a relation, which can be directly parsed by data-centric applications.
In this article, we delve into the complex topic of LLMs as potential data storage and retrieval systems, examining the possibilities and challenges by utilizing their language skills to enhance structured databases.
Q. What is the motivation and challenge for this work?
PP. The main motivation is to be able to run queries over all the resources of an institution. As LLMs become popular in processing proprietary documents, we enable the ability to query with one declarative interface, both corporate databases and text resources, as depicted in Figure 2.
One of the main challenges concerns the fragility of generative AI, which becomes apparent in structured data. When AI generates images or text, slight deviations or anomalies are usually acceptable for the final user. Consider, for instance, the scenario where you request an AI system to conjure an image of a dog traveling on an airplane — a task it will execute successfully in most cases. Current models are becoming very good at avoiding obvious mistakes, such as a dog with two tails. However, minor irregularities, such as slightly larger ears or marginally misaligned eyes, might escape notice, leading to the perception of a satisfactory outcome. This tolerance extends to the domain of natural language text, where the choice of an unusual adjective or a suspicious comma can be considered a matter of style. However, even minor factual errors, including irrelevant punctuation or typos, can swiftly contaminate structured data, introducing inaccuracies that render the data unreliable, particularly when employed in critical decision-making processes. Our work focuses on fortifying the queries executed on LLMs against sensitivity to ensure data accuracy.
Q. What is your proposed approach for augmenting query accuracy with AI?
PP. In our research, we leverage pre-trained Large Language Models (LLMs), such as ChatGPT, and query processing techniques from database literature to obtain highly accurate queries. It is widely acknowledged that LLMs excel when a difficult task is broken down into small ones in a conversational format, facilitating a back-and-forth exchange with the model. This is known as chain-of-thought prompting and requires humans to accurately craft a LLM’s prompt for every task.
Our innovative approach takes as input an arbitrarily complex SQL query and generates a structured table containing the requisite information based on the knowledge embedded within the LLM. We achieve this by deconstructing the query into smaller, more manageable pieces and automatically posing clear questions to the LLM for each component. This strategy is akin to constructing a logical plan, comprising a sequence of operations to retrieve and integrate data, ultimately providing the desired answers. Notably, our tool operates by deriving a query plan that collects and validates the LLM’s stored knowledge, which acts as a proxy of the information in the original corpus of documents. By adopting this logical plan approach, we effectively mitigate the risk of introducing errors into the results, making our tool considerably more precise than conventional natural language approaches in data extraction.
Q. What are the risks in this line of work?
PP. Our approach leverages a well-established principle — breaking down complex tasks — for transformative results. Our experiments over four LLMs, including Flan and TK for local use, as well as GPT-3 and ChatGPT, highlight the versatility of this approach. We have demonstrated the impact of prompt variations, showcasing how subtle changes in input can yield differing outcomes — a testament to the precision achievable through prompt generation from query plans.
However, the imperative for users to exercise due diligence in verifying outputs is of outmost importance. In navigating the domain of LLMs, it is crucial to dispel misconceptions, recognizing that these models possess vast probabilistic knowledge rather than true intelligence. Their interpretation of text occurs word by word, and even a minor mistake can carry significant consequences in data-driven contexts. Yet, LLMs remain an ever-improving force, poised to transform roles traditionally fulfilled by humans.
It is also essential to highlight the aspect of the beneficiaries of this transformative technology. LLMs are primarily driven by influential corporations, prompting debates surrounding the use of publicly available content to train these models and the resulting disparities in benefits. In this evolving landscape, these ethical considerations will remain central in discussions surrounding LLMs.
Q. What are promising research directions for the future?
PP. In the context of our work, our next challenge is hybrid querying over LLMs and databases. As depicted in Figure 2, we are getting close to a unified interface, powered with the expressive power of SQL, that allows precise access to all the knowledge in an institution. This would be the first time users get access to this combined information without any data pre-processing, witnessing the benefit of combining the techniques from the databases community with the new opportunities from the NLP field.
In general, training and running inference on these massive neural networks demand significant energy and resources, presenting a sustainability concern. The academic community, often resource-constrained, seeks innovative approaches to contribute meaningfully to LLM research without the overwhelming burden of massive computational requirements. Moreover, the prospect of open-source LLMs, supported by EU regulatory bodies, holds promise. This approach, where LLMs are made accessible to all, could democratize the field, with various stakeholders customizing models to suit their specific needs. In essence, the future of LLMs may see a diverse ecosystem of models, each tailored to its user’s unique angle and interests, underpinned by a commitment to energy-efficient and sustainable AI advancements.
References
1. Mohammed Saeed, Nicola De Cao, Paolo Papotti, Querying Large Language Models with SQL, EDBT 2024