By Kai Spriestersbach, first published under:

I’m thrilled to introduce something very special today: A brand-new paper that has just landed as a pre-print on arXiv – and I had the privilege of contributing to as a researcher at the German Research Center for Artificial Intelligence (DFKI)! Our team in the Data Science and its Applications (DSA) working group, led by Prof. Sebastian Vollmer, has been addressing challenges in the field of data sciences since its inception in 2021 at DFKI. Our current focus: Exploring the potential that large language models (LLMs) offer for the analysis and interpretation of data.

In the world of data sciences, researchers often face the tricky problem of working with incomplete datasets. Most conventional algorithms simply can’t handle these gaps in the data. Previous solutions involved data scientists consulting experts to help fill in the missing parts – a process that is not only time-consuming but often impractical.

Imagine if a machine could take on the role of an expert – that’s exactly what we explored. We wanted to find out if large language models (LLMs) have the potential to act as digital experts in various fields. These models, trained on vast amounts of data, potentially bring a deep understanding of a wide range of topics – from medicine to social sciences.

Our investigation, comparing the performance of LLMs with real data and recognized statistical methods for treating data gaps, has revealed some exciting findings. The results show: In many cases, LLMs can make predictions as accurately as conventional methods, without the need to rely on human experts. This discovery opens up entirely new possibilities for data analysis and could radically change the approach to the problem of expert shortages.

Let’s take a closer look at how this research not only fills data gaps but also challenges the definition of expertise itself.

In data analysis, whether in medicine, business, or environmental research, one often encounters the problem of incomplete information. Traditionally, two main strategies are employed: Prior Elicitation and Data Imputation.

Here’s a brief explanation of both methods:

Prior Elicitation aims to systematically capture existing expert knowledge to formulate assumptions about certain model parameters. This method minimizes the uncertainty of our conclusions and refines our models.

Data Imputation is used when our datasets have gaps. Instead of discarding valuable data due to some missing information, scientists fill these gaps with statistically plausible values. This way, the datasets are completed again and can be used for analyses. Our focus here was on the OpenML-CC18 Curated Classification Benchmark, a comprehensive dataset covering 72 classification tasks from various fields – from credit evaluation to medicine to marketing. This diversity ensured that our experiments simulate a wide range of real-world scenarios, thereby providing deep insights into the effectiveness of LLMs under different conditions.

A key aspect of our methodology was the artificial creation of data gaps following the “Missing at Random” (MAR) principle to simulate realistic situations where data points are missing. The goal was to create a setting where normally the expertise of experts would be sought. Subsequently, we defined a specific expert role for each dataset, which we used to initialize the LLM for answering questions about the missing data.

For the actual data imputation, we relied on a series of LLMs, including LLaMA 2 13B Chat, LLaMA 2 70B Chat, Mistral 7B Instruct, and Mixtral 8x7B Instruct. Each of these models was evaluated separately and compared with three traditional approaches to data imputation: mean and mode imputation for continuous or categorical features, k-Nearest Neighbors (k-NN) Imputation, and Random Forest Imputation. We measured the quality of imputation using the “Normalized Root Mean Square Error” (NRMSE) and the F1 score for continuous or categorical features, respectively.

This methodical approach allowed us to not only assess to what extent LLMs can serve as experts in data imputation but also to measure their efficiency compared to conventional methods.

Our investigation revealed surprising findings: The imputation quality of LLMs did not consistently surpass that of the three empirical methods used. However, it was found that LLM-based imputations can be of great benefit in certain areas, especially in engineering and computer vision. In datasets like ‘pc1’, ‘pc3’, and ‘satimage’ from these domains, we achieved imputation qualities with an NRMSE of about 0.1. Similar positive results were also observed in the fields of biology and natural language processing (NLP).

Interestingly, the performance of LLM-based imputation varied significantly depending on the field of application. While the social sciences and psychology provided rather disappointing results, better successes were recorded in the fields of medicine, business, and biology. Notably, the high performance of LLM-based imputation in the business sector was particularly remarkable.

These findings suggest that

LLMs could represent a promising alternative to traditional imputation methods in specific contexts, especially in scenarios where human expertise is hard to access. However, the universal applicability of LLM-based imputation is limited, and further research is needed to identify the areas and conditions under which this technique can be most effectively used.

Our detailed analysis of missing data imputation showed a variable performance of LLMs compared to traditional methods. Although the LLMs did not outperform the empirical methods in all cases, they revealed significant strengths in specific application areas.

The results suggest that LLMs, based on their extensive training dataset, are capable of providing precise assessments that can match real data in some cases. These differentiated results underscore that the use of LLMs for data imputation is promising, but requires careful consideration of the specific domain and area of application.

Our research findings significantly contribute to understanding the potential and limits of LLMs in data science. They demonstrate that targeted use of this technology, taking into account its strengths and weaknesses, is necessary. While LLMs in certain areas already achieve accuracy comparable to traditional imputation methods and human expertise, we are just at the beginning of what is possible with this technology.

The second part of our project addressed the fascinating question of how large language models (LLMs) can be used in the area of Prior Elicitation. We sought to find out what impact LLM-generated prior distributions have on data collection and subsequent analysis and how effective these are compared to traditional methods.

Our approach involved comparing the capabilities of LLMs with the results of an experiment by Stefan et al. (2022). In this experiment, six psychology researchers were asked about the typical small to medium effect sizes and Pearson correlations in their fields. Similarly, we asked LLMs to simulate either a single expert, a group of experts, or a non-expert and generate prior distributions, both with and without reference to the survey protocol used in the comparison study.

A particular challenge was developing a methodology that allows models to generate expert knowledge in areas where direct quantitative statements are restricted by built-in safety measures. Due to their alignment, conventional Instruct or Chat models typically do not provide quantitative information on sensitive topics such as health conditions.

To overcome these limitations, we applied an innovative prompting strategy. Instead of asking for specific mean values or standard deviations, we asked the models to express their assessments in the form of pseudocode for Stan distributions, such as y ∼ normal(120, 10), to indicate a distribution for the typical systolic blood pressure of a randomly selected person.

In our experiments on Prior Elicitation, ChatGPT 3.5 played a central role by successfully applying its knowledge of academic elicitation frameworks, such as the Sheffield Elicitation Framework in combination with the histogram method. We applied these methods to create prior distributions for the typical daily temperature and precipitation in 25 small and large cities worldwide for the month of December.

ChatGPT drew on its extensive knowledge acquired from the training data to lead a simulated discussion among experts and develop a parametric probability distribution. This approach highlights how LLMs can be effectively used for complex tasks like Prior Elicitation even with existing restrictions. Through innovative prompting strategies, LLMs can be utilized as sources of expert knowledge, enriching research in Bayesian data analysis and beyond.

A crucial part of our analysis was to assess the “concentration” or “breadth” of the AI-generated distributions compared to real data. Our goal was to determine how many real data points would be needed to confirm or refute the AI’s predictions. This investigation helped us understand the reliability of AI-supported collection of prior knowledge compared to traditional methods.

Our investigations led to surprising insights: The adoption of various expert roles by the LLMs had no significant impact on the generated priors. Regardless of the simulated role, the artificial experts tended to make cautious predictions, mostly pointing to small effects. An exception was GPT-4, which showed more courage and predicted moderately strong effects.

Particularly interesting was the observation of relationships between variables, such as the influence of weather on our shopping behavior. Here, the LLMs revealed their own, sometimes unexpected perspectives. While some models showed a “bathtub” curve, GPT-4 presented a more uniform, bell-shaped distribution.

The investigation into the confidence these digital intelligences place in their own predictions revealed further differences. Some models expressed themselves conservatively with cautious estimates, while GPT-4 was more informative. The Mistral 7B Instruct model stood out for its extremely high confidence, using thousands of data points. Interestingly, the beta priors for Pearson correlations offered by the LLMs differed significantly from those of real experts. GPT-4 provided a symmetrical unimodal distribution, while other models showed a right-skewed “bathtub” distribution.

In our meteorological task, we investigated how many real weather observations would be needed to make more accurate predictions than the artificial experts. This helped us evaluate whether relying on digital assistants or traditional weather models would be the better choice for future weather forecasts.

These results highlight that LLMs have the ability to generate priors that can compete with human expert assessments in some areas, yet show significant deviations in others. The suitability of LLMs as a replacement for human expertise in creating prior distributions varies depending on the specific task and the model used, underscoring the complexity and potential limits of this technology.

The findings of our research reveal the impressive potential of large language models (LLMs) to synthesize knowledge from a variety of sources and apply it effectively in various application contexts. Especially in situations where expertise is scarce or very costly, LLMs prove to be an invaluable resource.

Our studies show that LLMs can already provide valuable insights in fields such as medicine, economics, and biology, comparable to those of traditional data imputation methods. Moreover, the prior knowledge provided by LLMs, compared to conventional approaches and weighing costs and precision, can be a high-quality and cost-effective alternative in certain scenarios. Thus, using LLMs for Prior Elicitation could be a preferable option in some cases.

In summary, our research marks a decisive step towards a greater integration of LLMs into data science. The prospects are promising, and with further developments in technology and methodology, we may be at the beginning of a revolutionary era of data analysis where LLMs play a key role.

Would you like to know more about AI and how artificial intelligence can drive your company forward? Then get in touch! Get Em All will be happy to help you!