Shopping Cart

No products in the cart.

How good are LLMs in investor risk profiling?

As the financial industry increasingly looks to large language models for new solutions, a study asks how OpenAI’s ChatGPT-4 and Google’s Bard compare with Swiss bank experts in determining investors’ risk profiles. The results: relying on the AI-based evaluations does not appear to make a significant difference to the bottom line. However, the LLMs are not good at explaining the reasoning behind the risk profiles.

Large language models (LLMs) are increasingly prevalent in the financial sector and are likely to have greater influence over the long term. The possibilities for applications of LLMs in financial advisory is an emerging field now attracting attention in research. Increasingly powerful LLM-based chatbots, such as OpenAI’s ChatGPT and Google’s Bard, represent the latest developments in natural language processing and are currently being used and explored in a wide range of applications.

Risk profiling is a crucial aspect of providing investment advice, and creating investment portfolios and making investment decisions require a deep understanding of the individual investor. The applicability of GPTs in risk profiling has seen limited research to date. While ChatGPT and Bard have shown promising theoretical potential, their capabilities in comprehending and assessing investor risk profiles are not yet clear. It is widely acknowledged that LLM-based systems have certain inherent limitations. For instance, they are not necessarily output-consistent and are susceptible to producing hallucinations, information that is incorrect or untrue.

In a recent paper, these authors studied how such systems perform in risk profiling and to what extent they can be effectively applied. We showed the extent to which the current iterations of ChatGPT and Bard demonstrate accuracy and consistency in categorizing individual risk profiles for investors. Despite their power and transformative potential in many applications, the complexities and intricacies of these tools give rise to specific constraints that must be acknowledged and addressed in their future use.

Why use AI for risk profiling?

Recent studies have shown positive results in evaluating the efficiency and precision of AI applications compared to human advisors within financial advisory. A 2018 study, for example, suggests that in specific areas of financial planning, particularly in processing data, risk assessments, and portfolio management, AI has the potential to equal or even exceed the capabilities of human advisors. Other research has described how AI might provide investors with tailored financial advice based on their risk tolerance, investment objectives and other considerations.

Alongside these opportunities, conversational AI tools like ChatGPT and Bard also have certain inherent limitations. A LLM is a language model consisting of a neural network with billions of parameters trained on a large amount of unannotated data through self-supervised learning and is used, for example, to predict and generate text and other content. In considering ChatGPT’s and Bard’s capabilities to categorize investors’ risk profile, their probabilistic nature holds significant implications. As chatbots can be sensitive to nuances in the phrasing and structuring of the prompt, how the investor information and the assessment request are stated may influence the responses the chatbots provide. Any variability in their responses might undermine the reliability of chatbots in assessing investors. Even with similar or identical input, there is a possibility that the chatbots may generate varied responses. Inconsistent output in terms of the chatbots’ wording and phrasing can be compared to human advisors who express themselves differently. However, what is crucial is that the assessment remains consistent.

LLMs are also prone to generating information that is not factual or true, a phenomenon commonly referred to as “hallucination.” This is particularly of concern in financial advisory, where accuracy and trustworthiness are crucial. If hallucinations occur when assessing investor risk profiles that could lead to misclassifications or inappropriate financial advice, this could have serious implications in real-world contexts.

Another limitation of LLM-based tools is that they might struggle with common-sense reasoning tasks.9 Additionally, the reliability of chatbots such as ChatGPT and Bard may be affected by their inability to fully comprehend the complexity of human language and conversations.10 Since such chatbots generate words sequentially based on a given prompt, they are incapable of truly understanding the meaning behind words, implying that responses generated are likely to be shallow and lacking in depth and insight.

A potential role for LLMs

Risk profiling typically includes four steps: collecting information from the client, evaluating the collected information, constructing the portfolio for the client and reporting. Traditional methods of risk profiling have mostly relied on questionnaires and interviews, which aim to assess the investor’s attitudes toward risk, their past investment experiences and reactions to hypothetical market scenarios.


These traditional methods of risk profiling have multiple limitations. Douglas Rice examined questionnaires used by investment firms in the United States and found that the number of questions ranged from 1 to 49, with 11% of the questionnaires directly requesting the investor to choose a risk profile or portfolio themselves. Among other issues, he further found that the scoring and assignment of responses from questionnaires were often subjectively conducted by the advisor, and that the suggested asset allocations tended to benefit the investment firm more than the investor.


A 2014 study discovered that the advisor’s influence was a more significant factor in the composition of investor portfolios than the factors typically evaluated in questionnaires. This supports the idea that traditional questionnaire-based methods of risk profiling are not fully valid and are influenced by subjectivity. Different advisors might assess the same investor’s risk profile differently, potentially leading to varied assessments from different advisors.


Another challenge is that investors often have multiple and vaguely defined goals, complicating the task of creating a clear and quantitative investment strategy.15 A common situation may involve an investor who wants to save for retirement in ten years, while also saving for a down payment on a house and maintaining liquid assets as emergency funds. Other limitations relate to the subjectivity of the client’s self-reported data, which can lead to potential inconsistencies and biases in the risk assessment.

The use of AI and machine learning in financial advisory is transforming traditional approaches, allowing for the implementation of more data-driven methods in assessing clients’ risk profiles. In ‘Behavioral Finance for Private Banking’, the authors outlined different risk profiling methodologies, including artificial intelligence and machine learning techniques. With such techniques, banks can utilize past observations to identify typical patterns in client behavior and predict their future reactions to risk. While these tools can be powerful, they can also be misleading as they may not adequately differentiate between risk tolerance and other personality traits or behavior biases, and may therefore fail to differentiate between past and ideal behavior.

Methodology and research design

To understand chatbot performance in investor risk profiling, our study used ten different client cases provided by Alvin Amstein. The cases encompassed a variety of investor descriptions, each detailing information about their financial situations, investment objectives, risk preferences and knowledge and experience. While there were variations in the specific details, such as age, profession and investment objectives, the cases shared consistent overarching features for comparability. The investor descriptions were characterized by brief and direct statements, reflecting the real-world tendencies of investors to have limited information about their risk tolerance and related factors. Client cases ranged from an economics graduate student looking to manage a family inheritance and a middle-aged architect with two young children to a professional soccer player and a late-career entrepreneur.

The data from the bankers was collected by the Amstein study based on an online survey in the fall of 2023. The bankers were all employed at the same bank in Switzerland, one with operations nationwide. They were incentivized to score risk profiles of the 10 clients, doing so about 48 times for each client, along a scale ranging from 1 to 5.

Meanwhile, over seven weeks in 2023, ChatGPT and Bard each categorized the 10 clients a total of 16 times. To achieve comparability, the LLMs were asked the same questions as the bankers. We required the LLMs to “categorize the investor between one and five indicating the order: 1 lowest to 5 highest possibility to take risk.” Although the chatbots were requested to provide similar categorical responses as the human bankers, in some instances, they provided non-integer answers such as “2 or 3” or “2 leaning towards 3.” In such cases, the mean value of their suggested ranges was used for quantitative analysis.

Results

Our quantitative analysis showed that the risk profiles did not differ much between LLMs and bankers. For most client cases, the average risk scores assigned by the LLMs were not significantly different to those of the bankers. Moreover, based on the Portfolio Theory of Harry Markowitz, we showed that the remaining small differences are of no economic significance. But we also reported on a qualitative analysis done to evaluate the explanations provided by the chatbots — showing that the LLMs mainly base their risk profiles on general principles that miss the specific characteristics of the clients.

The chatbots’ frequent use of age in assessing a client’s investment horizon indicated an “understanding” (so to speak) that younger clients often have a higher risk capacity due to their longer time to recover from market downturns. However, their frequent reliance on such general principles sometimes resulted in poor matches in specific cases. For instance, the chatbots’ interpretation that the engineering student has a longer time horizon appeared to underlie the higher risk scores assigned and may explain the chatbots’ higher assessments of this client compared to those of advisors.

Financial obligations related to children and family were repeatedly understood as affecting the clients’ risk capacity. Additionally, the chatbots’ assessment of the clients reflected an understanding that higher income was linked to a higher risk capacity, and correspondingly, that high expenses were linked to a lower risk capacity. However, they struggled with an integrated analysis of these factors. This applied to assessments of investment horizons, and at times, they also seemed to struggle to relate and weight elements of the clients’ financial situations to each other.

Instances where the chatbots’ reasoning seemed to lack nuance — where there appeared to be a “lack of human-like understanding” — can be related to the chatbots’ inability to fully comprehend the complexity of human language and conversations. These findings support the notion that chatbots cannot fully understand the meaning behind words, which might lead to responses that are shallow, lacking depth and insight.

In some cases, this weakness in responses may be associated with divergent ratings compared to financial advisors. For instance, ChatGPT’s higher assessment of the insurance consultant may relate to this limitation, as the rating was associated with a perceived higher risk tolerance, likely overestimating the client’s actual risk tolerance. This was also supported by Bard’s reasoning, which directly linked the risk tolerance to the goal of 3–5% annual returns, corresponding with lower assessments compared to both ChatGPT and the bankers.

Without certainty, Bard’s arguments that were made based on unprovided information could be related to the phenomenon of extrinsic hallucination (meaning generations that cannot be supported or contradicted by the source content). Statements in the chatbot’s generated responses contained information that could not be verified from the input. For example, Bard stated that the engineering student was considering diversifying their portfolio, despite no such information provided, claimed a “stable income” without details of any income, and claimed that the student was willing to invest in riskier assets such as stocks and cryptocurrencies.

Considering the limitations identified in their reasoning, it is clear that the explanations provided by the chatbots are not well-suited. Their frequent oversimplification and misinterpretation of nuanced information revealed an inherent lack of human-like understanding, which raises concerns about the depth and reliability of their assessments and overall suitability when requiring judgment and personalized advice. The potential occurrence of extrinsic hallucinations further adds to these concerns. These errors reflect a broader issue of LLMs and their current limitations in fully grasping and responding to the intricacies of human language and individual circumstances. Their use in sensitive areas, such as risk profiling, therefore requires careful consideration. The shortcomings identified here indicate that the human capacity to fully interpret, emphasize and adapt to different situations of individuals cannot yet be fully replicated by these chatbots.

Conclusion on LLMs and risk profiling

Our study asked how ChatGPT and Bard categorize investor risk profiles compared to financial advisors. For half of the clients, the study revealed no statistically significant differences in the risk scores assigned by ChatGPT and Bard, compared to those assigned by bankers.

Moreover, on average, the differences had minor economic relevance. However, while the chatbots found the right risk profile, on average, their reasoning was based on general principles and many times missed the specific characteristics of the clients.

Since understanding one’s risk profile is essential for consistently investing over time, our study suggests that chatbots cannot replace bankers. But they may be used as a second opinion to check whether the banker’s assessment deviates from the prevailing opinion. Certainly, this was just one — but the first — study to assess such differences. Further research should be based on client advisors from a different bank, and they should repeat the risk scoring of the LLMs, since they improve over time.

Authors

  • Thorsten Hens is a Swiss Finance Institute Professor of Financial Economics at the University of Zurich and Adjunct Professor of Finance at University of Lucerne, Switzerland, as well as at the Norwegian School of Economics, NHH, in Bergen. His main research areas are behavioral finance with applications in wealth management and evolutionary finance with applications in asset management. Dr. Hens has published more than eighty peer-reviewed journal articles and is the co-author of more than ten books.

    View all posts
  • Trine Nordlie is Master’s student at the Norwegian School of Economics, NHH, in Bergen.

    View all posts