Beyond the Hype: Exploring the Impact of Large Language Models in Business

2001: A Space Odyssey has perhaps one of the most iconic examples of a sophisticated AI language model in popular culture. Just over 20 years after the setting of the film, the release of ChatGPT has brought the imaginings of Stanley Kubrick closer to reality. With 100 million weekly users, ChatGPT is one of the most widely used online tools, and its introduction has brought the capabilities of generative AI to the forefront once again.

With the storm of generative AI tooling flooding every aspect of society and life, does the tech actually live up to the hype? At Keen AI we primarily use AI to analyse visual data, however, we are interested in all cutting edge technologies, especially those that have the potential to change how we work, and so we set out to investigate if the technology behind ChatGPT, large language models (LLMs) can be used effectively in a business setting.

Generative AI is a paradigm shift but companies with sensitive data need alternatives to OpenAI and ChatGPT

The main question that remains to be answered is how do large businesses make LLMs available to their employees in a sustainable and cost-effective way. For many companies, using tools like ChatGPT that run on the cloud is not viable as it requires uploading sensitive data to a 3rd party. 

Large and sophisticated LLMs run on complex hardware infrastructure, and so businesses have two options for running large LLMs privately: use dedicated cloud servers with powerful computational resources or set up equivalent infrastructure locally on premises.

Having the hardware locally is a more secure option as the data never needs to leave the business, however, setting up complex infrastructure on premises is not only costly initially, but the energy demands of such hardware make it expensive to run also. Using dedicated cloud servers removes the need for acquiring expensive hardware, but will be costly in the long run.

Both options present issues with regards to the high costs, and the value brought by LLMs may not be worth it when weighed against the cost.

Is there a way to produce some of the benefits of LLMs without needing to shell out on fancy infrastructure and hardware? Is it possible to achieve most of the benefits of powerful LLMs like ChatGPT, from smaller LLMs that run on limited hardware?

Can an LLM running on an employee machine provide useful information acting as an interface and interpreter to a company’s digital assets?

Running an LLM locally has the potential to greatly augment a user’s experience by allowing them to extract insights from their local files or databases.

In order to test this application, we conducted a series of experiments to determine whether we could derive benefit from running LLMs on a laptop, albeit with a dedicated graphics card. The objective is to be able to answer the following key questions:

  • Can we run an LLM on the selected hardware in a usable way?
  • Does the LLM give accurate answers?
  • Is this approach better than alternative ways of getting the same information?

We also made the decision to limit the use case of the LLM to one that would be applicable to our customer base. As such, our experiments focussed on the use of LLMs for extracting insights from a database. This is particularly useful for us as it resembles reports we would generate for our customers. For large companies that own a lot of physical assets and infrastructure, defect detection and the subsequent reporting are an important part of the condition assessment process, and being able to query this information without the need for generating a report or technical knowledge of querying a database presents a genuine use case.

We used Python and models from Hugging Face to perform the experiments

We set the experiments up with the following tasks:

  • Create a test database with dummy data on defects raised by users that relate to OHL towers
  • Understand which LLMs and tools are available to see the current standard of widely available LLMs
  • Stand up a local LLM which we are able to interact with
  • Create an LLM pipeline that can interact and understand a representative database and answer basic questions

We ran the experiments on laptops with dedicated graphics cards, namely:

  • Lenovo Legion 5 with GeForce RTX 3070
  • MacBook Pro M1 Pro with dedicated GPU

We also chose to use Python as the programming language of choice. Its simplicity, readability, and extensive libraries for LLMs make it a good starting point for working with LLMs. All of the models used in this experiment are available on Hugging Face, an open source data science and machine learning platform. The list of packages and models used are provided below:

We chained a set of domain specific LLMs to build an agent for answering common questions about a domain specific dataset

Chaining refers to creating a chain of LLMs, with the output of one being used as the input to the next, until the final output provides the result. The pipeline that formed the basis for the majority of the experiments is illustrated below:

The LLMs used in the pipeline above are specialised for their particular purpose. They have been trained on data specific to the task being asked of them, compared to generalised models which have been trained on large and diverse datasets from different domains. When implementing LLMs for a specific purpose, it is often more effective to use specialised LLMs. The obvious trade-off here is that they often cannot provide useful results for more general tasks, which in our case, is an acceptable downside.

The results were mixed; accuracy, consistency and response times were an issue

We tested the pipeline with a number of prompts and the results were not promising. Response times were slow, for example, asking the pipeline “how many defects have been raised?” resulted in approximately a one minute wait for a response. More complex questions such as “how many defects were raised in 2020?”, would result in a pipeline failure. Having tested a number of prompts, we can answer the questions defined as the success criteria:

  • Can we run an LLM on the selected hardware in a usable way?
  • No. Response times were too long, even for simple questions.
  • Does the LLM give accurate answers?
  • Sometimes. If the LLM is able to generate an executable SQL query, it would often be able to translate the results of the query into a response, however, generating a SQL query is often where the pipeline would fail.
  • Is this approach better than alternative ways of getting the same information?
  • No. Given that the pipeline is unable to provide accurate answers in a timely manner, this approach is not a viable alternative.

A deeper dive into why some of the results were so poor

In order to gain a clearer picture of the inner workings of the LLM chain, let’s look at some example prompts we tested and the outputs at each stage.

As this database essentially contains data on defects, the easiest question we can ask is “how many defects have been raised?” This unsurprisingly results in a correct result of 2500 defects, however, we start to see interesting results if we change the prompt to “how many defects have been created?”. This again provides the same end result, but the intermediate step of deciding which tables are relevant to the prompt provides some insight into how the first LLM decides which tables are relevant. This is a fairly simple query that requires one table, and the first prompt results in just two tables being returned, with one of them being the correct table.

The second prompt however, results in almost all of the tables in the database being returned. Upon further investigation, there is no obvious reason as to why the second prompt would result in that many more results than the first, however, the DEFECTS table contains the column “date_raised”. This is the first example showing that prompt engineering is important, and that using language that is consistent with what is used in the database can lead to better results. It is worth noting that both prompts resulted in the same query being generated by the second LLM in the chain.

If we wanted to know more about defects that have been raised by a particular user, we could ask something like “how many defects have been raised by John Doe?”. The correct response to this question would be 846` however the response was that John Doe hasn’t raised any defects. The issue in this case is with the SQL query that is generated:

SELECT COUNT(id) AS number_of_defects FROM NEW_DEFECTS WHERE actioned = 0 AND actioned_by != 757071;

We can see an attempt here to filter the list of defects according to the actioned and actioned_by columns, where the relevant column in this case is actually the user_id column. If we amend the prompt to “how many defects have been raised by the user John Doe?” to try and hint that the user_id column is what is needed, the generated SQL query is as follows:

SELECT COUNT(id) AS number_of_defects FROM NEW_DEFECTS WHERE actioned = 0 AND actioned_by::TEXT ilike '%John%Doe%'::TEXT;

This SQL query has syntax errors, and it also has not learnt that the relevant table column in this case is user_id. This is where we really start to see the limitations of the smaller LLMs. Although the LLM being used here is optimised for SQL, its limited size means it is still unable to produce queries which follow SQL syntax rules.

There are also other considerations to be made that limit the usefulness of local LLMs

There are a number of other issues that are important to consider. The first is that LLMs are very resource intensive, especially when chained together. We could end up incurring a large cost for very little benefit especially if they are being implemented for the purposes of experimentation.

The second is that giving an LLM access to a database poses a security risk. From being able to alter or delete data, to being engineered to return data a user should not have access to, there are numerous security concerns with allowing an LLM free reign over a database, though these can be easily mitigated in various ways.

Finally, it remains to be seen how LLMs handle context-specific data. Businesses often have their own conventions and standards, which can even vary from team to team. How will LLMs cope when being presented with data which requires some prior understanding in order to make sense of it?

Local models aren’t yet ready for prime time but they aren’t far away, and we expect they will be ready within the next two or three years

Our initial foray into the world of LLMs showed that they are not quite ready for local deployment on the types of machines we tested on. It is worth noting that speeds were significantly improved on a more powerful MacBook Pro M1 Max model laptop, however, despite being a consumer level product, is more powerful than the standard issue company machine.

At present, model performance is heavily linked to its size, namely, the number of parameters it has. Without getting too technical, parameters represent the knowledge and patterns the model has learned from its training data. Essentially, the more parameters a model has, the better it is able to perform.

Over the past few years, the advancement of LLMs has largely come through increasing their size and the number of parameters. Model size has been increasing 10x year on year since 2018, as shown by this chart by Hugging Face:

Chart – Hugging Face

Open AI’s GPT 4 was released in March 2023, and is the largest model to date with an estimated 1.7 trillion parameters. But this increase in size comes at a massive cost, with the cost of training exceeding $100 million, which does not even include the ongoing costs of inference. 

The trend of increasing model size in order to better its performance isn’t sustainable. In fact, OpenAI’s CEO Sam Altman believes we will need to find different ways of improving the performance of LLMs: “I think we’re at the end of the era where it’s going to be these, like, giant, giant models,” he told an audience at an event held at MIT in 2023. “We’ll make them better in other ways.”

Instead of continually increasing the size of models, there’s a shift in focus on improving model efficiency. This includes optimising model architectures, reducing computational requirements, and enhancing inference speed without sacrificing performance. Techniques like model distillation and pruning can also be employed to compress large models into smaller, more efficient versions while retaining their performance to some extent.

What does this mean for locally run LLMs

In our experiments, the size of the models we were able to run was largely dictated by the RAM on our machines, which will likely remain to be a bottleneck in the near future. But with the focus shifting away from model size to other areas such as model efficiency and compression, we may see vast improvements in what we are able to run on humble consumer products in the near future.

About Hamzah Reta

Hamzah is excited by the potential of AI to take engineering processes to even greater heights. Following his passion for integrating these two worlds to build a better future, he is dedicated to helping Keen AI grow and achieve that vision.‍

More from

Hamzah Reta

People sharing ideas on a whiteboard about Rapid Prototyping of AI Solutions
Hamzah Reta

Rapid Prototyping

Businesses invest millions in data science projects. The expected benefits often end up being at the end of the proverbial

Read More >

Sign Up for More!

Sign up for news, info, and to stay ahead of AI industry trends.