Making Search Work for Investment Research
The Challenges, Pitfalls and how to Overcome them
Simon Gregory | CTO & Co-Founder
Series: Making Search Work for Investment Research
Part 1 – Time, Evolution & Chronology
Part 2 – Understanding Information Loss
Part 2a – Garbage In, Garbage Out
In this series of articles, I will be discussing some of these common challenges and pitfalls to avoid when producing an optimal search experience for your research customers.
You may want to read this if you are considering updating or investing in your research search engine…
Part 1 – Time, Evolution & Chronology
Our experience has shown that to be a true master of applying search to financial research, you not only need advanced search capabilities but also the domain expertise to understand investment research and all its quirks. Research has many unique factors to consider that make achieving an effective search solution a difficult endeavour.
Financial Research is a dynamic content stream, so you need to handle time, evolution, and chronology appropriately. Analytical reports are multi-thematic, richly formatted documents using a specialised markets language and terminology so accurately modelling information and context is hugely important. Lastly, Investment Research documents are financially sensitive and subject to regulation, so control, auditability and content integrity are paramount.
Search and the fourth-dimension problem – A very brief history of time
For most of the age of mankind, it was assumed there were only three dimensions. Time was a separate variable factored into the mathematical models as an additional piece of information. Everything worked well that way… until it didn’t. Hemann Minkowski and Albert Einstein revolutionized our understanding of the physical world in the early 20th century by considering time as a first-class citizen, by promoting it to the fourth dimension. They found that, as we stretched our limits to consider more extreme scenarios, such as very fast-moving objects or objects moving in high gravitational fields, the existing three-dimensional models were found to break down.
It may initially sound like a wild analogy if we are about to discuss matters of search, however it might be surprising to hear that research content actually presents an extreme scenario for traditional search. To get the best search results for financial research, you must understand that it deserves to be thought of differently from being just a typical document corpus, and also that research consumers are not typical users of search tools. You need to consider that time dimension.
Out of the box, search engines do not give good results for research content. This is not a critique on any specific technology, just an acknowledgement that the nature of research content does not play well with how traditional search engines in general are fundamentally designed.
The term “search” in the Research sphere means many things to different people, however in its purest form it should be about enabling a user to find the relevant pieces of information for a given search query.
Firstly, search engines (including full text, graph, vector) are essentially indexes that are designed to find the document that matches the input text in some way (text match, knowledge graph expression, semantic meaning) from a document corpus. This can work extremely well for some types of content.
Research, though, is peculiar in that the document corpus (like the universe) is always rapidly expanding and the nature of the content means that the most recent update is usually more relevant. Research is more akin to a news stream than a relatively static document corpus. Search engines are not geared up to incorporate time and you can see that when you are offered the option to order the responses by either most relevant or most recent. In traditional search engines, you can’t have both, as ordering by one of these properties is always at the expense of the other. So, ordering by time loses all the benefits of any relevancy information that can be calculated by the search engine.
Financial research is extremely time-sensitive, recency plays a core factor in a document’s overall relevance, and its contextual relevance with other documents in the chronology. Ideally for research, you really need the documents ordered by time relevance and filtered by content relevance. This is fundamentally not how typical search engine indexes are optimized and there are numerous hacks we have seen used to overcome this, many of which involve other compromises.
Another feature of research content is its inherently cyclical nature. Some content topics are updated intraday, while others are daily, weekly, monthly, quarterly, or annually (and various wavelengths in between). Themes also come and go, while specific topics have underlying trends that ripple over time. The picture I’m drawing here is not unlike the surface of the ocean, affected by tides, winds and storms in the financial markets. All this does is confuse the ranking engine. It will be looking for the biggest waves in the ocean regardless of how far away or what direction they are travelling. It is not necessarily considering the medium sized ones you’d like to know are about to hit you on the beach. For example, with no search engine customisation if you do a search now for Covid-19 you’ll very likely get documents dated back in 2020-2021 at the height of the pandemic. In 2024 that’s probably not what you are after.
It’s an extreme example but it highlights that you simply can’t ignore time. Search engines are predominantly designed and optimized to match documents by using the content index, while time is just another variable in the metadata. This problem is usually underestimated, and so in most cases it falls upon the user to make the most relevant / most recent Hobson’s choice compromise.
When developing the technology at Limeglass we made it a principle to make time a first-class citizen. This eventually led us down the path to developing our own search engine as we didn’t want to be tied to the compromises of the workarounds of existing systems. This has other challenges, of course, and certainly doesn’t mean that you should immediately drop your existing search device in favour of ours. We do believe there are opportunities to optimize search for research in your own off the shelf engine and we also have many ways to help you improve that experience.
LLMs are not search engines
A common misconception in the emergence of LLMs, particularly due to how ChatGPT or MS Copilot has disrupted the search market, is that LLMs themselves can be used instead of a search engine. Out of the box they can certainly recall facts from their trained model. These models are also extremely impressive, having likely been trained with millions of documents. The limiting factor is that LLMs are all (by design) fixed models – they have what’s called a ‘knowledge cutoff’. They encapsulate only the data used to train them and so, as described above, on their own they are simply not suitable for searching the research content stream.
Both ChatGPT and MS Copilot achieve what they do by searching the web, and that is where their dynamic capabilities come from. They sit on top of, or can directly query, a search engine for their source information. This means that the results from LLMs are only as good as the results of the underlying search engine, which, as we are discussing, is extremely difficult to get right for research. This technique (of an LLM using a search engine to generate its answers) has a name – ‘Retrieval Augmented Generation’ (RAG).
What LLMs essentially provide is an impressive user interface to interact with an existing search engine. Compared to traditional full-text search, they are better equipped to understand the user’s intent of a search query and provide more relevant results from a search result set. Think of them as your assistant who searches through the results of a search and writes a summary of the most pertinent results. The primary challenge of RAG is optimising what results go into the LLM prompt, which has technical hard limits and cost implications. Something I plan to cover in a future article on information and context.
So, we’re back to the same problem of how do we optimize search. Naturally the next place to look is another related technology that’s gaining traction – vector search.
Falling into the fixed model trap – vector embeddings
So, what about ‘vector embeddings’? Haven’t these been shown to get great results for search? Yes, vector embeddings are a fantastic leap forward from the full-text search approaches, which I’ll discuss further in a separate article. However, beware of falling into the exact same trap as described above. Vector search engines are ultimately no different in their approach to indexing and have all of the above pitfalls related to relevance vs time. There are additional ramifications to be aware of as well.
Vector embeddings are a fixed model system like LLMs (in fact, LLMs can also produce vector embeddings). They can be trained to understand the world of financial markets and produce vectors encapsulating important context information, and it is the model itself (the coordinates in this vector space) that represent the close connections and relationships. If you want to update the model in any significant way (just like the moving planets and stars that warp space-time in the universe), the vector spaces need to reshape and all coordinates change. This means that any previously generated vector embeddings are no longer relevant.
Think about that for a moment. Every significant change means your fine-tuned indexes become instantly obsolete and you essentially start again. To use this new model, you will need to re-process the entire document corpus to update the index before being able to search an updated model. This can be a costly thing to do and has an unpredictable risk element, as changes to models can have unintended side-effects. This is a hidden cost that is only going to get worse over time given the rapidly evolving corpus, driven by the financial research content stream.
Black swan events and other significant market landscape altering events are also important things to think about. These are times when active search traffic will be high, however these very events may have just rendered the model out of date on a topic of great interest. Is this the time you want to be undergoing the challenge of retraining, testing your model, and reindexing your document corpus? How quickly can you confidently do this without risk? Do you live with a suboptimal search during this period, or do you perform the update and risk the side effects? There’s no correct answer. It will have to be finely judged at the time.
Secondly, is applying modern day relationships what you want to do with older content? The model of the financial world evolves at relative speed due to shifting political landscapes, economic conditions, company rebrands and mergers. Ideally, you’d want the documents to maintain their knowledge at the time they were originally indexed, however this is something you are giving up when opting for a fixed model solution.
Vector databases are like a beautiful snapshot of the financial research galaxy. However, remember that in the research universe it is a snapshot of a system, where planets rotate around their stars, stars collide or explode and are reborn, asteroids pass through solar systems, black holes consume matter and all manner of weird and wonderful things happen over time, which cannot be modelled in a single image.
All the above means that when testing vector databases, beware that small test corpuses and day zero results will appear better than they will be in the running production system. The evolution of research content means that results will degrade over time and there will be a lot of difficult decisions and ongoing maintenance to maintain these results.
We use vector embeddings for various tasks at Limeglass, however we made the decision to build our search engine around a knowledge graph search stack. This doesn’t limit us from combining it with full-text, vector search or RAG in any way. Knowledge graphs are a powerful way to encapsulate relationships – they can greatly enhance search for the concepts that are important to clients in a similar manner to vector search. I’ll go into their other benefits in the next article, but in the context of this discussion, the important thing is that they are dynamic and can easily be adapted and grown over time without affecting the previously indexed content. They can also contain temporal constraints and allow reindexing of past data whilst maintaining the historic relationships.
Time is of the essence
We are always more than happy to discuss advanced customisation of your search engine for research, or to demonstrate how our engine works and where it can deliver very high-quality benchmarks for relevant research results. Fixed model solutions can provide excellent results in many cases, but ultimately your tech stack will need that dynamic edge to get the best results for time-oriented financial content.
We can also show you how our financial knowledge graph and content tagging is a perfect fit for RAG solutions (with or without vector search). This configuration already has a name in the tech sphere ‘Graph RAG’. It is a rapidly growing area of technology, as everybody looks to make best use of all the new AI tools. We have already successfully implemented our own solution today, it can also be incorporated in your own RAG solutions, and like the research universe, it is evolving!