Taxonomies#

April 17, 2025
in AI, Data, Taxonomies, LLM
13 min read

The Rising Value of Taxonomies in the Age of LLMs

Introduction

Large Language Models (LLMs) are growing the demand for structured data, creating a significant opportunity for companies specializing in organizing that data. This article explores how this trend is making expertise in taxonomies and data-matching increasingly valuable for businesses seeking to utilize LLMs effectively.

LLMs Need Structure

LLMs excel at understanding and generating human language. However, they perform even better when that language is organized in a structured way, which improves accuracy, consistency, and reliability. Consider this: Imagine asking an LLM to find all research papers related to a specific protein interaction in a particular type of cancer. If the LLM only has access to general scientific abstracts and articles, it might provide a broad overview of cancer research but struggle to pinpoint the highly specific information you need. You might get a lot of information about cancer in general, but not a precise list of papers that focus on the specific protein interaction.

However, if the LLM has access to a structured database of scientific literature with detailed metadata and relationships, it can perform much more targeted research. This database would include details like:

Protein names and identifiers
Cancer types and subtypes
Experimental methods and results
Genetic and molecular pathways
Relationships to other research papers and datasets

With this structured data, the LLM can quickly identify the relevant papers, analyze their findings, and provide a more focused and accurate summary of the research. This structured approach ensures that the LLM considers critical scientific details and avoids generalizations that might not be relevant to the specific research question. Taxonomies and ontologies are essential for organizing and accessing this kind of complex scientific information.

Large Language Models often benefit significantly from a technique called Retrieval-Augmented Generation (RAG). RAG involves retrieving relevant information from an external knowledge base and providing it to the LLM as context for generating a response. However, RAG systems are only as effective as the data they retrieve. Without well-structured data, the retrieval process can return irrelevant, ambiguous, or incomplete information, leading to poor LLM output. This is where taxonomies, ontologies, and metadata become crucial. They provide the 'well-defined scope' and 'high-quality retrievals' that are essential for successful RAG implementation. By organizing information into clear categories, defining relationships between concepts, and adding rich context, taxonomies enable RAG systems to pinpoint the most relevant data and provide LLMs with the necessary grounding for accurate and insightful responses.

To address these challenges and provide the necessary structure, we can turn to taxonomies. Let's delve into what exactly a taxonomy is and how it can benefit LLMs.

What is a Taxonomy

A taxonomy is a way of organizing information into categories and subcategories. Think of it as a hierarchical classification system. A good example is the biological taxonomy used to classify animals. For instance, red foxes are classified as follows:

Domain: Eukarya (cells with nuclei)
Kingdom: Animalia (all animals)
Phylum: Chordata (animals with a backbone)
Class: Mammalia (mammals)
Order: Carnivora (carnivores)
Family: Canidae (dogs)
Genus: Vulpes (foxes)
Species: Vulpes Vulpes (red fox)

alt text Annina Breen, CC BY-SA 4.0, via Wikimedia Commons

This hierarchical structure shows how we move from a very broad category (all animals) to a very specific one (Red Fox). Just like this animal taxonomy, other taxonomies organize information in a structured way.

Taxonomies provide structure by:

Improving Performance: Taxonomies help LLMs focus on specific areas, reducing the risk of generating incorrect or nonsensical information and improving the relevance of their output.
Facilitating Data Integration: Taxonomies can integrate data from various sources, providing LLMs with a more comprehensive and unified view of information. This is crucial for tasks that require broad knowledge and context.
Providing Contextual Understanding: Taxonomies offer a framework for understanding the relationships between concepts, enabling LLMs to generate more coherent and contextually appropriate responses.

Types of Taxonomies

There are several different types of taxonomies, each with its own strengths and weaknesses, and each relevant to how LLMs can work with data:

Hierarchical Taxonomies: Organize information in a tree-like structure, with broader categories at the top and more specific categories at the bottom. This is the most common type, often used in library classification or organizational charts. For LLMs, this provides a clear, nested structure that aids in understanding relationships and navigating data.

Faceted Taxonomies: Allow information to be categorized in multiple ways, enabling users to filter and refine their searches. Think of e-commerce product catalogs with filters for size, color, and price. This is particularly useful for LLMs that need to handle complex queries and provide highly specific results, as they can leverage multiple facets to refine their output.

Polyhierarchical Taxonomies: A type of hierarchical taxonomy where a concept can belong to multiple parent categories. For example, "tomato" could be classified under both "fruits" and "red foods." This allows LLMs to understand overlapping categories and handle ambiguity in classification.

Associative Taxonomies: Focus on relationships between concepts, rather than just hierarchical structures. For example, a taxonomy of "car" could include terms like "wheel," "engine," "road," and "transportation," highlighting the interconnectedness of these concepts. This helps LLMs understand the broader context and semantic relationships between terms, improving their ability to generate coherent and relevant responses.

Ultimately, the increasing reliance on LLM-generated content necessitates the implementation of well-defined taxonomies to unlock its full potential. The specific type of taxonomy may vary depending on the application, but the underlying principle remains: taxonomies are essential for enhancing the value and utility of LLM outputs.

LLMs and Internal Knowledge Representation

While we've discussed various types of external taxonomies, it's important to note that LLMs also develop their own internal representations of knowledge. These internal representations differ significantly from human-curated taxonomies and play a crucial role in how LLMs process information.

One way LLMs represent knowledge is through word vectors. These are numerical representations of words where words with similar meanings are located close to each other in a multi-dimensional space. For example, the relationship "king - man + woman = queen" can be captured through vector arithmetic, demonstrating how LLMs can represent semantic relationships.

alt text Ben Vierck, Word Vector Illustration, CC0 1.0

The word vector graph illustrates semantic relationships captured by LLMs using numerical representations of words. Each word is represented as a vector in a multi-dimensional space. In this example, the vectors for 'royal,' 'king,' and 'queen' originate at the coordinate (0,0), depicting their positions in this space. The vector labeled 'man' extends from the end of the 'royal' vector to the end of the 'king' vector, while the vector labeled 'woman' extends from the end of the 'royal' vector to the end of the 'queen' vector. This arrangement demonstrates how LLMs can represent semantic relationships such as 'king' being 'royal' plus 'man,' and 'queen' being 'royal' plus 'woman.' The spatial relationships between these vectors reflect the conceptual relationships between the words they represent.

However, these internal representations, unlike human-curated taxonomies, are:

Learned, Not Curated: Acquired through exposure to massive amounts of text data, rather than through a process of human design and refinement. This means the LLM infers relationships, rather than having them explicitly defined.
Unstructured: The relationships learned by LLMs may not always fit into a clear, hierarchical structure.
Context-Dependent: The meaning of a word or concept can vary depending on the surrounding text, making it difficult for LLMs to consistently apply a single, fixed categorization.
Incomplete: It's important to understand that LLMs don't know what they don't know. They might simply be missing knowledge of specific domains or specialized terminology that wasn't included in their training data.

This is where taxonomies become crucial. They provide an external, structured framework that can:

Constrain LLM Output: By mapping LLM output to a defined taxonomy, we can ensure that the information generated is consistent, accurate, and relevant to a specific domain.
Ground LLM Knowledge: Taxonomies can provide LLMs with access to authoritative, curated knowledge that may be missing from their training data.
Bridge the Gap: Taxonomies can bridge the gap between the unconstrained, often ambiguous language that humans use and the more structured, formal representations that LLMs can effectively process.

Taxonomies as Service Providers

Companies that specialize in creating and managing taxonomies and developing metadata schemas and ontologies to complement taxonomies are well-positioned to become key service providers in the LLM ecosystem. Their existing expertise in organizing information and structuring data makes them uniquely qualified to help businesses harness LLMs effectively.

For example, companies that specialize in organizing complex data for specific industries, such as healthcare or finance, often create proprietary systems to analyze and categorize information for their clients. In the healthcare sector, a company might create a proprietary methodology for evaluating healthcare plan value, categorizing patients based on risk factors and predicting healthcare outcomes. In the realm of workforce development, a company might develop a detailed taxonomy of job skills, enabling employers to evaluate their current workforce capabilities and identify skill gaps. This same taxonomy can also empower job seekers to understand the skills needed for emerging roles and navigate the path to acquiring them. These companies develop expertise in data acquisition, market understanding, and efficient data processing to deliver valuable insights.

Companies that specialize in creating and managing taxonomies are not only valuable for general LLM use but also for improving the effectiveness of Retrieval-Augmented Generation systems. RAG's limitations, such as retrieving irrelevant or ambiguous information, often stem from underlying data organization issues. Taxonomy providers can address these issues by creating robust knowledge bases, defining clear data structures, and adding rich metadata. This ensures that RAG systems can retrieve the most relevant and accurate information, thereby significantly enhancing the quality of LLM outputs. In essence, taxonomy experts can help businesses transform their RAG systems from potentially unreliable tools into highly effective knowledge engines.

Strategic Opportunities for Taxonomy Providers in the LLM Era

The rapid advancement and adoption of LLMs are driving an increase in demand for automated content generation. Businesses are increasingly looking to replace human roles with intelligent agents capable of handling various tasks, from customer service and marketing to data analysis and research. This drive towards agent-driven automation creates a fundamental need for well-structured data and robust taxonomies. Companies specializing in these areas are strategically positioned to capitalize on this demand.

Here's how taxonomy companies can leverage this market shift:

1. Capitalizing on the Content Generation Boom:

Demand-Driven Growth: The primary driver will be the sheer volume of content that businesses want to generate using LLMs and agents. Taxonomies are essential to ensure this content is organized, accurate, and aligned with specific business needs. Emphasize that the core opportunity lies in meeting this growing demand.

Agent-Centric Focus: Highlight that the demand is not just for general content but for content that powers intelligent agents. This requires taxonomies that are not just broad but highly specific and contextually rich.

2. Building Partnerships:

The surge in demand for LLM-powered applications and intelligent agents is creating a wave of new organizations focused on developing these solutions. Many of these companies will need specialized data, including job skills taxonomies, to power their agents effectively. This presents a unique opportunity for the job skills taxonomy provider to forge strategic partnerships.

Addressing the "Build vs. Buy" Decision: Many new agent builders will face the decision of whether to build their own skills taxonomy from scratch or partner with an existing provider. Given the rapid pace of LLM development and the complexity of creating and maintaining a robust taxonomy, partnering often proves to be the most efficient and cost-effective route. The taxonomy company can highlight the advantages of partnering:

Faster time to market
Higher quality data
Ongoing updates and maintenance

By targeting these emerging agent-building organizations, the job skills taxonomy company can capitalize on the growing demand for LLM-powered solutions and establish itself as a critical data provider in the evolving AI-driven workforce development landscape. This approach focuses on the new opportunities created by the LLM boom, rather than the existing operations of the taxonomy provider.

Seamless Integration via MCP: To further enhance the value proposition, taxonomy providers should consider surfacing their capabilities using the Model Context Protocol (MCP). MCP allows for standardized communication between different AI agents and systems, enabling seamless integration and interoperability. By making their taxonomies accessible via MCP, providers can ensure that agent builders can easily incorporate their data into their workflows, reducing friction and accelerating development.

3. Capitalizing on Existing Expertise as an Established Player:

Market Advantage: Emphasize that established taxonomy companies have a significant advantage due to their existing expertise, data assets, and client relationships. This position allows them to quickly adapt to the agent-driven market.

Economic Efficiency: Highlight the cost-effectiveness of using established taxonomy providers compared to building in-house solutions. Businesses looking to deploy agents quickly will likely prefer to partner with existing experts.

By focusing on the demand for content generation driven by the rise of intelligent agents and by targeting partnerships with agent-building organizations, taxonomy companies can position themselves for significant growth and success in this evolving market.

Why This Matters to You

We rely on AI more and more every day. From getting quick answers to complex research, we expect AI to provide us with accurate and reliable information. But what happens when the volume of information becomes overwhelming? What happens when AI systems need to sift through massive amounts of data to make critical decisions?

That's where organized data becomes vital. Imagine AI as a powerful detective tasked with solving a complex case. Without a well-organized case file (a robust taxonomy), the detective might get lost in a sea of clues, missing crucial details or drawing the wrong conclusions. But with a meticulously organized file, the detective can:

Quickly Identify Key Evidence: AI can pinpoint the most relevant and reliable information, even in a sea of data.
Connect the Dots: AI can understand the complex relationships between different pieces of information, revealing hidden patterns and insights.
Ensure a Clear Narrative: AI can present a coherent and accurate picture of the situation, avoiding confusion or misinterpretation.

In essence, the better the data is organized, the more effectively AI can serve as a reliable source of truth. It's about ensuring that AI doesn't just process information, but that it processes it in a way that promotes clarity, accuracy, and ultimately, a shared understanding of the world. This is why the role of taxonomies, ontologies, and metadata is so critical—they are the foundation for building AI systems that help us navigate an increasingly complex information landscape with confidence.

The Indispensable Role of Human Curation

While LLMs can be valuable tools in the taxonomy development process, they cannot fully replace human expertise (yet). Human curation is essential because taxonomies are ultimately designed for human consumption. Human curators can ensure that taxonomies are intuitive, user-friendly, and aligned with how people naturally search for and understand information. Human experts are needed not just for creating the taxonomy itself, but also for defining and maintaining the associated metadata and ontologies.

For example, imagine an LLM generating a taxonomy for a complex subject like "fine art." While it might group works by artist or period, a human curator would also consider factors like artistic movement, cultural significance, and thematic connections, creating a taxonomy that is more nuanced and useful for art historians, collectors, and enthusiasts.

alt text By Michelangelo, Public Domain, https://commons.wikimedia.org/w/index.php?curid=9097336

Developing a high-quality taxonomy often requires specialized knowledge of a particular subject area. Human experts can bring this knowledge to the process, ensuring that the taxonomy accurately reflects the complexities of the domain (for now).

Challenges and Opportunities

The rise of LLMs directly fuels the demand for sophisticated taxonomies. While LLMs can assist in generating content, taxonomies ensure that this content is organized, accessible, and contextually relevant. This dynamic creates both opportunities and challenges for taxonomy providers. The evolving nature of LLMs requires constant adaptation in taxonomy strategies, and the integration of metadata and ontologies becomes essential to maximize the utility of LLM-generated content. So, the expertise in developing and maintaining these taxonomies becomes a critical asset in the age of LLMs.

Enhanced Value Through Metadata and Ontologies

The value of taxonomies is significantly amplified when combined with robust metadata and ontologies. Metadata provides detailed descriptions and context, making taxonomies more searchable and understandable for LLMs. Ontologies, with their intricate relationships and defined properties, enable LLMs to grasp deeper contextual meanings and perform complex reasoning.

Metadata is data that describes other data. For example, the title, author, and publication date of a book are metadata. High-quality metadata, such as detailed descriptions, keywords, and classifications, makes taxonomies more easily searchable and understandable by both humans and machines, including LLMs. This rich descriptive information provides essential context that enhances the utility of the taxonomy.

Ontologies are related to taxonomies but go beyond simple hierarchical classification. While taxonomies primarily focus on organizing information into categories and subcategories, often representing "is-a" relationships (e.g., "A dog is a mammal"), ontologies provide a more detailed, formal, and expressive representation of knowledge. They define concepts, their properties, and the complex relationships between them. Ontologies answer questions like "What is this?", "What are its properties?", "How is it related to other things?", and "What can we infer from these relationships?"

Key Distinctions:

Relationship Types: Taxonomies mostly deal with hierarchical ("is-a") relationships. Ontologies handle many different types of relationships (e.g., causal, temporal, spatial, "part-of," "has-property").
Formality: Taxonomies can be informal and ad-hoc. Ontologies are more formal and often use standardized languages and logic (e.g., OWL - Web Ontology Language).
Expressiveness: Taxonomies are less expressive and can't represent complex rules or constraints. Ontologies are highly expressive and can represent complex knowledge and enable sophisticated reasoning.
Purpose: Taxonomies are primarily for organizing and categorizing. Ontologies are for representing knowledge, defining relationships, and enabling automated reasoning.

For instance, an ontology about products would not only categorize them (e.g., "electronics," "clothing") but also define properties like "manufacturer," "material," "weight," and "price," as well as relationships such as "is made of," "is sold by," and "is a component of." This rich, interconnected structure allows an LLM to understand not just the category of a product but also its attributes and how it relates to other products. This added layer of detail is what makes ontologies so valuable for LLMs, as they provide the deep, contextual understanding needed for complex reasoning and knowledge-based tasks. However, this level of detail also makes them more complex to develop and maintain, requiring specialized expertise and ongoing updates.

Therefore, companies that can integrate and provide these elements alongside taxonomies will offer a more compelling and valuable service in the LLM ecosystem. The combination of well-structured taxonomies, rich metadata, and detailed ontologies provides the necessary context and depth for LLMs to operate at their full potential.

Conclusion

The rise of LLMs is creating a classic supply and demand scenario. As more businesses adopt LLMs and techniques like RAG, the demand for structured data and the services of taxonomy providers will increase. However, it's crucial to recognize that the effectiveness of RAG hinges on high-quality data organization. Companies specializing in creating robust taxonomies, ontologies, and metadata are positioned to meet this demand by providing the essential foundation for successful RAG implementations. Their expertise ensures that LLMs and RAG systems can retrieve and utilize information effectively, making their services increasingly valuable for organizations looking to take advantage of LLM-generated content.