活在梦里

【图谱构建】KGGEN

论文摘要

论文: https://arxiv.org/abs/2502.09956

代码: https://github.com/stair-lab/kg-gen

研究问题与动机

本研究聚焦于知识图谱构建领域的两个关键挑战:

  • 知识图谱的稀疏性与不完备性严重制约了嵌入模型的性能——模型无法有效学习或推断缺失链接,从而导致知识补全与推理任务表现不佳
  • GraphRAG系统的效能高度依赖于底层知识图谱的质量,而现有自动构建的图谱往往存在噪声过多且不完整的问题

主要贡献

抽取框架:KGGen

该研究提出了KGGen,一个基于大型语言模型的开源工具包,能从纯文本中高效抽取高质量知识图谱。步骤:

  1. 实体与关系提取(generate阶段):利用语言模型从每个源文本中精确提取实体及其关系
  2. 跨源图谱聚合(aggregate阶段):收集并合并所有源图谱中的唯一实体和边,构建统一图谱,同时应用小写规范化减少实体冗余
  3. 迭代实体与边聚类(cluster阶段):通过基于语言模型的迭代聚类方法,合并表示相同现实实体或概念的节点和边,构建更紧密的知识图谱

评估框架:MINE

研究同时提出了MINE,一个专门针对文本到知识图谱抽取器的基准测试框架,评估流程包括:

  1. 语料库来源:构建了100篇涵盖历史、艺术、科学等多样化主题的测试文章,每篇约1000词
  2. 关键事实提取:从每篇文章中提取15个关键事实陈述,并经人工验证确保准确性与原文一致性
  3. 查询与评估机制
    • 对每个生成的知识图谱,使用相应文章的15个事实进行查询测试
    • 通过语义相似度识别与每个事实最相关的top-k节点(使用Sentence Transformers的all-MiniLM-L6-v2模型进行向量化)
    • 提取这些节点两层关系内的所有节点及关系作为查询结果
    • 由语言模型评估查询结果是否足以推断出相应事实,输出二元评价(1表示可推断,0表示不可推断)

实现

看了一下代码,基于DSPY框架实现。整体方法构思并不特别新颖,但其中第三步迭代实体与边聚类(即实体/边消歧)尤为值得关注。

实体和边抽取的提示词:

"""Extract key entities from the source text. Extracted entities are subjects or objects.
  This is for an extraction task, please be THOROUGH and accurate to the reference text."""

"""Extract subject-predicate-object triples from the source text. Subject and object must be from entities list. Entities provided were previously extracted from the same source text.
    This is for an extraction task, please be thorough, accurate, and faithful to the reference text."""

聚类的核心代码位于cluster_items函数(https://github.com/stair-lab/kg-gen/blob/main/src/kg_gen/steps/3_cluster_graph.py#L27),其工作流程如下:

flowchart TD
    A[开始] --> B[初始化:复制items,创建空clusters列表]
    
    B --> C{第一阶段:<br>主动寻找集群}
    C --> D[ExtractCluster:提取可能集群]
    D --> E[ValidateCluster:验证集群]
    E --> F{验证通过?}
    F -- 是 --> G[选择代表并创建集群<br>从remaining_items移除已归类项目]
    F -- 否 --> H[no_progress_count+1]
    
    G --> I{remaining_items为空<br>或无进展达LOOP_N次?}
    H --> I
    
    I -- 否 --> C
    I -- 是 --> J{第二阶段:<br>处理剩余项目}
    
    J --> K[按批次处理剩余项目]
    K --> L{现有clusters为空?}
    L -- 是 --> M[为每个项目创建独立集群]
    L -- 否 --> N[CheckExistingClusters:<br>尝试将项目加入现有集群<br>或创建新集群]
    
    M --> O[准备返回结果]
    N --> O
    O --> P[返回代表集合和集群字典]
    P --> Q[结束]

聚类过程采用迭代方式:在循环中将所有实体输入到提示词中,由大语言模型识别并提取同义实体集合。一旦提取出潜在聚类,会通过ValidateCluster进行二次验证,验证通过后再由ChooseRepresentative选择该聚类的最佳代表名称。若连续多轮未能提取新聚类,则在达到预设阈值后终止迭代。

终止迭代后,对所有的剩余实体判断其能否归于聚类中CheckExistingClusters。所有的提示词如下:

class ExtractCluster(dspy.Signature):
    """Find one cluster of related items from the list.
    A cluster should contain items that are the same in meaning, with different tenses, plural forms, stem forms, or cases. 
    Return populated list only if you find items that clearly belong together, else return empty list."""
    
    items: set[ItemsLiteral] = dspy.InputField()
    context: str = dspy.InputField(desc="The larger context in which the items appear")
    cluster: list[ItemsLiteral] = dspy.OutputField()

class ValidateCluster(dspy.Signature):
    """Validate if these items belong in the same cluster.
    A cluster should contain items that are the same in meaning, with different tenses, plural forms, stem forms, or cases. 
    Return populated list only if you find items that clearly belong together, else return empty list."""

    cluster: set[ClusterLiteral] = dspy.InputField()
    context: str = dspy.InputField(desc="The larger context in which the items appear")
    validated_items: list[ClusterLiteral] = dspy.OutputField(desc="All the items that belong together in the cluster")

class ChooseRepresentative(dspy.Signature):
  """Select the best item name to represent the cluster, ideally from the cluster.
  Prefer shorter names and generalizability across the cluster."""
  
  cluster: set[str] = dspy.InputField()
  context: str = dspy.InputField(desc="the larger context in which the items appear")
  representative: str = dspy.OutputField()

class CheckExistingClusters(dspy.Signature):
    """Determine if the given items can be added to any of the existing clusters.
    Return representative of matching cluster for each item, or None if there is no match."""

    items: list[BatchLiteral] = dspy.InputField()
    clusters: list[Cluster] = dspy.InputField(desc="Mapping of cluster representatives to their cluster members")
    context: str = dspy.InputField(desc="The larger context in which the items appear")
    cluster_reps_that_items_belong_to: list[Optional[str]] = dspy.OutputField(desc="Ordered list of cluster representatives where each is the cluster where that item belongs to, or None if no match. THIS LIST LENGTH IS SAME AS ITEMS LIST LENGTH")

class Cluster(BaseModel):
  representative: str
  members: set[str]

最终阶段对消歧后的图谱进行更新,实现方式简洁高效。原始关系表示为SPO三元组:

  • S: Subject (主语) - 关系的起点实体
  • P: Predicate (谓语) - 描述两个实体之间的关系类型
  • O: Object (宾语) - 关系的终点实体

更新过程中,系统检查S、O是否存在于聚类后的实体列表中,若不存在则通过Cluster.members查找对应的representative进行名称更新。P的处理方式类似。

显而易见,可能存在以下问题:

  • 如果语料库规模大,实体与关系数量可能超出模型上下文窗口限制
  • 实体关系复杂时,对模型推理能力提出较高要求,而论文缺乏针对性的消融实验
  • 聚类粒度难以精确控制,例如模型可能在不同迭代中将[麻雀、老鹰]归类为"鸟类",而后又将[猫、狗]归为"动物类"

尽管如此,此方案在实体数量较少的场景或需要进行小规模文本聚类的应用中仍具参考价值。