text2sql: multi-agent实现思路MAC-SQL

server/2024/10/21 4:19:00/

MAC-SQL出自2023年12月的论文《MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL》(github),它是用基于LLM的multi-agent来实现text2sql

MAC-SQL的整体思路如论文图2所示,由Decomposer、Selector、Refiner三个agent组成,个人觉得除了有multi-agent的思想外,MAC-SQL与DIN-SQL的思路很类似。

在这里插入图片描述

MAC-SQL的三个agent的协作过程如论文算法1所示。
在这里插入图片描述

接下来分别介绍MAC-SQL三个agent的实现:

  • Selector:其作用是schema linking,选择只与问题相关的数据库schema元素,其prompt如下图所示。值得注意的是,论文强调了只有在数据库schema对应的prompt长度超过了长度阈值时才会被激活,否则直接使用原始的数据库schema。

在这里插入图片描述

  • Decomposer: 在生成最终SQL之前生成一系列的中间步骤来提高LLM的推理能力。如论文图2所示意,Decomposer指导LLM将原始复杂问题分解成推理步骤后生成最后的SQL查询。作者使用CoT来生成子问题和其对应的SQL,实现时会先判断用户问题的难易程度,如果用户的问题比较简单,则直接生成SQL;如果用户的问题很复杂,则先生成子问题对应的SQL,逐步得到最终的SQL。在in-context learning时使用了few-shot例子。

    decomposer_prompt = '''
    Given a [Database schema] description, a knowledge [Evidence] and the [Question], you need to use valid SQLite and understand the database and knowledge, and then decompose the question into subquestions for text-to-SQL generation.When generating SQL, we should always consider constraints:
    [Constraints]
    - In 'SELECT <column>', just select needed columns in the [Question] without any unnecessary column or value - In 'FROM <table>' or 'JOIN <table>', do not include unnecessary table 
    - If use max or min func, 'JOIN <table>' FIRST, THEN use 'SELECT MAX(<column>)' or 'SELECT MIN(<column>)' - If [Value examples] of <column> has 'None' or None, use 'JOIN <table>' or 'WHERE <column> is NOT NULL' is better 
    - If use 'ORDER BY <column> ASC|DESC', add 'GROUP BY <column>' before to select distinct values==========[Database schema]
    # Table: frpm 
    [ (CDSCode, CDSCode. Value examples: ['01100170109835', '01100170112607'].), (Charter School (Y/N), Charter School (Y/N). Value examples: [ 1, 0, None] . And 0: N;. 1: Y),(Enrollment (Ages 5-17), Enrollment (Ages 5-17). Value examples: [5271.0, 4734.0].), (Free Meal Count (Ages 5-17), Free Meal Count (Ages 5-17). Value examples: [ 3864.0, 2637.0 ].
    And eligible free rate = Free Meal Count / Enrollment) 
    ] 
    # Table: satscores 
    [ (cds, California Department Schools. Value examples: ['10101080000000', '10101080109991'].), (sname, school name. Value examples: [ 'None', 'Middle College High', 'John F. Kennedy High', 'Independence High', 'Foothill High'].),(NumTstTakr, Number of Test Takers in this school. Value examples: [ 24305, 4942, 1, 0, 280] . And number of test takers in each school),(AvgScrMath, average scores in Math. Value examples: [699, 698, 289, None, 492 ] . And average scores in Math), (NumGE1500, Number of Test Takers Whose Total SAT Scores Are Greater or Equal to 1500. Value examples: [ 5837, 2125, 0, None, 191] . And Number of Test Takers Whose Total SAT Scores Are Greater or Equal to 1500. . commonsense evidence:. . Excellence Rate = NumGE1500 / NumTstTakr) 
    ]
    [Foreign keys]
    frpm.'CDSCode' = satscores.'cds'
    [Question]
    List school names of charter schools with an SAT excellence rate over the average.
    [Evidence]
    Charter schools refers to 'Charter School (Y/N)' = 1 in the table frpm; Excellence rate = NumGE1500 / NumTstTakrDecompose the question into sub questions, considering [Constraints], and generate the SQL after thinking step by step: 
    Sub question 1: Get the average value of SAT excellence rate of charter schools. 
    SQL 
    "'sql 
    SELECT AVG(CAST(T2.'NumGE1500' AS REAL) / T2.'NumTstTakr') FROM frpm AS T1 INNER JOIN satscores AS T2 ON T1.'CDSCode' = T2.'cds' WHERE T1.'Charter School (Y/N)' = 1 
    "'Sub question 2: List out school names of charter schools with an SAT excellence rate over the average. 
    SQL 
    "' sql 
    SELECT T2.'sname' FROM frpm AS T1 INNER JOIN satscores AS T2 ON T1.'CDSCode' = T2.'cds' WHERE T2.'sname' IS NOT NULL AND T1.'Charter School (Y/N)' = 1 AND CAST(T2.'NumGE1500' AS REAL) / T2.'NumTstTakr' > ( SELECT AVG(CAST(T4.'NumGE1500' AS REAL) / T4.'NumTstTakr')FROM frpm AS T3 INNER JOIN satscores AS T4 ON T3.'CDSCode' = T4.'cds' WHERE T3.'Charter School (Y/N)' = 1 ) 
    "' Question Solved. ==========[Database schema]
    # Table: account 
    [ (account_id, the id of the account. Value examples: [11382, 11362, 2, 1, 2367].), (district_id, location of branch. Value examples: [77, 76, 2, 1, 39].),(frequency, frequency of the acount. Value examples: ['POPLATEK MESICNE', 'POPLATEK TYDNE', 'POPLATEK PO OBRATU'].), (date, the creation date of the account. Value examples: ['1997-12-29', '1997-12-28'].) 
    ] 
    # Table: client 
    [ (client_id, the unique number. Value examples: [13998, 13971, 2, 1, 2839].), (gender, gender. Value examples: ['M', 'F']. And F:female . M:male ), (birth_date, birth date. Value examples: ['1987-09-27', '1986-08-13'].), (district_id, location of branch. Value examples: [77, 76, 2, 1, 39].) 
    ] 
    # Table: district 
    [ (district_id, location of branch. Value examples: [77, 76, 2, 1, 39].), (A4, number of inhabitants . Value examples: ['95907', '95616', '94812'].), (A11, average salary. Value examples: [12541, 11277, 8114].) ]
    [Foreign keys]
    account.'district_id' = district.'district_id' 
    client.'district_id' = district.'district_id'
    [Question]
    What is the gender of the youngest client who opened account in the lowest average salary branch?[Evidence]
    Later birthdate refers to younger age; A11 refers to average salary Decompose the question into sub questions, considering [Constraints], and generate the SQL after thinking step by step: 
    Sub question 1: What is the district_id of the branch with the lowest average salary?
    SQL 
    "' sql 
    SELECT 'district_id' FROM district ORDER BY 'A11' ASC LIMIT 1 
    "'
    Sub question 2: What is the youngest client who opened account in the lowest average salary branch? 
    SQL 
    "' sql 
    SELECT T1.'client_id' FROM client AS T1 INNER JOIN district AS T2 ON T1.'district_id' = T2.'district_id' ORDER BY T2.'A11' ASC, T1.'birth_date' DESC LIMIT 1 
    "'Sub question 3: What is the gender of the youngest client who opened account in the lowest average salary branch? 
    SQL 
    "' sql 
    SELECT T1.'gender' FROM client AS T1 INNER JOIN district AS T2 ON T1.'district_id' = T2.'district_id' ORDER BY T2.'A11' ASC, T1.'birth_date' DESC LIMIT 1 
    "' 
    Question Solved. ==========[Database schema]
    {desc_str}
    [Foreign keys]
    {fk_str}
    [Question]
    {query}
    [Evidence]
    {evidence} Decompose the question into sub questions, considering [Constraints], and generate the SQL after thinking step by step:
    '''
    

在这里插入图片描述

  • Refiner:其作用是为了检测到自动校正SQL错误,如论文图4所示意。如论文图2所示,在收到一个SQL查询之后,Refiner诊断SQL来评估器语法准确性、执行可行性,并从数据库中检索到非空结果。

    refiner_prompt = '''
    [Instruction]
    When executing SQL below, some errors occurred, please fix up SQL based on query and database info. Solve the task step by step if you need to. Using SQL format in the code block, and indicate script type in the code block. When you find an answer, verify the answer carefully. Include verifiable evidence in your response if possible.
    [Constraints]
    - In 'SELECT <column>', just select needed columns in the [Question] without any unnecessary column or value 
    - In 'FROM <table>' or 'JOIN <table>', do not include unnecessary table 
    - If use max or min func, 'JOIN <table>' FIRST, THEN use 'SELECT MAX(<column>)' or 'SELECT MIN(<column>)' - If [Value examples] of <column> has 'None' or None, use 'JOIN <table>' or 'WHERE <column> is NOT NULL' is better 
    - If use 'ORDER BY <column> ASC|DESC', add 'GROUP BY <column>' before to select distinct values
    [Query]
    {query}
    [Evidence]
    {evidence}
    [Database info]
    {desc_str}
    [Foreign keys]
    {fk_str}
    [old SQL]
    "' sql 
    {sql} 
    "'
    [SQLite error]
    {sqlite_error}
    [Exception class]
    {exception_class}Now please fixup old SQL and generate new SQL again.[correct SQL]
    '''
    

在这里插入图片描述

除了MAC-SQL思路外,论文作者还尝试构建了一个用于微调LLM的mulit-agent任务指令集,并用这个指令集微调了Code Llama 7B模型得到了开源模型SQL-Llama。


http://www.ppmy.cn/server/133531.html

相关文章

目标检测系统中需要【重新训练模型】说明

上百种【基于YOLOv8/v10/v11的目标检测系统】目录&#xff08;pythonpyside6界面系统源码可训练的数据集也完成的训练模型&#xff09;-CSDN博客 目标检测系统操作说明【用户使用指南】&#xff08;pythonpyside6界面系统源码可训练的数据集也完成的训练模型&#xff09;-CSDN…

算法专题七: 分治归并

目录 1. 排序数组2. 交易逆序对的总数3. 计算右侧小于当前元素的个数4. 翻转对 1. 排序数组 算法思路: 本道题使用归并的思路进行排序, 先讲数组分为左右两个区间, 然后合并两个有序数组. class Solution {vector<int> tmp; public:vector<int> sortArray(vector&…

wifi、热点密码破解 - python

乐子脚本&#xff0c;有点小慢&#xff0c;试过多线程&#xff0c;系统 wifi 连接太慢了&#xff0c;需要时间确认&#xff0c;多线程的话系统根本反应不过来。 也就可以试试破解别人的热点&#xff0c;一般都是 123456 这样的傻鸟口令 # coding:utf-8 import pywifi from pyw…

python 打包为动态链接库(.so文件)

参考资料&#xff1a;https://medium.com/yeap0022/linux-compress-python-packages-into-shared-library-so-8342bffab001 注意事项&#xff1a;.py, .c后缀文件都可以删掉&#xff0c;cache也可以删掉&#xff0c;但是__init__.py文件不能删&#xff0c;建立的子文件夹也不能…

JavaWeb环境下Spring Boot在线考试系统的优化策略

摘要 随着信息技术在管理上越来越深入而广泛的应用&#xff0c;管理信息系统的实施在技术上已逐步成熟。本文介绍了基于JavaWeb技术的在线考试系统设计与实现的开发全过程。通过分析基于Java Web技术的在线考试系统设计与实现管理的不足&#xff0c;创建了一个计算机管理基于Ja…

Android——通过MediaStore查询图片

查询图片&#xff1a; private void loadImageList() {String[] columns new String[]{MediaStore.Images.Media._ID, // 编号MediaStore.Images.Media.TITLE, // 标题MediaStore.Images.Media.SIZE, // 文件大小MediaStore.Images.Media.DATA, // 文件路径};Cursor cursor g…

基于FPGA的以太网设计(四)

1.ARP协议简介 ARP&#xff08;Address Resolution Protocol&#xff09;&#xff0c;即地址解析协议&#xff0c;是根据IP地址&#xff08;逻辑地址&#xff09;获取MAC地址的一种TCP/IP协议。在以太网通信中&#xff0c;数据是以“帧”的格式进行传输的&#xff0c;帧格式里…

零代码快速开发智能体 |甘肃旅游通

在互联网信息爆炸的时代&#xff0c;寻找一处让人心动的旅游胜地往往需要花费大量的时间和精力。而今天&#xff0c;我要向大家介绍一款能够帮助你轻松规划甘肃之行的智能体——“甘肃旅游通”。这款智能体通过低代码开发&#xff0c;集合了丰富的旅游信息和个性化推荐功能&…