使用 SQL 和表格数据进行问答和 RAG（6）—将指定目录下的 CSV 或 Excel 文件导入 SQLite 数据库

将指定目录下的 CSV 或 Excel 文件导入 SQLite 数据库。以下是详细代码逻辑：

1. 类结构

该类包含三个主要方法：

_prepare_db：负责将文件夹中的 CSV 和 XLSX 文件转换为 SQL 表。
_validate_db：用于验证 SQL 数据库中创建的表是否成功。
run_pipeline：主方法，按顺序调用 _prepare_db 和 _validate_db 方法完成整个数据导入流程。

2. 构造方法 `init`

def __init__(self, files_dir) -> None:

作用：

初始化类的实例，设置必要的变量，如数据文件目录和数据库路径。
从配置文件中加载数据库路径。

参数：

files_dir：指定包含数据文件（CSV/XLSX）的目录。

执行流程：

通过 LoadConfig 加载配置（如数据库存储路径）。
列出指定目录下的所有文件并存储到 self.file_dir_list。
使用 SQLAlchemy 的 create_engine 方法初始化 SQLite 数据库连接。

3. 私有方法 `_prepare_db`

def _prepare_db(self):

作用：

遍历指定目录中的所有文件，将 CSV 和 XLSX 文件转换为 Pandas DataFrame，然后保存到 SQLite 数据库中。

执行流程：

遍历目录中的文件：
- 判断文件扩展名，读取 CSV 或 XLSX 文件到 DataFrame。
- 如果文件不是 .csv 或 .xlsx，抛出 ValueError。
检查 SQL 数据库中是否已存在与文件名同名的表：
- 使用 inspect(self.engine).get_table_names() 检查表是否已存在。
- 如果存在，跳过该文件。
如果表不存在，则通过 Pandas 的 to_sql 方法将 DataFrame 写入数据库。

额外说明：

表名：使用文件名（去掉扩展名）作为表名。
重复表处理：如果表已存在，输出提示并跳过。

4. 私有方法 `_validate_db`

def _validate_db(self):

作用：

验证 SQL 数据库中的表是否已成功创建，并打印所有表名。

执行流程：

调用 inspect(self.engine).get_table_names() 获取数据库中所有表的列表。
打印数据库中的表名，便于用户确认表是否成功创建。

5. 公共方法 `run_pipeline`

def run_pipeline(self):

作用：

作为主入口，依次调用 _prepare_db 和 _validate_db，完成数据导入和验证。

执行流程：

调用 _prepare_db 方法，将数据文件转换为 SQL 表。
调用 _validate_db 方法，验证所有表的创建情况。

示例执行流程

初始化类实例：

processor = PrepareSQLFromTabularData("path/to/your/files")

运行数据导入和验证管道：

processor.run_pipeline()

假设文件夹包含文件 cancer.csv 和 diabetes.xlsx，数据库中已有表 cancer。程序输出可能如下：

Number of csv files: 2
Table 'cancer' already exists. Skipping...
==============================
All csv files are saved into the sql database.
==============================
Available table names in created SQL DB: ['cancer', 'diabetes']
==============================

关键点总结

SQLAlchemy 的使用：
- 使用 create_engine 连接 SQLite 数据库。
- 使用 inspect 检查数据库中已存在的表。
重复表的处理：
- 如果表已存在，则跳过，不覆盖数据。
Pandas 的集成：
- 使用 pd.read_csv 和 pd.read_excel 读取文件。
- 使用 to_sql 方法将数据写入数据库。
代码逻辑清晰：
- 数据准备与验证分开处理，方便调试和扩展。

扩展建议

支持其他文件格式：可以扩展支持 JSON 或 Parquet 文件。
异常处理：为数据库连接、文件读取等关键步骤添加更细化的异常处理。
日志功能：将信息输出（如表跳过提示）写入日志文件，便于后续分析。

完整代码：
app_config.yml：

directories:stored_csv_xlsx_directory: data/csv_xlsxsqldb_directory: data/sqldb.dbuploaded_files_sqldb_directory: data/uploaded_files_sqldb.dbstored_csv_xlsx_sqldb_directory: data/csv_xlsx_sqldb.dbpersist_directory: data/chromallm_config:agent_llm_system_role: "Given the following user question, corresponding SQL query, and SQL result, answer the user question.\nQuestion: {question}\nSQL Query: {query}\nSQL Result: {result}\nAnswer: "rag_llm_system_role: "You will recieve the user's question along with the search results of that question over a database. Give the user the proper answer."engine: "gpt-35-turbo"temperature: 0.0rag_config:collection_name: titanic_smalltop_k: 1

prepare_sqlitedb_from_csv_xlsx.py：

import os
import pandas as pd
from utils.load_config import LoadConfig
from sqlalchemy import create_engine, inspectclass PrepareSQLFromTabularData:"""A class that prepares a SQL database from CSV or XLSX files within a specified directory.This class reads each file, converts the data to a DataFrame, and thenstores it as a table in a SQLite database, which is specified by the application configuration."""def __init__(self, files_dir) -> None:"""Initialize an instance of PrepareSQLFromTabularData.Args:files_dir (str): The directory containing the CSV or XLSX files to be converted to SQL tables."""APPCFG = LoadConfig()self.files_directory = files_dirself.file_dir_list = os.listdir(files_dir)db_path = APPCFG.stored_csv_xlsx_sqldb_directorydb_path = f"sqlite:///{db_path}"self.engine = create_engine(db_path)print("Number of csv files:", len(self.file_dir_list))def _prepare_db(self):"""Private method to convert CSV/XLSX files from the specified directory into SQL tables.Each file's name (excluding the extension) is used as the table name.The data is saved into the SQLite database referenced by the engine attribute."""for file in self.file_dir_list:full_file_path = os.path.join(self.files_directory, file)file_name, file_extension = os.path.splitext(file)if file_extension == ".csv":df = pd.read_csv(full_file_path)elif file_extension == ".xlsx":df = pd.read_excel(full_file_path)else:raise ValueError("The selected file type is not supported")insp = inspect(self.engine)# 检查表是否已存在insp = inspect(self.engine)if file_name in insp.get_table_names():print(f"Table '{file_name}' already exists. Skipping...")else:df.to_sql(file_name, self.engine, index=False)print("==============================")print("All csv files are saved into the sql database.")def _validate_db(self):"""Private method to validate the tables stored in the SQL database.It prints out all available table names in the created SQLite databaseto confirm that the tables have been successfully created."""insp = inspect(self.engine)table_names = insp.get_table_names()print("==============================")print("Available table nasmes in created SQL DB:", table_names)print("==============================")def run_pipeline(self):"""Public method to run the data import pipeline, which includes preparing the databaseand validating the created tables. It is the main entry point for converting filesto SQL tables and confirming their creation."""self._prepare_db()self._validate_db()

输出：
在这里插入图片描述
查看sqlite3数据库中的数据：

import pandas as pd
from pyprojroot import here
from sqlalchemy import create_engine, inspect,textdb_path = str(here("data")) + "/csv_xlsx_sqldb.db"
db_path = f"sqlite:///{db_path}"engine = create_engine(db_path)def list_tables_and_data(engine):# 列出所有表名insp = inspect(engine)table_names = insp.get_table_names()print("Available tables in the database:", table_names)# 遍历每张表并打印前5行数据with engine.connect() as connection:for table in table_names:print(f"\nData from table '{table}':")query = text(f"SELECT * FROM {table} LIMIT 5;")result = connection.execute(query)for row in result:print(row)list_tables_and_data(engine)