我有一个Excel电子表格,需要每天导入到SQL Server中。电子表格将包含大约250000行,大约50列。我使用几乎相同的代码分别使用openpyxl和xlrd进行了测试。
下面是我正在使用的代码(减去调试语句):import xlrd
import openpyxl
def UseXlrd(file_name):
workbook = xlrd.open_workbook(file_name, on_demand=True)
worksheet = workbook.sheet_by_index(0)
first_row = []
for col in range(worksheet.ncols):
first_row.append(worksheet.cell_value(0,col))
data = []
for row in range(1, worksheet.nrows):
record = {}
for col in range(worksheet.ncols):
if isinstance(worksheet.cell_value(row,col), str):
record[first_row[col]] = worksheet.cell_value(row,col).strip()
else:
record[first_row[col]] = worksheet.cell_value(row,col)
data.append(record)
return data
def UseOpenpyxl(file_name):
wb = openpyxl.load_workbook(file_name, read_only=True)
sheet = wb.active
first_row = []
for col in range(1,sheet.max_column+1):
first_row.append(sheet.cell(row=1,column=col).value)
data = []
for r in range(2,sheet.max_row+1):
record = {}
for col in range(sheet.max_column):
if isinstance(sheet.cell(row=r,column=col+1).value, str):
record[first_row[col]] = sheet.cell(row=r,column=col+1).value.strip()
else:
record[first_row[col]] = sheet.cell(row=r,column=col+1).value
data.append(record)
return data
xlrd_results = UseXlrd('foo.xls')
openpyxl_resuts = UseOpenpyxl('foo.xls')
传递包含3500行的同一个Excel文件会产生截然不同的运行时间。使用xlrd我可以在2秒内将整个文件读入字典列表。使用openpyxl可以得到以下结果:Reading Excel File...
Read 100 lines in 114.14509415626526 seconds
Read 200 lines in 471.43183994293213 seconds
Read 300 lines in 982.5288782119751 seconds
Read 400 lines in 1729.3348784446716 seconds
Read 500 lines in 2774.886833190918 seconds
Read 600 lines in 4384.074863195419 seconds
Read 700 lines in 6396.7723388671875 seconds
Read 800 lines in 7998.775000572205 seconds
Read 900 lines in 11018.460735321045 seconds
虽然我可以在最后的脚本中使用xlrd,但由于各种问题(即int读作float,date读作int,datetime读作float),我将不得不硬编码许多格式。由于我需要为更多的导入重用此代码,因此尝试硬编码特定列以正确格式化它们是没有意义的,并且必须在4个不同的脚本中维护类似的代码。
对如何进行有什么建议吗?