2020国赛C中小微企业的信贷策略数据预处理代码

news/2025/1/18 9:42:29/

本题的思路做法有很多,本文不再赘述,只进行简单的数据预处理为后面几问分析做基础
本文目标是用python提取数据整理出最基础的数据

原题一共有3个附件,我们主要针对前两个附件进行处理

  • 附件1 123家有信贷记录企业的相关数据
  • 附件2 302家无信贷记录企业的相关数据
  • 附件3 银行贷款年利率与客户流失率关系的2019年统计数据

整理目标指标体系:
进项:单位个数,月均价税合计累计,负金额个数,月均负金额累计,作废发票个数,月均作废金额累计
销项:单位个数,月均价税合计累计,负金额个数,月均负金额累计,作废发票个数,月均作废金额累计

123家进项:

import numpy as np
import pandas as pd
import os
data1=pd.read_excel('data1.xlsx','销项发票信息')
data1
企业代号发票号码开票日期购方单位代号金额税额价税合计发票状态
0E1114593562017-08-04B037119401.711598.2911000.0有效发票
1E150762392017-08-09B008448170.941389.069560.0有效发票
2E150762402017-08-09B008448170.941389.069560.0有效发票
3E150762412017-08-09B008444085.47694.534780.0有效发票
4E150762422017-08-09B008444085.47694.534780.0有效发票
...........................
162479E12388877012019-12-17B109444827.67144.834972.5有效发票
162480E12388877022019-12-17B109447412.62222.387635.0有效发票
162481E123341730852019-12-17B130931917.4757.531975.0有效发票
162482E12388877032019-12-25B130937252.42217.587470.0有效发票
162483E12388877042019-12-25B130936660.19199.816860.0有效发票

162484 rows × 8 columns

data1['发票状态'] =data1['发票状态'].map({'作废发票':0,'有效发票':1})
data1['发票状态'].value_counts()
1.0    151278
0.0     11159
Name: 发票状态, dtype: int64
data1
企业代号发票号码开票日期购方单位代号金额税额价税合计发票状态
0E1114593562017-08-04B037119401.711598.2911000.01.0
1E150762392017-08-09B008448170.941389.069560.01.0
2E150762402017-08-09B008448170.941389.069560.01.0
3E150762412017-08-09B008444085.47694.534780.01.0
4E150762422017-08-09B008444085.47694.534780.01.0
...........................
162479E12388877012019-12-17B109444827.67144.834972.51.0
162480E12388877022019-12-17B109447412.62222.387635.01.0
162481E123341730852019-12-17B130931917.4757.531975.01.0
162482E12388877032019-12-25B130937252.42217.587470.01.0
162483E12388877042019-12-25B130936660.19199.816860.01.0

162484 rows × 8 columns

data1['开票日期']=data1['开票日期'].map(lambda x: 100*x.year + x.month)
data1
企业代号发票号码开票日期购方单位代号金额税额价税合计发票状态
0E111459356201708B037119401.711598.2911000.01.0
1E15076239201708B008448170.941389.069560.01.0
2E15076240201708B008448170.941389.069560.01.0
3E15076241201708B008444085.47694.534780.01.0
4E15076242201708B008444085.47694.534780.01.0
...........................
162479E1238887701201912B109444827.67144.834972.51.0
162480E1238887702201912B109447412.62222.387635.01.0
162481E12334173085201912B130931917.4757.531975.01.0
162482E1238887703201912B130937252.42217.587470.01.0
162483E1238887704201912B130936660.19199.816860.01.0

162484 rows × 8 columns

data1.groupby(['企业代号']).agg('count')
发票号码开票日期购方单位代号金额税额价税合计发票状态
企业代号
E18110811081108110811081108110
E10568568568568568568568
E10099999999999999
E10163636363636363
E102177177177177177177177
........................
E95913913913913913913913
E9620202020202020
E97264264264264264264264
E98153153153153153153153
E998888888

123 rows × 7 columns

dd=pd.pivot_table(data1, index=['企业代号','开票日期'],aggfunc=['count'])
#dd.to_excel('dd.xlsx')
data1.groupby(['企业代号'],sort=False).agg('count')
发票号码开票日期购方单位代号金额税额价税合计发票状态
企业代号
E18110811081108110811081108110
E212707127071270712707127071270712707
E324073240732407324073240732407324073
E42231223122312231223122312231
E51060106010601060106010601060
........................
E11921212121212121
E12029292929292929
E121186186186186186186186
E122118118118118118118118
E12365656565656565

123 rows × 7 columns

index=data1['企业代号']
index=index.drop_duplicates()
index
0           E1
8110        E2
20817       E3
44890       E4
47121       E5... 
162065    E119
162086    E120
162115    E121
162301    E122
162419    E123
Name: 企业代号, Length: 123, dtype: object
index.iloc[1]
'E2'
count=[0 for i in range(len(index))]
for i in range(len(data1['企业代号'])):for j in range(len(index)):if index.iloc[j] == data1['企业代号'][i]:count[j]+=1continue
count
[8110,12707,24073,2231,1060,1052,8149,9425,5906,568,1117,285,8005,3336,2412,439,676,381,3099,777,1466,1562,3570,1782,978,504,1468,569,475,800,2608,552,1073,707,770,1547,1273,628,264,883,663,327,245,2435,325,974,3518,1016,513,285,895,379,454,3501,590,495,438,388,2149,513,1992,219,673,1375,485,679,215,73,587,578,375,248,236,555,8655,245,258,268,86,48,649,343,706,498,293,22,74,400,172,261,129,39,148,196,913,20,264,153,8,99,63,177,64,21,122,153,54,33,33,83,250,22,47,75,6,47,44,144,21,29,186,118,65]
index2=index.drop_duplicates()
index2=index2.reset_index(drop=True)
unitcount=pd.Series(count)
namecop=pd.Series(index2)
df=pd.concat([namecop,unitcount],axis=1)
df.columns=['企业代号','发票数']
df
企业代号发票数
0E18110
1E212707
2E324073
3E42231
4E51060
.........
118E11921
119E12029
120E121186
121E122118
122E12365

123 rows × 2 columns

newdata=data1.set_index('企业代号')#将企业代号作为列索引
newdata
发票号码开票日期购方单位代号金额税额价税合计发票状态
企业代号
E111459356201708B037119401.711598.2911000.01.0
E15076239201708B008448170.941389.069560.01.0
E15076240201708B008448170.941389.069560.01.0
E15076241201708B008444085.47694.534780.01.0
E15076242201708B008444085.47694.534780.01.0
........................
E1238887701201912B109444827.67144.834972.51.0
E1238887702201912B109447412.62222.387635.01.0
E12334173085201912B130931917.4757.531975.01.0
E1238887703201912B130937252.42217.587470.01.0
E1238887704201912B130936660.19199.816860.01.0

162484 rows × 7 columns

index=index2
index
0        E1
1        E2
2        E3
3        E4
4        E5... 
118    E119
119    E120
120    E121
121    E122
122    E123
Name: 企业代号, Length: 123, dtype: object
df
企业代号发票数
0E18110
1E212707
2E324073
3E42231
4E51060
.........
118E11921
119E12029
120E121186
121E122118
122E12365

123 rows × 2 columns

aa=[]
for i in index:pp=newdata.loc[i]#按照企业代号进行处理daytemp=pp['开票日期']daytemp=pd.Series(daytemp)day_index=daytemp.drop_duplicates()#去重a=day_index.count()#计算总共有多少个月份aa.append(a)
len(aa)day=pd.Series(aa)
df=pd.concat([df,day],axis=1)
df.columns=['企业代号','发票数','月数']
df
企业代号发票数月数
0E1811030
1E21270737
2E32407338
3E4223135
4E5106033
............
118E1192110
119E120296
120E12118624
121E12211826
122E1236515

123 rows × 3 columns

d#f.drop(df.columns[0], axis=1, inplace=True)    # 删除第1列
#df
0
1
2
3
4
...
162479
162480
162481
162482
162483

162484 rows × 0 columns

#销售单位代号个数
ab=[]
for i in index:pp=newdata.loc[i]#按照企业代号进行处理daytemp=pp['购方单位代号']daytemp=pd.Series(daytemp)day_index=daytemp.drop_duplicates()#去重a=day_index.count()#计算总共有多少个月份ab.append(a)
len(ab)day=pd.Series(ab)
df=pd.concat([df,day],axis=1)
df.columns=['企业代号','发票数','月数','购方单位代号个数']
df
企业代号发票数月数购方单位代号个数
0E1811030360
1E212707371611
2E32407338136
3E422313526
4E510603333
...............
118E119211017
119E12029626
120E12118624123
121E1221182670
122E12365157

123 rows × 4 columns

#价税合计累计
ac=[]
for i in index:pp=newdata.loc[i]#按照企业代号进行处理daytemp=pp['价税合计']daytemp=pd.Series(daytemp)c=daytemp.sum()ac.append(c)
len(ac)
day=pd.Series(ac)
df=pd.concat([df,day],axis=1)
df.columns=['企业代号','发票数','月数','购方单位代号个数','价税合计累计']
df
企业代号发票数月数购方单位代号个数价税合计累计
0E18110303604.798812e+09
1E2127073716116.947082e+08
2E324073381366.827398e+08
3E4223135262.345631e+09
4E5106033332.427882e+08
..................
118E1192110173.724200e+04
119E120296261.963934e+05
120E121186241231.177928e+05
121E12211826708.748906e+04
122E123651572.671682e+05

123 rows × 5 columns

#负金额个数
ad=[]
d=0
for i in index:pp=newdata.loc[i]#按照企业代号进行处理daytemp=pp['金额']daytemp=pd.Series(daytemp)for j in range(len(daytemp)):if daytemp[j]<0:d+=1ad.append(d)d=0
len(ad)day=pd.Series(ad)
df=pd.concat([df,day],axis=1)
df.columns=['企业代号','发票数','月数','购方单位代号个数','价税合计累计','负金额个数']
df
企业代号发票数月数购方单位代号个数价税合计累计负金额个数
0E18110303604.798812e+09229
1E2127073716116.947082e+08360
2E324073381366.827398e+084501
3E4223135262.345631e+099
4E5106033332.427882e+0810
.....................
118E1192110173.724200e+040
119E120296261.963934e+050
120E121186241231.177928e+050
121E12211826708.748906e+048
122E123651572.671682e+050

123 rows × 6 columns

# 负金额总额
ae=[]
sum=0
for i in index:pp=newdata.loc[i]#按照企业代号进行处理daytemp=pp['金额']daytemp=pd.Series(daytemp)for j in range(len(daytemp)):if daytemp[j]<0:sum+=daytemp[j]ae.append(round(sum,2))sum=0
len(ae)
day=pd.Series(ae)
df=pd.concat([df,day],axis=1)
df.columns=['企业代号','发票数','月数','购方单位代号个数','价税合计累计','负金额个数','负金额累计']
df
企业代号发票数月数购方单位代号个数价税合计累计负金额个数负金额累计
0E18110303604.798812e+09229-1.300769e+08
1E2127073716116.947082e+08360-1.909563e+07
2E324073381366.827398e+084501-2.047618e+08
3E4223135262.345631e+099-7.221601e+06
4E5106033332.427882e+0810-9.652789e+05
........................
118E1192110173.724200e+0400.000000e+00
119E120296261.963934e+0500.000000e+00
120E121186241231.177928e+0500.000000e+00
121E12211826708.748906e+048-1.952390e+03
122E123651572.671682e+0500.000000e+00

123 rows × 7 columns

#作废发票个数
af=[]
f=0
for i in index:pp=newdata.loc[i]#按照企业代号进行处理daytemp=pp['发票状态']daytemp=pd.Series(daytemp)for j in range(len(daytemp)):if daytemp[j]==0:f+=1af.append(f)f=0
len(af)
day=pd.Series(af)
df=pd.concat([df,day],axis=1)
df.columns=['企业代号','发票数','月数','购方单位代号个数','价税合计累计','负金额个数','负金额累计','作废发票个数']
df
企业代号发票数月数购方单位代号个数价税合计累计负金额个数负金额累计作废发票个数
0E18110303604.798812e+09229-1.300769e+08224
1E2127073716116.947082e+08360-1.909563e+071042
2E324073381366.827398e+084501-2.047618e+08385
3E4223135262.345631e+099-7.221601e+06190
4E5106033332.427882e+0810-9.652789e+0555
...........................
118E1192110173.724200e+0400.000000e+003
119E120296261.963934e+0500.000000e+0020
120E121186241231.177928e+0500.000000e+0023
121E12211826708.748906e+048-1.952390e+0316
122E123651572.671682e+0500.000000e+0032

123 rows × 8 columns

#作废发票金额
ag=[]
gsum=0
for i in index:pp=newdata.loc[i]#按照企业代号进行处理daytemp=pp['发票状态']money=pp['金额']daytemp=pd.Series(daytemp)for j in range(len(daytemp)):if daytemp[j]==0:gsum+=money[j]ag.append(round(gsum,2))gsum=0
len(ag)    
day=pd.Series(ag)
df=pd.concat([df,day],axis=1)
df.columns=['企业代号','发票数','月数','购方单位代号个数','价税合计累计','负金额个数','负金额累计','作废发票个数','作废发票金额累计']
df
企业代号发票数月数购方单位代号个数价税合计累计负金额个数负金额累计作废发票个数作废发票金额累计
0E18110303604.798812e+09229-1.300769e+082248.715034e+07
1E2127073716116.947082e+08360-1.909563e+0710426.454014e+07
2E324073381366.827398e+084501-2.047618e+083851.860897e+07
3E4223135262.345631e+099-7.221601e+061901.712102e+08
4E5106033332.427882e+0810-9.652789e+05558.168982e+06
..............................
118E1192110173.724200e+0400.000000e+0034.854400e+02
119E120296261.963934e+0500.000000e+00200.000000e+00
120E121186241231.177928e+0500.000000e+00233.184300e+03
121E12211826708.748906e+048-1.952390e+03162.565260e+04
122E123651572.671682e+0500.000000e+00323.647341e+04

123 rows × 9 columns

df.to_excel('销项数据整理.xlsx')

http://www.ppmy.cn/news/271449.html

相关文章

OpenCV中出现“Microsoft C++ 异常: cv::Exception,位于内存位置 0x0000005C8ECFFA80 处。”的异常

1 致谢 对于OpenCV的安装 要感谢网友空晴拜小白提供的教程 链接如下&#xff1a; https://blog.csdn.net/sinat_36264666/article/details/73135823?refmyread#reply 教程写的十分详细 截图什么的也很好 2 问题描述 今天在安装和测试OpenCV的时候遇到了这样的问题 测试代…

Max length of title attribute in html

测了一下chrome是1024个utf-8字符。 具体可见&#xff1a; http://stackoverflow.com/questions/8516235/max-length-of-title-attribute 有人回答说IE是512个。 有人回答按照W3C标准&#xff0c;理想情况下应该少于64个字符。

nowcoder15251白兔的式子

链接 点击跳转 题解 一共往下走 n − 1 n-1 n−1次&#xff0c;其中 m − 1 m-1 m−1次往右走&#xff0c;所以答案是 C n − 1 m − 1 a ( n − 1 ) − ( m − 1 ) b m − 1 C_{n-1}^{m-1} a^{(n-1)-(m-1)}b^{m-1} Cn−1m−1​a(n−1)−(m−1)bm−1 代码 #include <bi…

第一批鸿蒙系统手机型号,华为鸿蒙第一批名单

产品 型号 招募基线版本 Mate X2 TET-ANOO 11.0.0.160(SP2C0OE160R9P4) 11.0.0.160(SP2C00E160R10P2) Mate40 OCE AN10 11.0.0.170(C00E170R6P3) 11.0.0.170(C00E170R7P1) 11.0.0.170(C00E170R8P1) 11.0.0.170(C01E170R6P3) 11.0.0.170(C01E170R7P1) 11.0.0.170(C01E170R8P1) M…

竹海码路(python篇6) 新手常见错误-

1 ERRORS: app0.userModel.nameM: (fields.E120) CharFields must define a max_length attribute. app0.userModel.passwordM: (fields.E120) CharFields must define a max_length attribute. djiango模型必须有一些参数的定义&#xff0c;例如CharFields并不是可以null的。 …

pycharm django数据库操作

修改默认sqlite3为mysql 1&#xff0c;在pycharm中添加pymysql&#xff08;File-->settings-->project-->Project Interpreter-->-->搜索pymysql-->install) 2&#xff0c;修改settings的默认设置 default: {ENGINE: django.db.backends.mysql,NAME: 数据库…

(Django)Web网页导入pycharm时遇到的问题集合

后台 入门项目毕业设计*创建项目过程**问题* 功能实现*华为身份证识别步骤&#xff1a;**点击按钮删除条目信息&#xff1a;**后台添加内容前端显示无法显示换行和空格&#xff1a;**前端HTML页面判断数据库是否有数据显示不同内容&#xff1a;**前端HTML页面判断是否为管理员用…

日常错误集

A B C D E ERRORS:myApp.Students.scontend: (fields.E120) CharFields must define a ‘max_length’ attribute. CharFields类型必须添加max_length属性 Exception: HTTP 599: SSL certificate problem: unable to get local issuer cer 在self.crawl()最后加上validate…