CBSsport的NBA直播数据整理小结一下……

news/2025/2/16 2:54:45/

忘记了是几个月前的哪一天,我偶然发现CBS的直播数据是可以直接从html文件中获得出手点数据的,当时应该是一阵狂喜呢,那时候我还不知道该怎么搞定ESPN的xml数据……

现在回头看以前处理过的CBS出手数据,不得不说很鸡肋。


处理后的文件包括CBSplayerID和球员名对应表,03-11年8个赛季的shotdata,shotType解释表。

CBS出手数据总数上和赛季整体统计有不小的差距,总数上经常有几百上千的多少,总数的比例都有98%以上,应该算不错了,但具体到单场比赛,会发现有shotdata的时间轴数据不准和出手球员错误的问题(主要是和NBA官网和ESPN的PBP数据时间轴做比较),这和之后获得的ESPNxml出手数据相比就有明显的不足了。


但另外值得一提的一点是,CBS和NBA官网的出手类型描述还是很丰富的,而ESPN的分类相对粗一点。

有次偶然发现一个特别的补扣


本来好奇的是这个球算助攻空接还是投篮不中前板补扣,结果却意外发现只有CBS描述这球是扣篮,而ESPN和NBA官网记的是上篮。

这么看来应该存在其它不一致的投篮描述,但也应该只是少数。考虑到时间轴不一致,统一起来应该还是比较麻烦的,暂未处理这个问题。


简单记录一下基本的抓取和处理过程:

1,03-11,8个赛季,分别保存一个某一天的scoreboard文件,抽取出8个赛季的全部比赛日。

例如:http://www.cbssports.com/nba/scoreboard/20110101。

主要是匹配页面中的“<a href=\"/nba/scoreboard/”,抽取其后的8位数字串加入比赛日集合。


2,用全部的比赛日链接做种子,配置Heritrix任务抓回所有比赛场次的shotchart页面。

主要是匹配“NBA_[0-9]+_[A-Z]*@[A-Z]*”,添加到等待抓取队列中。

原理上可以不用自己写个简单继承的Extractor,那需要另外在任务中设置链接过滤规则,而默认的链接抽取模块会抽出很多无用的链接来作判断,花费的抓取时间要多一些。

另外还可以先用下载工具抓取比赛日列表,然后用正则表达式提取所有比赛的特征字符串(需要编程),再用抽出的链接抓取shotchart页面。抓取部分用迅雷就可以轻松搞定,文件命名就是比赛特征字符串。

例如:http://www.cbssports.com/nba/gametracker/shotchart/NBA_20110101_CLE@CHI,抓取下来的文件名就是“NBA_20110101_CLE@CHI”。

不过我还是选择了编程的方法……

import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.util.regex.Matcher;
import java.util.regex.Pattern;import org.apache.commons.httpclient.URIException;
import org.archive.crawler.datamodel.CrawlURI;
import org.archive.crawler.extractor.Extractor;
import org.archive.crawler.extractor.Link;
import org.archive.io.ReplayCharSequence;
import org.archive.util.HttpRecorder;public class CBSScoreboardExtractor extends Extractor {private static final long serialVersionUID = 5855731422080471017L;	private static Logger logger =Logger.getLogger(CBSScoreboardExtractor.class.getName());	public CBSScoreboardExtractor(String name) {this(name, "CBSSport Scoreboard Extractor");}	public CBSScoreboardExtractor(String name, String description) {super(name, description);}//从scoreboard页面抽取CBS每场的比赛特征字符串private static final String CBS_FEATURE = "NBA_[0-9]+_[A-Z]*@[A-Z]*";private static final String SHOTCHART = "http://www.cbssports.com/nba/gametracker/shotchart/";protected void extract(CrawlURI curi){//下面一段代码主要用于取得当前链接的返回 字符串,以便对内容进行分析ReplayCharSequence cs = null;try {HttpRecorder hr = curi.getHttpRecorder();if (hr == null) {throw new IOException("Why is recorder null here?");}cs = hr.getReplayCharSequence();} catch (IOException e) {curi.addLocalizedError(this.getName(), e,"Failed get of replay char sequence " + curi.toString()+ " " + e.getMessage());logger.log(Level.SEVERE, "Failed get of replay char sequence in "+ Thread.currentThread().getName(), e);}if (cs == null) {return;}// 将链接返回的内容转成字符串String content = cs.toString();	try {           // 将字符串内容进行正则匹配// 取出其中的链接信息Pattern pattern = Pattern.compile(CBS_FEATURE);Matcher matcher = pattern.matcher(content);// 若找到了一个链接while (matcher.find()) {int start = matcher.start();int end = matcher.end();String aShotchartLink = SHOTCHART + content.substring(start, end);addLinkFromString(curi, aShotchartLink, "", Link.NAVLINK_HOP);}curi.linkExtractorFinished();} catch (Exception e) {e.printStackTrace();}}// 将链接保存记录下来,以备后续处理private void addLinkFromString(CrawlURI curi, String uri,CharSequence context, char hopType) {try {curi.createAndAddLinkRelativeToBase(uri, context.toString(),hopType);} catch (URIException e) {if (getController() != null) {getController().logUriError(e, curi.getUURI(), uri);} else {logger.info("Failed createAndAddLinkRelativeToBase " + curi + ", " + uri + ", " + context + ", " + hopType + ": " + e);}}}
}
这样下来共抓取了10000+场比赛的shotchart数据。


3,手工为每个赛季的比赛集中一个文件夹,剔除全明星赛和延期的比赛,还有10来比赛因为某一个页面链接错误没有抓取,手动保存了一些页面。


4,在单一的shotchart页面里抽取球员信息(CBSplayerID和球员名)和出手信息,分赛季写入文本。

package CBS;import java.io.*;
import java.util.Comparator;
import java.util.Iterator;
import java.util.TreeSet;/** 2003-11每个赛季的总出手数据分别保存为一个文本* 20031028-20040615 1189 + 82* 20041102-20050623 1230 + 84* 20051101-20060620 1230 + 89* 20061031-20070614 1230 + 79* 20071030-20080617 1230 + 86* 20081028-20090618 1230 + 85* 20091027-20100617 1230 + 82* 20101026-20110612 1230 + 81* Damon Jones & Dwayne Jones 2007-08 Cavaliers* James Jones & Jumaine Jones 2006-07 Suns* * rescheduled game* * 源数据中存在错误的球员信息* 同球员不同ID,Awvee Scorey ;同ID不同姓名,如Yao Ming、Ming Yao
*/
public class CBSShotchartParser {public static void main(String[] args) throws Exception{File directory = new File("E:\\NBA\\data\\2003-2011CBSshotchart\\06-07\\");String[] shotcharts = directory.list();//FileWriter fr0304 = new FileWriter("E:\\2003-04shotdata.txt");//FileWriter fr0405 = new FileWriter("E:\\2004-05shotdata.txt");//FileWriter fr0506 = new FileWriter("E:\\2005-06shotdata.txt");FileWriter fr0607 = new FileWriter("E:\\2006-07shotdata.txt");//FileWriter fr0708 = new FileWriter("E:\\2007-08shotdata.txt");//FileWriter fr0809 = new FileWriter("E:\\2008-09shotdata.txt");//FileWriter fr0910 = new FileWriter("E:\\2009-10shotdata.txt");//FileWriter fr1011 = new FileWriter("E:\\2010-11shotdata.txt");//延期安排的比赛,或出手数据为空FileWriter frReschGames = new FileWriter("E:\\rescheduledGames.txt");//球员姓名中出现特殊空格字符FileWriter frSpecialName = new FileWriter("E:\\SpecialName.txt");TreeSet<CBSplayerInfo> playerInfoSet = new TreeSet<CBSplayerInfo>();//FileWriter frID = new FileWriter("E:\\CBSplayerInfo.txt");		for(int i=0; i < shotcharts.length; i++){String pageFile = "E:\\NBA\\data\\2003-2011CBSshotchart\\06-07\\" + shotcharts[i];String gameKey = shotcharts[i].substring(4).replaceAll("_|@", "");String pageContent = "";BufferedReader br = new BufferedReader(new FileReader(pageFile));String aLine = br.readLine();while(aLine != null){pageContent = pageContent + aLine;aLine = br.readLine();}br.close();int cur = pageContent.indexOf("currentShotData = new String");int lcur = pageContent.indexOf("\"", cur);int rcur = pageContent.indexOf("\"", lcur+1);String rawShotdata = pageContent.substring(lcur+1, rcur);if(rawShotdata.equals("")){//处理可能出现的重排比赛(出手数据为空)frReschGames.append(shotcharts[i] + "\r\n");continue;}String shotData = gameKey + "," + pageContent.substring(lcur+1, rcur).replaceAll("~", "\r\n" + gameKey + ",");//player信息索引集(只保留CBSplayerId,first name,last name)//例如(240304:Tony Parker,9,PG,8-20,1-3,0-0,17|)保留(240304,Tony,Parker)cur = pageContent.indexOf("playerDataHomeString = new String",rcur);lcur = pageContent.indexOf("\"", cur);rcur = pageContent.indexOf("\"", lcur+1);String homePlayers = pageContent.substring(lcur+1,rcur);cur = pageContent.indexOf("playerDataAwayString = new String",rcur);lcur = pageContent.indexOf("\"", cur);rcur = pageContent.indexOf("\"", lcur+1);String awayPlayers = pageContent.substring(lcur+1,rcur);	String players = homePlayers + "|" + awayPlayers;			for(int j = 0; j < players.length(); j++){CBSplayerInfo aPlayer = new CBSplayerInfo();int cur1 = players.indexOf(":",j);aPlayer.id = players.substring(j,cur1);int cur2 = players.indexOf(" ",cur1);	//出现特例:20071103DALSAC中空格是" ";//20071211INDCLE中空格是字符集导致的乱码(先保存,暂不处理),cur2返回-1.int SPACE_LEN = 6;if(cur2 == -1){frSpecialName.append(shotcharts[i] + "\r\n");break;//cur2 = players.indexOf(" ",cur1);//SPACE_LEN = 1;}aPlayer.firstName = players.substring(cur1 + 1,cur2);int cur3 = players.indexOf(",",cur2);aPlayer.lastName = players.substring(cur2 + SPACE_LEN,cur3);playerInfoSet.add(aPlayer);		//添加球员ID信息j = players.indexOf("|",cur3);if(j == -1) break;}//保存shotchart数据if(gameKey.compareTo("200407") < 0){//fr0304.append(shotData + "\r\n");}else if(gameKey.compareTo("200507") < 0){//fr0304.close();//fr0405.append(shotData + "\r\n");}else if(gameKey.compareTo("200607") < 0){//fr0405.close();//fr0506.append(shotData + "\r\n");}else if(gameKey.compareTo("200707") < 0){//fr0506.close();fr0607.append(shotData + "\r\n");}else if(gameKey.compareTo("200807") < 0){fr0607.close();//fr0708.append(shotData + "\r\n");}else if(gameKey.compareTo("200907") < 0){//fr0708.close();//fr0809.append(shotData + "\r\n");}else if(gameKey.compareTo("201007") < 0){//fr0809.close();//fr0910.append(shotData + "\r\n");}else if(gameKey.compareTo("201107") < 0){//fr0910.close();//fr1011.append(shotData + "\r\n");}			System.out.println(shotcharts[i]);}//fr1011.close();//保存球员ID数据Iterator<CBSplayerInfo> it = playerInfoSet.iterator();while(it.hasNext()){CBSplayerInfo nextPlayer = it.next();String playerInfo = nextPlayer.id + "\t" + nextPlayer.firstName + "\t" + nextPlayer.lastName;//frID.append(playerInfo + "\r\n");}frReschGames.close();frSpecialName.close();		//frID.close();}
}
碰到一些页面空格不一致的编码问题,单独处理。

package CBS;import java.io.*;
import java.util.Iterator;
import java.util.TreeSet;public class CBSspecialName {public static void main(String[] args) throws Exception{TreeSet<CBSplayerInfo> playerInfoSet = new TreeSet<CBSplayerInfo>();FileWriter frID = new FileWriter("E:\\CBSplayerInfo.txt");	//球员姓名中出现特殊空格字符的文件FileWriter frSpecialName = new FileWriter("E:\\SpecialNameSpace.txt");BufferedReader br = new BufferedReader(new FileReader("E:\\NBA\\data\\SpecialName.txt"));String str = br.readLine();int cnt = 1;while(str != null){String page = "E:\\NBA\\data\\2003-2011CBSshotchart\\" + str;BufferedReader br2 = new BufferedReader(new FileReader(page));String pageContent = "";String aLine = br2.readLine();while(aLine != null){pageContent = pageContent + aLine;aLine = br2.readLine();}br2.close();int cur = pageContent.indexOf("playerDataHomeString = new String");int lcur = pageContent.indexOf("\"", cur);int rcur = pageContent.indexOf("\"", lcur+1);String homePlayers = pageContent.substring(lcur+1,rcur);cur = pageContent.indexOf("playerDataAwayString = new String",rcur);lcur = pageContent.indexOf("\"", cur);rcur = pageContent.indexOf("\"", lcur+1);String awayPlayers = pageContent.substring(lcur+1,rcur);String players = homePlayers + "|" + awayPlayers;players = new String(players.getBytes("iso-8859-1"));for(int j = 0; j < players.length(); j++){CBSplayerInfo aPlayer = new CBSplayerInfo();int cur1 = players.indexOf(":",j);aPlayer.id = players.substring(j,cur1);int cur2 = players.indexOf(" ",cur1);int cur2p = players.indexOf("|",cur1);if(cur2 == -1 || (cur2 > cur2p && cur2p != -1)){cur2 = players.indexOf("?",cur1);	//iso-8859-1下的空格}aPlayer.firstName = players.substring(cur1 + 1,cur2);int cur3 = players.indexOf(",",cur2);aPlayer.lastName = players.substring(cur2 + 1,cur3);playerInfoSet.add(aPlayer);		//添加球员ID信息System.out.println(str + ":" + aPlayer.display());j = players.indexOf("|",cur3);if(j == -1) break;}str = br.readLine();}frSpecialName.close();br.close();//保存球员ID数据Iterator<CBSplayerInfo> it = playerInfoSet.iterator();while(it.hasNext()){CBSplayerInfo nextPlayer = it.next();String playerInfo = nextPlayer.id + ";" + nextPlayer.firstName + ";" + nextPlayer.lastName;frID.append(playerInfo + "\r\n");}		frID.close();}
}

5,CBS默认shotchart数据里的第四节以及加时赛都是用3表示的period,编程修正。

package CBS;
/** 默认情况下,CBS的period数据中的第4节和加时赛都是3,本程序依次改为4,5,6……* 20101026HOULAL,0,5.0,3,1622542,1,0,25,40,25* 20101026HOULAL,0,11:41,3,1622542,5,1,0,42,0* period >= 3,同一gameID,当前一条shot时间为秒“.”,下一条包含分“:”时,period++*/
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.sql.Date;
import java.sql.Time;
import java.text.ParseException;
import java.text.SimpleDateFormat;public class CBSTime {public static void main(String args[]) throws Exception{String directoryPath = "E:\\2006-07shotdata\\";File directory = new File(directoryPath);String[] shotdata = directory.list();for(int i = 0; i < shotdata.length; i++){BufferedReader br = new BufferedReader(new FileReader(directoryPath + shotdata[i]));String aLine = br.readLine();FileWriter fr = new FileWriter(directoryPath + "CBS" + shotdata[i]);String[] lastShot = new String[]{"","","","","","","","","",""};while(aLine != null){String[] newShot = aLine.split(",");if(lastShot[0].equals(newShot[0]) && lastShot[3].compareTo("3") >= 0 && lastShot[2].contains(".") && newShot[2].contains(":")){Integer tmp = Integer.parseInt(lastShot[3])+1;newShot[3] = tmp.toString();}if(lastShot[0].equals(newShot[0]) && newShot[3].compareTo(lastShot[3]) < 0)newShot[3] = lastShot[3];lastShot = newShot;String aShot = lastShot[0]+","+lastShot[1]+","+lastShot[2]+","+lastShot[3]+","+lastShot[4]+","+lastShot[5]+","+lastShot[6]+","+lastShot[7]+","+lastShot[8]+","+lastShot[9];fr.append(aShot+"\r\n");System.out.println(aShot);aLine = br.readLine();}br.close();fr.close();}}
}

6,shotdata文本导入数据库就可以做一些简单的查询了~


http://www.ppmy.cn/news/357614.html

相关文章

再见邓肯!再见石佛!

轰轰烈烈地到来&#xff0c;安安静静的离开&#xff0c;这是最邓肯的方式。再见了运动男孩! 这一次&#xff0c;未来真的是你们的了&#xff0c;未来终究还是来了。北京时间7月11日晚上,马刺官方宣布,球队老将蒂姆-邓肯正式退役。NBA全明星球员,40岁的石佛,宣布结束自己的职业生…

python足球数据分析_我用Python对科比NBA生涯进行了一个数据分析

我是一个NBA的球迷&#xff0c;一直很喜欢科比的球风和“曼巴精神”&#xff0c;于是想写一篇文章看一下他的整个NBA生涯的数据情况是怎样的&#xff0c;这应该是一件有趣的事情。 使用工具:Ipython notebook 用到的库:Pandas,Matplotlib 1.数据来源 本次用到的数据来源于https…

NBA视频直播

CCTV5网上直播http://sports.cctv.com/29/index.shtml 转帖&#xff1a; 发表时间&#xff1a; 2008年10月06日 09时53分 评论/阅读(/) 本文地址&#xff1a; http://qzone.qq.com/blog/52452199-1223258032 和我一样喜欢篮球的兄弟&#xff1a;你好&#xff01; 我…

机器学习预测nba_通过机器学习预测2020年NBA季后赛支架

机器学习预测nba Paul the Octopus was a short-lived (26 January 2008–26 October 2010) cephalopod kept at the Sea Life Centre in Oberhausen, Germany, who became instantly famous because of his alleged ability to predict the results of FIFA World Cup footbal…

滑雪hhh

滑雪 题目 给定一个 R 行 C 列的矩阵&#xff0c;表示一个矩形网格滑雪场。 矩阵中第 i 行第 j 列的点表示滑雪场的第 i 行第 j 列区域的高度。 一个人从滑雪场中的某个区域内出发&#xff0c;每次可以向上下左右任意一个方向滑动一个单位距离。 当然&#xff0c;一个人能…

Chrome浏览器(油猴子)插件安装使用教程

油猴子说明文档 安装包&#xff1a; 链接&#xff1a;https://pan.baidu.com/s/1p2Sx5P99vP6eb3DIAoBOLQ?pwd6666 提取码&#xff1a;6666 浏览器安装油猴 文件到我的百度网盘可以直接获取&#xff08;上面有&#xff09; 安装教程 1.打开浏览器界面&#xff0c;点击右上方…

另一个伊甸专武,国际服用(手机随便做的,有点粗糙,见谅

紫央(刀哥 麦提 苏赛特(中二枪 抱歉已经真专了 阿佐美(风刀 杜娃 尤因(火锤 娜基(章鱼 拉克莱尔&#xff08;水弓 伽琉 缪露斯&#xff08;魔兽娘 还没打…… 伊丝卡&#xff08;会长 谢奈(水剑 觉得挺好看 安娜贝尔 洛基德&#xff08;土狗 玫丽娜(水锤 思琳&#xff08;老…

剑心---速度与位置

炼得『剑心』之人&#xff0c;能视天地为丹田&#xff0c;化肉体为经络。被视为『武』道本源的『炁』&#xff0c;更不必藏身。 新人写帖子&#xff0c;仅供内部人员参照 基本配置与在下A板杀手&#xff0c;请赐教一样&#xff0c;本章涉及相关的概念与知识推荐这篇文章——PID…