忘记了是几个月前的哪一天,我偶然发现CBS的直播数据是可以直接从html文件中获得出手点数据的,当时应该是一阵狂喜呢,那时候我还不知道该怎么搞定ESPN的xml数据……
现在回头看以前处理过的CBS出手数据,不得不说很鸡肋。
处理后的文件包括CBSplayerID和球员名对应表,03-11年8个赛季的shotdata,shotType解释表。
CBS出手数据总数上和赛季整体统计有不小的差距,总数上经常有几百上千的多少,总数的比例都有98%以上,应该算不错了,但具体到单场比赛,会发现有shotdata的时间轴数据不准和出手球员错误的问题(主要是和NBA官网和ESPN的PBP数据时间轴做比较),这和之后获得的ESPNxml出手数据相比就有明显的不足了。
但另外值得一提的一点是,CBS和NBA官网的出手类型描述还是很丰富的,而ESPN的分类相对粗一点。
有次偶然发现一个特别的补扣
本来好奇的是这个球算助攻空接还是投篮不中前板补扣,结果却意外发现只有CBS描述这球是扣篮,而ESPN和NBA官网记的是上篮。
这么看来应该存在其它不一致的投篮描述,但也应该只是少数。考虑到时间轴不一致,统一起来应该还是比较麻烦的,暂未处理这个问题。
简单记录一下基本的抓取和处理过程:
1,03-11,8个赛季,分别保存一个某一天的scoreboard文件,抽取出8个赛季的全部比赛日。
例如:http://www.cbssports.com/nba/scoreboard/20110101。
主要是匹配页面中的“<a href=\"/nba/scoreboard/”,抽取其后的8位数字串加入比赛日集合。
2,用全部的比赛日链接做种子,配置Heritrix任务抓回所有比赛场次的shotchart页面。
主要是匹配“NBA_[0-9]+_[A-Z]*@[A-Z]*”,添加到等待抓取队列中。
原理上可以不用自己写个简单继承的Extractor,那需要另外在任务中设置链接过滤规则,而默认的链接抽取模块会抽出很多无用的链接来作判断,花费的抓取时间要多一些。
另外还可以先用下载工具抓取比赛日列表,然后用正则表达式提取所有比赛的特征字符串(需要编程),再用抽出的链接抓取shotchart页面。抓取部分用迅雷就可以轻松搞定,文件命名就是比赛特征字符串。
例如:http://www.cbssports.com/nba/gametracker/shotchart/NBA_20110101_CLE@CHI,抓取下来的文件名就是“NBA_20110101_CLE@CHI”。
不过我还是选择了编程的方法……
import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.util.regex.Matcher;
import java.util.regex.Pattern;import org.apache.commons.httpclient.URIException;
import org.archive.crawler.datamodel.CrawlURI;
import org.archive.crawler.extractor.Extractor;
import org.archive.crawler.extractor.Link;
import org.archive.io.ReplayCharSequence;
import org.archive.util.HttpRecorder;public class CBSScoreboardExtractor extends Extractor {private static final long serialVersionUID = 5855731422080471017L; private static Logger logger =Logger.getLogger(CBSScoreboardExtractor.class.getName()); public CBSScoreboardExtractor(String name) {this(name, "CBSSport Scoreboard Extractor");} public CBSScoreboardExtractor(String name, String description) {super(name, description);}//从scoreboard页面抽取CBS每场的比赛特征字符串private static final String CBS_FEATURE = "NBA_[0-9]+_[A-Z]*@[A-Z]*";private static final String SHOTCHART = "http://www.cbssports.com/nba/gametracker/shotchart/";protected void extract(CrawlURI curi){//下面一段代码主要用于取得当前链接的返回 字符串,以便对内容进行分析ReplayCharSequence cs = null;try {HttpRecorder hr = curi.getHttpRecorder();if (hr == null) {throw new IOException("Why is recorder null here?");}cs = hr.getReplayCharSequence();} catch (IOException e) {curi.addLocalizedError(this.getName(), e,"Failed get of replay char sequence " + curi.toString()+ " " + e.getMessage());logger.log(Level.SEVERE, "Failed get of replay char sequence in "+ Thread.currentThread().getName(), e);}if (cs == null) {return;}// 将链接返回的内容转成字符串String content = cs.toString(); try { // 将字符串内容进行正则匹配// 取出其中的链接信息Pattern pattern = Pattern.compile(CBS_FEATURE);Matcher matcher = pattern.matcher(content);// 若找到了一个链接while (matcher.find()) {int start = matcher.start();int end = matcher.end();String aShotchartLink = SHOTCHART + content.substring(start, end);addLinkFromString(curi, aShotchartLink, "", Link.NAVLINK_HOP);}curi.linkExtractorFinished();} catch (Exception e) {e.printStackTrace();}}// 将链接保存记录下来,以备后续处理private void addLinkFromString(CrawlURI curi, String uri,CharSequence context, char hopType) {try {curi.createAndAddLinkRelativeToBase(uri, context.toString(),hopType);} catch (URIException e) {if (getController() != null) {getController().logUriError(e, curi.getUURI(), uri);} else {logger.info("Failed createAndAddLinkRelativeToBase " + curi + ", " + uri + ", " + context + ", " + hopType + ": " + e);}}}
}
这样下来共抓取了10000+场比赛的shotchart数据。
3,手工为每个赛季的比赛集中一个文件夹,剔除全明星赛和延期的比赛,还有10来比赛因为某一个页面链接错误没有抓取,手动保存了一些页面。
4,在单一的shotchart页面里抽取球员信息(CBSplayerID和球员名)和出手信息,分赛季写入文本。
package CBS;import java.io.*;
import java.util.Comparator;
import java.util.Iterator;
import java.util.TreeSet;/** 2003-11每个赛季的总出手数据分别保存为一个文本* 20031028-20040615 1189 + 82* 20041102-20050623 1230 + 84* 20051101-20060620 1230 + 89* 20061031-20070614 1230 + 79* 20071030-20080617 1230 + 86* 20081028-20090618 1230 + 85* 20091027-20100617 1230 + 82* 20101026-20110612 1230 + 81* Damon Jones & Dwayne Jones 2007-08 Cavaliers* James Jones & Jumaine Jones 2006-07 Suns* * rescheduled game* * 源数据中存在错误的球员信息* 同球员不同ID,Awvee Scorey ;同ID不同姓名,如Yao Ming、Ming Yao
*/
public class CBSShotchartParser {public static void main(String[] args) throws Exception{File directory = new File("E:\\NBA\\data\\2003-2011CBSshotchart\\06-07\\");String[] shotcharts = directory.list();//FileWriter fr0304 = new FileWriter("E:\\2003-04shotdata.txt");//FileWriter fr0405 = new FileWriter("E:\\2004-05shotdata.txt");//FileWriter fr0506 = new FileWriter("E:\\2005-06shotdata.txt");FileWriter fr0607 = new FileWriter("E:\\2006-07shotdata.txt");//FileWriter fr0708 = new FileWriter("E:\\2007-08shotdata.txt");//FileWriter fr0809 = new FileWriter("E:\\2008-09shotdata.txt");//FileWriter fr0910 = new FileWriter("E:\\2009-10shotdata.txt");//FileWriter fr1011 = new FileWriter("E:\\2010-11shotdata.txt");//延期安排的比赛,或出手数据为空FileWriter frReschGames = new FileWriter("E:\\rescheduledGames.txt");//球员姓名中出现特殊空格字符FileWriter frSpecialName = new FileWriter("E:\\SpecialName.txt");TreeSet<CBSplayerInfo> playerInfoSet = new TreeSet<CBSplayerInfo>();//FileWriter frID = new FileWriter("E:\\CBSplayerInfo.txt"); for(int i=0; i < shotcharts.length; i++){String pageFile = "E:\\NBA\\data\\2003-2011CBSshotchart\\06-07\\" + shotcharts[i];String gameKey = shotcharts[i].substring(4).replaceAll("_|@", "");String pageContent = "";BufferedReader br = new BufferedReader(new FileReader(pageFile));String aLine = br.readLine();while(aLine != null){pageContent = pageContent + aLine;aLine = br.readLine();}br.close();int cur = pageContent.indexOf("currentShotData = new String");int lcur = pageContent.indexOf("\"", cur);int rcur = pageContent.indexOf("\"", lcur+1);String rawShotdata = pageContent.substring(lcur+1, rcur);if(rawShotdata.equals("")){//处理可能出现的重排比赛(出手数据为空)frReschGames.append(shotcharts[i] + "\r\n");continue;}String shotData = gameKey + "," + pageContent.substring(lcur+1, rcur).replaceAll("~", "\r\n" + gameKey + ",");//player信息索引集(只保留CBSplayerId,first name,last name)//例如(240304:Tony Parker,9,PG,8-20,1-3,0-0,17|)保留(240304,Tony,Parker)cur = pageContent.indexOf("playerDataHomeString = new String",rcur);lcur = pageContent.indexOf("\"", cur);rcur = pageContent.indexOf("\"", lcur+1);String homePlayers = pageContent.substring(lcur+1,rcur);cur = pageContent.indexOf("playerDataAwayString = new String",rcur);lcur = pageContent.indexOf("\"", cur);rcur = pageContent.indexOf("\"", lcur+1);String awayPlayers = pageContent.substring(lcur+1,rcur); String players = homePlayers + "|" + awayPlayers; for(int j = 0; j < players.length(); j++){CBSplayerInfo aPlayer = new CBSplayerInfo();int cur1 = players.indexOf(":",j);aPlayer.id = players.substring(j,cur1);int cur2 = players.indexOf(" ",cur1); //出现特例:20071103DALSAC中空格是" ";//20071211INDCLE中空格是字符集导致的乱码(先保存,暂不处理),cur2返回-1.int SPACE_LEN = 6;if(cur2 == -1){frSpecialName.append(shotcharts[i] + "\r\n");break;//cur2 = players.indexOf(" ",cur1);//SPACE_LEN = 1;}aPlayer.firstName = players.substring(cur1 + 1,cur2);int cur3 = players.indexOf(",",cur2);aPlayer.lastName = players.substring(cur2 + SPACE_LEN,cur3);playerInfoSet.add(aPlayer); //添加球员ID信息j = players.indexOf("|",cur3);if(j == -1) break;}//保存shotchart数据if(gameKey.compareTo("200407") < 0){//fr0304.append(shotData + "\r\n");}else if(gameKey.compareTo("200507") < 0){//fr0304.close();//fr0405.append(shotData + "\r\n");}else if(gameKey.compareTo("200607") < 0){//fr0405.close();//fr0506.append(shotData + "\r\n");}else if(gameKey.compareTo("200707") < 0){//fr0506.close();fr0607.append(shotData + "\r\n");}else if(gameKey.compareTo("200807") < 0){fr0607.close();//fr0708.append(shotData + "\r\n");}else if(gameKey.compareTo("200907") < 0){//fr0708.close();//fr0809.append(shotData + "\r\n");}else if(gameKey.compareTo("201007") < 0){//fr0809.close();//fr0910.append(shotData + "\r\n");}else if(gameKey.compareTo("201107") < 0){//fr0910.close();//fr1011.append(shotData + "\r\n");} System.out.println(shotcharts[i]);}//fr1011.close();//保存球员ID数据Iterator<CBSplayerInfo> it = playerInfoSet.iterator();while(it.hasNext()){CBSplayerInfo nextPlayer = it.next();String playerInfo = nextPlayer.id + "\t" + nextPlayer.firstName + "\t" + nextPlayer.lastName;//frID.append(playerInfo + "\r\n");}frReschGames.close();frSpecialName.close(); //frID.close();}
}
碰到一些页面空格不一致的编码问题,单独处理。
package CBS;import java.io.*;
import java.util.Iterator;
import java.util.TreeSet;public class CBSspecialName {public static void main(String[] args) throws Exception{TreeSet<CBSplayerInfo> playerInfoSet = new TreeSet<CBSplayerInfo>();FileWriter frID = new FileWriter("E:\\CBSplayerInfo.txt"); //球员姓名中出现特殊空格字符的文件FileWriter frSpecialName = new FileWriter("E:\\SpecialNameSpace.txt");BufferedReader br = new BufferedReader(new FileReader("E:\\NBA\\data\\SpecialName.txt"));String str = br.readLine();int cnt = 1;while(str != null){String page = "E:\\NBA\\data\\2003-2011CBSshotchart\\" + str;BufferedReader br2 = new BufferedReader(new FileReader(page));String pageContent = "";String aLine = br2.readLine();while(aLine != null){pageContent = pageContent + aLine;aLine = br2.readLine();}br2.close();int cur = pageContent.indexOf("playerDataHomeString = new String");int lcur = pageContent.indexOf("\"", cur);int rcur = pageContent.indexOf("\"", lcur+1);String homePlayers = pageContent.substring(lcur+1,rcur);cur = pageContent.indexOf("playerDataAwayString = new String",rcur);lcur = pageContent.indexOf("\"", cur);rcur = pageContent.indexOf("\"", lcur+1);String awayPlayers = pageContent.substring(lcur+1,rcur);String players = homePlayers + "|" + awayPlayers;players = new String(players.getBytes("iso-8859-1"));for(int j = 0; j < players.length(); j++){CBSplayerInfo aPlayer = new CBSplayerInfo();int cur1 = players.indexOf(":",j);aPlayer.id = players.substring(j,cur1);int cur2 = players.indexOf(" ",cur1);int cur2p = players.indexOf("|",cur1);if(cur2 == -1 || (cur2 > cur2p && cur2p != -1)){cur2 = players.indexOf("?",cur1); //iso-8859-1下的空格}aPlayer.firstName = players.substring(cur1 + 1,cur2);int cur3 = players.indexOf(",",cur2);aPlayer.lastName = players.substring(cur2 + 1,cur3);playerInfoSet.add(aPlayer); //添加球员ID信息System.out.println(str + ":" + aPlayer.display());j = players.indexOf("|",cur3);if(j == -1) break;}str = br.readLine();}frSpecialName.close();br.close();//保存球员ID数据Iterator<CBSplayerInfo> it = playerInfoSet.iterator();while(it.hasNext()){CBSplayerInfo nextPlayer = it.next();String playerInfo = nextPlayer.id + ";" + nextPlayer.firstName + ";" + nextPlayer.lastName;frID.append(playerInfo + "\r\n");} frID.close();}
}
5,CBS默认shotchart数据里的第四节以及加时赛都是用3表示的period,编程修正。
package CBS;
/** 默认情况下,CBS的period数据中的第4节和加时赛都是3,本程序依次改为4,5,6……* 20101026HOULAL,0,5.0,3,1622542,1,0,25,40,25* 20101026HOULAL,0,11:41,3,1622542,5,1,0,42,0* period >= 3,同一gameID,当前一条shot时间为秒“.”,下一条包含分“:”时,period++*/
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.sql.Date;
import java.sql.Time;
import java.text.ParseException;
import java.text.SimpleDateFormat;public class CBSTime {public static void main(String args[]) throws Exception{String directoryPath = "E:\\2006-07shotdata\\";File directory = new File(directoryPath);String[] shotdata = directory.list();for(int i = 0; i < shotdata.length; i++){BufferedReader br = new BufferedReader(new FileReader(directoryPath + shotdata[i]));String aLine = br.readLine();FileWriter fr = new FileWriter(directoryPath + "CBS" + shotdata[i]);String[] lastShot = new String[]{"","","","","","","","","",""};while(aLine != null){String[] newShot = aLine.split(",");if(lastShot[0].equals(newShot[0]) && lastShot[3].compareTo("3") >= 0 && lastShot[2].contains(".") && newShot[2].contains(":")){Integer tmp = Integer.parseInt(lastShot[3])+1;newShot[3] = tmp.toString();}if(lastShot[0].equals(newShot[0]) && newShot[3].compareTo(lastShot[3]) < 0)newShot[3] = lastShot[3];lastShot = newShot;String aShot = lastShot[0]+","+lastShot[1]+","+lastShot[2]+","+lastShot[3]+","+lastShot[4]+","+lastShot[5]+","+lastShot[6]+","+lastShot[7]+","+lastShot[8]+","+lastShot[9];fr.append(aShot+"\r\n");System.out.println(aShot);aLine = br.readLine();}br.close();fr.close();}}
}
6,shotdata文本导入数据库就可以做一些简单的查询了~