X
  • Tuesday, December 5, 2017

PageRank-based College football (NCAA) ranking using OAAgraph

By: Jie Liu | Data Scientist

NCAA College football is American football played by teams of student athletes fielded by American universities, colleges, and military academies. It is one of the major weekend entertainments in the US. The match results capture most of the Sunday headlines. In particular, one key focus is the rankings of the teams. There are various types of rankings: CFP rankings, AP Poll, Coaches Poll, etc. Those rankings look similar to each other with slight differences. The ranking methods are not disclosed, but one thing is certain, the rankings are not based on a single algorithm or formula but generated by vote or poll from sport writers or non-players. This is largely because NCAA is not organizing a tournament game. Instead, the teams are playing in isolated 'conferences' such as Big Ten, ACC, etc. Therefore, many pairs of the teams do not even get a chance to play against each other. Moreover, there are teams like Notre Dame playing as 'independent' and thus not confined to a single conference.

A natural question is: can we create our algorithm to rank college football teams? The answer is yes and has been done already. See github https://github.com/joebluems/CollegeFootball2015. The author used a customized PageRank algorithm to calculate the page rank of each football team, which is then used to generate the rankings. The results look good on college football rankings in 2015. Why does PageRank work? Is there a way to improve the development or analytic process? In this blog, we will show how we achieve the same analysis using OAAgraph, an interface that integrates Oracle R Enterprise of the Advanced Analytics option to the Parallel Graph AnalytiX (PGX) engine part of the Oracle Spatial and Graph option.

NCAA College Football Data

Easy-to-read NCAA football outcomes can be found in https://www.sports-reference.com/cfb/years/2017-schedule.html, offering full information of all College football teams. It is also downloadable as CSV format.
The following R code reads the NCAA data from .csv and saves it as a table in Oracle Database by Oracle R Enterprise.

library(ORE)
ore.connect(...)
scores.df <- read.csv('scores.csv', header =T)
scores.df <- scores.df[, c('Winner', 'Pts', 'Loser', 'Pts.1')] 
colnames(scores.df) <- c('TEAM1', 'SCORE1', 'TEAM2', 'SCORE2')
scores.df$TEAM1 <- as.character(scores.df$TEAM1)
scores.df$TEAM2 <- as.character(scores.df$TEAM2)
scores.df$TEAM1 <- sapply(scores.df$TEAM1, function(str){ gsub('\\(.*\\) ', '', str)})
scores.df$TEAM2 <- sapply(scores.df$TEAM2, function(str){ gsub('\\(.*\\) ', '', str)})
teams.df <- read.csv('teams.csv', header = F)
colnames(teams.df) <- c('No', 'TEAM')
scores.df <- merge(scores.df, teams.df, by.x = 'TEAM1', by.y = 'TEAM')
colnames(scores.df)[colnames(scores.df) == 'No'] <- 'No1'
#colnames(scores.df)[colnames(scores.df) == 'CNT'] <- 'CNT1'
scores.df <- merge(scores.df, teams.df, by.x = 'TEAM2', by.y = 'TEAM')
colnames(scores.df)[colnames(scores.df) == 'No'] <- 'No2'

Let us take a look at the data frame:

> head(scores.df)
              TEAM2           TEAM1 SCORE1 SCORE2 No1 No2
1 Abilene Christian      New Mexico     38     14 118  11
2 Abilene Christian  Colorado State     38     10  72  11
3         Air Force        Michigan     29     13  29  50
4         Air Force            Army     21      0  43  50
5         Air Force         Wyoming     28     14 123  50
6         Air Force San Diego State     28     24 144  50

Each row of the data frame is a record of one match. TEAM1 and TEAM2 are team names in the match, and here TEAM1 is the winner and TEAM2 the loser. We also record the scores in SCORE1 and SCORE2. For ease of indexing the team, we also created IDs for the teams, which are columns No1, No2.
With this data frame ready, we can start our analysis.

Generate Graph by OAAgraph

Now, we show how to create a graph using OAAgraph based on the data frame scores.df. Here, we model each team as a node in the graph.
The following code creates a table TEAM containing all nodes i.e. teams indices in Oracle Database.

VID <- teams.df$No
NAME <- teams.df$TEAM
nodes.df <- data.frame(VID, NAME)
ore.drop(table = 'TEAM')
ore.create(nodes.df, table = 'TEAM')

A match between two teams is modeled as an edge with direction. The direction is from the loser to the winner. We can interpret the edge or relation as 'beaten by'. The following code creates a table of edges TEAM_EDGES. Each row contains the edge ID ('EID'), SVID (the source node id, i.e. the loser team in the match), DVID (destination node i.e. winner team), EL ( label of the edge, 'beaten_by') and other edge properties such as MARGIN, which is computed from the ratio of score difference to the winning score. This property can contain any numerical values by design. We will show how this property is used later.

scores.df <- scores.df[ !is.na(scores.df$SCORE1) & !is.na(scores.df$SCORE2), ]
# if team1 is beaten by team2, then team1 -> team2
EID <- rownames(scores.df)
SVID <- ifelse(scores.df$SCORE1 > scores.df$SCORE2, scores.df$No2, score.df$No1)
DVID <- ifelse(scores.df$SCORE1 > scores.df$SCORE2, scores.df$No1, score.df$No2)
EL <- rep('beaten_by', nrow(scores.df))
MARGIN <- abs(scores.df$SCORE2 - scores.df$SCORE1)*1.0/max(c(scores.df$SCORE1, scores.df$SCORE2, 0))
edges.df <- data.frame(EID, SVID, DVID, EL, MARGIN)
edges <- data.frame(EID, SVID, DVID, EL, DIFF)
ore.drop(table="TEAM_EDGES")
ore.create(edges, table = "TEAM_EDGES")

Note that OREdpylr can also be used to do the data transformation if the data resides in database originally.
After we complete the node and edge tables in Oracle Database, we are ready to create the graph. A simple command can be called by supplying both node and edge tables.

try(oaa.rm(graph), silent = TRUE)
graph <- oaa.graph(TEAM_EDGES, TEAM, "teamGraph")
> graph
Graph Name: teamGraph 
Number of Nodes: 209 
Number of Edges: 781 
Persistent Graph: FALSE 
Node Properties: NAME 
Edge Properties: MARGIN 

PageRank

The reason why we model the edges in such a way is to utilize PageRank. PageRank was originally used by Google to rank the search results of webpages. It provides a powerful way to generate rankings or 'reputation' of a webpage based on how many times it is quoted or linked by other webpages. A brief introduction of PageRank can be found in this link.
In the language of graph, each webpage is modeled as a node, and the linking from node A to node B is modeled as an edge from A to B. Ideally, the PageRank of a node will be higher when it is referenced by nodes with high ranks as well.
The same idea can be applied to college football ranking. We can model each college as a node and the match between two colleges as an edge. Here, the edge is from loser to the winner. This means the edge means 'beaten by'. Therefore if team A has more edges, i.e. higher in-degree, that means that team A beats more teams. If the teams beaten by team A have a higher PageRank, then naturally team A will also has a higher PageRank. In this way, we can use PageRank to rank the football teams.
To rank the college football teams, let us compute PageRank and see what the result looks like. In OAAgraph, a single command can be called to compute PageRank.

pagerank(graph, error = 0.0001, damping = 0.2, maxIterations = 1000)

This line of code computes PageRank and attaches the score to each node. To retrieve the PageRank, we can use a PGQL query in R:

cursor <- oaa.cursor(graph,
                     query = "select n.NAME, n.pagerank  where (n) order by n.pagerank desc")
oaa.next(cursor, 30)

The query returns the top 30 teams sorted by PageRank. The ranking is:

                 n.NAME  n.pagerank
1               Clemson 0.006036588
2                Auburn 0.006022145
3       Central Florida 0.005866008
4            Iowa State 0.005833792
5               Georgia 0.005796430
6            Miami (FL) 0.005606519
7              Oklahoma 0.005581062
8          Fresno State 0.005540282
9       Louisiana State 0.005534469
10      Texas Christian 0.005520499
11           Pittsburgh 0.005513876
12      San Diego State 0.005478629
13          North Texas 0.005453995
14               Toledo 0.005428720
15              Memphis 0.005427890
16           Ohio State 0.005398382
17            Wisconsin 0.005387744
18           Notre Dame 0.005381033
19          Wake Forest 0.005376808
20              Alabama 0.005368288
21        Virginia Tech 0.005360752
22 North Carolina State 0.005338435
23          Boise State 0.005337153
24       South Carolina 0.005333423
25     Florida Atlantic 0.005315308
26   Southern Methodist 0.005298519
27     Central Michigan 0.005277379
28           Penn State 0.005271921
29             Stanford 0.005271275
30     Washington State 0.005264159

By the time of this blog is written, the NCAA AP poll ranking is (week 14)

RK  TEAM
1   Clemson
2   Oklahoma
3   Wisconsin
4   Auburn
5   Alabama
6   Georgia
7   Miami
8   Ohio State
9   Penn State
10  TCU
11  USC
12  UCF
13  Washington
14  Stanford
15  Notre Dame
16  Memphis
17  LSU
18  Oklahoma State
19  Michigan State
20  Northwestern
21  Washington State
22  Virginia Tech
23  South Florida
24  Mississippi State
25  Fresno State

It seems that the PageRank has similar teams ranked at the top just as AP Poll, but in quite a different order. Some teams are ranked very high in PageRank but very low in AP Poll, such as Central Florida, Iowa State, Freson State. Why is that? Let us take a look at Central Florida. We can run a PGQL query to find all winners/losers to Central Florida:

> cursor <- oaa.cursor(graph, 
+                      query ="SELECT f.NAME,  g.NAME WHERE (f )-[e:beaten_by]->(g WITH NAME = 'Central Florida')")
> oaa.next(cursor, 20)
                  f.NAME          g.NAME
1            Connecticut Central Florida
2  Florida International Central Florida
3     Southern Methodist Central Florida
4             Cincinnati Central Florida
5          East Carolina Central Florida
6               Maryland Central Florida
7                Memphis Central Florida
8                   Navy Central Florida
9          South Florida Central Florida
10                Temple Central Florida
11           Austin Peay Central Florida
> cursor <- oaa.cursor(graph, 
+                      query ="SELECT f.NAME,  g.NAME WHERE (f WITH NAME = 'Central Florida')-[e:beaten_by]->(g )")
> oaa.next(cursor, 20)
Error in oaa.next.default(cursorObj, n) : cursor is empty

The errors indicates that the query of the teams that beat Central Florida returns nothing. Actually Central Florida is an all-winner in the first 11 games! That is why it is ranked that high. But in AP poll , Central Florida only ranks 12.
Another big difference is Alabama, which should ranked much higher but only ranked 20th here. One explanation is that PageRank places a lot of emphasis on the win-lose counts. The AP poll, on the other hand, considers way more beyond the win/lose counts, such as statistics in the match such as running distance of the quarterback, intercept/turnover counts, etc.

Weighted PageRank – Score Margin

Let us add more data into the calculation and see if we can improve the ranking. One thought is to consider the score margin of a team. If a team tends to win with a large margin, then that team should be ranked higher. The way we incorporate the score margin is to use the weighted PageRank. This algorithm allows a weight attached to each edge and rank higher for nodes with more incoming weighted edges.
The code can be written as

> pagerank(graph, 0.0001, 0.1, 1000, variant = 'weighted', weightPropName = 'MARGIN')
oaa.cursor over: ID, weighted_pagerank 
position: 0 
size: 209 
> 
> cursor <- oaa.cursor(graph, 
+                      query = "select n.NAME, n.weighted_pagerank where (n) order by n.weighted_pagerank desc")
> oaa.next(cursor, 30)
             n.NAME n.weighted_pagerank
1               Clemson         0.006511112
2                Auburn         0.006478130
3               Georgia         0.006234016
4       Central Florida         0.006195107
5            Notre Dame         0.006013695
6            Ohio State         0.005951788
7       Texas Christian         0.005940757
8              Oklahoma         0.005940566
9            Iowa State         0.005790163
10    Mississippi State         0.005757828
11            Wisconsin         0.005735430
12              Alabama         0.005698574
13           Miami (FL)         0.005693818
14           Penn State         0.005689439
15               Toledo         0.005595128
16               Oregon         0.005588613
17     Florida Atlantic         0.005514597
18           Pittsburgh         0.005501906
19         Fresno State         0.005501542
20              Memphis         0.005467258
21      Louisiana State         0.005453052
22        Virginia Tech         0.005430484
23  Southern California         0.005430218
24           Washington         0.005387747
25      San Diego State         0.005382916
26             Stanford         0.005359146
27             Missouri         0.005350855
28           Louisville         0.005331221
29 North Carolina State         0.005315241
30          Boise State         0.005286847

Looks like the ranking is improved a lot! Alabama is now ranked 12th. Another prominent difference is that Notre Dame ranks very high (5th), comparing to (15th) in AP Poll. This is because Notre Dame won quite a few games with large margin:
ND Vs Temple: 49-16, ND VS Boston College (49-20), Michigan State (38 -18), Miami (OH) (52-17), USC (49-14).
Although the ranking is not any closer to the AP Poll, we did see that adding weights to the link can impact the ranking through weighted PageRank algorithm. We believe the ranking can be improved if more match statistics are added.

Adjustment with Number of Lost

One particular flaw with using the PageRank method to rank the teams is that the PageRank algorithm only focuses on the teams that each team has beaten. Recall that the PageRank is computed as
PR(A) = (1-d) + d (PR(T1)/C(T1) +...+ PR(Tn)/C(Tn))
where PR() is the PageRank score. T1 - Tn are teams beaten by A. C(Ti) is the number of teams that Ti has lost to.
From this formula, we can see that there is no information about the teams that won team A! All the information used here is about teams lost to A. That gives us a biased ranking such that as long as a team beat excellent teams, that team will receive a high ranking.
This can be seen from Iowa State. This team is not ranked any high in AP Poll, but received a high PageRank in both vanilla and weighted type of PageRank. Let us take a look at this team.

> cursor <- oaa.cursor(graph, 
+                      query ="SELECT f.NAME,  g.NAME WHERE (f )-[e:beaten_by]->(g WITH NAME = 'Iowa State')")
> oaa.next(cursor, 20)
           f.NAME     g.NAME
1 Texas Christian Iowa State
2          Baylor Iowa State
3   Northern Iowa Iowa State
4        Oklahoma Iowa State
5           Akron Iowa State
6          Kansas Iowa State
7      Texas Tech Iowa State
> 
> 
> cursor <- oaa.cursor(graph, 
+                      query ="SELECT f.NAME,  g.NAME WHERE (f WITH NAME = 'Iowa State')-[e:beaten_by]->(g )")
> oaa.next(cursor, 20)
      f.NAME         g.NAME
1 Iowa State Oklahoma State
2 Iowa State           Iowa
3 Iowa State   Kansas State
4 Iowa State          Texas
5 Iowa State  West Virginia

Looks like Iowa State has 5 losses. This explains why it does not have a high ranking. But on the other hand, ISU beats high ranking teams such as TCU, Oklahoma, Texas Tech. This significantly boosts the PageRank score.
To avoid this defect, we can make some adjustment to the obtained PageRank score by punishing the teams with losses. The idea is to multiply a factor that decreases monotonically with the number of loses.
Here we used an empirical formula
PageRank/( a*# of losses + b)
The parameter b is to avoid divided-by-zero error when the team has no losses. Both a and b can be chosen by design. Here we chooses the parameter a and b such that the ranking looks as close as to AP Polls.

Let us first calculate the out degree of each node:

degree(graph, "out", "nLost")

This value is attached to each node with the property name 'nLost', which means that number of losses. Then we calculate the PageRank score.

pagerank(graph, error = 0.0001, damping = 0.6, maxIterations = 1000)
cursor <- oaa.cursor(graph,
                     query = "select n.NAME, n.pagerank, n.nLost where (n) order by n.pagerank desc")
rank.df <- oaa.next(cursor, 30)
After the PageRank is obtained, we compute the adjusted score: 
rank.df$SCORE <- rank.df$n.pagerank/(0.4*rank.df$n.nLost + 0.9)
rank.df[order(-rank.df$SCORE),]
            n.NAME  n.pagerank n.nLost       SCORE
1               Clemson 0.016961654       1 0.013047426
2       Central Florida 0.011323213       0 0.012581347
3                Auburn 0.018990912       2 0.011171124
4             Wisconsin 0.008984433       0 0.009982703
5              Oklahoma 0.012484544       1 0.009603495
6            Miami (FL) 0.012093970       1 0.009303054
7               Georgia 0.011239293       1 0.008645610
8               Alabama 0.010776809       1 0.008289853
9       Louisiana State 0.012340969       3 0.005876652
10           Ohio State 0.009626590       2 0.005662700
11           Notre Dame 0.010977572       3 0.005227415
12           Washington 0.008674728       2 0.005102781
13      Texas Christian 0.008620726       2 0.005071015
14  Southern California 0.008586252       2 0.005050737
15           Iowa State 0.014511512       5 0.005003970
16           Penn State 0.008371257       2 0.004924269
17      San Diego State 0.008303874       2 0.004884632
18             Stanford 0.010089052       3 0.004804311
19     Washington State 0.010070243       3 0.004795354
20         Fresno State 0.008914050       3 0.004244786
21          Boise State 0.008815810       3 0.004198005
22       Michigan State 0.008096342       3 0.003855401
23        Virginia Tech 0.007767759       3 0.003698933
24       Oklahoma State 0.007471362       3 0.003557791
25             Syracuse 0.013993134       8 0.003412959
26 North Carolina State 0.008332988       4 0.003333195
27                 Iowa 0.009211545       5 0.003176395
28           Pittsburgh 0.011743373       7 0.003173885
29    Mississippi State 0.007887910       4 0.003155164
30          Wake Forest 0.008395616       5 0.002895040

The result looks much better. Iowa State is now ranked 15th and Alabama ranks 8th. We believe that we can even approach the AP Poll rankings by adding consideration of more match data, but that is out of scope of this blog.

Conclusion

In this blog, we demonstrate how to use OAAgraph to generate rankings for NCAA football teams. The ranking results show that the top teams are close to AP Poll with a certain bias due to lack of data. Adding the score margin to the algorithm, we also demonstrate the application of weighted PageRank algorithm. We successfully generated rankings with favor to teams with higher score margin. By adjusting the PageRank score with the number of losses, we improved the accuracy of the ranking. Perhaps there will be AI rankings for College football as a primary ranking method!

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha