Leaderboard

Previous topic - Next topic

Ingix

Thanks again for the explanation. From the outside sometimes a bug looks like a strange feature, and sometimes a strange feature looks like a bug.

markus

I analyzed the full leaderboard at Scavenger (http://dominion.lauxnet.com/leaderboard/) and I noticed that the initial phi=2 seems to be chosen too high. Compared to chess, luck just makes it more difficult to beat opponents consistently – and there are fewer pros. So currently 95% of players (with at least 20 games) have a mu between [-1.95,1.68]. Whereas for a new player it is implicitly assumed that the 95% range is [-4,4].
This gives the odd results for some players with few games who get to very high/low mu. Currently, if someone new just beats a mu>2 player once, they'll end up with a mu around 2.5, which just doesn't sound right, given the luck involved in Dominion. (the change in mu is approximately phi^2*(wins-expected wins))

Therefore, I'd suggest to lower the parameter to phi=1 for new players and also to cap phi there. (Capping was suggested in glicko1, and it seems reasonable to me that someone who has played shouldn't have a higher rating deviation than someone new).

After a couple of months, it would actually be possible to estimate the parameters for initial phi, sigma, and tau, that give the best results in predicting game outcomes.


I also have some thoughts on the definition of a good match. I think comparing levels is bad especially at the lower end of the leaderboard. There are some people with a very low mu and high phi, which results in low levels and makes them a supposedly bad match for many opponents. Whereas actually we are very uncertain that their mu is that low.

My preferred way would be to use expected win probabilities and let players set a range. The advantage would be that a certain winning probability is more understandable for the layman than some level difference.

If you want to keep a criterion closer to the current system, I would define the range of suitable opponents as [mu-phi-x, mu+phi+x] with some cutoff x (x=0.5 seems reasonable to me). That would mean that there are more possible opponents for a player with a high phi (the system doesn't know the skill well) than for a player with a low phi (good estimate of the skill).

A more sophisticated matching algorithm could also check the distribution of players that started a match say in the last 30 minutes to determine a good cutoff. (When there are more players and/or a player is more in the middle of the distribution, you can find a more equal opponent within a certain time than for players in the tail of the distribution.)

Polk5440

Quote from: markus on 26 April 2017, 12:13:54 PM
I analyzed the full leaderboard at Scavenger (http://dominion.lauxnet.com/leaderboard/) and I noticed that the initial phi=2 seems to be chosen too high....

Therefore, I'd suggest to lower the parameter to phi=1 for new players and also to cap phi there....

After a couple of months, it would actually be possible to estimate the parameters for initial phi, sigma, and tau, that give the best results in predicting game outcomes....

My preferred way would be to use expected win probabilities and let players set a range. The advantage would be that a certain winning probability is more understandable for the layman than some level difference.

I agree wholeheartedly with this.

markus

I looked a bit more into potential improvements of the algorithm. I think there should be enough data for rated 2p-games (more than 300,000) to estimate a better initial phi and sigma. If there's an easy way to provide a list of these game outcomes, I'm also happy to play around with this or the suggestions from Glicko-boost below.

I think that having variable sigma doesn't do much. (That was the innovation from Glicko to Glicko-2.) In theory, it's nice, if more consistent players have a lower phi (more certainty about rating), but in practice that's hard to pick up by the algorithm. funkdoc has the minimum sigma=0.0595 in the top 20 now. Relative to sigma=0.06 that means that the phi after one day of not playing increases from 0.1597 to 0.1704 instead of 0.1706.

Also in the example of Mark Glickman's paper (http://glicko.net/research/dpcmsv.pdf) in Figures 3 and 4 it's apparent that sigma doesn't change much. And most of the difference between the top two panels comparing Glicko and Glicko-2 arise because sigma=0.01 in the constant variance case, and it's initialized with sigma=0.05 in the stochastic variance case.
Bottom line is, it doesn't hurt for now, but I think there are better ways in which making the algorithm more complex improves ratings.


I like the extensions that Glickman used in his more recent Glicko-boost algorithm (http://glicko.net/glicko/glicko-boost.pdf):

1)   First mover advantage: in the long run, it doesn't really matter for the rating, as players roughly play 50% of their games as first player, but in the short run it causes unnecessary rating fluctuations. This is overcome by adding something to the mu of the first player when evaluating the outcome. In the simplest case, it would be a constant, but one could estimate also more complicated forms (e.g. depending on difference of skill, or absolute level of skill).

2)   Phi boost based on exceptional performance: the idea is that on average players could have a lower phi, making their ratings less swingy once they have stabilized. But if someone plays exceptionally strong, their phi is increased to make climbing the leaderboard easier and to reflect more uncertainty about their rating. This serves a similar purpose as variable sigma in the current system, but it apparently works better – at least Glickman used it for the more recent algorithm.

3)   Iterating on ratings update: that is useful mainly for games involving newer players. The idea is to not use the opponent's rating from the beginning of the day, but from the end of the day. So if the opponent I beat today also lost a lot of other games, I'll get less of a boost than if they won the other games. That should prevent some of the extreme mu's that we have seen after the first few days, when a player with a high phi beats a stronger player / loses to a weaker player a couple of times.

4)   Sigma depending on mu and phi: I'm not so sure about that one, but one could estimate the daily phi increase based on mu and phi. The example would be that someone with a high mu is more likely to have a stable skill, than someone who's still learning the game. So their phi would increase by less per day and thus be lower. This would stabilize the mu of better players. Of course, it might be that estimating the parameters leads to the opposite result – or doesn't make much of a difference, in which case it doesn't have to be implemented.

Stef

Thanks for putting more thought into this. Your posts contain some suggestions that certainly make sense.

The lower initial phi of 1 instead of 2 seems good. It now seems to needlessly penalize new players. But can you back it up with some argument/example? Suppose a new account does pretty well on its first day, an 8-2 record against various random opponents, what would the resulting rating be for these two options?

I'm uncertain about using (mu - phi) over (mu - 2 * phi). While there is no need for quick degeneration of levels, I do actually believe you get worse at Dominion if you don't play for a while, and I don't want people that haven't played for half a year to still be near their old rating/rank.

I am tempted to change the system to immediate updates over daily updates. Not that I really prefer that myself, but people seem to like it and it prevents some questions about the current leaderboard.

----

Most of all I don't want to introduce new rules/parameters too often. Ideally you or someone could compose an actual short list of proposed changes and when there are no valid counterarguments we just do that. That would require filling in some more details.

markus

Quote from: Stef on 08 May 2017, 12:42:12 PM
The lower initial phi of 1 instead of 2 seems good. It now seems to needlessly penalize new players. But can you back it up with some argument/example? Suppose a new account does pretty well on its first day, an 8-2 record against various random opponents, what would the resulting rating be for these two options?

I wouldn't think of high phi as "penalizing" - that is "only" true for the rank in the leaderboard. High phi primarily means that mu changes more in response to under/overperforming.

For the 8-2 example that means (I take opponenents to have mu=0 as well):
phi=2: new mu=1.47, if opponents are new (phi=2) / mu=1.11, if opponents have phi=0.4
phi=1: new mu=0.90, if opponents are new (phi=1) / mu=0.87, if opponents have phi=0.4

Quote from: Stef on 08 May 2017, 12:42:12 PM
I'm uncertain about using (mu - phi) over (mu - 2 * phi). While there is no need for quick degeneration of levels, I do actually believe you get worse at Dominion if you don't play for a while, and I don't want people that haven't played for half a year to still be near their old rating/rank.
First, my hunch is that currently (average) phi is too high, because sigma is too high. (Especially at the top of the leaderboard). That makes subtracting 2*phi more important. I don't have a strong opinion on that, but I would continue to subtract at least 1.5*phi.

There is no degeneration of mu in Glicko - you could add that, but we don't have any observation of players not playing in a while now, so it would be a bit arbitrary. But you could for example subtract 0.1 for players who haven't played at all, 0.09 for players with 1 game,..., and 0 for 10 games or more. This could lead to an overall deflation of rankings, so you would have to boost everyone's by a tiny bit. (I actually noticed, that the average mu in the current system keeps falling, currently at -0.38.)

To get a degeneration of the rank in the leaderboard, it's already fine if phi increases. Sigma determines how strong that is. The underlying assumption currently would be that after half a year not playing, a third of people should have improved/lost their skill mu by at least 0.8. That seems too much to me for players at the top, but maybe for the average player that's accurate.

Quote from: Stef on 08 May 2017, 12:42:12 PM
I am tempted to change the system to immediate updates over daily updates. Not that I really prefer that myself, but people seem to like it and it prevents some questions about the current leaderboard.

Probably the way to go is like Scavenger to display an updated rating, that is at least approximately the new one at midnight - and use this for matchmaking purposes. (With the current algorithm it's possible to predict it exactly, with some refinements like the iteration suggested in 3) above, it wouldn't be exact anymore).


Quote from: Stef on 08 May 2017, 12:42:12 PM
Most of all I don't want to introduce new rules/parameters too often. Ideally you or someone could compose an actual short list of proposed changes and when there are no valid counterarguments we just do that. That would require filling in some more details.

I agree that this shouldn't change every month. Now, it would make sense to do something, because we have actual data to base it on / try it out. If you can provide for example a CSV file with the rated games (day,Player1,Player2,result), I'm happy to play around and make some suggestions.

markus

I tried the above suggestions with the rated 2 player game results from the first 38 days and created the attached figures.

It was not surprising that most of the gain comes from choosing a more suitable initial phi instead of 2, which turned out to be too large. Sigma could also be lowered a bit, but that affects the results less and (hence) is harder to estimate. (Intuitively it measures how much a player's skill could change over time, which is hard to know after a month).

What I also liked is the iteration of the daily results: that means that you use the opponents' end of day rating to calculate your rating change and helps when matched up with newer players, whose rating still changes more within a day.

Other improvements (boost for exceptional performances, letting sigma depend on mu and phi) didn't matter much. So I left them out here for the sake of keeping things simple.

The estimates are around initial phi=0.75 and sigma=0.05. Therefore, I'm plotting 3 versions:
1)   current system (phi0=2, sigma=0.06) in red,
2)   phi0=0.75, sigma=0.05 in blue
3)   phi0=0.75, sigma=0.05 and adding the iteration in green

To get the estimates, I minimized the average "discrepancy" of rated games (negative log-likelihood for the experts), that is –s*log(p)-(1-s)*log(1-p), where s is the outcome of the game (0 / 0.5 / 1) and p is the predicted win probability. The most boring rating system would always predict a 50% chance of winning, which results in a discrepancy of 0.693. So that's where the curves on day 1 start in the top left panel. I discarded the first 2 weeks, when estimating the parameters, because everyone starting from scratch is not going to be representative in the future.

You can see in the top left panel that naturally all systems get better over time. Interpreting the absolute values of discrepancy is difficult, because it depends on (quality of) match-making: if everyone was matched against an equal opponent, you can't do better than predicting 50% and will be wrong 50% of the times (discrepancy=0.693). If a strong player plays someone weaker and you correctly predict 80% win chances the discrepancy is 0.500.
So what happens, if you for example predict a 65% chance of winning, while the true one is 70%: the (average) discrepancy will be 0.616 instead of 0.611 for the best prediction. I'm saying this, because even though the differences between the curves might not seem too big, they can actually mean a lot in terms of predictive power.

For the second panel in top row, I use the number of days a given player has played on the x-axis. (again discarding games in the first 2 weeks). So the first point are new players that only started playing rated games after 2 weeks. You can see that the optimized coefficients do better in the beginning. After all, most of the gains come from picking a better initial value for phi. Hopefully, there will be many new players also in the futures, so it would be good to do a better job there.

The third panel in top row conditions on mu at the end of the 38 days (so how good the third model believes that players currently are). You can see that most of the gains come from players in the middle of the distribution. That's not surprising, because they account for the most games played, and I didn't give any higher weight to top players for example, when minimizing discrepancy.

The first panel in the middle row shows the bias of the predicted win probability. The red curve (current system) shows for example that players with an expected win probability between 70% and 80% actually won 5% fewer games than predicted. Ideally, you want to have these curves around 0, which the improved versions mostly achieve. (My intuition for the bias is that with high initial phi players with good/bad results in the first days over/under-shoot their actual skill mu, such that they get too high/low expected win probabilities in the following days.)

The middle panel shows the cumulative distribution function for mu at the end of the sample (what fraction of players is below a given value for mu). The optimized ones are much less dispersed than the current one. This is mostly due to players with only few games, who'll stick closer to mu=0, when initial phi is lower. This also means that a player who is truly very good/bad will take a bit longer to reach the appropriate mu, as each game's result has less of an influence on mu.

The right panel in the middle row shows the resulting cumulative distribution function for phi. In general, phi will be lower because of a lower initial value and a smaller daily increase (showing up mostly for players with many games in the bottom left). Phi is also capped to be at most 0.75.

The bottom row looks at the "rating deflation" that seems to be going on. Although I should say first, that this is not a big problem, because only absolute differences in mu between players matter, such that a shift of the whole distribution doesn't change predicted win probabilities for example. In the bottom left panel you can see that the average mu of players that have already played until that day falls over time. This can happen in theory, because the gain in mu of the winner is generally not equal to the loss in mu of the loser of a game. In particular, players with a higher phi (newer or less frequent players) get a bigger positive or negative adjustment of mu. Therefore, the decline of average mu suggests that players with a high phi underperform the expectations. That could happen, because better players played from day 1 / earlier and newer players are worse than the average older player. The good thing is that this shouldn't go on forever, because adding new players at mu=0 (higher than the average) counteracts this effect. And with a bit of fantasy one can see the red curve at least beginning to flatten out over time.

The middle and right panel in the bottom row show how the cut-offs for top 1% and top 10% / top 100 and top 1000 players evolved over time.

SirDagen

#37
Dear markus,

I find your treatise/comments utterly fascinating but do understand very little of it (although I do know my math). Could you explain in a few words for a statistical layman, what sigma, phi and mu stand for?
Greetings,
SirDagen

markus

mu is the estimate for the a player's skill: the difference between two player's mu primarily determines the win probability:

mu-difference: 0.5   1   1.5   2
prob. of win:  62%  73%  82%  88%


phi measures how certain the system is about the skill mu (95% of players are supposed to have their true skill between mu-2*phi and mu+2*phi) It decreases when games are played. (more so, if phi is high (uncertain own ranking), or if the opponent has a low phi (certain about opponent's ranking), or if the opponent is more equal (more informative game outcome)
In turn, phi affects how much mu changes, when the actual wins differ from the expected wins. Higher phi lets mu change by more. (change in mu is roughly phi^2*(wins-exp.wins))

sigma determines how much phi increases per day (after having potentially been decreased by playing games). This captures that we become more uncertain about a player's skill as time passes by. In particular phi_new = sqrt(phi^2+sigma^2) if a player doesn't play.

SirDagen

That is so helpful. Thank you, Markus.
Greetings,
SirDagen

Martin plays Piano

I just saw (after writing my text here), that Markus has replied in the meanwhile, and SirDagen said thank you, but let's try to explain those maths stuff once more from a layman to a layman perspective ...

As Stef stated in his initial thread, when the leaderboard was announced in the first week of April, there are 3 essential parameters, which influence our ratings. At the very end, the cumbersome result of this formula is translated into the levels we all can see in our daily leaderboard (currently from 68 down to 0).

The (Mu) is your estimated and dynamic Dominion skill – we all had started with a Mu = 0. As you can see in the leaderboard in the meanwhile the top players mostly have a (Mu) >2. The majority of all players is around 0,5 to 0 – some others are far below 0.
Why is this ? – The (Mu) is recalculated after every game – it depends on the skill levels of your opponents and of course it depends on the game results (win or loss) against those opponents. Let's say it like this (from layman to layman), the new (Mu) is calculated based on your own (Mu) versus the opponents (Mu), which implies a calculated win probability. So the (Mu) will go up by beating good players and will go down by losing against weaker players. If you always win against much weaker players it will go up, but very slow. You can only get up in the leaderboard significantly by beating better players – and as I know the biggest fear of the top players is a loss against a much weaker player.

There is a second parameter called (Phi) - which is identified as your deviation. This might be the most difficult part to explain here (and this is mostly the part Markus wants to change in the current formula). This (Phi) had started for all of us with 2 and is supposed to go down and down to be close to 0 (in theory). It shows more or less your consistency of playing – and the more often you are playing, the more the system knows about your predictability and the probability, how the next game will end (considering the skill level of your opponent). So if you are a top player and you played some hundred games with a quite good consistency (eg. you won 70% of your games), your (Phi) went down to 0,2 (for instance) – and a loss against a weaker player doesn't destroy your whole ranking, it's more like a statistical outlier. If you haven't played many games so far (let's say only 5 games) - and all these games ended up alternating with losses and wins, your (Phi) might be still very high – and the system didn't get a good profile of your gaming consistency yet. And finally there are also some players with 5 rated games, and they won all of them – so their (Phi) is medium low and their (Mu) can be really high (even after only 5 games). I assume this is the effect Markus describes in his thread, that a newbie with 5 (accidental) wins in his first 5 games can show up in the top20 (which is mathematically correct but not the intended result to be shown in the leaderboard) – normally with the first loss this newcomer statistically will be brought back down to earth, and I guess, Markus has suggested some adjustments with the initial (Phi) to get rid of those one-day shooting stars.

The third parameter is relatively unimportant called (Sigma) – this is your volatility which shows how often and regular you are playing – this is only relevant for players doing a break for some weeks or months to make sure, that they don't remain in the top 100 forever. Their ranking will fall down due to the sinking (Sigma) the longer you are not playing. Even if you are not playing for a few days the (Sigma) changes very slightly your rating – so being an active player is rewarded or the other way around being a lazy player with big intervals of inactivity is penalized.

Hopefully this helps for your understanding – and hopefully my explanation was correct ...

Have fun
Martin

markus

Thanks, I just want to correct two things  8)
1) The daily decrease in phi in Glicko doesn't actually depend on the results of the games that you play.
2) Sigma affects everyone and not just those that havent't played. This is the reason why phi doesn't go close to 0, even after many games. If you play more games per day (!) it will converge to a lower level. What that level is, depends on sigma and the quality of your opponents.If you then cut back on the number of games per day, phi will increase and converge to a higher level.

Polk5440

Nice job, markus! Very interesting.

The first chart in the middle row is the most important one to me. It's what I would look at first when evaluating whether the rating system is doing a good job. It kind of supports this idea that it's risky for a strong player to play against someone rated too much lower because the strong player is more likely to lose than predicted.

I think you have presented some pretty strong evidence that some tweaks are needed.