It’s been 36 hours since the end of the first round of the LT Tracking Project (Volunteer your data for the LT Tracking Project! - MPQ General Discussion - 505 Go! Official Forums). If anyone has data they haven’t yet sent in, please do and I will add it to these results.
I have posted the raw preliminary data (with draws commingled and potentially identifying information stripped out) here: http://dropproxy.com/f/D8D
Note on Analysis
Before we get to what the data shows, let’s talk about what this data can tell us and how we know it. We have a data set and we want to determine whether or not we should question what we have been told and what we have reasonably assumed to be true about the game’s random draws. Here’s a few basic, testable hypotheses:
- In this period (and ever since the release of Phoenix), the probability of drawing a 5* is fixed at 10%.
- LT pulls are fair. That is, no extraneous variables should affect cover pulls. It shouldn’t matter a) who is opening the LT, b) what time it was opened, c) through what means (token or CP) the LT was obtained, d) how much money the person has spent, e) how developed the roster is, or f) how the person plays.
- All covers of the same rarity have the same probability of being drawn, which should match the listed drop rate.
- Each character’s three color should be evenly distributed.
We can only evaluate these probabilistically through our limited dataset. That is, we have to calculate how likely our results are if we assume these hypotheses are true. This likelihood is called the p-value (p-value - Wikipedia). For example, if we assume that the 5* drop rate is 10% but only got 1 5* out of 20, we can calculate that permutations or 1 or 0 out of 20 make up around 40% of all permutations - slightly unlucky, but not enough for us to question the 10% assumption. If it was 1 out of 50, we would be below the bottom 5%. The threshold where we start to question the assumption is the significance level; in typical scientific studies, it’s somewhere around 1-5%, but it has to be tailored to the experiment. I’m not sure what the significance level for any of our “experiments” should be, but the lower the threshold, the more we should question the assumption.
Similarly, the more data we have, the more we can question unexpected results. For example, one 5* out of 20 and ten 5*s out of 200 are both 5% drop rates. But the first one is around the bottom 25%, the second is lower than the bottom 1%. The p-value actually tells us whether our sample size is good enough - if we had too small a sample size, we’d mathematically be unable to reach our significance threshold.
Not being able to reach a high enough p-value does not mean that the assumptions are true, just that we do not have the data to question it.
The Data
There were 436 total LT draws. Two contributors (one of which is me) supplied more than 75% of the data. I have validated the results of the other contributor against his or her in-game roster and found no discrepancies (as well as a perfect match in 5* covers). Therefore, most of the data appears reliable but leaves us unable to determine whether individual accounts (or devices or OS software) affect RNG (assumption 2a). If 2a is invalid, this does puts into question the rest of the results and, of course,
This amount of data is sufficient to potentially falsify the listed 10% drop rate (assumption 1). If we accept a significance threshold of 5%, an aggregate drop rate of less than 7.5% or or more than 12.5% would be sufficient for us to question this.
Many people also opened tokens in large chunks, which means that much of the time data is clustered. We also have significant missing data on the token/cp question. I’ve not yet had the chance to go through the survey data; that data may not be diverse enough either. So, I’m not sure that we have enough data to question the assumptions that hinge on these variables.
The Results
Ultimately, the overall 5% drop rate was 39 out of 436, or 8.9%. This is lower than the 10% rate, but this only puts us in the bottom 26.036 percentile of all permutations so we do not have enough of a reason to question assumption 1.
This breaks down to 208 total Latest LTs and 228 Classic LTs. The 5* rate for the Latest is 7.7%, for Classic 10.1%. Neither is significant enough for us to question assumption 2c.
Kind of boring. But so far so good. The next part is more interesting.
The most common 4* character pulled is three-way tie between Ghost Rider, Iceman, and Red Hulk, each at 23. The least common 4* character pulled is Star Lord at 4. The listed draw rate for all these characters is 3.5%. Since there is a 90% chance to pull a 4* and there are 26 valid 4*s (Quake and Devil Dino are not in the LT pool), this should actually be 3.46%.
23 pulls from 436 at 3.46% is top 96 percentile, which are marginal or close to marginal results close to the reasonable range of significance levels.
4 pulls from 436 at 3.46% is bottom 0.069 percentile, which is far below any reasonable significance level.
This is an astounding result. 4 Star-Lords out of 436 pulls is strongly suggestive that cover distribution is not even. We do not have sufficient data to determine whether the source of this unevenness is global or improperly biased by individual accounts, devices, or some other variable. But we can be pretty close to certain that something is wrong. The next two least common pulls are XF Wolverine and 4hor, each with 9 covers, putting them at 4.9 percentile, which is quite low if marginally significant.
The most common 5* character is Black Suit Spider-Man at 11 pulls out of 208 (Latest LT) at 3.33% (top 91 percentile). The least common 5* was OML at 4 pulls out of 228 (Classic LT) at 3.33% (bottom 8.705 percentile). These are marginal results as well. The problem with the 5* results is that the populations of Classic and Latest tokens are roughly half the size of the total number of LT; if we had more data, we might be able to definitively say for sure whether or not 5* drop rates are as skewed as 4* ones seem to be. As it stands, it is merely very interesting.
ADDENDUM on Colors:
I’ve added a section to the results spreadsheet breaking draws by character-color pairs to test distribution of colors as well. If we assume that every color is randomly distributed as well, there is a 1.15% chance to get any specific 4* character-color combination (1.11% for 5*).
The most common color-cover is Ghost Rider Red at 12 covers (top 99.472 percentile). The other Ghost Rider color-covers are black at 8 (top 86.637 percentile) and green at 3 (bottom 26.144 percentile). This means that just within the population of Ghost Rider draws, red at 12 is top 95.195 percentile assuming 33.33% and green at 3 is bottom 2.648.
The lowest color-covers are XF Wolverine Black and X-23 Black at 1, both bottom 3.919 percentile.
On the 5* front, OML Yellow and Red, Surfer Purple, Phoenix Green, and Goblin Purple are all tied for least common at 1. For OML and Surfer, this is bottom 27.932 percentile. Most common is BSSM Purple at 6 (top 97.036%).
Here’s the TLDR summary:
-
• Overall, we’re close enough to the 10% 5* drop rate that we do not have enough reason to seriously question it.
-
• There were a fair number more 5*s from Classic LTs than Latest LTs, but the difference is not significant.
-
• At least one 4*s is significantly less likely to drop than average. Others have very high and low rates that are marginally significant. Several color-cover combinations dropped at rates which are also in the range of reasonable significance. Taken together, the distribution of both characters and colors is skewed enough that we should raise serious doubts as to whether or not character draws are actually evenly distributed.
RNGs can be tricky to get right and MPQ runs on a wide variety of platforms (and we’ve already seen how draws are determined client-side rather than server-side). I’d love to see a dev response on this, but the historical data on that happening strongly suggests that I should not hold my breath.