
Ed-tech companies tend to publish pilot results that show strong positive outcomes. We understand the incentive. We want to be honest anyway, because the parts of our pilot that didn't work the way we expected are more interesting — and more useful to schools considering the platform — than the parts that did.
The pilot ran from January 2024 (spring term) through November 2024 (autumn term) across four state primary schools in Lewisham, Hackney, Lambeth, and Southwark. Total enrolment: 340 students across Year 3 and Year 4 (ages 7–9). We had a control group of 112 students in the same year groups at two of the same schools who used standard classroom practice without the platform for the spring term, before switching to the platform in autumn.
Study Design and Limitations
This was not a randomised controlled trial. Student assignment to platform or control was determined by class, not by individual randomisation. Classes using the platform and control classes were in the same schools and taught by different teachers — which means teacher quality is a confound we cannot fully control for.
We're flagging this upfront because we've seen ed-tech companies present convenience-sample pilot data as if it were RCT evidence. It isn't. Our results are encouraging, but they do not constitute proof that the platform causes the outcomes we observed. We are in early discussions with UCL Institute of Education about a more rigorous study, and we'll report those results when they're available.
With that caveat clearly stated: here's what we found.
Primary Outcome: Multiplication Tables Check Scores
The Year 4 Multiplication Tables Check is the primary assessment used in the study. It's administered nationally by the Department for Education and provides a standardised score from 0 to 25. The national average score in 2024 was 19.8.
Platform group Year 4 students (n=183 who completed at least 80% of platform sessions): mean MTC score of 22.1 (SD 2.4). Control group Year 4 students (n=89): mean MTC score of 18.6 (SD 3.1). The difference (3.5 points) is statistically significant at p<0.001 using an independent samples t-test.
However, the platform group had a somewhat higher pre-pilot baseline on a diagnostic multiplication assessment administered in January 2024. When we run an ANCOVA controlling for January baseline scores, the adjusted between-group difference is 2.1 points — still statistically significant (p=0.008) but smaller than the unadjusted figure.
We report the adjusted figure as our primary outcome. A 2.1-point improvement in MTC score, on a 25-point scale, with a national average of 19.8, represents meaningful movement for a primary-age student. But it's not as dramatic as the unadjusted headline number, and we think schools deserve to know both figures.
Secondary Outcome: Progress Below the Expected Standard
The Department for Education defines "working at the expected standard" on the MTC as a score of 20 or higher. We tracked the proportion of students in each group scoring below 20.
Platform group: 18% of Year 4 students scored below 20 on the MTC. Control group: 37% scored below 20. The relative reduction in below-expected-standard outcomes is 51% — which is a more meaningful figure for many schools than an average score difference, because it speaks directly to the at-risk population.
We're cautious about this figure because of the baseline confound described above. Students who start stronger are more likely to cross the 20-point threshold. Even so, the direction is consistent and the magnitude is notable enough to be worth reporting.
Usage Patterns: What Predicted Better Outcomes
Session frequency was the single strongest predictor of MTC score improvement within the platform group. Students who completed three or more sessions per week showed mean MTC improvement of 4.2 points from January to June baseline comparisons. Students completing one to two sessions per week showed mean improvement of 1.8 points. Students completing fewer than one session per week showed no significant improvement.
Three sessions per week is the minimum threshold. Schools that scheduled the platform as a consistent three-times-a-week activity delivered dramatically better outcomes than those that used it opportunistically. This is consistent with the spacing literature — but it's still worth stating plainly, because several schools in the pilot initially planned to use the platform "when there's time," which turned out to mean once or twice a week on average.
Session length also mattered. The platform automatically adjusts session length based on student performance; average session was 14.2 minutes. Students whose sessions averaged under 10 minutes (indicating they were completing tasks quickly and not being given enough challenge) showed lower improvement than students in the 12–18 minute range. This prompted us to add a difficulty extension setting that teachers can enable to ensure faster students are always presented with challenging material rather than completing sessions early.
What Didn't Work as Expected
Three findings from the pilot were surprises — and not positive ones.
Year 3 outcomes were weaker than Year 4. Our pre-pilot hypothesis was that earlier intervention would produce stronger effects. The Year 3 students in the pilot showed smaller MTC-equivalent score improvements than Year 4 students. Post-pilot analysis suggests the platform's question sequencing was not sufficiently calibrated for Year 3 ability ranges, and the IRT model item parameters (calibrated primarily on Year 4 students) performed less accurately on Year 3 students whose ability range is wider. We've revised the Year 3 module significantly for the 2025 rollout based on this finding.
Teacher dashboard engagement was lower than expected. We assumed teachers would regularly use the analytics view to guide their intervention decisions. In practice, four of the twelve teachers in the pilot checked the analytics view less than once per week. Exit interviews revealed that these teachers didn't distrust the data — they simply didn't have a consistent habit for when to look at it. This prompted us to add weekly digest emails that bring the three most actionable data points to teachers automatically, without requiring them to remember to log in to the analytics view.
One school showed no improvement. Hackney School B (pseudonymised) showed no significant improvement in either MTC scores or on our internal assessments across either term. Teacher interviews and session log analysis revealed that the school's timetable meant students completed sessions during unstructured "golden time" rather than during dedicated numeracy sessions — meaning sessions were frequently interrupted, students were competing with alternative activities, and session completion rates were low (averaging 1.4 sessions per week, well below the three-per-week threshold). The data from this school is included in all our summary statistics and has not been excluded. It's an important reminder that the platform's effectiveness depends on how it's implemented, not just whether it's present.
Teacher Feedback Themes
End-of-term qualitative feedback from the twelve class teachers in the pilot produced consistent themes.
Positive: "I know which students need help before they tell me." Seven of twelve teachers mentioned unprompted that the dashboard changed how they thought about lesson planning — specifically that knowing which facts individual students were struggling with allowed them to design targeted group activities rather than whole-class instruction.
Positive: "Setup was genuinely easy." All twelve teachers mentioned setup time favourably. The actual setup time (from account creation to first live session) averaged 47 minutes. Three teachers said they'd expected it to take a full day. This matters for adoption — complex onboarding is a significant adoption barrier for primary school ed-tech.
Mixed: "The session reports are good but I don't always have time to read them." This connects to the dashboard engagement finding above. The information is there; the habit of accessing it consistently is harder to establish than we anticipated.
Critical: "The year 3 questions felt too hard in the first month." This confirms the Year 3 calibration issue described above. Two Year 3 teachers reported that some of their lower-attaining students became visibly frustrated with the platform in the first three weeks before the adaptive algorithm had enough data to correctly calibrate difficulty. We've addressed this by lowering the initial difficulty floor for Year 3 student accounts.
What Comes Next
We're planning a larger pilot for the 2025–26 academic year, targeting 15 schools across a wider geographic range including schools outside London. The pre-Seed funding from Fuel Ventures will fund the development changes described in this report as well as the expanded pilot infrastructure.
We'll report those results with the same transparency we've applied here. If they show weaker effects than the South London pilot, we'll say so. If we identify new failure modes, we'll describe them. The alternative — publishing only the results that look good — would be less uncomfortable in the short term and worse for everyone in the long term, including us.