My PhD Thesis Outline - Planning for My Final Year as a Computer Science PhD Student

TL;DR

The remaining 15 months of funding translate into a target to finish core research work in about six months, leaving roughly nine months for thesis writing.

Briefing Cornell Notes

Briefing

With just 15 months left of funding, a computer science PhD student in machine learning lays out a tightly sequenced plan to finish research and shift into thesis writing—aiming to “wrap up” core research within about six months so roughly nine months can go to drafting, revising, and finalizing the thesis.

The thesis centers on marathon training analytics using Strava data. The work focuses on turning complex, messy time-series training logs—each run session broken into 100-meter intervals with distance, time, pace, elevation, and heart-rate—into models that can (1) predict race performance, (2) quantify how training disruptions affect outcomes, and (3) recommend training plans to improve performance. The scale of the dataset grows dramatically across the PhD: early experiments used about 20,000 runners over 16 weeks, while later work expands to a couple million runners and tens of millions of sessions. That expansion forces new data engineering skills: extracting, cleaning, and re-running analyses at a much larger scale.

Early progress came quickly. In the first year, the student built features from training history to predict race times and produced two first-author conference publications, including work on training recommendations (what a “next week” might look like depending on whether a runner wants to improve or ease off). After the pandemic, the research focus shifted toward training disruptions—examining how breaks in training influence marathon performance—supported by multiple mini-projects and later journal submission.

Structurally, the thesis is planned around the experiments rather than a purely linear “methods/results” flow. The outline includes an introduction chapter, a literature review, and a dedicated dataset chapter that documents how raw Strava data becomes usable modeling inputs, including the cuts and cleaning decisions required for a messy dataset. The core contribution is then split into three main chapters: (1) feature engineering and performance prediction, (2) training disruptions using more data-science-heavy analysis with some machine learning, and (3) training recommendations framed like a recommender system for marathon running.

A key practical challenge is validation for recommendations. Live user studies are unlikely, so validation is expected to rely on generating sample training plans and recruiting people to judge whether the plans are sensible and understandable—more “can humans follow this?” than “does it improve outcomes in the real world?”

For the remaining year, the student’s calendar is designed to front-load research during a less teaching-heavy summer. June priorities include finishing “camera-ready” material for an accepted conference paper and completing analysis for an additional training-disruptions piece that could become its own publication. The next phase targets feature engineering and representation learning for performance prediction, followed by a user-evaluation step to test explainability and usability of training plans. The final recommender-system work is scheduled for roughly October through December. Throughout, the student plans to write iteratively—capturing notes and drafts early—while avoiding premature full write-ups that might require rerunning models if the dataset changes again.

Cornell Notes

The PhD plan is built around a hard deadline: after passing a funding milestone, there are 15 months left to submit the thesis. The strategy is to finish most research in about six months, then devote roughly nine months to thesis writing. The research uses Strava time-series training data to (1) predict marathon race performance, (2) study how training disruptions affect outcomes, and (3) generate training recommendations. Work evolves from a smaller dataset (~20,000 runners) to a much larger one (millions of runners and tens of millions of sessions), requiring new data engineering and feature engineering. The thesis structure is experiment-driven, with a dedicated dataset chapter and a validation approach for recommendations based on human judgment rather than live trials.

Why does the thesis need a dedicated dataset chapter, and what does that chapter have to accomplish?

Because the Strava data is messy and time-series heavy, the thesis must document how raw logs become modeling-ready inputs. That includes the transformation pipeline, the specific “cuts” or filtering decisions, and how the dataset is cleaned and restructured so the resulting features can support prediction and recommendation experiments. The dataset chapter is positioned as fundamental to the thesis contribution, not just background.

How does the research tackle marathon performance prediction using training history?

It extracts training-related features from runners’ past sessions and uses them to predict race times. Early work used about 16 weeks of training data from roughly 20,000 runners, building models from engineered features. Later work returns to the larger dataset after developing better representations, aiming to improve both predictive performance and the downstream usefulness of those features for recommending training plans.

What changes when the dataset scales from tens of thousands of runners to millions?

The core modeling ideas remain, but the workflow changes: the student must learn to handle much larger raw data volumes, including extraction, cleaning, and processing at scale. The shift from ~20,000 runners to a couple million runners and tens of millions of sessions forces new data engineering skills and often requires re-running experiments with updated representations.

What is the planned approach to validating training recommendations without a live user study?

Instead of running a real-world trial where people follow the system, the plan is to generate sample training plans and ask people to evaluate whether the recommendations are sensible and understandable. The emphasis is on usability and interpretability—whether runners can follow what the system suggests—rather than measuring performance gains in a controlled longitudinal study.

How is the thesis organized to match the project’s experimental structure?

The outline is designed around multiple mini-projects and experiments. It includes an introduction and literature review, then a dataset chapter, followed by three main chapters aligned to the research goals: (1) feature engineering and performance prediction (with model comparisons and justification, including how deep learning models are handled), (2) training disruptions (more data-science-focused with some machine learning), and (3) training recommendations as a recommender system.

What is the student’s time-management logic for the final year?

The plan front-loads research during a summer period with fewer teaching hours, targeting completion of specific deliverables month-by-month: camera-ready conference updates, training-disruption analysis, feature engineering and representation work, a user-evaluation step, and then the larger recommender-system build in the later months. The goal is to avoid writing the thesis too early when dataset changes might force reruns of models and revisions of results.

Review Questions

What specific validation method is planned for training recommendations, and why does it replace live user studies?
How does the thesis’s chapter structure reflect the way the research was conducted (mini-projects and multiple experiments)?
What are the main technical shifts required when moving from ~20,000 runners to a dataset with millions of runners and tens of millions of sessions?

Key Points

1
The remaining 15 months of funding translate into a target to finish core research work in about six months, leaving roughly nine months for thesis writing.
2
Marathon performance prediction is built from Strava time-series training data by engineering features from runners’ past sessions.
3
Dataset scale increases from ~20,000 runners to a couple million runners, requiring new data extraction, cleaning, and processing skills.
4
The thesis is organized around three contribution areas: feature engineering/performance prediction, training disruptions, and training recommendations.
5
Training recommendations are expected to be validated through human judgment of generated plans rather than live user trials.
6
A dedicated dataset chapter is treated as essential because it documents how messy raw Strava data becomes modeling-ready inputs.
7
The final-year schedule is designed to front-load research during a less teaching-heavy summer to reduce the risk of rework from dataset changes.

Highlights

The thesis plan explicitly separates research and writing: finish research in ~6 months, then draft and revise for ~9 months.

Strava data drives three linked goals—race-time prediction, disruption impact analysis, and training-plan recommendations—using engineered representations of time-series sessions.

Scaling from 20,000 runners to millions forces a workflow shift: extraction and cleaning at scale become as important as model choice.

Recommendation validation is planned as “does this plan make sense to humans?” using generated samples, not a live trial.

The thesis structure is experiment-driven, with a dataset chapter that documents transformation decisions as part of the core contribution.

Topics

PhD Timeline
Marathon Prediction
Feature Engineering
Training Disruptions
Training Recommendations