Controlled Improvement and Error Analysis

Model	Accuracy	Macro F1	Zero-F1 classes
Majority class	23.0%	~0.003	112
TF-IDF + LogReg	54.2%	0.132	70
Your fine-tuned encoder	~56%	~0.20	~47

	Overtrained	Data-starved
Data size	Enough data, too many epochs	Not enough data, any epoch count
Fix	Early stopping	More data
Prediction quality	May or may not degrade	Hits a low ceiling

Class	Examples	Share of total loss
"Incorrect info on your report"	13,333	~23%
"Lost or stolen money order"	5	~0.009%

Approach	Max weight	Result
No weighting	1.0	47 classes at F1 = 0
Sqrt-inverse	3.6	Works — rescues 10 classes
sklearn `balanced`	128	Crashes training
Raw inverse	2,666	Don't even try

Class	Examples	Weight
"Incorrect info on your report"	13,333	0.06
"Problem with a purchase on your statement"	3,905	0.12
"Lost or stolen money order"	5	3.23
"Shopping for a line of credit"	4	3.61

	Accuracy	Macro F1	Zero-F1 classes
Without weighting	55.4%	0.199	49
With weighting	50.2%	0.216	39

Tier	Training examples	Typical F1 range
Head (6 classes)	≥ 2,000	0.5 – 0.7
Mid (~40 classes)	100 – 1,999	0.2 – 0.4
Tail (~67 classes)	< 100	Many at 0.0

Field	What to write
Variable changed	Exactly ONE thing
Held constant	List everything else
Prediction	What you expect — write this BEFORE running
Result	What actually happened
Meaningful?	Is this difference larger than noise?

Paper	Year	Key idea
Zhang et al. — Rethinking Generalization	2017	DNNs memorize yet generalize
Lin et al. — Focal Loss	2017	Reshape loss to focus on hard examples
Cui et al. — Class-Balanced Loss	2019	Effective number of samples
Nakkiran et al. — Deep Double Descent	2019	The U-curve has a second descent
Kang et al. — Decoupling	2020	Representations are fine; classifiers are biased
Menon et al. — Logit Adjustment	2021	You can't maximize accuracy and balanced error
Fang et al. — Minority Collapse	2021	Minority-class classifiers collapse in geometry
Mallinar et al. — Tempered Overfitting	2022	Benign vs tempered vs catastrophic

Model	Configuration	Key evidence
Charlie	1 epoch, full data (undertrained)	Only 1 epoch in logs, both curves still descending
Bravo	15 epochs, 5K subset (data-starved)	Most epochs but worst metrics. 157 batches/epoch vs 1808
Delta	3 epochs, full data (vanilla baseline)	Highest accuracy, matches Week 1 numbers
Alpha	3 epochs, full data, class weighting	Lower accuracy, higher F1, fewer zero-F1 classes

Welcome back. Last week you built your first neural model. You fine-tuned ModernBERT on consumer complaints and got numbers meaningfully better than TF-IDF. This week we ask two questions. Can you make the model better? And when a number moves, how do you know it's real? Those are the two skills: controlled improvement and diagnostic reasoning.

Here's where we are. The majority baseline gets 23 percent. TF-IDF gets 54 percent accuracy but only 0.132 macro F1 — 70 classes completely ignored. Your fine-tuned model improved both: roughly 56 percent accuracy, 0.20 macro F1, about 47 classes still at zero. You rescued about 23 classes. Real progress. But 47 classes still get nothing. The question: can we fix that? And just as important: if we change something and the metric moves by a point or two, is that real or noise?

This week is about two skills. First, diagnostic reasoning: if I hand you a trained model, can you look at the metrics and training curves and figure out what happened? Did it converge? Did it overfit? Was a specific intervention applied? Second, controlled improvement: when you make a change to the training pipeline, can you attribute the result to the change you made? The lecture covers both. In the lab, you'll practice diagnostic reasoning — figuring out what happened to models someone else trained. In the homework, you'll practice controlled improvement — training your own models and measuring the effects of specific changes.

Here's the plan. In the lecture, we'll cover the diagnostic tools — how to read training curves, what class weighting does, how to examine errors — and then the methodology of running controlled experiments. In the lab, you'll receive four pre-trained models with their training curves. All four are ModernBERT trained on the same data, but with different configurations. Some are good. Some have problems. You figure out which is which. Then in the homework, you train your own models.

Let's start with the most important diagnostic tool. If someone hands you a trained model and you can only look at one thing, look at the training curve. It tells you more about what happened during training than any single metric.

A training curve plots two values over the course of training. Train loss — the cross-entropy loss on the training batches — is what the optimizer directly minimizes. Val loss — the same metric on held-out data — tells you how the model performs on examples it has never seen. At the start, both are high. As training progresses, both should decrease. The diagnostic information is in what happens after that initial decrease. Do they both flatten? Does one keep going down while the other goes up? Does one stop at epoch 1? These patterns tell you different stories about what happened during training.

There are three regimes. Undertrained means you stopped too early. Both curves were still decreasing — the model had more to learn. The fix is simple: train for more epochs. Well-trained means you stopped at the right time. Training loss has stabilized, validation loss is at its lowest point. Overtrained means you went too far. Training loss has been driven to near zero — the model has memorized the training data — but validation loss is climbing. Now, I should flag: Nakkiran and colleagues showed in 2019 that this classical story is incomplete. If you keep training past the overfitting bump, test error can sometimes drop again — "deep double descent." The paper is in your readings. For today, the three regimes are your primary diagnostic tool, but keep in mind the picture is richer than this.

Here's how to read the curves in practice. On the right you see three example training runs. The first model — both curves still dropping. Undertrained. The second — both curves stable, val loss at its minimum. Well-trained. The third — train loss driven to near zero while val loss climbs steadily after epoch 3. Overtrained. The gap between the blue and red lines in that third panel is the overfitting gap. In the lab today, you'll see training curves from four different models and you'll use exactly this pattern recognition to diagnose them.

Here's a distinction that trips up even experienced practitioners. Both overtrained and data-starved models show rising validation loss. But the stories are completely different. An overtrained model had enough data but trained too long — early stopping fixes it. A data-starved model never had enough data to begin with — no amount of early stopping helps, because the model hit its ceiling on epoch 2 and everything after that is just noise. We saw this in our own experiments: a 149-million parameter model trained on just 5,000 examples learned what it could in the first few epochs, then the training loss kept dropping as it memorized, but its actual prediction quality plateaued. The val loss rose because the model got more confident, not because its predictions got worse. Mallinar and Nakkiran — yes, the same Nakkiran — published a taxonomy of overfitting in 2022. They call this regime "tempered" overfitting: some degradation in calibration, but much less than classical theory predicts. Not benign, not catastrophic, somewhere in between. Both papers are in your readings.

Let's work through one. Train loss drops steadily from 2.8 to 0.4 across 5 epochs. Val loss drops sharply in epoch 1, reaches its minimum around epoch 2 at 1.6, then starts creeping up. By epoch 5 it's back to 1.6 — above the minimum. This is a classic overtraining pattern. The best checkpoint is epoch 2, when val loss was lowest. What would you try next? Save that epoch 2 checkpoint and try early stopping. Or: ask whether you have enough data. If this is a small subset, maybe the issue isn't too many epochs — it's too few examples. You can't tell from the curve alone. You need to know the data size. This is the kind of reasoning I want you doing in the lab.

This is one of the most important papers in modern deep learning theory. Zhang and colleagues took a standard neural network and trained it on CIFAR-10 with the correct labels. It generalized fine — standard result. Then they replaced all the labels with random noise. The same network memorized the entire dataset perfectly — training loss went to zero. But of course it couldn't generalize, because there was nothing to generalize. Here's the puzzle: the network has enough capacity to memorize 50,000 random labels. So when you train it on real labels, why doesn't it just memorize those too? Why does it learn patterns instead of memorizing examples? The common distribution-free capacity measures — VC dimension, Rademacher complexity — are too loose to distinguish these cases. They predict that any model with enough capacity to fit random data should also overfit real data. But it doesn't. The paper's point isn't that all of statistical learning theory is dead — it's that the standard explanations based on explicit regularization and older capacity intuitions are inadequate for modern deep networks. This paper is the reason you should never trust a simple story about overfitting. The PDF is in your readings folder.

Now let's talk about the biggest problem with our dataset and the biggest lever for improving the model. Class imbalance. You've seen the numbers — 47 classes at zero F1. But numbers in a table are abstract. Let's make the problem concrete.

Let's look at the actual distribution. Your largest class — "incorrect information on your report" — has over 13,000 training examples. Your smallest classes have 5 examples each. That's a 2,666 to 1 ratio. And 67 of your 113 classes — more than half — have fewer than 100 examples. This is the "tail." And this isn't some pathological dataset I constructed to make a point. This IS the real distribution of consumer complaints. Real-world classification is almost always like this. Medical diagnoses: a few common conditions dominate, hundreds of rare ones in the tail. Fraud detection: 99.9% of transactions are legitimate. Species identification: a few common species make up most observations. If you only ever work on balanced benchmarks like CIFAR-10, you'll never see this. But the moment you touch real data, you're in long-tail territory.

Let's be precise about the mechanism. Cross-entropy loss averages over all N training examples. Every example gets equal weight. But that means each class's contribution to the total loss is proportional to how many examples it has. The largest class contributes 23 percent of the total loss. The smallest contributes nine thousandths of a percent. Think about what that means for the optimizer. The gradient from getting a rare-class example right is negligible. The gradient from getting a common-class example right is enormous. The optimizer follows the gradient. And the gradient says: common classes matter, rare classes don't. Those 47 classes with F1 of zero? That's not a bug. That's the optimizer doing exactly what you told it to do.

The fix for class imbalance in cross-entropy is conceptually simple: multiply each example's loss by a weight that depends on its class frequency. Rare classes get higher weights, producing larger gradients, so the optimizer pays attention. But the critical question is: how much weight? And this is where people get into trouble. No weighting gives you 47 classes at zero. Sqrt-inverse — the square root of the frequency ratio — gives a maximum weight of about 3.6x. That's enough to rescue 10 classes from zero without destabilizing training. Sklearn's "balanced" mode uses raw inverse frequency, which gives weights up to 128x. We tried it. Training collapsed — the model oscillated wildly and never converged. Raw inverse frequency gives 2,666x for the rarest class. That would be catastrophic. I should also mention focal loss — Lin et al. 2017. Instead of reweighting by class frequency, focal loss down-weights well-classified examples regardless of class. Same problem, different angle. The paper is in your readings. We use class weighting in this course because it's simpler and directly addresses the class imbalance, but you should know focal loss exists.

Cui and colleagues at Cornell and Google gave this problem a theoretical foundation. Their key insight: the effective number of samples for a class is less than the raw count. Why? Because as you add more examples, each new example increasingly overlaps with existing ones in feature space. The 100th example of "incorrect information on your report" tells the model less than the first example did. They formalize this with a single parameter, beta, that controls how fast diminishing returns kick in. When beta is near zero, every example counts equally — no reweighting. When beta is near one, you get full inverse-frequency weighting. The sweet spot is somewhere in between. Our sqrt-inverse weighting is a related heuristic — it shares the same intuition of diminishing returns with class size, but it's not a special case of their formula. Their weighting is based on the inverse of the effective number, which for beta near 1 behaves like inverse frequency, not inverse square root. Sqrt-inverse is a simpler, more conservative compromise that works well in practice. The paper gives you the theoretical framework for thinking about WHY raw inverse over-corrects: it treats every rare example as maximally informative, but in reality, even 5 examples have some overlap.

Here are the actual weights from our dataset. The most common class — incorrect information on your report, with over 13,000 examples — gets a weight of 0.06. That means its loss contribution is reduced. The rarest class — shopping for a line of credit, with 4 examples — gets a weight of 3.61. That's a 60x ratio. The model will care 60 times more about getting a rare-class example right than a common-class example. That's significant but not insane. Full inverse-frequency weighting would give a ratio of over 2,000x. We tried that. Training collapsed. Sqrt-inverse is the practical middle ground.

Here's what happens when you apply class weighting. Look at these numbers from our experiments. Accuracy drops from 55.4 to 50.2 percent — a 5 point hit. But macro F1 improves from 0.199 to 0.216. And the number of zero-F1 classes drops from 49 to 39 — ten classes that the model previously ignored now get at least some correct predictions. This looks like a contradiction. How can the model get worse and better at the same time?

This chart makes the trade-off concrete. On the left, without class weighting, there's a big spike at F1 equals zero — those 49 classes the model ignores. The head classes cluster around 0.5 to 0.7. On the right, with class weighting, the zero spike shrinks. Ten classes that were at zero now have F1 between 0.1 and 0.3. They're not great, but they exist. The head classes drop slightly — they give up a few F1 points. The net effect on macro F1 is positive because rescuing a class from zero to 0.1 adds more to the average than losing a class from 0.7 to 0.65. But accuracy drops because the model now sometimes predicts a rare class where it used to correctly predict a common one. Kang and colleagues at Facebook AI showed in 2020 that on their benchmarks, representations learned on imbalanced data are largely fine — the problem is in the classifier layer. When we reweight the whole model, we may be distorting useful representations to help the classifier. That paper is in your readings and it's a good one.

This isn't just a quirk of our dataset. Accuracy and macro F1 are fundamentally different objectives. Accuracy asks: what fraction of examples did you classify correctly? Since most examples come from common classes, accuracy rewards getting common classes right. Macro F1 asks: what's the average per-class F1? Every class counts equally regardless of size. So macro F1 rewards covering rare classes even at the expense of common ones. When they disagree — as they do here — neither metric is "wrong." They're measuring different things. The question you have to answer is: what does your application need? A spam filter? You care about accuracy — most emails are legitimate, and you need to classify the common case correctly. A medical triage system? You care about macro F1 — missing a rare condition could be fatal. This is a design decision, not a statistical one.

Menon and colleagues at Google Research proved something that practitioners had intuited but never formalized. If you want to optimize for balanced error — treating every class equally regardless of size — the optimal classifier isn't the standard argmax of the logits. You need to adjust the logits by log class priors. Common classes get penalized; rare classes get boosted. This can be applied post-hoc at inference time or baked into the training loss. The paper shows this is Fisher consistent for balanced error and unifies several prior approaches to handling imbalance. But the deeper insight is the one that matters for us: the optimal classifier for accuracy is a DIFFERENT function from the optimal classifier for balanced error. You cannot maximize both simultaneously. So when you see accuracy go down and macro F1 go up with class weighting, that's not a bug in your training — it's a mathematical necessity. You're moving from one optimal frontier to another.

Class weighting helps rare classes. But which ones? And where does the model still fail? Error analysis tells you. This is the skill that separates someone who trains models from someone who understands them.

When you compute F1 for each class and group by frequency, the pattern is stark. Head classes — the 6 classes with 2,000 or more training examples — get F1 between 0.5 and 0.7. Mid-tier classes get 0.2 to 0.4. Tail classes — the 67 classes with fewer than 100 examples — many are at zero. Performance tracks class frequency almost perfectly. Class weighting helps the tail — it rescues some classes from zero. But it can't overcome data scarcity. A class with 5 training examples simply doesn't have enough signal for the model to learn a reliable decision boundary. No amount of loss weighting fixes that. Fang and colleagues published a paper in PNAS in 2021 showing that beyond a critical imbalance ratio, something they call "minority collapse" occurs: the last-layer geometry skews so that minority-class classifiers converge toward each other. The model can't tell those classes apart no matter what you do. Separately — and this is our observation, not the paper's claim — we've found that scaling model size from 395 million to 8 billion parameters doesn't help on our task. The paper is in your readings.

Here's a question to think about. When the model misclassifies a tail-class example, where does the prediction land? There are three stories. Maybe the model defaults to predicting head classes because they dominate training. Maybe the model predicts a semantically similar class — it knows the general topic but can't distinguish subcategories. Or maybe errors are scattered randomly among tail classes. I want you to form a prediction now. You'll test it empirically in the homework by building a confusion matrix and categorizing where the tail-class errors actually go. The answer tells you something fundamental about the model's failure mode.

After the confusion matrix, look at individual examples. Find the 5 to 10 predictions where the model was most confident and most wrong. These are gold. They tell you exactly what the model misunderstands. Common patterns you'll see: the true label and predicted label share vocabulary, or the complaint describes a situation that genuinely could be classified either way. Those aren't model failures — they're labeling ambiguities. Distinguishing between model errors and label ambiguity is a real skill, and one you'll practice in the homework.

Let me pull this together. Everything we've covered — training curves, class weighting, per-class metrics, error analysis — is part of a diagnostic mindset. A trained model is a body of evidence. The training curve tells you about optimization. Per-class F1 tells you about failure patterns. The confusion matrix tells you about error structure. Your job is to read this evidence and understand what happened. Not just "the number went up" — but why, and whether you can trust it.

Now the lab. This is different from last week. You'll receive four models. All four are ModernBERT-base fine-tuned on the same consumer complaints dataset. But each was trained differently. Each comes with its training log — per-epoch train loss, validation loss, accuracy, and macro F1. What you don't know is what training configuration produced each model. Your job: evaluate them on the validation set, build a comparison table, plot the training curves, form a hypothesis about what each config was, and bring your hypothesis to me with evidence. Not to the notebook — to me, in person. I'll confirm or correct one model at a time.

Now the second half. You've learned how to diagnose what happened to a model. In the homework, you'll make changes yourself. Let's talk about how to do that systematically.

The principle is dead simple. Change one variable. Hold everything else constant. Measure the effect. This is basic experimental design. And yet in practice, most ML work violates it. People change the learning rate and the scheduler and add class weighting in the same run, the number goes up, and they declare victory. But they don't know which change helped. Maybe two of the three made things worse. In the homework, every experiment you run uses a template that forces you to name exactly one variable you changed and list everything you held constant.

I want to share a real mistake we made while building this course's materials. We ran an experiment that changed both the learning rate and the learning rate scheduler at the same time. Macro F1 improved by 2 points. Great result, right? But then a reviewer asked: which change helped? And we couldn't answer. Maybe the new learning rate was better. Maybe the new scheduler was better. Maybe one helped and the other hurt, and the net effect happened to be positive. We had to throw the result away and rerun two separate experiments — one changing only the learning rate, one changing only the scheduler. The confounded experiment wasn't free. It cost GPU time and wall-clock time that we then had to spend again, plus the reruns. Controlled experiments feel slower in the moment. They're faster in total because you never have to redo them.

Here's the template. Five fields. Variable changed — exactly one thing. Held constant — list everything else explicitly. Prediction — what you expect to happen and why, and you write this before you run the experiment. Result — what actually happened. And meaningful — is the difference you observed large enough to trust, or could it be noise? That last field is the one most people skip. We'll talk about why it matters in a few slides.

Quick notes on two knobs. Learning rate — you covered this last week. 2e-5 is the standard for fine-tuning. But we found that ModernBERT at 1e-3 — 50x higher — still produced a working model. Modern architectures with AdamW are more robust than the original BERT guidance suggests. Schedulers — we tested cosine versus linear at the same learning rate. The difference was less than half a macro F1 point. On this task, with this training length, the scheduler barely matters. That's a real finding. Not every knob is worth turning. The skill is figuring out which ones actually move the needle and spending your time there.

Batch size is another knob. Smaller batches mean more optimizer steps per epoch, which means more chances for rare classes to show up in a batch and influence the gradient. We tested batch 16 versus 32. The effect was real but modest — sometimes it helped, sometimes the difference was within noise. The mechanism is debated. What matters: it's an empirical question. You'll test it in the homework and decide for yourself whether it helps on this task.

Early stopping. Instead of training for a fixed number of epochs, monitor a metric after each epoch and stop when it stops improving. The tension on long-tail tasks: validation loss and macro F1 can disagree. Val loss might start climbing because the model is becoming overconfident on common classes, while macro F1 is still improving because the model is getting better at rare classes. Use macro F1 for your stopping criterion — it's aligned with what you care about. But be cautious: macro F1 is noisier than loss, so small epoch-to-epoch movements might be noise. Use a patience of at least 2 epochs. And always save checkpoints so you can go back to the best one. Best checkpoint is rarely the last checkpoint.

One more thing before the lab. I want you to be able to recognize when training code is broken, not just when hyperparameters are suboptimal. Here's a real bug. PyTorch's cross-entropy loss expects raw logits — the unnormalized outputs of the model. Internally, it applies log-softmax for numerical stability. But if someone applies softmax to the logits before passing them to cross-entropy, the function sees values between 0 and 1 instead of values that could be any real number. The gradients become tiny because softmax compresses the range. The loss gets stuck around 4.5 — which is negative log of 1/113, the entropy of a uniform distribution. The model effectively never learns. In the lab, you'll train a small model on a data subset and you'll see this bug in action. The training curve is unmistakable: the loss flatlines. Once you've seen it, you'll never miss it again.

Everything we've talked about — metrics, comparisons, experiments — assumes the numbers mean something. But do they? Your validation set has 6,430 examples across 113 classes. That averages to about 57 per class. But the distribution follows the same long tail as the training set. Some classes have hundreds of validation examples. Some have very few. About 20 classes have 5 or fewer validation examples — for these classes, F1 is essentially random. About 10 classes have exactly 1 validation example — a literal coin flip. That's roughly 18 percent of the classes in your macro F1 average contributing noise, not signal. So when you see macro F1 move by one point — say, from 0.199 to 0.210 — you need to ask: did 93 classes get slightly better, or did a handful of coin flips land differently? The homework has you build a histogram of validation examples per class so you can see this for yourself. The takeaway isn't that metrics are useless. It's that you need to interpret them with awareness of their noise floor.

Last concept before we wrap up. Reproducibility. When you train the same model with the same config but a different random seed, three things change: the classification head's initialization, the order data is batched, and the dropout masks. Same config, different seed, different number. A 1-2 point variation in macro F1 between seeds is completely normal. This means a single run is anecdotal. The practical workflow given free T4 constraints: explore with one seed — try many ideas quickly. When you find something promising, confirm with three seeds. Report the mean and standard deviation. If your improvement disappears across seeds, it wasn't real.

After the lab, the homework is where you train your own models. You'll reproduce the class weighting result on full data — you've seen what it does in the lab, now verify it yourself. You'll test batch size. You'll implement early stopping. You'll do a real error analysis — confusion matrix, per-class F1 by tier, hard-example inspection. And you'll discover how noisy your validation metrics actually are. The memo has five sections with prompts embedded in the notebook. Due Wednesday morning, HTML via Moodle. Plan for five to six hours.

Here's your reading list. Eight papers spanning 2017 to 2022 that cover the theoretical foundations of everything we discussed today. The three in bold — Zhang, Cui, and Menon — we covered in the lecture. The other five were referenced on individual slides. You are not expected to read all eight — that would be unreasonable for a one-week turnaround. Pick two or three that connect to whatever you find most interesting in your memo. If you write about the accuracy-F1 trade-off, read Menon and Kang. If you write about why the tail is hard, read Fang. If the training curve behavior surprised you, read Nakkiran and Mallinar. Engaging with the literature is what separates a graduate-level memo from a homework report.

One sentence. Next week we try something completely different. See you in the lab.

STOP HERE. Everything after this slide is the lab reveal. Do not advance until every student has submitted their hypotheses in the notebook. When ready, go one model at a time.

Show this slide when everyone has written their hypotheses. Go through one at a time. Charlie is the easiest — everyone should get "undertrained." Bravo is the dramatic one — 15 epochs but worst performance because of data scarcity. Delta is the control — matches what they built in Week 1. Alpha is the deepest lesson — the accuracy-F1 paradox from class weighting. Pause on Alpha and ask: why would you accept lower accuracy? When is that trade-off worth it?

Controlled Improvement and Error Analysis

ECBS5200 — Week 2

A model is only as good as your ability to understand why it works.

Where we left off

This week

Today's plan

Training Curves

The single most diagnostic artifact you can look at.

What is a training curve?

Three regimes

Reading the curves

Data-starved vs overtrained: a crucial distinction

Worked example: read this curve

The paper: Zhang et al. 2017

The Class Imbalance Problem

Before we fix it, let's see how bad it really is.

Your class distribution

What cross-entropy actually does

Class weighting: make the loss care

The paper: Cui et al. 2019 — Effective Number of Samples

What the weights look like

The accuracy-F1 paradox

Why the paradox happens

Metrics disagreement: which number do you trust?

The paper: Menon et al. 2021 — Logit Adjustment

Error Analysis

Where exactly does the model fail — and why?

Per-class F1 by frequency tier

Confusion matrix: where do errors land?

Hard-example inspection

The Diagnostic Mindset

The Lab: Diagnostic Forensics

Controlled Experiments

Changing things systematically

The principle

The mixing knobs warning

The experiment template

Learning rate and schedulers

Batch size

Early stopping

Common training bugs

Val set reliability

Seeds and reproducibility

Your homework this week

Week 2 Reading List

Next week

REVEAL BELOW

Do not advance past this slide until students have written their hypotheses.

Lab Reveal