The idea of distributions did not come immediately to L. Schwartz. The final definition came to him after several attempts, as he describes in his wonderful autobiography. Here I will give a non-historical account.
Weakly Differentiable Functions
There are many functions in mathematics, calculus, and engineering that are not differentiable at every point.
A simple example is
$$
F(x) = |x|.
$$
Its derivative is $-1$ for $x<0$ and $1$ for $x>0$, but it is not defined at $x=0$.
However, if we consider
$$
f(x) = \operatorname{sgn}(x) :=
\begin{cases}
-1, & x<0,\\
0, & x=0,\\
1, & x>0,
\end{cases}
$$
we see that
$$
F(x)-F(y) = \int_x^y f(t)\, dt,
$$
for every $x,y \in \mathbb{R}$.
Thus, $F$ behaves like a primitive of $f$ on $\mathbb{R}$, except that $f$ is not regular enough to apply the classical rules of calculus, where primitives are defined for continuous functions. In particular, we cannot conclude that $F'(0) = f(0)$, since $f$ is discontinuous at $0$.
Nevertheless, in some weak sense, $f$ is the derivative of $F$. The idea of weak derivatives was introduced by Sobolev, prior to Schwartz.
This might look trivial, but there are natural situations where weak derivatives are convenient. A classical example arises in the earliest partial differential equations in the literature: the wave equation and the heat equation.
Consider someone playing a guitar. Under some (idealized) assumptions, the motion of the string is described by the wave equation
$$
u_{tt}(x,t) = c^2\, u_{xx}(x,t),
$$
where $u(x,t)$ is the height of the string at position $x$ and time $t$. Here, $u_{tt}$ and $u_{xx}$ are second-order partial derivatives in $t$ and $x$.
A simple difficulty occurs if the musician plucks the string: the initial shape of the string at $t=0$ may resemble the graph of $|x|$. Then $u(x,0)$ is not differentiable with respect to $x$ at some point. How can we use a differential equation if the initial shape is not differentiable?
The idea is to differentiate without ever differentiating directly, using the principle of integration by parts.
Integration by Parts
The classical Integration by Parts formula states that if $f$ and $\phi$ are smooth (say, $C^1$) then
$$
\int_a^b f'(x)\phi(x)\, dx
= - \int_a^b f(x)\phi'(x)\, dx
+ f(b)\phi(b) - f(a)\phi(a).
$$
If $\phi$ vanishes outside $(a,b)$, this simplifies to
$$
\int_\mathbb{R} f'(x)\phi(x)\, dx
= - \int_\mathbb{R} f(x)\phi'(x)\, dx.
$$
The surprising fact is that the right-hand side still makes sense even if $f$ is not classically differentiable, provided $\phi$ is smooth enough. Thus we can compute
$$
\int f'(x)\phi(x)\, dx
$$
without ever differentiating $f$ directly.
Very Regular Functions as Probes
In practice, you never observe a function $f(x)$ directly. For example, if $f(x)$ is the temperature of a metal bar, you only measure temperature averages near a point. You are probing $f$ using a smoother function $\phi$. Formally, define
$$
T_f(\phi) := \int_\mathbb{R} f(x)\,\phi(x)\, dx.
$$
An interesting example is when $\phi$ is a $C^\infty$ function with compact support so that $\phi\ge0$ and $\int \phi \, dx = 1$. Then $T_f(\phi)$ is a kind of weighted average of $f$ over the support of $\phi$.
Importantly, if $f$ and $g$ are continuous and
$$
T_f(\phi) = T_g(\phi)
$$
for every $\phi\in C_c^\infty(\mathbb{R})$, then $f=g$. Here $C_c^\infty(\mathbb{R})$ denotes the vector space of all $C^\infty$ functions with compact support. Thus, the functional $T_f$ contains all the information about $f$, so we can identify $f$ with this linear functional on the space of test functions.
Moreover, if $f\in C^\infty$, integration by parts yields
$$
T_{f'}(\phi) = -\,T_f(\phi'), \tag{1}
$$
so we can recover the derivative of $f$ in this weak sense.
We define $g$ to be the weak derivative of $f$ if
$$
T_g(\phi) = -\, T_f(\phi')
$$
for every test function $\phi$.
The Breakthrough of Schwartz
Sobolev used weak derivatives very successfully. But in Sobolev’s approach, weak derivatives were still functions.
Schwartz went beyond this: he observed that identifying functions with their functionals $T_f$ on $C^\infty_c$ allowed him to keep differentiating arbitrary linear functionals, even if they do not come from any function.
He defined a distribution as any linear functional
$$
T : C^\infty_c(\mathbb{R}) \to \mathbb{R},
$$
and its derivative by
$$
T'(\phi) := -\,T(\phi').
$$
This allows us to differentiate objects that are not functions and to obtain derivatives that are also not functions.
For example, differentiating $F(x)=|x|$ in the sense of distributions gives
$$
T'_F = T_{\operatorname{sgn}},
$$
where $\operatorname{sgn}$ is the sign function defined above.
Differentiating again,
$$
T''_F = T'_{\operatorname{sgn}} = 2\,\delta_0,
$$
where $\delta_0$ is Dirac’s delta at $0$, defined by
$$
\delta_0(\phi) := \phi(0).
$$
This distribution does not come from any classical function. It comes from a measure assigning weight $1$ to the point $0$. And we can keep differentiating ad infinitum.
The space $C_c^\infty$ is a classical space of test functions. We can also considered other test function spaces (such as $C^\infty$ functions with rapid decay), which we omit here.
That is a major advantage of distributions: they allow us to place many disparate objects in the same space: $L^p$ functions, wildly discontinuous functions, continuous functions, $C^k$ functions, regular measures, and even more exotic objects, are all distributions, functionals on a suitable space of test functions.
Within this framework, we can with great freedom take linear combinations, form series, and differentiate these otherwise unrelated objects. Distributions allow us to treat all of them on equal footing.
Why Distributions Are Useful
Distributions are extremely useful because they allow you to break down the proof of the existence of solutions to problems involving functions and its derivatives (such as partial differential equations) into two steps:
Existence as a distribution: First, prove that a solution exists in the sense of distributions. At this stage, one can freely manipulate derivatives of arbitrary order without worrying too much about its existence in a strong sense and its regularity.
Regularity improvement: Then, step by step, establish that this distribution is more regular, eventually showing that it corresponds to an actual function with the desired regularity.
A final example
I have also encountered distributions in my own work (and with my students), not in the classical PDE setting, but in the study of functions similar to the Weierstrass function
$$
f(x)=\sum_{n=0}^\infty \frac{1}{2^n} \cos(2^n x).
$$
This belongs to the famous family of examples by Weierstrass of nowhere differentiable functions. However, one is tempted to formally differentiate term by term:
$$
f'(x)=\sum_{n=0}^\infty -\,\sin(2^n x), \tag{2}
$$
which does not make sense in the classical sense. Nevertheless, the right-hand side does define a distribution. Indeed, for a $2\pi$-periodic $C^\infty$ function $\phi$,
$$
T'_f(\phi)
= - \int_0^{2\pi} f(x)\,\phi'(x)\, dx
= \sum_{n=0}^\infty \int_0^{2\pi} -\,\sin(2^n x)\,\phi(x)\, dx.
$$
Thus, the formal derivative in (2) makes sense in the sense of distributions, and this approach can be used to study the regularity of the Weierstrass function and similar examples.