Epidemiology - Modeling the Spread of Diseases

March 18, 2020

Please link, rather than copy. Thank you! Thomas German version

Epidemiology is the science of the distribution, patterns, and determinants of diseases. We are interested in the distribution here. Like in many scientific fields mathematical modeling plays a large role. We look at a basic model for the spread of disease.

Note: I am not an epidemiologist but a mathematician and modeler. I hope this overview contributes a bit to spreading general understanding of the spreading of diseases like CoViD-19. In particular, some of the assumptions for parameters are "made up" and I try to keep the explanations understandable.

We aim to cover some ground here:

  • We get to know a classic epidemiological model, the SIR model. This is "too simple" but many of the contemporary models are modifications and extensions of this model.
  • We see what role key quantities like the basic reproduction numbers play and how the herd immunity threshold is derived.
  • We investigate the mathematics behind "flattening the curve" and in particular see which proportion of the population will be ill at the peak of the epidemic. This gives us an intuition for how much one has to reduce infection-relevant contacts for avoiding the breakdown of the medical system.
  • We have a browser-based SIR model implementation for your own experimenting with the parameters.
  • We look at limitations of the SIR model.
  • I have a Jupyter notebook on GitHub for further experimentation.

A simple epidemiological model

One of the most basic models in epidemiology is the SIR model (Susceptible to Infectious to Recovered) going back to Kermack und McKendrick (published in 1927).

The SIR model applies to diseases which will leave recovered people immune against a renewed infection.

We consider a partition of the population in three groups: - Susceptible (but presently uninfected), - Infectious (ill or infectious before the outbreak of the disease), - Recovered (or otherwise immune, e.g. through vaccination).

Every person is in one of the three groups at any given time. A migration between the three is possible from susceptible to infectious and from infectious to recovered. We assume that no other migrations are possible. In particular, any immunization beyond recovery (e.g. vaccination) has already happened at the start.

We assume that the total population consists of $N$ persons and that $N$ is large enough that we can use a continuous model. (We will drop $N$ in a bit, but we want to stay close to the typical formulation for now.)

We introduce the counts $S$ of susceptible persons, $I$ of infectious, and $R$ of recovered. These are functions in time (defined for times $t >= 0$) and we write $S(t)$, $I(t)$, and $R(t)$. The total population size $N$ is assumed to be constant in time.

One aspect of the population dynamics we are neglecting here are births and deaths (or people moving from and into the population). This is based on the assumption that the overall effect of them is very small in comparison to the dynamics between $S$, $I$, and $R$. This makes our model a closed system and we have $$S(t) + I(t) + R(t) = N.$$

We consider the rates of change of these quantities in terms of derivatives that we write as $\dot S$, $\dot I$, and $\dot R$. From the equation for the closed system we get the balance condition

$$\dot S(t) + \dot I(t) + \dot R(t) = 0.$$


We introduce a positive rate $\beta > 0$ that measures the number of persons an infectious person has infection-relevan contacts per unit of time. As often in mathematical modeling, we have rates as derivatives even if we might be more used to consider relative changes from one discrete timestep to the next. If you compare to compounding of interest, this here is the equivalent of continuous compounding. The rates $\beta$ and $\gamma$ do, however, have units $\frac{1}{\textrm{unit of time}}$, e.g. number per day.

If we assume that all persons of the population are the other side of the contact with equal probability, the probability of the contact person being susceptible is $S(t) / N$. This person will then migrate to infected. Because this happens for all $I(t)$ infected persons simultaneously, the rate of migrations, i.e. the decrease of the of the number of susceptible but uninfected is $\beta S(t)I(t) / N$, so we have

$$ \dot S(t) = - \frac{\beta S(t)I(t)}{N}. $$


There are several options for the modeling of recovery. We might make an assumption of how long people remain infectious. We take an even simpler approach and introduce a rate of recovery $\gamma$, the proportion of infectious persons that migrate to the recovered state per time interval. This increases the number $R$ of recovered persons, so we get

$$ \dot R(t) = \gamma I(t). $$

We will see how $\gamma$ relates to the duration of staying in the infectious state.

Equation for the infectious compartment

By substituting the equations for the change in susceptible and revocered compartments into the balance condition we get

$$ \dot I(t) = - \dot S(t) - \dot R(t) = \left(\beta \frac{S(t)}{N} - \gamma \right) I(t). $$

Invariance with respect to total population size

When we move from counts $S$, $I$, $R$ to proportions $s = S/N$, $i = I/N$ and $r = R/N$ the equations remain the same as for $N=1$. This invariance allows us to set $N=1$ and interpret $S$, $I$, and $R$ as proportions of the total population. We will do this for what folows.The model is - with the exception of the duration of the infectious state - very similar to a model for nuclear chain reaction: $S$ then is the proportion of fuel and $R_0$ is the neutron multiplication factor. $R$ is then the burnt fuel or other material. It is hard to make sense of $I$ in this intuitive translation, it is proportional to the intensity of the reaction. Most likely $\gamma$ would be very large. There is a famous illustration of nuclear chain reaction using mouse traps from the German TV show Sendung mit der Maus on nuclear power. It starts at approximately 7:40. This gives us some intuition why it is so hard to control an epidemic.

Typical duration of infectiousness

We started out with only infectious persons (and so no new cases are entering the $I$ state), we would have the equation

$$ \dot I(t) = - \gamma I(t). $$

This is the equation for exponential decay. Its solution is

$$ I(t) = I(0) exp(- \gamma t). $$

We can plot this:

time to recovery

If we read off time on the $x$ axis (e.g. in days), then the y axis gives the proportion of still infectious persons: After 2 days, approximately 65% of the infectious persons at the start are still infectious.

But we can also consider the area under the curve as horizontal lines to get durations of infectiousness: 65% of the people infectious at the beginning have a time to recovery of at least 2 days.

This means that the area under the curve is the mean time to recovery (or duration of infectiousness) $T_I$. We can compute the integral:

$$ T_I = \int_0^\infty \exp(-\gamma t) dt = - \frac{1}{\gamma} \exp(-\gamma t) \big|_{t=0}^\infty = \frac{1}{\gamma}. $$

If we want to convert the average time to recovery into a recovery rate $\gamma$, we would take the reciprocal of the time to recovery $\gamma = 1/T_I$. (But the model does make a structural assumption about the distribution, too.)

We might consider the following: If we look at the first known German CoViD-19 cluster ("Webasto-Cluster"), we might assume that people were infected on January 27 and have recovered by February 27, so we would get an approximate time to recovery of 30 days (this is, however, a maximal rather than average time to recovery). This translates into $\gamma \approx \frac{1}{30} \approx 0.033$. Unfortunately, things are not that simple and we would need more precise (i.e. clinical) observations.

The Robert-Koch-Institut links studies that hint at an average incubation period of 5-6 days and that patients remained infectious for 8 days after the first symptoms. This is, of course, much better than our free-hand estimate from third-hand sources and might give a recovery rate of $\gamma \approx \frac{1}{15} \approx 0.066$. But here is a weakness of the SIR model we are using: Ideally, we would only take the infectious time into account, not the latency. This would make another factor of approximately $2$ (and there are lower estimates for the infectious time than 8 days). If we do that, however, we would unerestimate the transmission rate (because at the beginning of the latency period there are fewer infections than at the end). For a disease with a substantial latency one would rather use a SEIR model adding an Exposed subpopulation that has caught the virus, but the virus is still latent.

A very rough estimate of the infection rate $\beta$

If we assume that recovery initially does play a crucial role for the development of the infectious subpopulation and almost all persons are susceptible, we can derive a very rough estimate of the infection rate $\beta$ from the increases in observed infections. This is because then the equation for the infectious reduces to that of exponential growth $$ \dot I = \beta I S - \gamma I \approx \dot I = \beta I, $$ so we have for some $k>0$ $$ \beta \approx \frac{1}{k} \ln \frac{I(t+k)}{t} $$ In Germany, from March 12 to 16 the reported infections increased from 1567 cases to 6012 cases. This would give an infection rate of $$ \beta \approx \frac{1}{5} \ln \frac{6012}{1567} \approx 0.26. $$

But if we assume an infectious period of $T_I = 8$ (days), we would likely need to take recovery into account. We might still be able to assume $S / N \approx 1$ so we would get $$ \dot{\ln I} = \frac{\dot I}{I} \approx \beta - \gamma \approx 0.26 + 0.125 \approx 0.385. $$ This gives us a $\beta$ that is corrected by adding $\gamma$, thus incorporating recovery. We would still run into trouble with the latency period, though.

A more earnest method of estimating the parameters runs simulations of the model and then varies the parameters to minimize the difference between simulated and observed cases. This seems to work well for influenza outbreaks, for CoViD, the data basis seems much more difficult to work with.

The Basic Reproduction Number

If we assume that infection-relevant contacts occur with a rate of $\beta$ and the average time to recovery is $T_I$, every infectious person has a total of $R_0 = \beta T_I = \frac{\beta}{\gamma}$ relevant contacts (if all contacts were with persons in $S$, they would be infections). This number is immensly important and is called the basic reproduction number. (Note: The naming of the basic reproduction number $R_0$ is not related to the proportion of recovered persons $R$. Likely the typical variable names come from different contexts.)

One can see that multiplying $\beta$ and $\gamma$ by the same factor causes a "time rescaling": If we double both, everything happens twice as fast, i.e. the model state at $t$ is the same as the original model's state at $2t$. We will see that the "shape" of the epidemic - in particular the maximum of simultaneous infections - crucially depends on the basic reproduction number $R_0$.

Looking at the equations, we observe that for $R_0 < 1$ the derivative of $I$ is always negative (because $S/N < 1$). The number of infectious people decreases in this case. Thus - according to the model and in the sense of epidemiology - everything is under control.

Wikipedia's entry for the basic reproduction number lists estimates of $R_0$, for CoViD-19 the estimate is in the range of $1.4$ to $3.9$. The Robert-Koch-Institut lists 2.4-3.3.

Compared to this, our estimate of $\hat R_0 = \frac{0.26}{0.033} \approx 8.1$ would be very high. If we used $\gamma = 0.125$ (8 days infectious from clinical observation) and the the correction by $\gamma$, we would have $\hat R_0 = \frac{0.385}{0.125} \approx 3.1$. This is roughly within expectations, but recall from above that we had doubts about leaving out latency.

This could be an indication that the number of undetected cases is low and leading to the SIR model having limited applicability here.

A SIR model for your own experiments

Here is a browser-based SIR model so you can experiment with the parameters:

For this choice of parameters the basic reproduction number is $R_0 = \beta / \gamma $ = (Wikipedia lists an estimate of 1.4-3.9 for CoViD-19). The peak proportion of infectious $I$ people is approximately .

The maximum for the number of simultaneously infectious

One key question is, how many people are infectious (or rather ill) at the same time. If we make an assumption that e.g. 5% of the infected need intensive medical care, we can see if the medical system has enough capacity to treat patients. If $5\% \max I > capacity$, we will be in trouble (other factors also play a role, e.g. the capacity is not constant, the base utilization for other diseases etc.).

We assume that the basic reproduction number is $R_0 > 1$, see above for $R_0 < 1$.

So let us calculate the maximum of simultaneously infectious $I_{max}$. One can show that the maximum is attained and we call the point in time when it is attained $t^*$ That the maximum is attained is a typical compactness argument. We would also need to consider uniqueness or specify that $t^*$ is the first such point in time, but it turns out that the maximum is indeed unique. In the end the root cause has a lot to do with the monotonicity of $S$ and $R$ that we exploit below.

We cannot directly compute the maximum $I(t^*)$, instead we will compute $S(t^*)$ and $R(t^*)$ and then use the equation $S+I+R = 1$ of the closed system. (Recall that we assumed $N=1$.)

This may be the mathematically most involved section. If you want, you can skip it in the first pass and just look at the result.

Proportion $S(t^*)$ of susceptible at the peak $I$

The maximum of $I$ is the point at which the change in $I$ switches signs from positive to negative or, more precisely, when the derivative $\dot I(t^*) = 0$. Plugging this into the equation for $\dot I(t)$ we get $$ 0 = \dot I(t^*) = \left(\beta S(t^*) - \gamma \right) I(t^*). $$

But because $I > 0$ (or we would not have any epidemic) we have $$ 0 = \beta S(t^*) - \gamma $$ or equivalently $$ S(t^*) = \frac{\gamma}{\beta} = \frac{1}{R_0}. $$

Thus we have determined $S(t^*)$.

Herd immunity

This calculation also gives us when we reach herd immunity, i.e. when the number of infectious people declines on its own. We often see the rearranged equation $S(t^*) R_0 = 1$ as a starting point. If we (for this throught only) assume that we have very small (negligible) proportion of infections, the proportion of immune persons is given by $R(t^*) = 1 - S(t^*)$. Thus the herd immunity threshold $HIT$ is $$ HIT = R(t^*) = 1-S(t^*) = 1-\frac{1}{R_0}. $$

If we assume $R_0 \approx 2.5-3.3$ for CoViD-19, the herd immunity threshold is $HIT \approx 60\%-70\%$.

My impression is that this is the 60%-70% of the population that are expected to be infected without interventions (e.g. by chancellor Merkel). But this assumes that, when we reach this immunity level, the number of infections can be pushed close $0$. However, if we reach the HIT at the height of the epidemic, the number of infectious people will decline, but a significant proportion of susceptible people (in the model possibly everyone) will be yet be infected.

Relation between number of susceptible $S$ and recovered $R$

To determine $R(t^*)$, we can establish a relation between the proportion of susceptible $S$ and recovered $R$. For this we eliminate $I$ from the equations for $\dot S$ and $\dot R$.

If we solve both the equations for $\dot S$ und $\dot R$ for $I$ and equate them, we get $$\dot S / (-\beta S) = \dot R / \gamma$$.

When we write the quotient $\dot S/S$ as derivative of the logarithm, this yields $$ \dot {(\ln S)} = \dot S / S = -\frac{\beta}{\gamma} \dot R. $$

The method of separation of variables lets us integrate in time on both sides. We obtain

$$ \ln S(t) - \ln S(0) = -\frac{\beta}{\gamma} (R(t) - R(0)) $$ and thus taking the exponential we have $$ S(t) = S(0) \exp \left( -\frac{\beta}{\gamma}{(R(t) - R(0))}\right) $$

Number $R(t^*)$ of recovered and $I(t^*)$ of infectious at the peak of $I$

Now if $S(t^*) = \frac{\gamma}{\beta}$, $R(0) \approx 0$, and $S(0) \approx 1$, we have $$ R(t^*) = R(0) -\frac{\gamma}{\beta} (\ln S(t_{crit} - \ln S(0)) \approx -\frac{\gamma}{\beta} \ln \frac{\gamma}{\beta} $$ and $$I_{max} = I(t^*) = 1 - S(t^*) - R(t^*) \approx 1-\frac{\gamma}{\beta} (1-\ln \frac{\gamma}{\beta} ).$$

This is the estimate of the maximum of $I$ we sought.

The dependence of the peak of infections on the basic reproduction number

As we saw, the (approximative) value of $I$ for the peak We saw that, under the assumptions that initially the entire population is susceptible and (which is impllied) almost noone is immune yet, the (approximate) value of $I$ at the peak depends only on the basic reproduction number. The correspondence is $$ I_{max} \approx 1-\frac{1}{R_0} (1 + \ln R_0). $$ We can plot this relation:

Maximaler Krankenstand vs. R0


Under the assumptions of the model (a substantial qualification) the peak number of simultaneously infected people $I(t^*)$ increases strongly with increasing basic reproduction number $R_0$. We can try to make a back of the envelope calculation of what this means for the medical system.

If we assume that Germany has approximately 25.000 ICU so approximately 3 per 10000 people. If we assume that we could make half of these, or 1.5 per 100000 people, available for treatment of CoViD patients that are critical cases. If we assume 1.5% of all cases to be critical, somewhat optimistic relative to the (outdated) 5% estimate for China (lower right figure in the Diagrams subsection), but maybe critical cases are only temporarily critical, I would not know. Then the maximum infection proportion of $\frac{1.5}{10000} / 1.5\% = 1\%$ would be critical. In the model this corresponds to $R_0 \approx 1.15$, much lower than the 2.4 - 3.3 figure given by the RKI.

What to do?

Impact of early recognition and isolation of known cases

It seems that Hong Kong, Singapore, South Korea (after some initial struggle) and Taiwan had very good success tracking cases and containing the disease. Bavaria (where yours truly lives) has apparently given up tracking by March 16 because they could not identify infection chains.

A speculative comment: It seems that (skiing) vacation play an important role as a source of infections and that schools were an important factor in spreading them. To me, the following pair of numbers seems to be a rather strong indication: On March 12, the day before the general closure of schools was announced, there were 500 identified cases and 125 schools had been closed as part of the policy to close schools with at least one identified case in them. Letting kids go to school for a week after the holidays and (only) then putting them under quarantine as done for South Tirol visitors probably was not ideal.

How would we (roughly) consider the effect of early detection in the model? If the time to recovery is $T_I$ but one diagnoses already after $T_D$ and (successfully) isolates identified cases, any infectious case will only have $\tilde R_0 = \beta T_D$ instead of $R_0 = \beta T_I$ infection-relevant contacts.

If we re-define $I$ as the proportion of non-diagnosed infectious, we plug this this lower $\tilde R_0$ or $\tilde \gamma = 1/T_D$ into our model. The actual proportion of infectious people is then given by $\frac{T_I}{T_D} I$, of these $I$ have not been diagnosed.

General isolation

Now many European countries, Germany included, attempt to generally reduce potentially infection-relevant contacts as much as possible. In the model we might assume that a reduction of contacts by a fixed factor implies the same reduction of $\beta$ (or $R_0$). But how easy is it to reduce contacts by a factor 3 to 4?

If we look at some of the flattening the curve illustrations, it sometimes seems that you need to get the numbers down by a factor of two (or try to increase capacity of the medical system). With the numbers we got, however, the difference is quite a bit more pronounced:We should say that our assumptions may or may not be accurate. But we did make assumptions consciously and made them transparent, so we can debate whether they are reasonable or not. I would add that the RKI also speaks of the need to reduce the number of contacts by a factor of 3 and that we have, sadly, seen overloaded medical systems.

flattening the curve

We should also say that with our assumptions, we need to reduce the number of contacts by 3-4 to get the number of infections to decline. For the weaker goal of flattening the curve to not overload the medical system, we need to reduce the number of contacts by 2.85-3.85. This will probably not lead to that different a policy.

We will see how things work out, let us hope for the best.

Limitations of the model

Some of the things left out of the model do not significantly limit the conclusions: The model does not include a case fatality rate. However, if we know the case fatality rate $q$, we expect (up to a small temporal error) $q R$ fatalities and $(1-q) R$ recovered. (But for Ebola, it seems that people also infected themselves from deceased people.) Our consideration of the use of tracking above works in a very similar way.

But we should question some of the implied assumptions:

  • The model is lacking a spatial component. In particular, infection is independent of spatial distance between individuals except for the implied effect through $\beta$ or $R_0$.
  • Unidentified cases are not modeled. Beyond the very crude assumption we made for the tracking this is very hard to include. I would expect this to be a cause for the difficulties of fitting the model to (publicly) reported numbers.
  • The stochastic nature of the spreading of the disease and the observation of cases is not considered. This may be OK for a very coarse model like this, but for more refined models (e.g. with unidentified cases, a spatial component) this might be a more significant limitation.

Maybe I will follow up with an exposition of a more advanced model.


Using the basic SIR model we have made an excursion through very elementary epidemiological considerations. In particular, we saw the role the basic reproduction number $R_0$ plays. We have also seen how the peak number of ill people can be determined approximatively from $R_0$. From this we estimated how far the reproduction number must be reduced to avoid overloading the medical system. In particular, it would seem that drastic measures are likely a good idea if relate the estimated number of severe cases to the capacity.

I hope this little overview can contribute to the general understanding of this type of models. Even if we only scratched the basics, we now have a little more background to the analysis of policy options that appears to drive govermental reaction to the pandemic.

I appreciate your comments, criticism, and suggestions for improvements. Please send them by mail to Thomas Viehmann tv@lernapparat.de.

Again, I would like to emphasize that I am not an epidemiologist but a Machine Learning and PyTorch trainer and consultant by profession.


Steve Lizcano has had very helpful comments for a draft of this blog. Naturally, all inaccuracies are my own doing.


That CoViD captures our attention likely does not help all other problems of this world, e.g. the fate of war refugees. CoViD will take quite a toll in terms of lives and economic impact, but they should not stop us to also think of others.