### question 1 (b)

@Dr Wischik: is there a missing bracket in log x - theta? maybe log(x-theta), as that will make solutions possible?

UPD: After looking back to the notes: do we maximise each parameter by taking only partial derivatives? I.e. only d/dbeta? I feel like I'm missing a point when more parametres are involved.

### 2 Answers

There is no missing bracket. The equation is \[\log(\hat F(x))\approx-\beta\log x -\gamma max((log x)-\theta, 0)\] Plotting this on a log-log axis means that you take the substitution \(\varphi=log x\) and \(\psi=\log\hat F(x)\) and plotting \(\varphi\) against \(\psi\). You get \(\psi=-\beta\varphi-\gamma max(\varphi-\theta)\). This is two straight lines that meet at \(\varphi=\theta\).

After wasting two hours trying to get a line of python code that compiles, you can see that plotting the empirical distribution of the given dataset on a log-log scale does produce these two lines.

Note that as this is a fake maths course, we defined the distribution function as a function that decreases from 1 to 0. This is useful in this case, though it is not the conventional definition.

Now for the parameters: we defined the likelihood as \[\prod_{i=1}^nf(x_i)\] where \(f(x)=\mathbb P(X=x)\), so we need \(f\). Normally, this would be zero in case of a continuous distribution. However in the empirical dataset, data is rounded down to integers (I guess), so we can take (for \(x\in\mathbb N\): \[f(x)=\mathbb P(x\le X<x+1)=\hat F(x)-\hat F(x+1)=...\] write a python function to expand this.

then you can calculate the log of the likelihood function by adding together the log of these. This function looks quite complicated with all the test data in it, I wouldn't take partial derivatives of it.

You can use the scipy.optimize.fmin function (example at the end of the notebook) to maximise this for \(\beta\), \(\gamma\), \(\theta\).

EDIT: sorry for shit latex, trying to fix

EDIT2: fixed shitty latex

---

Differentiating gives you \( \frac{\text{d}}{\text{d}x}\hat{F}(x) = \hat{f}(x) \) which is the probability density function corresponding to the cumulative distribution function \( \hat{F}(x) \).

*(There will be a minus there if you are looking at the 'tail version' of the cumulative distribution function, i.e. the decreasing one. Obviously we want the density function to be nonnegative so it should be clear something's wrong if you forget the minus.)*

Now you are left with the task: 'find parameters \((\beta, \gamma, \theta) \) for which \( \hat{f}(x) \) generates the observed results with the highest probability'. This optimized \( \hat{f}_{\beta, \gamma, \theta}(x) \) will be called the max likelihood estimator for this model (two lines in log-log plot).

By hand, obtain the formula for \( \text{log likelihood}(\beta, \gamma, \theta) \); it is defined as \( \log \mathbb{P}(dataset | \beta, \gamma, \theta) \).

Then use functionality such as that offered by scipy.optimize.fmin to make your computer numerically solve the minimization of the log likelihood (for a fixed dataset it's just a function of three variables).

The last two lines have their counterparts in the max likelihood estimation part of these notes: https://notebooks.azure.com/djw1005/libraries/founds/html/1-1__fit.ipynb.

**EDIT:**As people have noticed, this is incorrect, because for continuous distributions \(\mathbb{P}(dataset | \beta, \gamma, \theta) \) is \( 0 \).

Instead, in the continuous case one defines [src1, src2] the likelihood of the parameters as \( f_{\text{joint}}(dataset | \beta, \gamma, \theta) = f_{\text{joint}}(x_1, \ldots, x_n | \beta, \gamma, \theta) \), where \( f_{\text{joint}} \) is the joint probability distribution for the space the dataset comes from. In our case that is \( (X_1, X_2, \ldots, X_n) \), where all the \( X_i \) are i.i.d. and hence:

\( f_{\text{joint}}(x_1, \ldots, x_n | \beta, \gamma, \theta) = \prod\limits_i f(x_i | \beta, \gamma, \theta) \).

You should be able to make it work from here as before.