| • Some of the important tools of statistics with applications in clinical studies in neurology are hazard ratio, hypothesis testing, confounding variables, P value, confidence interval, etc. |
| • Bayesian analysis requires the establishment of a prior probability, eg, to show that treatment A is superior to treatment B after a trial has been conducted; it is necessary to specify the probability of treatment A's superiority based on evidence available before the trial. |
Some of the concepts and techniques used to evaluate the results of clinical trials are as follows:
Hazard ratio. This represents the odds that an individual in the group with the higher hazard reaches the endpoint first. In a therapeutic trial examining time to disease resolution with a drug, it represents the odds that a treated patient will resolve symptoms before a control patient. In a trial to evaluate preventive effect, it describes the likelihood of progression of disease in the treatment group compared to the control group. For example, a hazard ratio of 0.66 in a clinical trial of use of statins or fibrates to lower serum cholesterol was associated with a one third lower risk of stroke (01). With a hazard ratio 1.12, no association was found between lipid lowering drug use and coronary heart disease.
Hypothesis testing. A statistical procedure can be designed to test and possibly disprove a "null hypothesis," a term used to state that no difference exists between outcomes resulting from treatments that are being compared. The results of such comparisons are seldom identical; there will usually be some difference in the outcomes between the experimental and the control groups. If the difference is significant, the null hypothesis is disproved. Chi-square, a statistical test commonly used to compare observed data with data we would expect to obtain according to a specific hypothesis, is used to test “null hypothesis.”
Confounding variables. These are defined as the variables correlated (positively or negatively) with both the dependent variable and the independent variable so that the results do not reflect the actual relationship between the variables under study. For example, search for the causes of diseases in epidemiological studies is based on associations with various risk factors, but there may be other factors that are associated with the exposure and affect the risk of developing the disease that will distort the observed association between the disease and exposure under study. Various methods to modify a study design to exclude or control confounding variables, besides randomization, include restriction and matching based on age and sex.
Sample size. In a clinical trial, an important question is the number of subjects needed for a statistically significant result. There are several methods for sample size calculations, which is usually based on the statistics used in the analysis of the data. According to the "rule of thumb," an important component of sample size calculations is the power associated with a certain statistical procedure (34). Switching to a different statistical procedure during the data analysis may alter the anticipated power. Moreover, the estimated treatment effect may no longer be meaningful in the scale of the new statistic.
Errors. Errors occur when one incorrectly evaluates the difference in outcomes between the placebo and the treatment groups. Type 1 errors are the erroneous conclusion of difference when in fact no difference exists. The probability of a type 1 error (alpha) is usually set at 0.05. Type 2 errors occur when a false conclusion is made that the 2 outcomes are not significantly different when they actually are. Type 2 errors can result from erroneously failing to reject the null hypothesis. The probability of a type 2 error (beta) decreases as the sample gets larger or the statistical power (1-beta) increases.
P values. P values, or significance levels, indicate the probability that the results were obtained by chance and are used to assess the degree of dissimilarity between 2 or more sets of measurements or between 1 set of measurements and a standard. P values measure the strength of the evidence against the null hypothesis. The smaller the p value, the stronger the evidence against the null hypothesis. When calculating the p value, one chooses a measure of the quantity of interest and a range of measurements that will be computed.
P value is usually indicated as smaller than 0.05 (p < 0.05) or smaller than 0.01 (p < 0.01). When the p value is between 0.05 and 0.01, the result is usually considered to be statistically significant; if the p value is less than 0.01, the result is often considered to be highly statistically significant. The advantage of p values is that they provide a specific, objectively chosen level for the investigator to keep in mind. Furthermore, it is simpler to determine if the p value is larger or smaller than 0.05 than to compute the exact probability. The main disadvantage is that p values suggest a rather meaningless cutoff point that is not relevant to the investigation. P values convey meaningful information only if they are put into a clinical context. Confidence interval can be used to indicate the clinical significance of a p value. This is important because a small clinical difference may be statistically significant because of a large sample size, whereas a clinically important effect may appear statistically nonsignificant if the number of subjects studied is too small.
P value remains controversial, and the idea that a single number can capture both the long-range outcomes of an experiment and the evidential meaning of a single result has been questioned. An arbitrary division of results, into "significant" or "nonsignificant" according to the p value, was not the intention of the founders of statistical inference. Some investigators suggest using the Bayesian approach instead because it enables the integration of background knowledge with statistical findings. Correlation of P values with Bayes factors suggests that 1 reason for lack of reproducibility of scientific studies is attributed to the conduct of significance tests at unjustifiably high levels of significance. To address this problem, evidence thresholds required for the declaration of a “significant” finding should be increased to 25–50:1, and to 100–200:1, ie, conduct of tests at P values of 0.005 or 0.001 (22). In view of the prevalent misuses of and misconceptions concerning P values, some statisticians prefer to supplement or even replace P values with other approaches, including methods that emphasize estimation over testing such as confidence interval (38).
Confidence interval. Confidence interval covers a range of values, including the true value. The confidence interval covers a large proportion of the sampling distribution of the statistic of interest. The 95% confidence interval for the sample is the range of values from mean -1.96 standard error to mean +1.96 standard error and is usually interpreted as a range of values that contain the true population mean with a probability of 0.95. The true effect of the treatment may be greater or lesser than what is observed, and confidence intervals tell us how much greater or smaller the true effect is likely to be. Confidence intervals can be calculated for various measures of association. Confidence intervals are recommended for better communication of clinical trial results and for evaluation of diagnostic tests.
The t test. The t test, also called the "Student's t test," is used for measured variables when comparing 2 means. The t test, although valuable, does not give the probability that the 2 random samples have, in fact, come from the same population (14). The unpaired t test compares the medians of 2 independent samples. The unpaired t test is parametric, and its nonparametric equivalent is the Mann-Whitney U test. The paired t test compares 2 paired observations on the same individuals or on matched individuals. Its nonparametric equivalent is the Wilcoxon matched pair test. The values for t are calculated as follows:
t (unpaired) = | Difference between means Standard error of difference |
t (paired) = | Mean difference Standard error of difference |
Analysis of variation (ANOVA) or multivariate analysis. This enables comparison among more than 2 sample means. A 1-way analysis of variation deals with a single, categorically independent variable, whereas factorial analysis of variation can deal with multiple factors in several different configurations. Several statistical packages are available for ANOVA. Analysis of variation has an advantage over the t test when several drugs in different groups are being compared and numerous possible comparisons can be made. In this situation, the use of multiple t tests to do 2-way comparisons would be inappropriate, as it would lead to the loss of any interpretable level of significance. A 1-way analysis of variation is presented as a table that includes the sums of squares between groups and sums of squares within groups. It also contains the value of the F ratio, which is equal to the mean square (between) divided by mean square (within). The larger the F ratio is, the more significant the results are. An extension of this approach called "factorial analysis of variation" enables the inclusion of any number of factors in a single experiment and looks at the independent effect of each factor without distorting the overall probability of chance difference.
Standard deviation as a measure of variability. The standard deviation, usually depicted by the abbreviation SD, is a measure of variability. When the standard deviation of a sample is calculated, it is an estimate of the variability of the population from which the sample was drawn. It should not be confused with standard error, which depends on both the standard deviation and the sample size; the standard error falls as the sample size increases. This forms the basis for calculation of sample size for a controlled trial. Although standard error falls as sample size increases, standard deviation tends not to change with the increase.
Measures of association. Measures of association refer to the use of statistical analysis to study the association between 2 variables within a group of subjects in a clinical trial. Correlation (with calculation of correlation coefficient) is the method used to study possible association between 2 continuous variables. Regression analysis enables the value of 1 variable to be predicted from any known value of the other variable. Traditionally linear regression methods have been preferred to nonlinear regression methods because of their inherent simplicity. With the widespread availability of computers and statistical software packages, nonlinear regression, an often more appropriate analysis, should be considered. Most computer programs for nonlinear regression analysis provide information necessary to perform all the calculations.
Odds ratio. The odds ratio is the ratio of the probability that the event of interest occurs to the probability that it does not. This is usually estimated by the ratio of the number of times that the event of interest occurs to the number of times that it does not. Odds ratio is also useful as a measure of association. It was originally used in analysis of case-control epidemiological studies for measuring relative risk, but it is now applied to randomized trials and meta-analysis. Supposing the odds of occupational exposure to aluminum were 1.3 higher in cases with dementia than in controls, the odds ratio is said to be 1.3. An odds ratio greater than 1 indicates a positive association between exposure and disease; an odds ratio equal to 1 indicates no association; and an odds ratio less than 1 indicates a negative association. However, association is not a proof of causation. The calculation to find the approximate 95% confidence interval for an odds ratio is not difficult. The use of odds ratios has increased in medical reports for the following reasons (08):
| • They provide an estimate (with confidence interval) of the relationship between 2 binary ("yes or no") variables. |
| • They enable examination of the effects of other variables on that relationship, using logistic regression. |
| • They provide a convenient interpretation in case-control studies. |
Comparison of survival rates. Survival rates of 2 groups of patients treated by different methods in clinical trials may need to be compared. A traditional method is plotting curves of each of the 2 variants use time from start of observation and survival rate at various points of time. The limitation of this approach is that the survival curves differ, but this is not sufficient for the investigator to conclude that the survival in 1 group is worse than the other. It gives a comparison at some arbitrary point in time but does not provide a comparison of the total survival experience of the 2 groups. According to the Kaplan Meier survival curve, censoring is unrelated to prognosis, the survival probabilities are the same for subjects recruited early and late in the study, and the events happen at the times specified (07).
The logrank test is used to test the null hypothesis that there is no difference between the groups in the probability of event (death) at any time point and the analysis is based on the times of deaths (09). For example, in a clinical trial of 50 patients, 1 patient in group I (20 patients) dies in week 3, so the risk of death in this week is 1/50. If the null hypothesis were true, the expected number of deaths in group I is 20 x 1/50 = 0.4. Similarly, in group II (30 patients) the expected number of deaths is 30 x 1/50 = 0.6. The same calculations are performed each time an event occurs. This way of handling censored observations is the same as for the Kaplan-Meier survival curve. The logrank test is most likely to detect a difference between groups when the risk of an event is consistently greater for 1 group than the other. It is unlikely to detect a difference when survival curves cross, a situation that may occur when comparing a medical with a surgical intervention.
Logistic regression. This is usually carried out by professional statisticians and is playing an increasing role in clinical studies. It is relevant to studies when only 2 possible outcomes are of interest: recovery or death; improvement or no improvement; and presence or absence of side effects. Logistic regression can also be used to analyze the role of risk factors at the onset of a disease to test the independence of various parameters, such as discontinuation of antiepileptic medications and generalized background abnormalities on EEG as predictors of risk of status epilepticus.
Standard regression-based methods have been applied to statistically measure individual rates of impairment at several time points after concussion in college football players to decide when an athlete can return to competition (27). The data suggest that use of neuropsychological testing to detect subtle cognitive impairment is most useful once postconcussive symptoms have resolved.
Artificial neural networks are now being designed as an alternative to logistic regression to predict evolution of some events during a disease. In a retrospective study of a database of patients with confirmed aneurysmal subarachnoid hemorrhage, a simple artificial neural network model was more sensitive and specific than multiple logistic regression models for prediction of cerebral vasospasm (15).
Scientific inference from big data. Current research activity and collection of information is generating big data, ie, data sets whose heterogeneity, complexity, and size -- measured in terabytes or petabytes -- exceed the capability of traditional approaches to data processing, storage, and analysis. Analysis of big data is needed to identify complex patterns hidden inside volumes of data that could accelerate scientific discovery and development of beneficial technologies as well as products. For example, analysis of big data combined from a patient’s electronic health records, environmental exposure, activities, and genetic and proteomic information is expected to help guide the development of personalized medicine (28).
Bayesian statistical methods. In comparing results of treatment, Bayesian analysis begins with the observed differences between treatment A and treatment B and then asks how likely it is that treatment A is superior to treatment B. In other words, the Bayesian method induces the probability of the existence of the true, but as of yet unknown, underlying state. However, the Bayesian analysis requires the establishment of a prior probability. To obtain a Bayesian probability that treatment A is superior to treatment B after a trial has been conducted, it is necessary to specify the probability of treatment A's superiority based on evidence available before the trial. The first application of Bayesian methodology was in diagnostic medicine. As applied to diagnosis, Bayes’ theorem states as follows:
Prob (disease/test positive) = | Prob Prob (test positive/disease) x (disease) -------------------------------------------------- = Prob (test positive) |
Prob Prob Prob Prob (test positive/disease) x (disease) + (test positive/no disease) x (no disease) --------------------------------------------------------------------------------------- Prob Prob Prob Prob (test positive/disease) x (disease) + (test positive/no disease) x (no disease) |
Bayesian statistics has now permeated all the major areas of medical statistics, including clinical trials, epidemiology, meta-analyses and evidence synthesis, spatial modeling, longitudinal modeling, survival modeling, molecular genetics, and decision making in respect of new technologies.
Statistical evaluation of clinical measurements. Several clinical measurements are not precise and may vary according to the examiner and the technique used. Studies comparing 2 methods are common. The aim of these studies is to see if the results of the methods agree well enough for 1 method to replace the other. Another objective is to see if the results of 2 studies conducted by different observers using the same method agree with each other; this is termed "inter-rater agreement." Here, the problem is one of estimation, rather than any hypothesis testing. A simple approach to assessing agreement is to see how many exact agreements are observed. The weakness of this approach is that it does not take into account where the agreement is or what agreements would take place by chance. One measure of agreement, called kappa, has a value of 1.0 when the agreement is perfect, a value of 0 when no agreement is better than chance, and negative values when agreement is worse than chance.
Statistical evaluation of diagnostic tests. Considerable clinical research is being done to evaluate and improve methods of diagnosis of disease. The terms "positive" and "negative" refer to the presence or absence of the condition of interest as confirmed by a definitive examination. An example is the polymerase chain-reaction-based rapid test for the detection of enteroviral RNA in the cerebrospinal fluid, where the results are later verified by culture of the cerebrospinal fluid. In a study, 96.3% of the patients who tested positive with polymerase chain reaction were later positive with culture as well, whereas 99.0% of those that had a negative polymerase chain reaction test were shown to have a negative culture. The 2 terms used to describe these situations are "sensitivity" and "specificity." Sensitivity is the proportion of positives correctly identified by the test, and specificity is the proportion of negatives correctly identified by the test.
Sensitivity and specificity do not tell us the probability of the test resulting in a correct diagnosis, whether it is positive or negative. For this we use positive and negative predictive values. Positive predictive value is the proportion of patients with positive test results who are correctly diagnosed; negative predictive value is the proportion of patients with negative test results who are correctly diagnosed. Positive and negative predictive values give a direct assessment of the usefulness of the test in practice. The 4 possibilities are as follows:
| (A) True positive: The test is positive, and the disease is present. (B) False positive: The test is positive, but the disease is absent. (C) False negative: The test is negative, but the disease is present (D) True negative: The test is negative, and disease is absent. |
These quantities can be represented as follows:
| • Sensitivity = a/(a + c) • Specificity = d/(b + d) • Positive predictive value (PPV) = a/(a + b) • Negative predictive value (NPV) = d/(c + d) |
Prevalence of the disease in the study can be calculated as (a + c)/n if the study is carried out in a definable group of patients.
With the above information, positive predictive value and negative predictive value can be calculated as follows:
PPV =
| sensitivity x prevalence ---------------------------------------------------------- sensitivity x prevalence + (1-specificity) x (1-prevalence) |
NPV =
| sensitivity x (1-prevalence) ---------------------------------------------------------- (1-sensitivity) x prevalence + specificity x (1-prevalence) |
If sensitivity and specificity estimates are reported without a measure of precision, clinicians cannot know the range wherein the true values of the indices are likely to lie. Therefore, evaluations of diagnostic accuracy should be qualified with confidence intervals.
An example of application of the statistical evaluation is critical review of a publication claiming that MRI-assisted diagnosis of autistic spectrum disorder can be carried out with a sensitivity and specificity of up to 90% and 80%, respectively (16). A common misconception would be that the test is 90% (sensitivity) accurate. For autistic spectrum disorder, if the prevalence is 1 in 100, the diagnostic accuracy may be less than 5% (positive predictive value), ie, 5 in every 100 with a positive test would have autistic spectrum disorder. Of those who do not have the disease, 80% (specificity) will test negative, but the problem is with the 20% who do not have the disease and yet test positive. This approach would not be worthwhile for screening a population with low prevalence of autistic spectrum disorder.
Statistical significance. The term “statistical significance” has been used for decades in various publications. A proposal with support from other authors suggests retaining P values but abandoning ambiguous statements (significant/nonsignificant), suggests discussing “compatible” effect sizes, and points out that many effects are refuted on discovery or replication (03). There is considerable criticism of this proposal. Although the statistics of scientific work requires improvement, banning of statistical significance while retaining P values (or confidence intervals) will not improve the situation and may foster statistical confusion and create problematic issues with study interpretation. Therefore, the term “statistical significance” should be retained (21).