Which of the following pearson coefficients is considered to have the strongest positive correlation

Video Transcript

Which of the following Pearson′s correlation coefficients indicates the strongest correlation? Is it (A) negative 0.14, (B) negative 0.87, (C) negative 0.88, or (D) negative 0.33?

We begin by recalling that Pearson′s correlation coefficient describes the measure of linear agreement between two variables. In effect, it tells us how close a set of points sit to a straight line. We know that points with a positive correlation will sit near to a straight line with positive slope or gradient, whereas points with a negative correlation will sit near to a straight line with negative slope, that is, sloping downwards from left to right. Points with a perfect positive linear correlation will sit exactly on a straight line with positive slope or gradient, and the Pearson′s correlation coefficients will be equal to one, whereas points with a perfect negative linear correlation will sit entirely on a straight line with a negative slope, and the correlation coefficient will be equal to negative one.

It is important to note that this does not mean that the gradient or slope of the line will be equal to negative one. Likewise, a correlation coefficient of one does not mean that the slope of the positive correlation line is one. It just means that the points do lie on a straight line with negative gradient and their correlation coefficient is equal to negative one.

We can therefore see that Pearson′s correlation coefficients vary from negative one to one, with a value of negative one indicating the strongest possible negative correlation and a value of positive one indicating the strongest positive correlation. All of the options in this question are negative. So, to find the value that indicates the strongest correlation, we′re looking for the value which is closest to negative one. The correct answer is therefore option (C). Out of the four options, negative 0.88 indicates the strongest correlation.

Selection of Variables and Factor Derivation

David Nettleton, in Commercial Data Mining, 2014

Correlation

The Pearson correlation method is the most common method to use for numerical variables; it assigns a value between − 1 and 1, where 0 is no correlation, 1 is total positive correlation, and − 1 is total negative correlation. This is interpreted as follows: a correlation value of 0.7 between two variables would indicate that a significant and positive relationship exists between the two. A positive correlation signifies that if variable A goes up, then B will also go up, whereas if the value of the correlation is negative, then if A increases, B decreases.

For further reading on the Pearson Correlation Method, see:

Boslaugh, Sarah and Paul Andrew Watters. 2008. Statistics in a Nutshell: A Desktop Quick Reference, ch. 7. Sebastopol, CA: O'Reilly Media. ISBN-13: 978-0596510497.

Considering the two variables “age” and “salary,” a strong positive correlation between the two would be expected: as people get older, they tend to earn more money. Therefore, the correlation between age and salary probably gives a value over 0.7. Figure 6.2 illustrates pairs of numerical variables plotted against each other, with the corresponding correlation value between the two variables shown on the x-axis. The right-most plot shows a perfect positive correlation of 1.0, whereas the middle plot shows two variables that have no correlation whatsoever between them. The left-most plot shows a perfect negative correlation of − 1.0.

Which of the following pearson coefficients is considered to have the strongest positive correlation

Figure 6.2. Correlations between two numerical variables

A correlation can be calculated between two numerical values (e.g., age and salary) or between two category values (e.g., type of product and profession). However, a company may also want to calculate correlations between variables of different types. One method to calculate the correlation of a numerical variable with a categorical one is to convert the numerical variable into categories. For example, age would be categorized into ranges (or buckets) such as: 18 to 30, 31 to 40, and so on.

As well as the correlation, the covariance of two variables is often calculated. In contrast with the correlation value, which must be between − 1 and 1, the covariance may assume any numerical value. The covariance indicates the grade of synchronization of the variance (or volatility) of the two variables.

For further reading on covariance, see:

Boslaugh, Sarah and Paul Andrew Watters. 2008. Statistics in a Nutshell: A Desktop Quick Reference, ch. 16. Sebastopol, CA: O'Reilly Media. ISBN-13: 978-0596510497.

Table 6.2 shows correlations between four business variables taken from Table 6.1. The two variables that have the highest correlations are profession with income (US $), with a correlation of 0.85, and age with income (US $), with a correlation of 0.81. The lowest correlations are cell phone usage with income (0.25) and cell phone usage with profession (0.28). Hence the initial conclusion is that cell phone usage doesn’t have a high correlation with any other variable, so it could be considered for exclusion from the input variable set. Table 6.1 also shows that cell phone usage has a significantly lower reliability (0.3) than the other variables and this could have repercussions on its correlation value with the remaining variables. Also, profession only has a high correlation with income; however, it will be seen that this correlation pair (income, profession) is important to the type of business. Given that each variable has a correlation with every other variable, the values are repeated around the diagonal. Therefore, the values on one side of the diagonal can be omitted. Note that all the values are equal to 1 on the diagonal, because these are the correlations of the variables with themselves.

Table 6.2. Correlations between candidate variables

AgeIncome (US $)ProfessionCell Phone Usage
Age 1 0.81 0.38 0.60
Income (US $) 0.81 1 0.85 0.25
Profession 0.38 0.85 1 0.28
Cell phone usage 0.60 0.25 0.28 1

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780124166028000066

29th European Symposium on Computer Aided Process Engineering

Sergio Medina-González, ... Lazaros G. Papageorgiou, in Computer Aided Chemical Engineering, 2019

4 Results and discussion

Using the Pearson correlation and three thresholds values (0.91; 0.92 and 0.93) the adjacency matrices and the associated networks were constructed as described in section 2. Then, the Louvain algorithm was used to detect the communities within each network. Essentially, Louvain is a two-step algorithm that maximises the modularity metric, in which for a given network, the first step assigns nodes into clusters only if that increases the modularity value, whereas the second step creates a new network where each node represents a cluster from the previous step. These two steps are iterated until no further modularity improvement is possible (Blondel et al., 2008). After applying the Louvain algorithm for this problem, a total of 9, 13 and 31 communities were identified for each threshold respectively. Analysing the obtained networks (Figure 1), it is evident that the density is inversely proportional to the threshold value. Particularly, the larger the threshold values, the more isolated scenario clusters are produced, whereas for small values a very densely connected network is obtained which compromises the definition of cluster centroids. Such a behaviour confirms the undesirable properties in the reduced set of scenarios for the extreme threshold points and stresses the need for a metric that represents a balanced rate between number of clusters and network density.

Which of the following pearson coefficients is considered to have the strongest positive correlation

Figure 1. Networks and clusters for different thresholds values.

For the three generated networks, their centroids were identified using degree as centrality metric. The probability of each cluster centroid is represented by the aggregation of the original probabilities of all the elements within their respective cluster, while the original uncertainty parameters information was kept. Finally, the MILP problem was implemented in GAMS 24.7 and solved for each one of the reduced sets of scenarios using CEPLEX 12.6.3 to a relative optimality gap of 5%.

In order to illustrate the effectiveness of the proposed approach, SCENRED and OSCAR (the two most used scenario reduction approaches) were used as a comparison reference. The resulting set of scenarios from the three methods represent the input data to solve the MILP problem. Table 2 shows the optimal results for each case.

Table. 2. Results for the optimisation of MILP problem.

Size/modules(m)ExpProfìt (€)PostProcess Profit (€)Gap (%)CPUtime (s)
FULL-SPACE 100 345,198 --- -- 131,607
Th = 0.91 Modularity (Q = 0.205214)
OSCAR 9 346,642 345,231 0.4085 1,101
SCENRED 9 348,304 342,374 1.7322 1,193
SCANCODE 9 357,655 345,500 3.5183 1,137
Th = 0.92 Modularity (Q = 0.32264)
OSCAR 13 343,475 345,115 0.4752 1,459
SCENRED 13 349,373 345,500 1.1211 1,512
SCANCODE 13 348,142 345,115 0.8771 1,495
Th = 0.93 Modularity (Q = 0.43473)
OSCAR 31 346,051 345,115 0.2711 7,765
SCENRED 31 348,562 345,115 0.9987 7,812
SCANCODE 31 344,156 343,545 0.1779 7,818

Table 2, displays a significant dispersion between the expected performances (ExpProfit) for the reduced set of scenarios and the full-space (no reduction), since the scenarios in the reduced set may be different for each cluster and strategy. Therefore, a post-process analysis was performed in order to promote a fair comparison (PostProcessProfit). To derive those values, the first stage decisions obtained after optimising the MILP problem using the reduced set, were fixed for the full-space problem. The gap between the expected value and its associated post-process outcome was also calculated, confirming that the three approaches approximate the profit (variation < 4%) and proving that SCANCODE is a feasible alternative to the current scenario reduction approaches. Despite this small gap, the approximation error for SCENRED and SCANCODE increases as a function of the reduction degree while in the case of OSCAR, a relatively steady gap was obtained despite the level of reduction. Such a behaviour is due to the consideration of both, the original uncertain parameters and their effect over the expected performance. Nonetheless, obtaining such information implies a pre-processing task, which hinders its application to large-scale problems. In any case, further research seeking the reduction of the approximation gap in SCANCODE is crucial for its further application in real-life problems with large sized scenario sets.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128186343501363

Split-Half Reliability

Robert L. Johnson, James Penny, in Encyclopedia of Social Measurement, 2005

Calculation of the Reliability Coefficient

Split-half reliability is typically estimated with the use of a Pearson correlation. Subsequently, the Spearman–Brown prophecy formula is applied to estimate the reliability of the full-length test. The Spearman–Brown method assumes that the two halves of the test are parallel. Parallelism requires that an examinee has the same true score across forms and the mean, variance, and error are the same across forms. If not, the estimated full length reliability for Spearman–Brown will be greater than obtained by other measures of internal consistency.

Not all calculations of split-half estimations use the Pearson correlation. Rulon provided two split-half formulas that he attributed to John Flanagan. One formula is based on the standard deviation of difference scores between the half-tests. The formula is

(3)rw=1−σ d2σw2

where d = Xa − Xb and σw2 is the variance for the whole test. Assumptions for this formula include (a) the difference between the two true scores for the two half-tests is constant for all examinees, and (b) the errors in the two half scores are random and uncorrelated.

The other formula is

(4)rw=4σaσbrabσw2

where σa is the standard deviation of scores for one test half and σb is the standard deviation associated with the other test half. Unlike the Spearman–Brown formula, these formulas do not require equivalent halves with equal variances. Both assume experimentally independent halves. Neither reliability estimate requires the application of the Spearman–Brown prophecy formula. Guttman offered the following contribution to the estimation of split-half reliability

(5)rw=21− σa2+σb2σw2

The terms σa2 and σb2 represent the variance associated with each test half.

If variances are equal on the two halves, the reliability estimate based on Spearman–Brown will be the same as achieved with the split-half procedures described by Rulon and Guttman. Moreover, the strict equality of variances is not required for convergence of reliability estimates across methods. According to Cronbach, if the ratio of the standard deviations for the two test halves is between 0.9 and 1.1, then Spearman–Brown gives nearly the same result as Eqs. (3) and (5).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985000967

Understanding Your Data

Jules J. Berman, in Data Simplification, 2016

Pearson's Correlation

Similarity scores are based on comparing one data object with another, attribute by attribute, usually summing the squares of the differences in magnitude for each attribute, and using the calculation to compute a final outcome, known as the correlation score. One of the most popular correlation methods is Pearson's correlation, which produces a score that can vary from − 1 to + 1. Two objects with a high score (near + 1) are highly similar.18 Two uncorrelated objects would have a Pearson score near zero. Two objects that correlated inversely (ie, one falling when the other rises) would have a Pearson score near − 1 (See Glossary items, Correlation distance, Normalized compression distance).

The Pearson correlation for two objects, with paired attributes, sums the product of their differences from their object means, and divides the sum by the product of the squared differences from the object means (Fig. 4.15).

Which of the following pearson coefficients is considered to have the strongest positive correlation

Figure 4.15. Formula for Pearson's correlation. \frac{\sum (x_i - \overline{x})(y_i - \overline{y})}{\sqrt{\sum{(x_i - \overline{x})ˆ2}}{\sqrt{\sum{(y_i - \overline{y})ˆ2}}}.

You will notice that the Pearson's correlation is parametric, in the sense that it relies heavily on the "mean" parameter for the two objects. This means that Pearson's correlation might have higher validity for a normal distribution, with a centered mean, than for a distribution that is not normally distributed, such as a Zipf distribution (See Glossary items, Nonparametric statistics, Zipf distribution).

Python's Scipy module offers a Pearson function. In addition to computing Pearson's correlation, the Scipy function produces a two-tailed p-value, which provides some indication of the likelihood that two totally uncorrelated objects might produce a Pearson's correlation value as extreme as the calculated value (See Glossary item, p-value).

Let's look at a short Python script, pearson.py, that calculates the Pearson correlation on two lists.

#!/usr/bin/python

from scipy.stats.stats import pearsonr 

a = [1,2,3,4] 

b = [2,4,6,8]

print pearsonr(a,b)

exit

Here's the output of pearson.py

c:\ftp\py>sci_pearson.py

(1.0, 0.0)

The first output number is the Pearson correlation value. The second output number is the two-tailed p-value. We see that the two list objects are perfectly correlated, with a Pearson's correlation of 1. This is what we would expect, as the attributes of list b are exactly twice the value of their paired attributes in list a. In addition, the double-tailed p-value is zero, indicating that it unlikely that two uncorrelated lists would yield the calculated correlation value.

Let's look at the Pearson correlation for another set of paired list attributes.

#!/usr/bin/python

from scipy.stats.stats import pearsonr 

a = [1,4,6,9,15,55,62,-5]

b = [-2,-8,-9,-12,-80,14,15,2] 

print pearsonr(a,b)

exit

Here's the output:

c:\ftp\py>sci_pearson.py

(0.32893766587262174, 0.42628658412101167)

In this case, the Pearson correlation is intermediate between 0 and 1, indicating some correlation. How does the Pearson correlation help us to simplify and reduce data? If two lists of data have a Pearson correlation of 1 or of − 1, this implies that one set of the data is redundant. We can assume the two lists have the same information content. For further explanation, see Section 4.5. Reducing Data in this chapter.

If we were comparing two sets of data and found a Pearson correlation of zero, then we might assume that the two sets of data were uncorrelated, and that it would be futile to try to model (ie, find a mathematical relationship for) the data (See Glossary item, Overfitting).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128037812000047

Indispensable Tips for Fast and Simple Big Data Analysis

Jules J. Berman, in Principles and Practice of Big Data (Second Edition), 2018

Section 11.3. The Dot Product, a Simple and Fast Correlation Method

Our similarities are different.

Yogi Berra

Similarity scores are based on comparing one data object with another, attribute by attribute. Two similar variables will rise and fall together. A score can be calculated by summing the squares of the differences in magnitude for each attribute, and using the calculation to compute a final outcome, known as the correlation score. One of the most popular correlation methods is Pearson's correlation, which produces a score that can vary from − 1 to + 1. Two objects with a high score (near + 1) are highly similar [10]. Two uncorrelated objects would have a Pearson score near zero. Two objects that correlated inversely (i.e., one falling when the other rises) would have a Pearson score near − 1. [Glossary Correlation distance, Normalized compression distance, Mahalanobis distance]

The Pearson correlation for two objects, with paired attributes, sums the product of their differences from their object means and divides the sum by the product of the squared differences from the object means (Fig. 11.3).

Which of the following pearson coefficients is considered to have the strongest positive correlation

Fig. 11.3. Formula for Pearson's correlation, for two data objects, with paired sets of attributes, x and y.

Python's Scipy module offers a Pearson function. In addition to computing Pearson's correlation, the scipy function produces a two-tailed P-value, which provides some indication of the likelihood that two totally uncorrelated objects might produce a Pearson's correlation value as extreme as the calculated value. [Glossary P value, Scipy]

Let us look at a short python script, sci_pearson.py, that calculates the Pearson correlation on two lists.

from scipy.stats.stats import pearsonr

a = [1, 2, 3, 4]

b = [2, 4, 6, 8]

c = [1,4,6,9,15,55,62,-5]

d = [-2,-8,-9,-12,-80,14,15,2]

print("Correlation a with b: " + str(pearsonr(a,b)))

print("Correlation c with d: " + str(pearsonr(c,d)))

Here is the output of pearson.py

Correlation a with b: (1.0, 0.0)

Correlation c with d: (0.32893766587262174, 0.42628658412101167)

The Pearson correlation of a with b is 1 because the values of b are simply double the values of a; hence the values in a and b correlate perfectly with one another. The second number, “0.0”, is the calculated P value.

In the case of c correlated with d, the Pearson correlation, 0.329, is intermediate between 0 and 1, indicating some correlation. How does the Pearson correlation help us to simplify and reduce data? If two lists of data have a Pearson correlation of 1 or of − 1, this implies that one set of the data is redundant. We can assume the two lists have the same information content. If we were comparing two sets of data and found a Pearson correlation of zero, then we might assume that the two sets of data were uncorrelated, and that it would be futile to try to model (i.e., find a mathematical relationship for) the data. [Glossary Overfitting]

There are many different correlation measurements, and all of them are based on assumptions about how well-correlated sets of data ought to behave. A data analyst who works with gene sequences might impose a different set of requirements, for well-correlated data, than a data analyst who is investigating fluctuations in the stock market. Hence, there are many available correlation values that are available to data scientists, and these include: Pearson, Cosine, Spearman, Jaccard, Gini, Maximal Information Coefficient, and Complex Linear Pathway score. The computationally fastest of the correlation scores is the dot product (Fig. 11.4). In a recent paper comparing the performance of 12 correlation formulas the simple dot product led the pack [11].

Which of the following pearson coefficients is considered to have the strongest positive correlation

Fig. 11.4. The lowly dot product. For two vectors, the dot product is the sum of the products of the corresponding values. To normalize the dot product, we would divide the dot product by the product of the lengths of the two vectors.

Let us examine the various dot products that can be calculated for three sample vectors,

a = [1,4,6,9,15,55,62,-5]

b = [-2,-8,-9,-12,-80,14,15,2]

c = [2,8,12,18,30,110,124,-10]

Notice that vector c has twice the value of each paired attribute in vector a. We'll use the Python script, numpy_dot.py to compute the lengths of the vectors a, b, and c; and we will calculate the simple dot products, normalized by the product of the lengths of the vectors.

from __future__ import division

import numpy

from numpy import linalg

a = [1,4,6,9,15,55,62,-5]

b = [-2,-8,-9,-12,-80,14,15,2]

c = [2,8,12,18,30,110,124,-10]

a_length = linalg.norm(a)

b_length = linalg.norm(b)

c_length = linalg.norm(c)

print(numpy.dot(a,b) / (a_length ⁎ b_length))

print(numpy.dot(a,a) / (a_length ⁎ a_length))

print(numpy.dot(a,c) / (a_length ⁎ c_length))

print(numpy.dot(b,c) / (b_length ⁎ c_length))

Here is the commented output:

0.0409175385118 (Normalized dot product of a with b)

1.0 (Normalized dot product of a with a)

1.0 (Normalized dot product of a with c)

0.0409175385118 (Normalized dot product of b with c)

Inspecting the output, we see that the normalized dot product of a vector with itself is 1. The normalized dot product of a and c is also 1, because c is perfectly correlated with a, being twice its value, attribute by attribute. We also see that the normalized dot product of a and b is equal to the normalized dot product of b and c (0.0409175385118); because c is perfectly correlated with a and because dot products are transitive.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978012815609400011X

Classification

Vijay Kotu, Bala Deshpande, in Data Science (Second Edition), 2019

Correlation similarity

The correlation between two data points X and Y is the measure of the linear relationship between the attributes X and Y. Pearson correlation takes a value from −1 (perfect negative correlation) to +1 (perfect positive correlation) with the value of zero being no correlation between X and Y. Since correlation is a measure of linear relationship, a zero value does not mean there is no relationship. It just means that there is no linear relationship, but there may be a quadratic or any other higher degree relationship between the data points. Also, the correlation between one data point and another will now be explored. This is quite different from correlation between variables. Pearson correlation between two data points X and Y is given by:

(4.9)Correlation(X,Y)=sxysx×sy

where sxy is the covariance of X and Y, which is calculated as:

sxy =1n−1∑i=1n(xi− x¯)(yi−y¯)

and sx and sy are the standard deviation of X and Y, respectively. For example, the Pearson correlation of two data points X (1,2,3,4,5) and Y (10,15,35,40,55) is 0.98.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128147610000046

30th European Symposium on Computer Aided Process Engineering

Bianca Williams, ... Selen Cremaschi, in Computer Aided Chemical Engineering, 2020

3.2 Feature Selection Methods

We performed feature selection using the training data set in order to discover which of the bioreactor features were most influential on the cardiomyocyte content. The set of features considered consists of all the collected bioreactor features measured up until the seventh day of differentiation (dd7).

3.2.1 Correlations

The Pearson and Spearman correlations (Bonett & Wright, 2000) between the collected bioreactor features and the cardiomyocyte content were calculated. The Pearson correlation measures the strength of the linear relationship between two variables. It has a value between -1 to 1, with a value of -1 meaning a total negative linear correlation, 0 being no correlation, and + 1 meaning a total positive correlation. The Spearman correlation measures the strength of a monotonic relationship between two variables with the same scaling as the Pearson correlation.

3.2.2 Principal Component Analysis

Principal component analysis (PCA) converts a set of possibly correlated variables into a set of linearly uncorrelated ones through an orthogonal transformation (Hotelling, 1933). The resulting principal components (PCs) are linear combinations of the original set of variables.

3.2.3 Machine Learning Technique Built-In Feature Selection

Each of the machine learning techniques applied has its own method for selecting features and ranking their predictive importance. During the MARS model construction, a pruning pass is performed over the model that removes terms and features based on the level of their effect on GCV criteria. For RF models, features are selected based on how well they improve the separation of the data at each decision node. GPR selects features using its built-in automatic relevance determination method.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128233771502743

Users and uses of Google's information

Elad Segev, in Google and the Digital Divide, 2010

Relationships between indices

Table 4.2 summarises the rankings of the EPV, VoU and SoS indices, indicating possible relationships.

Table 4.2. A summary of index rankings

EPVVoUSoS
Russia (0.67) Spain (0.74) Korea (0.99)
Germany (0.64) Denmark (0.72) India (0.98)
Sweden (0.64) Sweden (0.71) Australia (0.98)
France (0.63) Ireland (0.71) China (0.98)
Ireland (0.62) Germany (0.69) USA (0.98)
Spain (0.60) France (0.68) Ireland (0.97)
Finland (0.59) New Zealand (0.65) Canada (0.97)
Japan (0.59) Finland (0.63) Russia (0.96)
New Zealand (0.57) Russia (0.62) Italy (0.96)
India (0.56) India (0.62) Norway (0.95)
Denmark (0.55) Italy (0.59) Brazil (0.95)
Brazil (0.55) Japan (0.57) Germany (0.94)
Italy (0.54) UK (0.57) New Zealand (0.94)
UK (0.51) Norway (0.56) Netherlands (0.94)
Canada (0.49) Brazil (0.52) Japan (0.91)
China (0.49) Netherlands (0.52) UK (0.9)
Norway (0.49) Australia (0.5) Spain (0.89)
Australia (0.48) China (0.47) Finland (0.88)
Korea (0.47) Canada (0.47) France (0.87)
Netherlands (0.45) USA (0.46) Denmark (0.85)
USA (0.40) Korea (0.44) Sweden (0.83)

In theory, very high scores EPV index mean that most search queries are concentrated in economic and political-related categories. Similarly, very low EPV index scores mean that most search queries are concentrated in entertainment-related categories. In both extreme cases (of very high and low EPV scores) the VoU index is supposed to be low, as the spread of search queries is not even among the different categories. However, in practice, Table 4.2 implies a possible positive correlation between the EPV and the VoU indices. It indicates that countries with low EPV scores (e.g. the USA, Canada, Australia, Korea and China) also have low VoU scores, while countries with high EPV scores (e.g. Sweden, Ireland and Germany) usually also have high VoU scores. Yet, there are no countries in Table 4.2 with high EPV scores and low VoU scores. This is primarily due to the fact that there are no countries with a very high concentration of economic and political-related searches. The countries with the highest EPV scores (e.g. Russia, Germany, Sweden, France and Ireland) still have 20–40 per cent of entertainment-related searches, and thus display a greater variety of searches than other countries (i.e. greater VoU scores).

As a positive correlation between the two indices is expected, and there are no assumptions regarding their distribution, a Spearman23 single-tailed correlation test confirms that the EPV index and the VoU index have a strong positive correlation with a p-value of less than 0.01 (Table 4.3).

Table 4.3. Correlation between the EPV and the VoU indices

VoU_IND
Spearman’s rho EPV_IND Correlation coefficient 0.807**
Sig. (1-tailed) 0.000
N 21

**Correlation is significant at the 0.01 level (1-tailed).

A combination of the two correlated indices in one graph provides a vivid presentation of the differences between countries in terms of the content and the variety of searches (Figure 4.4).

Which of the following pearson coefficients is considered to have the strongest positive correlation

Figure 4.4. Content vs. variety of searches

While the EPV index reflects the content aspect, the VoU and SoS indices reflect another two aspects of the digital divide in information uses: volume and control. Table 4.2 implies that many countries that scored highly on the VoU index (e.g. Sweden, Denmark, Spain and France) tend to have low SoS index scores. Similarly, countries with low VoU scores (e.g. USA, Canada, Korea and China) tend to have high SoS scores. Thus, a negative correlation between the two indices is expected. As there are no assumptions about their distribution, a Spearman24 single-tailed correlation test confirms that the VoU index and the SoS index have a strong negative correlation with a p-value of less than 0.01 (Table 4.4).

Table 4.4. Correlation between the VoU and the SoS indices

VoU_IND
Spearman’s rho SoS_IND Correlation coefficient –0.696**
Sig. (1-tailed) 0.000
N 21

**Correlation is significant at the 0.01 level (1-tailed).

This significant negative correlation suggests that countries with more specific search queries (i.e. high SoS index) will usually also display a lower variety of search topics (low VoU index) and vice versa. In other words, there is a certain trade-off between the variety and the specificity of searches. One possible reason for this is that entertainment-related search queries (e.g. ‘hilary duff’ or ‘green day’ which were popular in Canada in February 2005) tend to be more specific and focus on certain people or television programmes, while politics and economics-related search queries (e.g. ‘aftonbladet’ or ‘expressen’ which were popular in Sweden during 2004 and 2005) tend to refer to general news or shopping portals (in which users are often required to continue and search for more specific information). This assumption gets further support in a Spearman25 single-tailed correlation test that indicates a strong positive correlation between the SoS values and the percentage of entertainment-related searches in each country. Similarly, a strong negative correlation was indicated between the SoS values and the percentage of shopping-related searches, indicating that many shopping-related searches are more general (e.g. referring to general shopping portals rather than specific products and services).

While most countries with high SoS values tend to have a greater concentration of entertainment-related searches and thus less variety, findings also indicate that it is possible to maximise the two. A combination of the VoU and the SoS indices in one graph reveals the differences between countries in terms of the specificity and the variety of searches.

Figure 4.5 shows the negative relation between the indices. It suggests that countries with more specific search queries exercise a greater control and manipulation of online information, while countries with a greater variety of searches are exposed to a wider range of information, which means that they display a better understanding of the various applications of online information. Those who can maximise the opportunities of the search engine as an instrument for providing and retrieving information in a wider range of fields and with greater accuracy and depth display better information skills (see also Bonfadelli, 2002; Florida, 2002). Looking at the international level, the model indicates that countries above the best-fit line exercise a better politics of online information in terms of search accuracy and variety of information uses. In particular, search queries from Ireland and Germany exhibit a higher balance of variety and accuracy than searches from other countries. Although they are as varied as searches from Sweden, Denmark or France, they are also more accurate and specific. Thus, while news-related searches in Sweden and Denmark were for general portal-sites, in Germany and Ireland popular searches were more specific, for example, ‘george bush’, ‘pope’ (or ‘papst’ in German) or ‘vatican’ (or ‘vatikan’ in German).26

Which of the following pearson coefficients is considered to have the strongest positive correlation

Figure 4.5. The trade-off between variety and specificity

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9781843345657500046

Spatial Autocorrelation

R.P. Haining, in International Encyclopedia of the Social & Behavioral Sciences, 2001

3.4 Information Loss in Statistical Tests

Spatial autocorrelation can undermine the use of classical statistical techniques. The correlation coefficient (r) is a common statistic for measuring the linear relationship between two variables (X and Y). The Pearson correlation coefficent varies between −1 and +1 with +1 signifying a perfect positive relationship between X and Y (as X increases, Y increases). The inference theory for the correlation coefficient is based on:

(4)(n-2) 1/2|rˆ|(1-rˆ2) -1/2

where n is the sample size and r̂ is the estimated correlation coefficient. Under the null hypothesis that the correlation in the population is zero (4) is t distributed with (n−2) degrees of freedom. However the amount of statistical information carried by the n observations when the data are (positively) spatially autocorrelated is less than would be the case were the n observations to be independent. In the terminology of Clifford and Richardson (1985) the ‘effective sample size’ is less than the actual sample size n so the degrees of freedom for the test is less than n. The solution to the problem is to compute this effective sample size (N′) which is obtained from the spatial correlogram (see Sect. 3.2). Then in (4) N′ replaces n and it is t distributed with (N′−2) degrees of freedom.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0080430767025110

Multitemplate-based multiview learning for Alzheimer’s disease diagnosis

M. Liu, ... D. Shen, in Machine Learning and Medical Imaging, 2016

9.4.3 Results of Feature Filtering-Based Method for AD/MCI Diagnosis

In this group of experiments, the balancing factor λ in Eq. (9.3) is set to 0.38. The SVM classifier used here is implemented by the LIBSVM library (Chang and Lin, 2011), using a linear kernel and C = 1 (the default cost). Finally, M = 1:1500 features are tested, and the best results are reported for quantitative comparison.

Table 9.2 first shows the results using a single template for AD/NC classification, to demonstrate the variability of classification results when using different templates even for the same classification task, where the best results are marked in boldface. Because the proposed FS method integrates not only the PC but also the “intertemplate” correlation from the multiple templates, two conventional FS methods are examined based on single templates. The first FS method is simply based on the ranking of PC, and the second method combines PC with SVM-RFE-based FS (Guyon et al., 2002) (as proposed in Fan et al. (2007)) for jointly considering multiple features in the selection. It should be noted that, in the single template case, the feature extraction performed in the proposed method is the same as COMPARE (Fan et al., 2007). Therefore in this chapter, the PC+SVM-RFE-based method using a single template is denoted as COMPARE.

Table 9.2. Results of AD Versus NC and pMCI Versus sMCI Classification Using Single Templates (A1–A10)

AD vs. NC ClassificationpMCI vs. sMCI Classification
PCCOMPAREPCCOMPARE
TemplateACCSENSPEACCSENSPEACCSENSPEACCSENSPE
A1 84.09 78.33 88.40 83.16 75.33 89.17 68.93 64.62 73.18 71.03 68.79 73.18
A2 84.94 80.56 88.30 81.95 73.67 88.40 68.87 68.56 69.09 71.46 71.97 70.76
A3 83.12 77.33 87.56 84.50 78.44 89.17 69.34 65.15 73.41 69.81 69.47 70.08
A4 84.87 80.44 88.33 85.72 82.22 88.40 72.71 73.56 71.82 71.82 72.58 71.06
A5 85.85 82.56 88.46 84.05 76.22 90.00 70.66 69.39 71.82 71.93 71.21 72.80
A6 84.38 78.33 89.04 85.35 83.56 86.73 71.04 65.98 75.98 72.86 69.62 76.14
A7 82.23 77.22 86.09 87.07 81.33 91.54 71.08 73.94 68.18 74.56 70.8 78.64
A8 83.59 79.44 86.86 84.48 79.44 88.46 70.27 68.71 71.67 71.88 68.56 75.00
A9 83.65 77.33 88.40 82.27 78.44 85.38 68.55 66.36 70.68 71.10 66.97 75.15
A10 83.28 83.78 83.01 83.20 76.56 88.46 69.00 72.05 65.83 71.74 70.15 73.41

Table 9.2 reports the best classification accuracies (ACC) for each of the 10 templates using PC and COMPARE, along with their respective sensitivities (SEN) and specificities (SPEC). Note that the sensitivity and the specificity refer to the portions of correctly identified AD patients and correctly classified NC subjects, respectively. From Table 9.2, it is clear that COMPARE outperforms PC when using their own best templates (ie, A5 for PC and A7 for COMPARE). However, for some templates (ie, A1, A2, A5, A9, and A10), the use of additional SVM-RFE-based FS (in COMPARE) cannot further improve the simple PC-based classification (in terms of the best classification accuracy). That is, the result improvement brought by SVM-RFE is limited, but at a cost of increased computational burden.

Furthermore, the results of AD versus NC and pMCI versus sMCI classification using multiple templates are given in Table 9.3. The proposed (multitemplate-based) FS method (namely MA_Proposed) that considers both PC and “intertemplate” correlation is compared with both PC- and COMPARE-based FS methods using either a single template (namely sa:PC and sa:COMPARE) or multiple templates (namely MA_PC and MA_COMPARE). For fair comparison, the averaged results of single-template-based methods (sa:PC and sa:COMPARE) across all 10 templates are reported. In MA_PC, all regional features extracted from 10 different templates are used, thus resulting in a feature representation with M × K = 15, 000 dimensions for each subject; afterwards, the top 1500 features are selected out of 15,000 features based on the PC, and M = 1:1500 features are subsequently selected and used for classification. In MA_COMPARE, the top 1500 features are first selected in the same way as MA_PC, but additionally using SVM-RFE to further refine the selected features, before inputting them to the SVM for classification.

Table 9.3. Results of AD Versus NC and pMCI Versus sMCI Classification Using Single Templates (sa:PC, sa:COMPARE, sa:Proposed) and Multiple Templates (MA_PC, MA_COMPARE, MA_Proposed)

AD vs. NCpMCI vs. sMCI
MethodACC (%)SEN (%)SPE (%)ACC (%)SEN (%)SPE (%)
sa:PC 82.01 75.88 86.76 68.49 67.80 69.10
sa:COMPARE 81.52 77.11 84.92 70.06 68.08 72.02
MA_PC 85.91 81.56 89.23 72.78 74.62 70.91
MA_COMPARE 87.19 80.56 92.31 73.35 75.76 70.83
MA_Proposed 91.64 88.56 93.85 72.41 72.12 72.58

For both AD versus NC and pMCI versus sMCI classification, the best classification accuracies (ACC) as well as the corresponding sensitivities (SEN) and specificities (SPEC) of all methods are illustrated in Table 9.3. The results clearly show that MA_Proposed is better than any other methods in terms of all metrics. It should be noted that the sensitivities of sa:PC, sa:COMPARE, MA_PC, and MA_COMPARE are much lower in comparison to their corresponding specificities. A low sensitivity value indicates low confidence on AD diagnosis, which will greatly limit their practical usage. On the other hand, MA_Proposed gives a significantly improved sensitivity value. Together with its high specificity (93.85% for AD vs. NC classification), the MA_Proposed method produces more confident AD diagnosis results.

In addition, Fig. 9.11 illustrates the results of sa:PC, sa:COMPARE, MA_PC, MA_COMPARE, and MA_Proposed in AD versus NC and pMCI versus sMCI classification with respect to different numbers of top selected features. From Fig. 9.11, it is clear that the results of multitemplate-based methods (MA_PC, MA_COMPARE, and MA_Proposed) outperform the results of single-template-based methods (sa:PC and sa:COMPARE) by a significant margin. Specifically, in Fig. 9.11 (left), sa:PC and sa:COMPARE reach their best classification accuracy with a small portion of top selected features, and their performances decline rapidly when more features are included in AD versus NC classification. This indicates that many of their selected features are noisy and redundant, if using only a single template. In contrast, multitemplate-based methods consistently increase or maintain their performance with the increase of the number of features used, which demonstrates that the complementary information from different templates is aggregated together to improve the classification. In addition, with the assistance of SVM-RFE, the COMPARE-based methods (sa:COMPARE and MA_COMPARE) achieve better performance than the PC-based methods (sa:PC and MA_PC) in both cases of using single template and multiple templates. Fig. 9.11 (left) also demonstrates that MA_Proposed significantly outperforms all other comparison methods. Although only a small portion of features can give good classification accuracy for the single-template-based methods, the performance of the MA_Proposed method is consistently improved with use of more features (ie, 91.64% when using 1268 features for AD versus NC classification). This phenomenon shows that the redundant features from a single template can be integrated with the features from other templates (in an effective way) to yield more robust and discriminative representations. From Fig. 9.11 (right), we can observe again that all three multitemplate-based methods (MA_PC, MA_COMPARE, and MA_Proposed) perform significantly better than the two single-template-based methods (sa:PC, sa:COMPARE) in pMCI versus sMCI classification, indicating the power of using multiple templates in aggregating more useful information for classification. Among all three multitemplate-based methods, MA_Proposed demonstrates comparable performance to both MA_PC and MA_COMPARE. When using the M = 500:1000 top selected features, the proposed method (MA_Proposed) gives the best overall classification results. On the other hand, MA_COMPARE gets its best results when using M = 1:500 features, and MA_PC achieves its best results when using M = 1000:1500 features.

Which of the following pearson coefficients is considered to have the strongest positive correlation

Fig. 9.11. Results of sa:PC, sa:COMPARE, MA_PC, MA_COMPARE, and MA_Proposed in (left) AD versus NC classification and (right) pMCI versus sMCI classification.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128040768000098

Which of the following Pearson coefficients is considered having the strongest positive correlation?

The correlation coefficient is a value between -1 and +1. A correlation coefficient of +1 indicates a perfect positive correlation.

What is the strongest positive correlation?

Correlations range from -1.00 to +1.00. The correlation coefficient (expressed as r ) shows the direction and strength of a relationship between two variables. The closer the r value is to +1 or -1, the stronger the linear relationship between the two variables is.

Is 0.80 A strong positive correlation?

Correlation Coefficient = 0.8: A fairly strong positive relationship. Correlation Coefficient = 0.6: A moderate positive relationship. Correlation Coefficient = 0: No relationship. As one value increases, there is no tendency for the other value to change in a specific direction.

Is 0.95 A strong positive correlation?

The magnitude of the correlation coefficient indicates the strength of the association. For example, a correlation of r = 0.9 suggests a strong, positive association between two variables, whereas a correlation of r = -0.2 suggest a weak, negative association.