May 27, 2014

Playing the Long Game of Human Biological Variation

This post is presented in two parts: an epistomological discussion (I) and a technical one (II). Hopefully this will be informative.

I: Biological Variation: an epistomological tour
This month was a provocative one for those interested in human biological variation. Nicholas Wade's book "A Troublesome Inheritance: genes, race, and human history" is being roundly criticized by Anthropologists [1] and Evolutionary Biologists [2] alike. While I have not read the book, I have gleaned the basic outline from various reviews and recaps. But from what I understand about it, the book is based on the notion that the study of human biological variation has not been treated fairly in the history of science. That is to say, people who are satisfied with the phrases "there are no such thing as race" or "races are sociopolitical entities" [3] are ignoring up-and-coming work in the area of human genetics. And indeed, recent findings based on technologies such as genome resequencing and functional genomics might suggest a interpretive revision is necessary.

As Wade would have it, this research suggests that human subdivision is a real phenomenon, much like species or other biological classifications. This in and of itself would be an incomplete argument. But the way in which Wade links human genetic diversity to behavioral differences and confounds racial categories with cultural groups is a source of great controversy. In fact, a book review by Andrew Gelman [4] highlights both the preposterous and plausible nature of Wade's argument. The preposterousness of Wade's argument lies in its extramission theory-like reliance on genes enabling behavioral tendencies based on cultural identity. The plausibility lies in the recognition of population sub- structure, which will be discussed later in this post.

Spatial distribution of a single human phenotypic trait (skin color). Is there any significance to these geographic patterns, and how can we tell? Some people think a map like this contains self-evident truths, others do not.

My impression of Wade's argument boils down to this: the argument against human population subdivision (racial classification) is largely driven by political correctness, and all we need to do is apply the right science to see the truth. There is a circularity and self- evident quality to this argument that is somewhat disturbing. Unfortunately for Wade, he uses precious little insight from either Population Genetics nor modern Evolutionary Anthropology. So the Evolutionary Anthropology/Biology critique is not merely based on enforcing conformity, it is substantive and necessary. And as we will later see, the interpretation of human biological variation is much more complicated than many people will admit to.

I would be remiss to say that Wade is not without his supporters [5]. A more cynical eye might label him (Wade) and these supporters as "scientific racists" or "deniers" (the denier label has also been applied to Anthropologists by a likely Wade supporter). But the real problem is the dialogue between different scientific research group and fields of study. In the study of variation and its consequences, what is obvious to a person in one field might take someone in another field completely by surprise. People who are unfamiliar with nuanced variation-dependent thinking are more likely to embrace the notion that there is something fishy about these ways to deal with variation and instead choose a so-called common sense approach. But because science is so (unnecessarily) hyper-specialized, perhaps we should not expect an full accounting of variation from any one group or field. Instead perhaps we should play the long game of interpreting variation.

II. How do we approach human structural variation?
The debate between Agustin Fuentes and Nicholas Wade [6] operationalizes this gap in understanding in two ways. First, Wade sets up a straw man in declaring that the standard Anthropological view only considers races as socio-political categories rather than biological ones. But since biological differences between human populations are self-evident (albeit at a superficial level) and consistent with so-called common sense, this view must be erroneous (or at least severely flawed). Yet this is a conceptual assumption on Wade's part -- racial categories are often based on superficial biological attributes, but rooted in social context and the fluid nature of identity over time. This is the essence of the Anthropological lesson:  biological features and social context often cannot be disentangled. Secondly, Wade uses an example of several geographically- distinct "natural" groups being generated by a program called Structure [7]. One might think that a statistical argument would supercede any conceptual arguments that might be rooted in ideology. But as I will show you, so-called racial categories cannot be easily statistically disentangled from other structural features of human populations, either.

This is what a cluster analysis with structure looks like (a generic example). Considering that this is an unsupervised form of machine learning and a rather exploratory form of data analysis, should this be the standard tool for the analysis of human genomic variation? COURTESY: R and Bioconductor manual.

So is Wade's argument typical of contemporary thinking on human biological variation? We can look to research groups located at Stanford (Pritchard Lab), Chicago (Lahn Lab) [8], and UC Santa Cruz (Haussler Lab) for instructive lessons. These groups have found both structured variation and intriguing differences between populations, but unlike Wade do not associate it with broad cultural characterizations. Their hypothesis is that natural selection has acted extensively on modern human populations, owing largely to regional environmental adaptation [9]. This has included adaptations for lactose tolerance and the discovery of human accelerated regions (HARs) [10]. Indeed, there is accumulating evidence that evolutionary adaptation has occurred on a regional basis in modern humans even within the last 10,000 years (e.g. fast evolutionary change). This has resulted in regional distinctions between populations [11], but whether these demographic changes warrant formal taxonomic distinction is another matter.

 A standard PCA analysis shows that genotype can predict geographic origin of genotype. COURTESY: Gene expression blog.

Indeed, while these differences may serve to support the existence of biological races [12], it is unclear how these are precisely defined. Should it be based on the geographic distribution of variation [13], or should it be based on genealogical continuity? So even when we are not dealing with the naive theories of Wade, traditional taxonomic views of race are still problematic when dealing with the complexity of intraspecific variation. What kinds of lessons can we learn from all of this? I will now propose several points that suggest how we might address human variation in an informed manner that incorporates multiple points of view.

 Map showing selected examples of recent human evolution. COURTESY: Washington Post.

1. Geography does not equal natural subdivision.
In all of the reviews of Wade's book, it has never been mentioned why cluster analyses might not be the best way to uncover the true structure of intraspecific variation. So why is it incorrect to look for "natural subdivisions" using a cluster analysis? Because what Wade and others claim to be "natural" subdivisions of genetic diversity may actually be geographic artifacts.

In the Fuentes/Wade debate and in a blog post by Jennifer Raff [14], it is mentioned that a series of experiments can be conducted using k-means cluster analysis. The Structure program provides a k-means-like cluster analysis (or perhaps more accurately, a k-class cluster analysis) with corrections for the effects of admixture between groups [15]. k-means cluster analysis is, of course, a method for specifying the structure of a dataset a priori. In this type of supervised cluster analysis, the number of categories (of order k) should correspond to the number of actual sub-categories (or structure) in the data. Quite telling is that when the value for k is set above 5, the analysis produces clusters which are not geographically mixed (e.g. European and Asian populations in the same group).

One way to correct for this bias is to perform a spatial decorrelation (e.g. spatial PCA) [16]. This would result in removing the similarity of individuals due to shared location [17]. This may seem counterintuitive at first, but consider why this might be important. While a race might be defined by singular traits (e.g. eye color or skin color), these are not robust enough to constitute meaningful population structure in and of themselves. If biological subdivisions do indeed exist, then we would want a classificatory scheme that includes as many traits (e.g. dimensions) as possible without picking up the consequences of traits being co-located in space.

2. Allele frequencies are not dynamic information in and of themselves
Another issue is that even though population structure exists among Homo sapiens, contemporary data are merely a static snapshot of our species. And in a cultural species such as Homo sapiens, the extensive populations bottlenecks that define migration events and ethnogenesis play a large role in defining structure. These structural features have the potential to mimic more systematic gene flow restriction and racial group-specific evolution. Any hypothesis of systematic population structure in humans must also consider alternative explanations such as these.

An Analysis of Molecular Variance (AMOVA) [18] should generally reveal that there is more variation within than between many continental level subdivisions of Homo sapiens. Thus, there is no reason to believe that races (if they exist) have to be geographically distinct any more than continental groups being homogeneous. But perhaps they do not exist, and do not serve as a stable and reliable taxonomic category. A paper by Long and Kittles [19] considers four different race concepts based on patterns of the genetic fixation parameter Fst: typological, population, taxonomic, and lineage. While the measurement of allele frequencies reveal more variation than expected using Fst, these data are still not consistent with formal taxonomic (e.g. racial) subdivisions.

3. Environment is both the problem and the solution
A third issue involves the role of GxE, or the interaction of genes and environment. While there are of course formal GxE tests in modern genetics, my purpose here is to point out what is perhaps the true role of environmental influences on variation. To understand this, we must return to the standard Anthropological position on race. Paraphrased, this position states that in the scheme of history races are sociopolitical entities, the biological significance of which are diluted by the nature of identity. So as we can see, environment actually involves explicitly cultural adaptation with its own structural features. This may give rise to an additional term: GxExC.

This cultural variance, unlike the environmental variance found in other mammals such as foxes or hedgehogs, can be substantial. But does culture act as a multiplier of genetic effects, or as a buffer from genetic effects? It really does depend on both the social and biological context. On the one hand, culture has allowed populations to adapt to new environments without the need for genomic adaptation. On the other hand, environmentally-driven genomic adaptations have allowed for cultural innovation. It may be best to consider this relationship between culture and genome as a dual-process model, the evolution of which can often be independent of any nominal genetic structure.

4. Variation is not straightforward
Finally, there is a question of how you compare groups of equal amounts of variation and keep behavioral influences on migration and other factors [20] to a minimum. If you are looking to not compare apples and oranges [21], one must keep this issue in mind. What type of group is a valid subdivision, particularly for comparative purposes? Is the fundamental level of subdivision continental, or based on ethnic group, or is it simply one of restricted gene flow? Due to their basis in migration and identity, human ethnic groups can be either contrived (e.g. of a polyglot nature) [18] or homogeneous. Even in cases where significant structure exists, the link to broader social relevance (e.g. cultural diversity and traits) is questionable at best [22].

What is the true structure of our species and how does it matter for purposes of classification? One is that even though some interbreeding has been found to occur between modern humans and archaic species, the trend of recent human expansion out of Africa (the RAO model) is still predominant [23]. This suggests that the apparent structure in human genetic data may not be all that deep, pointing towards strong contributions from so-called fast evolution. While fast evolution may be sufficient to drive racial population structure on its own, the nature of changing political boundaries and population patterns make traditional biological classification somewhat superfluous for humans [24].

The tempo and broad overview of human demographic expansion, according to the RAO model.

5. Population genetics + Linnean taxonomy might not be the answer
There are also several problems with the concept of race as a taxonomic (a) and organizational (b) term. Let's look at each of these in more detail.

a) Linnean-style classification below the species level is problematic. As putative sub-categories of a species, races are not defined in the same way as a species. Aside from there being many species definitions, the most popular (the Biological species concept) is based on molecular mechanisms for reproductive isolation and the genesis of dinscontinuous variation. This is true even among populations that evolve in a spatially comingled fashion. However, defining races (and their emergence) using a systematic approach is much trickier, and would presumably involve restricted gene flow within humans. But despite geographic barriers and the local concentration of certain adaptations and traits [25], the assumption that intraspecific structure should necessarily be nested or discrete is likely misguided [24].

b) To assess significant population structure, we often begin from a condition of panmixia. Panmixia is a situation where every member of the species (or population) have an equal chance of breeding with one another. This is often thought of as random assortment, and so restrictions in gene flow will lead to a signature of population subdivision (groups which are called demes). However, consider that
The problem is that we typically use panmixia as the null hypothesis. A more plausible (to me) null model is something like a scale-free or even small-world network, complete with highly-connected populations and weak connectors between populations. This would not only account for all possible configurations of interbreeding relationships (which are heavily influenced by culture) over evolutionary time, but also would account for the dual inheritance and evolutionary processes of culture and biology simultaneously [26].

A scenario for modern human demographic expansion. Notice the role of migrations and population expansions. COURTESY: Figure 1 in [27].

So what does the long game of human biological variation involve? It helps to resist the temptation of approaching the issue from a solely reductionist perspective. While large-scale genetic data give us a formidable advantage in the understanding of human diversity [28], not making the link to social complexity and phenotype makes the technology and the science involved much less useful. But sometimes one source of data can help drive the entire enterprise forward. In Pritchard, Pickrell, and Coop [29], whole-genome data allows them to propose a new mechanism called polygenic adaptation, which allows for rapid adaptation in a population without the need for selective sweeps. While this stands at odds with comparative morphology, physiology, and conventional population genetics, and thus might be incorrect, it also might help us make sense of the big question in a new light. New analytical tools and a unification of theoretical perspectives can also help, particularly in an area where the potential for mythology and biases are rife. Given that "nothing in the study of variation makes sense with complexity" [30], a vision that transcends discipline and investigative approach is absolutely required.


NOTES:
[1] Anthropologists chime in: Fuentes, A.   The Troublesome Ignorance of Nicholas Wade. HuffPo blog, May 19 (2014) AND Marks, J.   The Genes Made Us Do It: the new pseudoscience of racial difference. In These Times, May 12 (2014) AND Dunsworth, H.   If scientists were to make the arbitrary decision that biological race is real, can you think of a positive outcome? Mermaid's Tale blog, May 22 (2014).

[2] Evolutionary Biologists chime in: Orr, H.A.   Stretch Genes. NYTimes Review of Books, June 5 (2014) AND Coyne, J.   New book on race by Nicholas Wade: Professor Ceiling Cat says paws down. Why Evolution is True blog, May 14 (2014) AND Pigliucci, M.   On the Biology of Race. Scientia Salon blog, May 29 (2014) AND Yoder, J.   How A Troublesome Inheritance gets human genetics wrong. The Molecular Ecologist blog, May 29 (2014).

[3] Although this paper serves as a means to make sense of human variation from an Anthropological perspective that goes beyond sloganeering: Weiss, K.M. and Fullerton, S.M.   Racing around, getting nowhere. Evolutionary Anthropology, 14, 165-169 (2005).

[4] Gelman, A.   The Paradox of Racism. Slate Magazine, May 8 (2014).

[5] One take on the relative merits and of Wade's book and support of his argument can be found here: VanBruggen, R.   Race is Real. what Does the Mean for Society? RealClearScience blog, May 6 (2014).

[6] Rex   What Happened at the Fuentes-Wade Webinar. Savage Minds blog, May 14 (2014).

[7] Wade, Nicholas. Gene Study Identifies 5 Main Human Populations, Linking Them to Geography. NY Times, December 20 (2002). But also see this article as a follow-up: Rotimi, C.N.   Are medical and nonmedical uses of large-scale genomic markers conflating genetics and 'race'? Nature Genetics, 36(11), S43-S47 (2004).

[8] For more information, please see the following citations:
a) Raj, A., Stephens, M., Pritchard, J.K.   Variational Inference of Populations Structure in Large SNP Datasets. Genetics, doi:10.1534/genetics.114.164350 (2014).

b) Atkins, C.E.   Bruce Lahn Interview. H+ Magazine, May 12 (2012).

c) Seed Interview: Bruce Lahn. Seed Magazine, September 11 (2006).

[9] McAuliffe, Kathleen. They Don't Make Homo Sapiens Like They Used To: Our species—and individual races—have recently made big evolutionary changes to adjust to new pressures. Discover Magazine, Feb. 2, 2009.

[10] Pollard, K.S., Salama, S.R., King, B., Kern, A.D., Dreszer, T., Katzman, S., Siepel, A., Pedersen, J.S., Bejerano, G., Baertsch, R., Rosenbloom, K.R., Kent, J., and Haussler, D.   Forces shaping the fastest evolving regions in the human genome. PLoS Genetics, 2(10), e168 (2006).

[11] Hsu, Steve. Demography and fast evolution. Information Processing, Aug. 9, 2011.

[12] More biologists chime in: Moran, L.A.   Do Human Races Exist? Sandwalk blog, March 1 (2012) AND Lahn, B.T. and Ebenstein, L. (2009) Let's celebrate human genetic diversity. Nature, 461, 726-728.

[13] For the integration of genealogy and geography, please see: Tishkoff, S.A. and Kidd, K.K. Implications of biogeography of human populations for 'race' and medicine. Nature Genetics, 36(11), S21 - S27 (2004).

Please also see the following reference: The Genographic Project. National Geographic.

[14] Another Anthropologist chimes in: Raff, J.   Nicholas Wade and race: building a scientific facade. Violent Metaphors blog, May 21 (2014).

[15] UPDATED 7/15/2014: If you are wondering as to the technical details of the clustering approach used in Structure, see the following blog post: Pontikos, D.   k-means and Structure. Dienekes' Anthropology Blog, July 15 (2014).

Technically, Structure does not actually use k-means. Instead of using k-unimodal centroids (or assuming normally-distributed categories), the algorithm underlying Structure uses bimodally-determined classes to account for the existence of potential admixture. Again, this is done to correct for artifact due to admixture and other sources of genealogical recombination. However, the issue of potential spatial artifact remains.

[16] Here are two examples of how to treat spatially-dependent data:
a) Spatial PCA: Novembre, J. and Stephens, M.   Interpreting principal component analyses of spatial population genetic variation. Nature Genetics, 40(5), 646-649 (2008).

b) General Classification: Hariharan, B., Malik, J., and Ramanan, D.   Discriminative Decorrelation for Clustering and Classification. Lecture Notes in Computer Science (LNCS), 7575, 459-472 (2012).

[17] Epperson, B.K.   Geographical Genetics. Princeton University Press (2003).

[18] The AMOVA is a relative of the ANOVA (Analysis of Variance). For more information, please see: Excoffier, L., Smouse, P., and Quattro, J.   Analysis of molecular variance inferred from metric distances among DNA haplotypes: Application to human mitochondrial DNA restriction data. Genetics, 131, 479-491 (1992).

[19] Long, J.C. and Kittles, R.A.   Human Genetic Diversity and the Nonexistence of Biological Races. Human Biology, 81(5-6), 777-798 (2009).

[20] Foster, S.A.   The Geography of Behavior: an evolutionary perspective. Trends in Ecology and Evolution, 14(5), 190-195 (1999).

[21] Frost, P.   Apples, Oranges, and Genes. Evo and Proud blog, November 5 (2011).

[22] Khan, R.   Human Races May Have Biological Meaning, But Races Mean Nothing About Humanity. Discover's The Crux blog, May 2 (2012). Please also see the November 2004 supplemental issue of Nature Genetics on human variation for many enlightening articles.


[23] Stringer, C.   Why we are not all multiregionalists now. Trends in Ecology and Evolution, 29(5), 248-251 (2014).

[24] Laden, G.   The Scientific, Political, Social, and Pedagogical Context for the Claim that "Race does not exist". Greg Laden's blog, November 29 (2008).

[25] Novembre, J., Galvani, A.P., and Slatkin, M.   The Geographic Spread of the CCR5 Δ32 HIV-Resistance Allele. PLoS Biology, 3(11), e339 (2005).

[26] Richerson, P.J. and Boyd, R.   Not By Genes Alone: How Culture Transformed Human Evolution. University of Chicago Press (2005).

[27] Balaresque, P.L.   Challenges in human genetic diversity: demographic history and adaptation. Human Molecular Genetics, 16(R2), R134-R139 (2007).

[28]  Tennessen, J.A., O'Connor, T.D., Bamshad, M.J., and Akey, J.M.   The Promise and Limitations of Population Exomics for Human Evolution Studies. Genome Biology, 12, 127 (2011).

[29] Pritchard, J.K., Pickrell, J.K., and Coop, G.   The Genetics of Human Adaptation: hard sweeps, soft sweeps, and polygenic adaptation. Current Biology, 20(4), R208-R215 (2010).

[30] a play on the classic Dobzhansky quote. Dobzhansky's views on the race concept evolved over time from seeing races as biological "clusters" fixed in space to races as "Mendelian populations" . For more, please see: Gannett, L.   Theodosius Dobzhansky and the Genetic Race Concept. Studies in History and Philosophy of Biological and Biomedical Sciences, 44(3), 250-261 (2013).

May 20, 2014

Starstuff Squared, Rubik's Cubed

Welcome to the 250th Synthetic Daisies post! This post consists of three subthemes cross-posted from Tumbld Thoughts. The first is in honor of the Google Doodle for the 40th Anniversary of the Rubik's Cube, while the latter two are the supplemental readings for the tenth and eleventh episodes of the Cosmos reboot. 

I. Rubik's 3-D CSS Cubes


Today is the 40th anniversary of the Rubik's Cube. Aside from an invention that sold 350 million copies, Rubik's Cube is also an example of a permutation puzzle that contains an interesting problem related to group theory. The doodle itself is unique in that it utilizes a technology called CSS 3-D transforms [1]. Naturally, there is a Google Doodle.

[1] Edidin, R.   How Google Built Its 3-D Interactive Rubik’s Cube Doodle. May 19 (2014). Also check out the Chrome Cube Lab, which uses this technology to render interactive cube-based puzzles beyond Rubik's namesake.



II. All I Want For Christmas is an Electric Charge


Here are the supplemental readings for the tenth episode of the Cosmos reboot ("The Electric Boy"). Readings are organized by theme.


The Electric Boy and his Legacy:

History of the Christmas Lectures. The Royal Institution.

Cody, D.   Social Class. The Victorian Web. July 12 (2002).

Burgess, M.P.D.   Semiconductor History: Faraday to Shockley. Transistor History (2008).

Williams, A.T.   Faraday vs. Maxwell and Faraday and the Ether. Consciousness, Physics, and the Holographic Paradigm (2010).

Electromagnetic Spectrum. NASA Goddard Space Flight Center.



Frog Legs and Televisions:
Galvani's animal electricity experiments. Institute of Engineering and Technology.

Luigi Galvani (1737-1798). Center for Integrating Research + Learning, Magnet Lab, Florida State University.

Borgens, R.B., Vanable, J.W., and Jaffe, L.F.   Bioelectricity and regeneration. I. Initiation of frog limb regeneration by minute currents. Journal of Experimental Zoology, 200(3), 403–416 (1977).

Iconoscope. Wikipedia, March 20 (2014).

Philo Farnsworth (1906-1971), Electronic Television. Inventor of the Week Archive, Lemelson-MIT (1999).


Researching Faraday Cages and Electromagnetic Fields on Google is a sad statement on Internet culture:
Chandler, N.   How Faraday Cages Work. How Stuff Works.

Trottier, L.   A Growing Hysteria. Committee for Skeptical Inquiry, CFI. October (2009).




Inventions/Discoveries of the Electric Boy:
Faraday's Inventions. Michael Faraday's World.

Homopolar Generator, Wikipedia. April 1 (2014).

Electrolysis, Wikipedia. May 7 (2014).

Faraday Cage, Wikipedia. March 19 (2014).

Electric Motor, Wikipedia. May 10 (2014).

Static Electricity, Wikipedia. May 5 (2014).


III. Leaving Nothing but Footprints, but Still Living On.


Here are the supplemental readings for the eleventh installment of the Cosmos reboot ("The Immortals"). As usual, readings are organized by theme.



Entropy Is Not Immortality, Time Can Be Written Down:
Matson, J.   What Keeps Time Moving Forward? Blame It on the Big Bang. Scientific American, January 7 (2010).

Mlodinow, L. and Brun, T.A.   Relation between the psychological and thermodynamic arrows of time. Physical Review E, 89, 052102 (2014).


Jones D.L.   Aging and the germ line: where mortality and immortality meet. Stem Cell Reviews, 3(3), 192-200 (2007).

Barksdale, M.   10 Methods of Measuring Time. Discovery TV: Relativity and Time.

Origins of Writing Systems. AncientScripts.com.



Fun With the Origins of DNA:
Akst, J.   RNA World 2.0. The Scientist, March 1 (2014).

Moran, L.A.   Changing Ideas About the Origin of Life. Sandwalk blog, August 7 (2012).

Joshi, S.S.   Origin of Life: the Panspermia Theory. December 2 (2008).

Klyce, B.   Cosmic Ancestry

Saenz, A.   Venter creates first synthetic self-replicating bacteria from scratch. SingularityHub, May 20 (2010).


Moving Life (via Dispersal):
Levin, S.A., Muller-Landau, H.C., and Nathan, R.   The Ecology and Evolution of Seed Dispersal: a theoretical perspective. Annual Review of Ecology, Evolution, and Systematics, 34, 575-604 (2003).

Gronstal, A.   Space Rocks Could Reseed Life on Earth. Astrobiology Magazine, May 15 (2008).




Civilization is (not) Forever:
Chandler, G.   Desertification and Civilization. Saudi Aramco World, 58, 6 (2007).

Arbesman, S.   210 Reasons for the Fall of the Roman Empire. Social Dimension blog, June 26 (2013).

Kunzig, R.   Geoengineering: How to Cool Earth--At a Price. Scientific American, November (2008).

Duncan, R.C.   The Olduvai Theory: sliding towards a post-industrial Stone Age. Institute on Energy and Man, June 27 (1996).

Math Program Cracks Cause of Venus Hell Hole. Space Daily, March 21 (2001).

May 12, 2014

Fireside Science: The Analysis of Analyses

This material is cross-posted to Fireside Science. This is part of a continuing series on the science of science (or meta-science, if you prefer). The last post was about the structure and theory of theories.


In this post, I will discuss the role of data analysis and interpretation. Why do we need data, as opposed to simply observing the world or making up stories? The simple answer: it gives us a systematic accounting of the world in general and experimental manipulations in particular. As opposed to the apparition on a piece of toast, it provides a systematic accounting of the natural world independent of our sensory and conceptual biases. But as we saw in the theory of theories post, and as we will see in this post, it takes a lot of hard work and thoughtfulness. What we end up with is an analysis of analyses.

Data take many forms, so approach analysis with caution. COURTESY: [1].

Introduction
What exactly is data, anyways? We hear a lot about it, but rarely stop to consider why it is so potentially powerful. Data are both an abstraction of and incomplete sampling (approximation) of the real world. While the data are not absolute (e.g. you can always have more data or more completely sample the world), the data provide a means of generalization that is partially free from stereotyping. And as we can see in the cartoon above, not all data that influence our hypothesis can even be measured. Some of it is beyond the scope of our current focus and technology (e.g hidden variables), while some of it consists of interactions between variables.

In the context of the theory of theories, data has the same advantage over anecdote that deep, informed theories have over naive theories. In the context of the analysis of analyses, data does not speak for itself. To conduct a successful analysis of analysis, it is important to be both interpretive and objective. Finding the optimal balance between each of these gives us an opportunity to reason more clearly and completely. If this causes some people to lose their view of data as infallible, then so be it. Sometimes the data fails us, and other times we fail ourselves.

When it comes to interpreting data, the social psychologist Jon Haidt suggests that "we think we are scientists, but we are actually laywers" [2]. But I would argue this is where the difference between the untrained eyes sharing Infographics and the truly informed acts of analysis and data interpretation becomes important. The latter is an example of a meta-meta-analysis, or a true analysis of analyses.

The implications of Infographics are clear (or are they?) COURTESY: Heatmap, xkcd.

NHST: the incomplete analysis?
I will begin our discussion with a current hot topic in the field of analysis. It involves interpreting statistical "significance" using an approach called Null Hypothesis Statistical Testing (or NHST). If you have even done a t-test or ANOVA, you have used this approach. The current discussion about the scientific replication crisis is tied to the use (and perhaps overuse) of these types of tests. The basic criticism involves the inability of NHST statistics to conduct multiple tests properly and properly deal with experimental replication.

Example of the NHST and its implications. COURTESY: UC Davis StatWiki.

This has even led scientists such as John Ioannidis to demonstrate why "most significant results are wrong". But perhaps this is just to make a rhetorical point. The truth is, our data are inherently noisy. Too many assumptions/biases go into collecting most datasets, all for data which has too little known structure. Not only are our data noisy, but in some cases may also possess hidden structure which violates the core assumptions of many statistical tests [3]. Some people have rashly (and boldly) proposed that this points to flaws in the entire scientific enterprise. But, like most things, this does not take into account the nature of the empirical enterprise and reification of the word significance.

A bimodal (e.g. non-normal) distribution, being admonished by its unimodal brethren. Just one case in which the NHST might fail us.

The main problem with the NHST is that it relies upon distinguishing signal from noise [4], but not always in the broader context of effects size or statistical power. In a Nature News correspondence [5], Regina Nuzzo discusses the shortcomings of the NHST approach and tests of statistical significance (e.g. p-values). Historical context of the so-called frequentist approach [6] is provided, and its connection to assessing the validity of experimental replications are discussed. One possible solution is the use of Bayesian techniques [7] to assess something called statistical power. The Bayesian approach allows one to use a prior distribution (or historical conditioning) to better assess the meaningfulness of one's statistically significant result. But the construction of priors relies on the existence of reliable data. If these data do not exist for some reason, we are back to square one.

Big Data and its Discontents
Another challenge to conventional analysis involves the rise of so-called big data. Big data is the collection and analysis of very large datasets, which come from sources such as high-throughput biology experiments, computational social science, open-data repositories, and sensor networks. Considering their size, big data analyses should allow for good power and ability to distinguish signal from noise. Yet due to their structure, we are often required to rely upon correlative analyses. While correlation is equated with relational information, it (as it always has) does not equate to causation [8]. Innovations in machine learning and other data modeling techniques can sometimes overcome this limitation, but correlative analyses are still the easiest way to deal with these data.

IBM's Watson: powered by large databases and correlative inference. Sometimes this cognitive heuristic works well, sometimes not so much.

Given a large enough collection of variables with a large number of observations, correlations can lead to accurate generalizations about the world [9]. The large number of variables are needed to extract relationships, while the large number of observations are needed to understand the true variance. This can be a problem where subtle, higher-order relationships (e.g. feedbacks, time-dependent saturations) exist or when the variance is not uniform with respect to the mean (e.g. bimodal distributions).

Complex Analyses
Sometimes large datasets require more complicated methods to find relevant and interesting features. These features can be thought of as solutions. How do we use complex analysis to find these features? In the world of analysis of analyses, large datasets can be mapped to solution spaces with a defined shape. This strategy uses convergence/triangulation as a guiding principle, but does so through the rules of metric geometry and computational complexity. A related and emerging approach called topological data analysis [10] can be used to conduct rigorous relational analyses. Topological data analysis takes datasets and maps them to a geometric shape (e.g. topology) such as a tree or in this case a surface.


A portrait of convexity (quadratic function). A gently sloping dataset, a gently sloping hypothesis space. And nothing could be further from the truth......

In topological data analyses, the solution space encloses all possible answers on a surface, while the surface itself has a shape that represents how easy it is to move from one portion of the solution space to another.
One common assumption is that this solution space is known and finite, while the shape is convex (e.g. a gentle curve). If that were always true, then analysis would be easy: we could use a moderate large-sized dataset to get the gist of patterns in the data. any additional scientific inquiry would constitute filling in the gaps. And indeed sometimes it works out this way.

One example of a topological data analysis of most likely Basketball positions (includes both existing and possible positions). COURTESY: Ayasdi Analytics and [10].

The Big Data Backlash.....Enter Meta-Analysis
Despite its successes, there is nevertheless a big data backlash. Ernest Davis and Gary Marcus [11] present us with nine reasons why big data are problematic. Some of these have been covered in the last section, while others suggest that there can be too much data. This is an interesting position, since it is common wisdom that more data always give you more resolution and insight. Insight and information can be obscured by noisy or irrelevant data. But even the most informative of datasets can yield misinformed analyses if the analyst is not thoughtful.

Of course, ever-bigger datasets by themselves do not give us the insights necessary to determine whether or not a generalized relationship is significant. The ultimate goal of data analysis should be to gain deep insights into whatever the data represent. While this does involve a degree of interpretive subjectivity, it also requires an intimate dialogue between analysis, theory, and simulation. Perhaps the latter is much more important, particularly in cases where the data are politically or socially sensitive. These considerations are missing from much contemporary big data analysis [12]. This vision goes beyond the conventional "statistical test on a single experiment" kind of experimental investigation, and leads us to meta-analysis.

The basic premise of a meta-analysis is to use a strategy of convergence/triangulation to converge upon results using a series of studies. The logic here involves using the power of consensus and statistical power to arrive at a solution. The problem is represented as a series of experiments with an effect size for each. For example, if I believe that eating oranges causes cancer, how should I arrive at a sound conclusion? One study with a very large effect size, or many studies with various effect sizes and experimental contexts. According to the meta-analysis view, the latter should be most informative. In the case of potential factors in myocardial infarction [13], significant results that all point in the same direction (with minimum effect size variability) lend the strongest support to a given hypothesis.

Example of a meta-analysis. COURTESY: [13].

The Problem with Deep Analysis
We can go even further down the rabbit hole of analysis, for better or for worse. However, this often leads to problems of interpretation, as deep analyses are essentially layered abstractions. In other words, they are higher-level abstractions dependent upon lower-level abstractions. This leads us to a representation of representations, which will be covered in an upcoming post. Here, I will propose and briefly explore two phenomena: significant pattern extraction and significant reconstructive mimesis.

One form of deep analysis involves significant pattern extraction. While the academic field of pattern recognition has made great strides [14], sometimes the collection of data (which involve pre-processing and personal bias) is flawed. Other times, it is the subjective interpretation of these data which are flawed. In either case, this results in the extraction patterns that make no sense that are then assigned significance. Worse yet, some of these patterns are also thought to be of great symbolic significance [15]. The Bible Code is one example of such pseudo-analysis. Patterns (in this case secret codes) are extracted from a database (a book), and then these data are probed for novel but coincidental pattern formation (codes formed by the first letter of every line of text). As this is usually interpreted as decryption (or deconvolution) of an intentionally placed message, significant pattern extraction is related to the deep, naive theories discussed in "Structure and Theory of Theories".

Congratulations! Your pattern recognition algorithm came up with a match. Although if it were a computer instead of a mind, it might do a more systematic job of rejecting it as a false positive. LESSON: the confirmatory criteria for a significant result needs to be rigorous.

But suppose that our conclusions are not guided by unconscious personal biases or ignorance. We might intentionally leverage biases in the service of parsimony (or making things simpler). Sometimes, the shortcuts we take in representing natural processes present difficulties in understanding what is really going on. This is a problem of significant reconstructive mimesis. In the case of molecular animations, this has been pointed out by Carl Zimmer [16] and PZ Myers [17] for molecular animations. In most molecular animations, processes occur smoothly (without error) and within full view of the human observer. Contrast this with the inherent noisiness and spatially-crowded environment of the cell, which is highly realistic but not very understandable. In such cases, we construct a model which consists of data, but that model is selective and the data is deliberately sparse (in this case smoothed). This is an example of a representation (the model) that informs an additional representation (the data). For purposes of simplicity, the model and data are somehow compressed to preserve signal and remove noise. And in the case of a digital image file (e.g. .jpg, .gif) such schemes work pretty well. But in other cases, the data are not well-known, and significant distortions are actually intentional. This is where big challenges arise in getting things right.

An multi-layered abstraction from a highly-complex multivariate dataset? Perhaps. COURTESY: Salvador Dali, Three Sphinxes of Bikini.

Conclusions
Data analysis is hard. But in the world of everyday science, we often forget how complex and difficult this endeavor is. Modern software packages have made the basic and well-established analysis techniques deceptively simple to employ. In moving to big data and multivariate datasets, however, we begin to face head-on the challenges of analysis. In some cases, highly effective techniques have simply not been developed yet. This will require creativity and empirical investigation, things we do not often associate with statistical analysis. It will also require a role for theory, and perhaps even the theory of theories.

As we can see from our last few examples, advanced data analysis can require conceptual modeling (or representations). And sometimes, we need to map between domains (from models to other, higher-order models) to make sense of a dataset. This, the most complex of analyses, can be considered representations of representations. Whether a particular representation of a representation is useful or not depends upon how much noiseless information can be extracted from the available data. Particularly robust high-level models can take very little data and provide us with a very reliable result. But this is an ideal situation, and often even the best models presented with large amounts of data can fail to given a reasonable answer. Representations of a representations also provide us with the opportunity to imbue an analysis with deep meaning. In a subsequent post, I will this out in more detail. For now, I leave you with this quote:
“An unsophisticated forecaster uses statistics as a drunken man uses lampposts — for support rather than for illumination.” Andrew Lang.

NOTES:
[1] Learn Statistics with Comic Books. CTRL Lab Notebook, April 14 (2011).

[2] Mooney, C.   The Science of Why We Don't Believe Science. Mother Jones, May/June (2011).

[3] Kosko, B.   Statistical Independence: What Scientific Idea Is Ready For Retirement. Edge Annual Question (2014).

[4] In order to separate signal from noise, we must first define noise. Noise is consistent with processes that occur at random, such as the null hypothesis or a coin flip. Using this framework, a significant result (or signal) is a result that deviates from random chance to some degree. For example, a p-value of 0.05 represents a 95% chance that the replicates observed could not have occurred due to chance. This is, of course, an incomplete account of the relationship between signal and noise. Models such as Signal Detection Theory (SDT) or data smoothing techniques can also be used to improve the signal-to-noise ratio.

[5] Nuzzo, R.   Scientific Method: Statistical Errors. Nature News and Comment, February 12 (2014).

[6] Fox, J.   Frequentist vs. Bayesian Statistics: resources to help you choose. Oikos blog, October 11 (2011).

[7] Gelman, A.   So-called Bayesian hypothesis testing is just as bad as regular hypothesis testing. Statistical Modeling, Causal Inference, and Social Science blog, April 2 (2011).

[8] For some concrete (and satirical) examples of how correlation does not equal causation, please see Tyler Vigen's Spurious Correlations blog.

[9] Voytek, B.   Big Data: what's it good for? Oscillatory Thoughts blog, January 30 (2014).

[10] Beckham, J.   Analytics Reveal 13 New Basketball Positions. Wired, April 30 (2012).

[11] Davis, E. and Marcus, G.   Eight (No, Nine!) Problems with Big Data. NYTimes Opinion, April 6 (2014).

[12] Leek, J.   Why big data is in trouble - they forgot applied statistics. Simply Statistics blog, May 7 (2014).

[13] Egger, M.   Bias in meta-analysis detected by a simple, graphical test. BMJ, 315 (1997).

[14] Jain, A.K., Duin, R.P.W., and Mao, J.   Statistical Pattern Recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1), 4-36 (2000).

It is interesting to note that the practice of statistical pattern recognition (training a statistical model with data to evaluate additional instances of data) has developed techniques and theories related to rigorously rejecting false positives and other spurious results.

[15] McCardle, G.     Pareidolia, or Why is Jesus on my Toast? Skeptoid blog, June 6 (2011).

[16] Zimmer, C.     Watch Proteins Do the Jitterbug. NYTimes, April 10 (2014).

[17] Myers, P.Z.   Molecular Machines! Pharyngula blog, September 3 (2006).

May 9, 2014

From Worst to Most Variable? Technical paper now available

Awhile back, I wrote a blog post entitled "Are the Worst Performers the Best Predictors?". It was a data analysis intended to draw out distinctions between informed predictions and actual performance. This was also inspired by the observations that the most probable champion often does not actually win the champaionship. The analysis featured PredictWise predictions and performance data from the 2013 MLB and NFL seasons.


This paper was an emergent effort, but I believe that I have a good idea of what is going on. The technical paper ("From Worst to Most Variable? Only the worst performers may be the most informative") is now public [1]. For the technical paper, I added an additional analysis of 2014 NCAA Tournament data, and synthesized the resulting data using a hybrid phenomenological-predictive model. Using this somewhat unconventional approach, I was able to show that only the most consistently bad (and sometimes the consistently best) teams actually conform to expectation.


[1] Alicea, B.   From Worst to Most Variable? Only the worst performers may be the most informative. Figshare, (2014). Datasets and analyses are also available as Supplemental Information.

May 5, 2014

The Continuing Adventures of Starstuff

Here are the latest supplemental readings for the Cosmos reboot (as usual, cross-posted to Tumbld Thoughts). This post covers episodes 8 ("Sisters of the Sun", I) and 9 ("The Lost Worlds of Planet Earth", II). 

I. The Incredible Shape of Sunspots 


Here are the supplemental readings for the eighth installment of the Cosmos reboot ("Sisters of the Sun"). Readings are loosely organized by topic.


Observing the Night Sky:
Plait, P.   Black Skies, Smiling At Me. Bad Astronomy blog, April 23 (2014).

Klinkenborg, V.   Light Pollution. National Geographic, November (2008).

How to see the Big Dipper, and the famous stars Mizar and Alcor. EarthSky blog, March 24 (2013).

Emergence of a scientific field, in ways that were ahead of its time:
International Year of Astronomy - Annie Jump Cannon. The Museum of Flight.

Stellar Classification System. HyperPhysics.

Payne, C.   Stellar Atmospherics. Publications of the Astronomical Society of the Pacific, 38(221), 33. SAO/NASA Astrophysics Data System (ADS).

Biba, E.   Why the government should fund unpopular science. Popular Science, October 4 (2013).


Incredible Dynamics, Incredible Universe:
Mann, A.   Tatooine Times Two: Amateur Astronomers Find Planet in Four-Star System. Wired Science, October 15 (2012).

Planets with Two Suns Could Grow Black Trees. Space.com, April 18 (2011).

VLT Captures Stunning Stellar Explosion In 3D. RedOrbit, August 4 (2010).

The Great Eruption of Eta Carinae -- One of the Most Massive Stars in the Milky Way. Daily Galaxy blog, February 15 (2012).


But wait, there's more!
Steffens, M.   Australia's First Astronomers. Astronomy Basics, ABC Science.

Fuller, R.   The Kamilaroi and Euahalayi Emu in the Sky. Australian Indigenous Astronomy blog, March 31 (2014).

Ecology/Energy in Ecosystems. Chapter 14, Ecology. Wikibooks, October 22 (2013).



II. The More Things Change, the More They Are Subject to Uniformitarianism



Here are the supplemental readings for the ninth installment of the Cosmos reboot ("The Lost Worlds of Planet Earth"). A highly-selective (but still excellent) guided tour of the history of life on Earth. Readings are organized by theme.


Hundreds of Billions of _________ (your favorite noun here):
Carboniferous Period. National Geographic.

The Carboniferous. Palaeos.com.

Castro, J.   How do fossils form? LiveScience, June 26 (2013).

Permian-Triassic Extinction. PBS Evolution Library (2001).


The Great Dying. Science@NASA (2002).

Oskin, B.   Earth's Greatest Killer Finally Caught. LiveScience, December 12 (2013).


Plates, Oceans, and Deep Sea Surprises:



Oceans Atlas. HRW World Atlas (2006).

Atlantic Ocean Geophysical Map. National Geographic.


Pastore, R.   10 GIFs of Deep-Sea Creatures Encountering a Sub. Popular Science, May 2 (2014).

Villanueva, J.C.   Milankovitch Cycle. Universe Today, September 9 (2009).


Printfriendly