Archive

Archive for April 21, 2012

Get social and eco-friendly this Earth Day (infographic)

April 21, 2012 Leave a comment

VentureBeat

via Get social and eco-friendly this Earth Day (infographic).

Get social on Earth DayEarth Day is Sunday, April 22, and if you have no idea how to celebrate, we can help.

This weekend people will unplug their electronics, get out of their houses, and ride their bikes to a tree planting Earth Day celebration, hoping to make up for all those times they sent a water bottle to a landfill or poured paint down their drains. But if the thought of going outside and interacting with nature makes you shudder, you can do your eco-friendly part from the comfort of your couch.

Recyclebank, a company that rewards people for doing earth-friendly things like recycling and planting trees, put together an infographic of social actions you can take to be more eco-friendly.

If every one of Facebook’s 845 million users cut one minute off their shower time, the amount of water saved would fill 1.1 million Olympic-sized swimming pools. If every person on Twitter turned their computer off for one hour, the energy and carbon emissions savings would be equivalent to taking 9,128 cars off the road each year.

Given Pinterest’s ever-growing popularity, if every person on Pinterest pinned a green tip to one of their boards, there would be 12 million green tips pinned every month. This tip may be the least energy-saving of the bunch, but spreading the green love around can’t hurt.

Whether you celebrate Earth Day indoors or out, you can reduce your carbon footprint by shutting down your computer, taking a shorter shower, and recycling those water bottles already.

Click on the infographic to enlarge.

Holding leaf and Earth image via Shutterstock

Advertisements
Categories: Environment

States of punishment

April 21, 2012 Leave a comment

Graphic detail

via States of punishment.

Mapping the death penalty in America

FROM 2000 to 2011 there were, on average, five death-row exonerations a year in the United States, according to the Death Penalty Information Centre. North Carolina alone saw three exonerations in six months in 2008. The following year the state legislature passed the Racial Justice Act, which gives death-row inmates the chance to commute their sentences to life without parole if a judge rules the sentences were tainted by racial bias. (More than half of North Carolina’s death-row inmates are black.) The first ruling will be issued on April 20th, a decision that could set a precedent for other challenges based on race. Indeed the ripples could be felt across the country, especially in Pennsylvania and Missouri, where similar legislation is pending. Other states are reconsidering capital punishment altogether. In November voters in California, which has more people on death row than any other state, will vote on whether to repeal the death penalty. And Connecticut, where a repeal bill was recently passed, is set to become the 17th state to abolish capital punishment.

Categories: Law, US Observation Tags: ,

Integrating Genomes

April 21, 2012 Leave a comment
  1. D. R. Zerbino1,
  2. B. Paten1,
  3. D. Haussler1,2,*

Author Affiliations


  1. 1Center for Biomolecular Sciences and Engineering, University of California, Santa Cruz, CA 95064, USA.

  2. 2Howard Hughes Medical Institute, University of California, Santa Cruz, CA 95064, USA.
  1. *To whom correspondence should be addressed. E-mail: haussler@soe.ucsc.edu

ABSTRACT

As genomic sequencing projects attempt ever more ambitious integration of genetic, molecular, and phenotypic information, a specialization of genomics has emerged, embodied in the subdiscipline of computational genomics. Models inherited from population genetics, phylogenetics, and human disease genetics merge with those from graph theory, statistics, signal processing, and computer science to provide a rich quantitative foundation for genomics that can only be realized with the aid of a computer. Unleashed on a rapidly increasing sample of the planet’s 1030 organisms, these analyses will have an impact on diverse fields of science while providing an extraordinary new window into the story of life.

Since the first genome sequences were obtained in the mid-1970s (1,2), computers have been necessary for processing (3) and archiving (2,4) sequence data. However, the discipline of computational genomics traces its roots to 1980, when Smith and Waterman developed an algorithm to rapidly find the optimal comparison (alignment) of two sequences of length n among the more than 3n possibilities (25), and Stormo et al. built a linear threshold function to search a library of 78,000 nucleotides of Escherichia coli messenger RNA sequence for ribosome binding sites (6). What seemed large data sets for biology then don’t seem so today, as high-throughput, short-read sequencing machines churn out terabytes of data (27). We have seen a 10,000-fold sequencing performance improvement in the past 8 years, far outpacing the estimated 16-fold improvement in computational power under Moore’s law (8). Using genomics data to model genome evolution, mechanism, and function is now the heart of a lively field.

Every genome is the result of a mostly shared, but partly unique, 3.8-billion-year evolutionary journey from the origin of life. Diversity is created mostly by copy errors during replication. These create single-base changes, which are known as substitutions if spread to the whole population (fixed) or single-nucleotide polymorphisms (SNPs) if not uniformly present in the population (segregating). Replication errors also create insertions and deletions (collectively, indels), as well as tandem duplications where a short sequence is repeated sequentially. Chromosomes often exchange long similar segments through the process of homologous recombination. Specific sequences of DNA, known as transposable elements, have the capacity to replicate themselves within the cell, using machinery analogous to that found in certain viruses, leaving many copies (9). Rearrangements lead to patterns such as inversions, segmental deletions and duplications (causing copy number variants), chromosome fusion and fission, and translocations between chromosomes (10). At the largest scale, occasionally the whole genome is duplicated, greatly increasing its gene content (11). The present diversity of life was created gradually through these edits and is manifest in the germline genotype of each living individual. Starting from the germline genotype, the genomes of the somatic cells continue to experience similar edits during the lifetime of every individual, some undergoing a kind of evolutionary process called somatic selection that plays a role in cancer and immunity (212).

Genomes are the core of the molecular mechanisms of cells, and of the physical properties (or phenotype) of organisms. They contain recipes of the active molecules of the cell, proteins, and their messenger RNAs, as well as other functional RNAs. Sequencing technology is used to determine RNA abundance (13), subcellular location (14), splicing isoforms (15), secondary structure (16), and rates of engagement with molecular complexes such as the ribosome (17). It is used to assay the epigenetic mechanisms that regulate RNA and protein production and function, including methylation (218), histone modifications (19), transcription factor binding (20), chromatin accessibility (21), and chromatin three-dimensional interactions (222). When applied to these data, computational genomics builds models of epigenetic mechanisms and gene regulatory networks (223), articulating with the broader models of molecular systems biology such as protein signaling cascades, metabolic pathways, and regulatory network motifs (24).

Combining evolutionary, mechanistic, and functional models, computational genomics interprets genomic data along three dimensions. A gene is simultaneously a DNA sequence evolving in time (history), a piece of chromatin that interacts with other molecules (mechanism), and, as a gene product, an actor in pathways of activity within the cell that affect the organism (function). Molecular phenotypes from epigenetic state and RNA expression levels are the first stations on the road from genotype to organismal phenotype, where evolutionary selection acts. Beyond the basics of storing, indexing, and searching the world’s genomes, the three fundamental, interrelated challenges of computational genomics are to explain genome evolution, model molecular phenotypes as a consequence of genotype, and predict organismal phenotype.

Obtaining Genomic Sequences

Current methods in genome analysis start with genome assembly (2), the process of reconstructing an entire genome from relatively short random DNA fragments, called reads. Given sufficient read redundancy, or coverage depth, it is possible to detect read overlaps and thereby progressively reconstitute most of the genome sequence (2). However, this ideal scenario is complicated by the fact that genomes commonly contain large redundant regions (repeats), or regions where the statistical distribution of bases is significantly biased (low-complexity DNA), leading to coincidental, spurious read overlaps. These create complex networks of read-to-read overlaps that do not all reflect actual overlaps in the genome. The most persistent difficulty of assembly is to determine which overlaps are legitimate and which are spurious. This problem is NP-hard, which means that it is at least as hard as any problem in the class of problems that can be solved in nondeterministic polynomial time (2). Therefore, we expect that the only efficient solutions will be heuristic methods that are not guaranteed to find the optimal solution. For this reason, difficult regions of genomes are left as undetermined gaps (2), prone to errors (2), or costly to finish (2). Newer sequencing technologies, producing longer reads (2), may alleviate this problem.

After the first complete genome from a species is assembled (the reference genome), new genomes from that species or closely related species are generally not assembled de novo but are assembled using the reference genome as a template, exploiting similarities derived from the common evolutionary ancestor. Reads from the new genome are mapped (aligned) onto the reference genome (2), and systematic discordances are detected (2). This process may be used simply to enumerate the variants present in the new genome, or to guide the complete assembly of the new genome (called reference-based assembly) (2). However, with short reads, mapping algorithms may also be confounded by repeats (2), and both mappers and reference-based assemblers tend to bias toward the reference genome, occasionally treating genuine variants as noise (2). As technology improves toward longer reads, we expect improvements here too.

Modeling the Evolution of Genotype

Genomes are compared by alignment: The bases of DNA are partitioned into sets (columns) that are putatively derived from a suitably recent common ancestor. From this, we can analyze what was conserved or changed during the evolution of the genomes from their common ancestor. At a large scale, alignments can indicate changes in segment order and copy number, and at a small scale they can indicate specific base substitutions (see Fig. 1).

Fig. 1

Assembly and alignment. (A) Assembly of a number of reads, grouped by pairwise sequence overlap. Because the genomic sequence contains a repeated sequence, the reads coming from the two copies of the repeat overlap and must be separated by the assembly software to produce a linear assembly. (B) Alignment of five sequences and an outgroup. Each row is a sequence; each column is a set of bases that descend from a common ancestral base. Six columns are highlighted. Column 4 contains a base that is fixed among the five sequences, whereas the other columns contain segregating SNPs. The trees on the sides represent two alternative ways of representing the phylogeny between the sequences. The left tree is optimal in terms of substitution complexity for columns 1, 2, and 3; the right tree is optimal for columns 4, 5, and 6. Given the difference between the two trees, a recombination event may have occurred between columns 3 and 5.

As in genome assembly, the primary challenge is to distinguish spurious sequence similarities from those due to common ancestry. Regions of genomes that are subject to purifying selection in which similarity of sequence is conserved, such as orthologous protein-coding regions, can often be reliably aligned across great evolutionary distances, such as between vertebrates and invertebrates. Regions that are neutrally drifting (i.e., not under positive or negative selection) diverge much more quickly, and can be reliably aligned only if they diverged recently (e.g., within the past 100 million years for two vertebrate genomes) (2). It is therefore common to distinguish alignments of subregions (local alignments) (2) from alignments of complete sequences (global alignments) (2) or even complete genomes (genome alignments) (2). Local alignments are typically used between conserved functional regions of more distantly related genomes (2). Conversely, full genome alignments become practical when comparing genomes from closely related species.

When applied to more than two species or to multiple gene copies within a species, phylogenetic methods provide an explicit order of gene descent through shared ancestry. When the model of evolution is restricted to consider only indels and substitutions (the most common events), the phylogeny is represented by a single tree in which the terminal (leaf) nodes represent the observed (present-day) sequences, the branches represent direct lines of descent, and the internal nodes represent the putative ancestral sequences (2). Finding the optimal phylogeny under probabilistic or parsimony models of substitutions (and also of indels) is NP-hard (2), and considerable effort has been devoted to obtaining efficient and accurate heuristic solutions.

Phylogenetic analysis is complicated by homologous recombination, which creates DNA molecules whose parts have different evolutionary histories (Fig. 2). The coalescent model with recombination (2) models the evolutionary history of a gene with both substitutions and homologous recombination. Individual histories of parts are represented in an alignment with a separate phylogenetic (coalescent) tree for each base (2).

Evolutionary relationships between DNA sequences may also include balanced structural rearrangements that change the order and the orientation of the bases in the genome, as well as segmental duplications, gains, and losses that alter the number of copies of homologous bases (2). Unfortunately, these processes are usually modeled and treated separately from one another, and separately from substitutions and short indels. The construction of a mathematically and algorithmically tractable unified theory of genome evolution, in which stochastic processes jointly describe base substitution, recombination, rearrangement, and the various forms of duplication, gain, and loss, remains a major challenge for the field (2). With incomplete knowledge of the mathematical difficulties inherent in such a model, it is hard to predict when, if ever, such a model will be forthcoming. The only thing we are assured of is that projects such as the 1000 Genomes Project (2) will be producing massive amounts of data from which to build and test (via likelihood methods) a variety of approximate models.

Fig. 2

The dynamic processes that affect and are affected by the genome. Top: The genome changes as it is modified by random mutations. At the larger scale, homologous recombination events swap equivalent pieces of DNA, rearrangements reconnect different regions of DNA, and transposable elements can self-reproduce. At the finer scale, small modifications such as substitutions and insertion/deletion events occur. Bottom: The genome affects the molecular processes in the cell, namely the transcription of genes and functional RNA, which through pathways affect the phenotype of the organism by causing phenotypes such as disease and other specific traits. Through natural selection, the phenotypes condition the selective pressure on the genome favoring or disfavoring specific mutations.

From Genotype to Phenotype

Geneticists have correlated genomic mutations to phenotypic differences for many years, but today they do so at an unprecedented scale. Sequencing surveys across vertebrates [Genome 10K (2)], insects [i5K (2)], plants (2), microorganisms (2), cell lineages (2), and “metagenomes” (obtained by sequencing DNA from environmental samples containing an unknown collection of organisms) (2) present us with tens of thousands of genomes and challenge us to rework and deepen our methods. To date, such studies have given us concrete examples of the unfolding history and diversity of life, explored the ties between the body’s microbial populations and our health, and investigated the response of species to current environmental changes such as climate shift, disease, and competitors (2). Future studies could be coupled with experimental data derived from an expansion of cell culture resources for diverse species and tissues (2) and newer single-cell assay methodologies (2), allowing deeper comparisons.

When studying the population genetics of a single species, the recombination rate determines how likely it is that proximal sequence variants share the same coalescent tree (2). Lack of recombination leads to linkage disequilibrium, in which nearby segregating variants are correlated. This phenomenon is exploited in correlating specific segregating variants with phenotypic traits or diseases—for example, in genome-wide association studies conducted with microarrays or incomplete sequencing data (2). However, this same phenomenon limits the resolution of these approaches in finding the actual causal variant. Genome-wide association studies are also blind to the patterns of allele segregation in close relatives. Future genotype-phenotype studies using complete genomes will increasingly use genotypic context in related as well as unrelated cases and controls, combined with better prediction of the possible effects of genome variants, to identify causal variants (2).

Large projects such as ENCODE (2), modENCODE (2), and the Epigenomics Roadmap (2) are providing data on the epigenome and the transcriptional machinery needed to construct models of molecular phenotypes involving epigenetic state, RNA expression, and (inferred) protein levels, requiring specialized analysis tools. Genome browsers such as Ensembl (2) and the UCSC Genome Browser (2) provide an integrated view of these data, along with background knowledge and various modeling results. Because many key elements of epigenetics, RNA expression, and protein production cannot be directly measured and therefore must be inferred, the mathematical models of these processes contain numerous latent (hidden) variables, often one for every site in the genome. Approaches include hidden Markov models (2), factor graphs (2), Bayesian networks (2), and Markov random fields (2). Model inference (parameter estimation) and model application (computation of conditional and marginal probabilities) are large-scale computational tasks.

Genotype determines phenotype via epigenetic, transcriptional, and proteomic state. Classification and regression methods that are used to predict phenotype from genotype can take advantage of estimates of these intermediate states as additional or alternate inputs (2). These methods include general linear models, neural networks (2), and support vector machines (2), preferred in part because of their ability to cope with very high-dimensional input feature spaces (i.e., with many measured variables). There is currently more to be gained in predicting phenotype by incorporating biological knowledge to improve the input feature space—for example, by substituting inferred transcript levels or inferred protein activity levels for raw gene expression measurements (2)—than by using yet more sophisticated techniques of classification and regression.

Looking Ahead to Applications

Understanding the shared evolutionary history of life starts by storing, indexing, and comparing genomes. It requires tools to rapidly produce evolutionarily related segments of DNA according to a model of genome evolution when prompted with a query segment. How will this be accomplished as we collectively grow from petabytes (1015 bytes) of genome data today to exabytes (1018 bytes) tomorrow? One possibility may be to use differential compression based on the inferred evolutionary trajectory of genomes, where each sequence is represented as a set of differences from its inferred parent (2). This may allow us to create a new web of genetic information that is compact, rapidly searchable, and directly reflects the natural origin of genomic relatedness.

Genomics has had a profound effect on medicine and will continue to do so. Cancer therapeutics are expected to advance as a result, because genomic modifications are the source of nearly all cancers (2). Within the body’s somatic cells, genomic changes occur at random, from environmental impacts, or as a result of treatment; subpopulations of genetically distinct cancer cells expand and compete (2). Sequencing a sample of a cancer patient’s noncancerous tissue reveals the patient’s genome at birth (i.e., germline genome). Comparing this to the genome obtained from a tumor biopsy then reveals the mutations that have occurred subsequently in the patient’s cancer cells. Tracking tumor genomes in this manner from early disease through each stage of treatment will become the norm and will inform therapeutic decisions (2). Changes that are readily detectable only through computational methods in genotype, epigenetic state, gene expression pattern, and activated pathway structure will provide crucial information on the state of the tumor during initial tumor growth and during the emergence of resistance to therapy (2). Recurring tumor-specific genomic variants and intermediate molecular phenotypes that drive cancer and determine patient response to therapy will come more clearly to light (2) and will be translated into better-targeted cancer diagnosis and treatment (2).

Other fields of medicine will also benefit from computational methods and findings. For example, immune cells undergo specific mutations through rounds of somatic selection (2), accompanied by changes in epigenetic state, gene expression pattern, and activated pathway structure. Deep sequencing of T cell receptors and B cell antibodies (2), coupled with genome-wide measurements of genetic variation, epigenetic state, and gene expression pattern in immune cells, will be used to model immune cell function and correlate immune response with antigen. High-throughput genomics data will be used in vaccine design (including cancer immunotherapy) and the treatment of infectious diseases (2), autoimmune diseases, and compromised immune systems resulting from chemotherapy, transfusions, transplants, and stem cell therapies (2).

Genomic variants, epigenetic state, and expression pattern play key roles in stem cell therapies and basic science applications of stem cells that can only be discerned through the use of computational tools (2). Induced pluripotent stem (iPS) cells and lineage-specific directly reprogrammed cells are made from somatic cells (2) that have already incurred somatic mutations and are cultured in conditions that may select for further mutations (2). These mutations will soon be assessed with whole-genome analysis. Measurements of epigenetic modification and gene expression will confirm the pluripotent or lineage-specific status of the reprogrammed cells and verify that the epigenetic memory of the tissue from which they were derived is erased. Because every batch of reprogrammed cells will show some unexpected genetic mutations, epigenetic changes, and expression differences on a genome-wide level, some with consequences, the interpretation of these data will be of critical importance. In summary, the future of research into cancer, immunology, and stem cells involves all three key challenges of computational genomics: explaining (somatic) evolution, modeling molecular phenotype, and predicting organismal phenotype.

In addition to other medical applications, similar scenarios are playing out in applications of genomics in a wide range of fields, such as agriculture (2) and the study of human prehistory (2). The increasing availability of data is leading to the development of elaborate multidimensional analysis tools incorporating DNA sequences, alignments, phylogenetic trees, lists of variants, epigenomic and functional assays, phenotypic changes, etc. To face the challenges of obtaining the maximum information from every sequencing experiment, we must borrow advances from a spectrum of different research fields and tie them together into foundational mathematical models implemented with numerical methods. There is a tension between the comprehensiveness of models and their computational efficiency. As this plays out, a comprehensive but computable model of genome evolution and its functional repercussions on organisms is taking shape, embodied in computational genomics. Yet we still await a formulation that is both simple and expressive enough to compare models, store information, and communicate results in an exabyte age. As a common language develops, shaped by our increasing knowledge of biology, we anticipate that computational genomics will provide enhanced ability to explore and exploit the genome structures and processes that lie at the heart of life.

Supplementary Materials

www.sciencemag.org/cgi/content/full/336/6078/179/DC1

Table S1

References and Notes

    1. W. Fiers
    2. et al

    ., Complete nucleotide sequence of bacteriophage MS2 RNA: Primary and secondary structure of the replicase gene. Nature 260, 500 (1976).

  1.  For a full list of references by subject, see table S1 in the supplementary materials.
    1. T. R. Gingeras,
    2. J. P. Milazzo,
    3. D. Sciaky,
    4. R. J. Roberts

    , Computer programs for the assembly of DNA sequences. Nucleic Acids Res. 7, 529 (1979).

    1. G. H. Hamm,
    2. G. N. Cameron

    , The EMBL data library. Nucleic Acids Res. 14, 5 (1986).

    1. T. F. Smith,
    2. M. S. Waterman

    , Identification of common molecular subsequences. J. Mol. Biol. 147, 195(1981). 10.1016/0022-2836(81)90087

    1. G. D. Stormo,
    2. T. D. Schneider,
    3. L. Gold,
    4. A. Ehrenfeucht

    , Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coliNucleic Acids Res. 10, 2997 (1982).

    1. E. R. Mardis

    , The impact of next-generation sequencing technology on genetics. Trends Genet. 24, 133(2008).

    1. L. D. Stein

    , The case for cloud computing in genome informatics. Genome Biol. 11, 207 (2010). 10.1186/gb-2010-11-5

    1. C. Feschotte

    , Transposable elements and the evolution of regulatory networks. Nat. Rev. Genet. 9, 397(2008).

    1. L. Feuk,
    2. A. R. Carson,
    3. S. W. Scherer

    , Structural variation in the human genome. Nat. Rev. Genet. 7, 85(2006).

    1. P. Dehal,
    2. J. L. Boore

    , Two rounds of whole genome duplication in the ancestral vertebrate. PLoS Biol. 3,e314 (2005).

    1. L. M. F. Merlo,
    2. J. W. Pepper,
    3. B. J. Reid,
    4. C. C. Maley

    , Cancer as an evolutionary and ecological process.Nat. Rev. Cancer 6, 924 (2006).

    1. A. Mortazavi,
    2. B. A. Williams,
    3. K. McCue,
    4. L. Schaeffer,
    5. B. Wold

    , Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5, 621 (2008).

    1. J. C. Simpson,
    2. R. Wellenreuther,
    3. A. Poustka,
    4. R. Pepperkok,
    5. S. Wiemann

    , Systematic subcellular localization of novel proteins identified by large-scale cDNA sequencing. EMBO Rep. 1, 287 (2000). 10.1093/embo

    1. C. Trapnell
    2. et al

    ., Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature 28, 511 (2010).

    1. J. G. Underwood
    2. et al

    ., FragSeq: Transcriptome-wide RNA structure probing using high-throughput sequencing. Nat. Methods 7, 995 (2010).

    1. N. T. Ingolia,
    2. S. Ghaemmaghami,
    3. J. R. S. Newman,
    4. J. S. Weissman

    , Genome-wide analysis in vivo of translation with nucleotide resolution using ribosomal profiling. Science 324, 218 (2009).

    1. A. L. Brunner
    2. et al

    ., Distinct DNA methylation patterns characterize differentiated human embryonic stem cells and developing human fetal liver. Genome Res. 19, 1044 (2009).

    1. T. S. Mikkelsen
    2. et al

    ., Genome-wide maps of chromatin state in pluripotent and lineage-committed cells.Nature 448, 553 (2007).

    1. G. Robertson
    2. et al

    ., Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat. Methods 4, 651 (2007).

    1. A. P. Boyle
    2. et al

    ., High-resolution mapping and characterization of open chromatin across the genome.Cell 132, 311 (2008).

    1. M. J. Fullwood
    2. et al

    ., An oestrogen-receptor-α-bound human chromatin interactome. Nature 462, 58(2009).

    1. P. J. Mitchell,
    2. R. Tjian

    , Transcriptional regulation in mammalian cells by sequence-specific DNA binding proteins. Science 245, 371 (1989).

    1. U. Alon

    , Network motifs: Theory and experimental approaches. Nat. Rev. Genet. 8, 450 (2007).

  2. Acknowledgments: We thank D. A. Earl for designing the figures, and E. Green, M. Häussler, J. Ma, D. Earl, H. Zerbino, R. Kuhn, G. Hickey, T. Pringle, K. Pollard, A. Krogh, R. Shamir, M. Waterman, and R. Durbin for their corrections and comments. Supported by the Howard Hughes Medical Institute (D.H.), National Human Genome Research Institute Data Analysis Center for the Encyclopedia of DNA Elements grant U01 (B.P.), and the American Association for Cancer Research (Stand Up To Cancer/An Integrated Approach to Targeting Breast Cancer Molecular Subtypes and Their Resistance Phenotypes) (D.R.Z.).

Beyond Turing’s Machines

April 21, 2012 Leave a comment

by Andrew Hodges

E-mail: andrew.hodges@wadh.ox.ac.uk

In marking Alan Turing’s centenary, it’s worth asking what was his most fundamental achievement and what he left for future science to take up when he took his own life in 1954. His success in World War II, as the chief scientific figure in the British cryptographic effort, with hands-on responsibility for the Atlantic naval conflict, had a great and immediate impact. But in its ever-growing influence since that time, the principle of the universal machine, which Turing published in 1937 (1), beats even this.

When, in 1945, he used his wartime technological knowledge to design a first digital computer, it was to make a practical version of that universal machine (2). All computing has followed his lead. Defining a universal machine rests on one idea, essential to Turing’s mathematical proof in 1936, but quite counter-intuitive, and bearing no resemblance to the large practical calculators of the 1930s. It put logic, not arithmetic, in the driving seat. This central observation is that

instructions are themselves a form of data. This vital idea was exploited by Turing immediately in his detailed plan of 1945. The computer he planned would allow instructions to operate on instructions to produce new instructions. The logic of software takes charge of computing. As Turing explained, all known processes could now be encoded, and all could be run on a single machine. The process of encodement could itself be automated and made user-friendly, using any logical language you liked. This approach went far beyond the vision of others at the time.

Even more fundamental than the universal machine is the concept of computability that Turing defined in 1936. His first step was to ask for a precise definition of “mechanical process,” and he proceeded by analyzing what it means for someone to follow a rule. In modern terms, “anything that can be done by a computer program” was his answer, with the invention of the computer as a by-product. This leaves open the question of whether such an analysis would include absolutely everything that a material system (including brains) might be able to achieve. Ever since 1936, this nagging question has been on the agenda, and it is still there.

In 1939, Turing suggested that mathematical steps that are not rule-following, and so not computable, could be identified with mental “intuition” (3). However, he gave no discussion of whether the human brain was actually embodying uncomputable physical processes. Wartime experience led Turing in a different direction. His brilliant codebreaking algorithms, outdoing human guessing, stimulated the conviction that all mental operations must be computable, including those functions of the mind not apparently following methodical rules. His 1950 paper (4), most famous for the wit of the Turing test of intelligence (5), also included a careful discussion of computability. It set out a basic argument that if the brain’s action is computable, then it can be implemented on a computer, this being a universal machine. In defending this point of view, Turing referred to what would now be called chaotic effects in the brain and argued that these did not prevent computer simulation. Notably, at this time Turing was also founding a new branch of mathematical biology: He was applying the insights of an applied mathematician who was also one of the first to use a computer for simulating physical systems.

In 1951, however, Turing gave a radio talk with a different take on this question, suggesting that the nature of quantum mechanics might make simulation of the physical brain impossible. This consideration can be traced back in Turing’s thought to 1932, when he first studied the axioms of quantum mechanics [see (6)]. Turing then took up renewed interest in quantum theory and noted a problem about the observation of quantum systems (now known as the quantum Zeno effect). With his death, this train of thought was lost, but the serious question of relating computation to fundamental physics has remained.

Since the 1980s, quantum computing has given a practical technological arena in which computation and quantum physics interact excitingly, but it has not yet changed Turing’s picture of what is computable. There are also many thought-experiment models that explore what it would mean to go beyond the limits of the computable. Some rather trivially require that machine components could operate with boundless speed or allow unlimited accuracy of measurement. Others probe more deeply into the nature of the physical world. Perhaps the best-known body of ideas is that of Roger Penrose (7). These draw strongly on the very thing that motivated Turing’s early work—the relationship of mental operations to the physical brain. They imply that uncomputable physics is actually fundamental to physical law and oblige a radical reformulation of quantum mechanics.

CREDIT: COURTESY OF THE ALAN TURING DIGITAL ARCHIVE, KING’S COLLEGE, CAMBRIDGE, UK

Superficially, any such theory contradicts the line that Turing put forward after 1945. But more deeply, anything that brings together the fundamentals of logical and physical description is part of Turing’s legacy. He was most unusual in disregarding lines between mathematics, physics, biology, technology, and philosophy. In 1945, it was of immediate practical concern to him that physical media could be found to embody the 0-or-1 logical states needed for the practical construction of a computer. But his work always pointed to the more abstract problem of how those discrete states are embodied in the continuous world. The problem remains: Does computation with discrete symbols give a complete account of the physical world? If it does, how can we make this connection manifest? If it does not, where does computation fail, and what would this tell us about fundamental science?

References

    1. A. M. Turing

    Proc. London Math. Soc. S2-42, 230 (1937).

    1. M. Davis

    , The Universal Computer: The Road from Leibniz to Turing (Taylor & Francis, Boca Raton, FL,Turing Centenary edition, 2012).

    1. A. M. Turing

    Proc. London Math. Soc. S2-45, 161 (1939).

    1. A. M. Turing

    Mind 49, 433 (1950).

    1. R. M. French

    Science 336, 164 (2012).

    1. A. Hodges

    , Alan Turing: The Enigma (Princeton Univ. Press, Princeton, NJ, Turing Centenary edition,2012).

    1. R. Penrose

    , The Emperor’s New Mind (Oxford Univ. Press, Oxford, 1989).

New Julia Language Seeks to Be the C for Scientists

April 21, 2012 Leave a comment

ACM TechNews

via New Julia Language Seeks to Be the C for Scientists.

Julia is a new programming language being developed for building technical applications. In its early stages, Julia has been used in such applications as image analysis and linear algebra research. The language is the brainchild of developer Stefan Karpinski, along with Jeff Bezanson and Viral Shah. InfoWorld Editor at Large Paul Krill recently got the lowdown on Julia during an interview with Karpinski.

InfoWorld: When did you develop Julia, and what was the main intention in developing it?

Karpinski: We started talking in August 2009. Viral put Jeff and me in contact, and we started talking about our frustrations with technical computing languages. Primarily, one of them being that you just have to use so many different languages for different things because they’re all good at one thing but not good at everything. You end up having to switch languages a lot.

[ Check out InfoWorld’s recent interviews with the founders of C++ and Python. | Subscribe toInfoWorld’s Developer World newsletter for the latest news on software development. ]

InfoWorld: When you say technical computing, to what type of applications are you specifically referring?

Karpinski: It’s a broad category, but it’s pretty much anything that involves a lot of number-crunching. In my own background, I’ve done a lot of linear algebra but a fair amount of statistics as well. The tool of choice for linear algebra tends to be Matlab. The tool of choice for statistics tends to be R, and I’ve used both of those a great deal. But they’re not really interchangeable. If you want to do statistics in Matlab, it’s frustrating. If you want to do linear algebra in R, it’s frustrating.

InfoWorld: So you developed Julia with the intent to make it easier to build technical applications?

Karpinski: Yes. The idea is that it should be extremely high productivity. To that end, it’s a dynamic language, so it’s relatively easy to program, and it’s got a very simple programming model. But it has extremely high performance, which cuts out [the need for] a third language [C], which is often [used] to get performance in any of these other languages. I should also mention NumPy, which is a contender for these areas. For Matlab, R, and NumPy, for all of these options, you need to at some point drop down into C to get performance. One of our goals explicitly is to have sufficiently good performance in Julia that you’d never have to drop down into C.

InfoWorld: The Julia Web page says Julia is designed for cloud computing and parallelism. What’s the intent there?

Karpinski: The idea is that a lot of number-crunching these days ends up being done in the cloud environment, where you can allocate resources on demand. Traditional parallel computing for technical problems often use MPI [Message Passing Interface], a very, very established and popular way of doing large-scale parallel applications. But it doesn’t have any facility for elasticity. If you want to add processors midcomputation, that can’t really be done.

It also has limited functionality for recovery from faults, and that’s a huge problem in cloud computing because resources are not dedicated hardware, they’re not big supercomputing machines that have extremely high reliability. They’re virtual machine instances in a cloud somewhere, and it’s not uncommon for a resource to just disappear.

InfoWorld: How does Julia overcome these issues with cloud computing?

Karpinski: Well, these aren’t solved problems. These are things that we’re still working on, but we have a simpler model for building a large parallel applications via a global distributed address space. This means that you can hold a reference to an object that lives on another machine participating in a computation. You can manipulate references easily and pass them around between machines, making it easy to keep track of what’s being computed where. Also, you can add machines in midcomputation. We haven’t yet built the facility for removing machines midcomputation because we thought this was less of a pressing issue. But if you realize partway through a computation that you want more machines because you want the computation to go faster, you can do that.

InfoWorld: Would you use Julia for big data applications, data analysis, that sort of application?

Karpinski: It depends on what you mean by big data. “Big data” is a very overloaded term at this point. But I think a lot of, for example, predictive analysis, modeling problems are often linear algebra problems, and graph analysis problems are also something that Julia would be good at tackling. Certain kinds of applications are probably better addressed by a traditional technology like Hadoop. If you have a lot of things to count, Hadoop is perfect. If you want to compute a large matrix factorization, Hadoop is absolutely atrocious. That’s the kind of place where Julia would be a stellar choice.

InfoWorld: You’re not going to use Julia to develop, say, an accounting application or a mobile application, would you?

Karpinski: We actually have plans to develop a compiler. And once you have a compiler, you could write Julia code for embedded and mobile applications.

InfoWorld: When would that compiler be ready?

Karpinski: Probably sometime in the next year.

InfoWorld: The Web page says, “Julia’s LLVM-based just-in-time compiler combined with the language’s design allow it to approach and often match the performance of C, C++.” What was your intent in approaching or matching the performance of C, C++?

Karpinski: C, C++, and I should also add Fortran, because it’s another old compiled language and is known for its performance. Those are the gold standard, that’s how you measure performance. If you can match or beat C++, you’ve made it.

InfoWorld: What’s the exact status of Julia at this point?

Karpinski: It’s a pre-1.0. We’re shooting for 1.0 soon. [Version] 1.0 is a big commitment. To me, at least, and to the other core developers, it means that we’re committing to a stable API.

InfoWorld: And this is available under an MIT license, correct?

Karpinski: Yes. The core is available under MIT, which is very permissive.

InfoWold: Why the name, Julia?

Karpinski: That’s everybody’s favorite question. There’s no good reason, really. It just seemed like a pretty name.

This article, “New Julia language seeks to be the C for scientists,” was originally published atInfoWorld.com. Follow the latest developments in business technology news and get a digest of the key stories each day in the InfoWorld Daily newsletter. For the latest developments in business technology news, follow InfoWorld.com on Twitter.

Paul Krill is an editor at large at InfoWorld, focusing on coverage of application development (desktop and mobile) and core Web technologies such as HTML5, Java, and Flash.

ACM Names 2012-13 Athena Lecturer

April 21, 2012 Leave a comment

ACM TechNews

via ACM Names 2012-13 Athena Lecturer.


Nancy Lynch, MIT [image courtesy ACM].
The Association for Computing Machinery’s Council on Women in Computing (ACM-W) yesterday named MIT’s Nancy Lynch its 2012-13 Athena Lecturer, recognizing Lynch for her advances in distributed systems enabling dependable Internet and wireless network applications. The Athena Lecturer award, which comes with a $10,000 honorarium provided by Google, “celebrates women researchers who have made fundamental contributions to computer science.”

According to the press release:

“Lynch’s work has influenced both theoreticians and practitioners,” said Mary Jane Irwin, who heads the ACM-W Athena Lecturer award committee. “Her ability to formulate many of the core problems of the field in clear and precise ways has provided a foundation that allows computer system designers to find ways to work around the limitations she verified, and to solve problems with high probability” [more following the link].

In a career spanning more than 30 years, Lynch identified the boundaries between what is possible and provably impossible to solve in distributed settings. She developed new distributed algorithms, created precise models for analyzing distributed algorithms and systems, and discovered limitations on what distributed algorithms can accomplish.

Lynch’s breakthrough research with M.J. Fischer and M.S. Paterson produced the “FLP” result. It defined as a mathematical problem the challenge of establishing agreement in asynchronous distributed systems (i.e., those with no timing assumptions) in the presence of failures. This innovation had a major impact on the design of fault-tolerant distributed data-management systems and communication systems.

Lynch’s textbook, Distributed Algorithms, is the definitive reference on the basics of the field. It introduces readers to the fundamental issues underlying the design of distributed systems, including communication, coordination, synchronization, and uncertainty. It integrates the results of distributed algorithms research using a common mathematical framework.

The Athena Lecturer award will be presented at ACM’s Annual Awards Banquet in mid-June.

Lynch will deliver the Athena Lecture at the 2013 PODC (Principles of Distributed Computing) conference.

To learn more about this year’s Athena Lecturer and view past recipients, see the ACM’s press release.

(Contributed by Erwin Gianchandani, CCC Director)

Inside view on ads review

April 21, 2012 Leave a comment

The Official Google Blog

via Inside view on ads review.

 

This is the first in a series of posts that will provide greater transparency about how we make our ads safer by detecting and removing scam ads. -Ed.

A few weeks ago, we posted here about our efforts in fighting bad ads, and we shared a video with the basics of how we do it. Today I wanted to delve a little deeper and give some insight into the systems we use to help prevent bad ads from showing. Our ads policies are designed with safety and trust in mind—we don’t allow ads for malicious downloadscounterfeit goods, or ads with unclear billing practices, to name a few examples. In order to help prevent these kinds of ads from showing, we use a combination of automated systems and human input to review the billions of ads submitted to Google each year. I’m one of many engineers whose job is to help make sure that Google doesn’t show bad ads to users.

We’ve designed our approach based on a three-pronged strategy, each focused on a different dimension of the problem: ads, sites, and advertiser accounts. These systems are complementary, sharing signals among each other so that we can comprehensively attack bad ads.

For example, in the case of a site that is selling counterfeit goods, this three-pronged approach aims to look for patterns that would flag such a site and help prevent ads from showing. Ad review notices patterns in the ads and keywords selected by the advertiser. Site review analyzes the entire site to determine if it is selling counterfeit goods. Account review aims to determine if a new advertiser is truly new, or is simply a repeat offender trying to abuse Google’s advertising system. Here’s more detail on how we review each of these three components.

Ad Review
An ad is the snippet of information presented to a user, along with a link to a specific webpage, or landing page. The ads review system inspects individual ads and landing pages, and is probably the system most familiar to advertisers. When an advertiser submits an ad, our system immediately performs a preliminary examination. If there’s nothing in the ad that flags a need for further review, we tell the advertiser the ad is “Eligible” and show the ad only on google.com to users who have SafeSearch turned off. If the ad is flagged for further review, in most cases we refer to the ad as “Under Review” and don’t show the ad at all. From there, the ad enters our automated pipeline, where we employ machine learning models, a rules engine and landing page analysis to perform a more extensive examination. If our automated system determines an outcome with a high degree of confidence, we will either approve the ad to run on Google and all of our partners (“Approved”), approve the ad to show for appropriate users in specific locations (“Approved – Limited”) or reject the ad (“Disapproved”). If our automated system isn’t able to determine the outcome, we send the ad to a real person to make a final decision.

Site Review
site has many different pages, each of which could be pointed to by different ads, often known as a domain. Our site review system identifies policy issues which apply to the whole site. It aggregates sites across all ads from all advertisers and regularly crawls them, building a repository of information that’s constantly improving as new scams and new sites are examined. We store the content of advertised sites and use both machine learning models and a rules engine to analyze the sites. The magic of the site review system is it understands the structure of language on webpages in order to classify the content of sites. Site review will determine whether or not an entire site should be disabled, which would prevent any ads leading to that site showing from any account. When the automated system isn’t able to determine the outcome with a high degree of confidence, we send it to a real person to make a decision. When a site is disabled, we tell the advertiser that it’s in violation of “Site Policy.”

Account Review
An account is one particular advertiser’s collection of ads, plus the advertiser’s selections for targeting and bidding on those ads. An account may have many ads which may point to several different sites, for example. The account review system constantly evaluates individual advertiser accounts to determine if the whole account should be inspected and shut down for policy violations. This system “listens” to a variety of signals, such as ads and keywords submitted by the advertiser, budget changes, the advertiser’s address and phone number, the advertiser’s IP address, disabled sites connected to this account, and disapproved ads. The system constantly re-evaluates all accounts, incorporating new data. For example, if an advertiser logs in from a new IP address, the account is re-evaluated to determine if that new signal suggests we should take a closer look at the content of the advertiser’s account. If the account review system determines that there is something suspect about a particular account with a high degree of confidence, it automatically suspends the account. If the system isn’t sure, it stops the account from showing any ads at all and asks a real person to decide if the account should be suspended.

Even with all these systems and people working to stop bad ads, there still can be times when an ad slips through that we don’t want. There are many malicious players who are very persistent—they seek to abuse Google’s advertising system in order to take advantage of our users. When we shut down a thousand accounts, they create two thousand more using different patterns. It’s a never-ending game of cat and mouse.

We’ve put a great deal of effort and expense into building these systems because Google’s long-term success is based on the trust of people who use our products. I’ve focused my time and energy in this area for many years. I find it inspiring to fight the good fight, to focus on the user, and do everything we can to help prevent bad ads from running. I’ll continue to post here from time to time with additional thoughts and greater information about how we make ads safer by detecting and removing scam ads.

Posted by David W. Baker, Director of Engineering, Advertising

Categories: Google, IT Observation Tags: , ,
%d bloggers like this: