Technology in Government (TIG) | Topics in Privacy (TIP)

Technology in Government (TIG) and Topics in Privacy (TIP) consist of weekly discussions and brainstorming sessions on all aspects of privacy (TIP) and uses of technology to assess and solve societal, political, and government problems (TIG). Discussions are often inspired by a real-world problems being faced by the lead discussant, who may be from industry, government, or academia. Practice talks and presentations on specific techniques and topics are also common.

Topics are usually not posted earlier than the week before.

Schedule Spring 2013

Date  Discussant  Topic
2/11  Michael Fertik,
2/25  Seth Stephens-Davidowitz  Using Google Data to Predict What Political Surveys Miss and Who Will Vote
3/4  Merce Crosas and Latanya Sweeney  Enabling Research on Very Big Data
3/11  Michael Bar-Sinai  Creating a Platform for Making Frequently Updated, Very Big Data Accessible for Research
3/18  Sarah Ellis  Privacy and Public-Use Mortgage Data
3/25  Natasha Singer, The New York Times  You for Sale: The Business of Consumer Data
4/1  Jerry Reiter, Duke University  Protecting Confidentiality by Releasing Simulated Public Use Datasets
4/15  Deborah Hurley and Richard Carback III  Better Democracy Through Random Sampling
4/22  Robert Gellman  Best Privacy Anywhere
4/29  Emmanual Guillory  Life on the Hill
4/29  Christine Choirat  The Analytic Hierarchy Process and the Theory of Measurement
5/6  Adam Tanner  The Business of Personal Data
5/20  Personal Genome Project and Silent Spring  Debriefing the Re-identification of the Personal Genome Project

Abstracts of Talks and Discussions

  1., Michael Fertik

    Michael Fertik founded with the belief that people and businesses have the right to control and protect their online reputation and privacy. Credited with pioneering the field of online reputation management (ORM), Fertik is lauded as the world's leading cyberthinker in digital privacy and reputation. Michael was most recently named Entrepreneur of the Year by TechAmerica, an annual award given by the technology industry trade group to an individual they feel embodies the entrepreneurial spirit that made the U.S. technology sector a global leader.

    He is a member of the World Economic Forum Agenda Council on the Future of the Internet, a recipient of the World Economic Forum Technology Pioneer 2011 Award and through his leadership, the Forum named a Global Growth Company in 2012. Fertik is an industry commentator with guest columns in Harvard Business Review, Reuters, and Newsweek. He frequently appears on national and international television and radio, including BBC, Good Morning America, Today Show, Dr. Phil, CBS Early Show, CNN, Fox, Bloomberg, and MSNBC. He is also co-author of the bestselling book, "Wild West 2.0" and "The Reputation Economy" (Crown, forthcoming in 2013).

    Fertik founded his first Internet company while at Harvard College. He received his JD from Harvard Law School.

  2. Using Google Data to Predict What Political Surveys Miss and Who Will Vote

    This is a two-part discussion. Part I: How can we know how much racial animus costs a black candidate if few will admit such socially unacceptable attitudes to surveys? I suggest a new source to proxy an area's prejudice: Google search queries. I compare the proxy --the percent of an area's Google searches that include racially charged language -- to Barack Obama's 2008 and 2012 vote shares, controlling for the vote share of the 2004 Democratic presidential candidate, John Kerry. Previous research using a similar specification but survey proxies for racial attitudes yielded little evidence that racial attitudes affected Obama. An area's racially charged search rate, in contrast, is a robust negative predictor of Obama's vote share. Continuing racial animus in the United States appears to have cost Obama roughly four percentage points of the national popular vote in both 2008 and 2012, giving his opponent the equivalent of a home-state advantage nationally.

    Part II: Google searches prior to an election can be used to predict turnout in different parts of the United States. Change in October search volume for "vote/voting" over a four year period explains 20-40 percent of state-level change in turnout rates. The predictive power is little affected by changes in registration rates or early votes over the same period. This information might prove useful in predicting candidate performance beyond what is contained in polls.

    Seth Stephens-Davidowitz is an economics PhD student at Harvard University.

  3. Enabling Research on Very Big Data

    Society is capturing and constructing data on scales that could not possibly have been imagined previously, so much so that many of these data flows are never-ending streams containing huge volumes of information. These are not merely what others refer to as "big data" but tend to include the largest of these, "very big data". In order for society to reap many of the potential benefits from very big data, we have to rethink the way researchers engage with data. For example, researchers have routinely worked with small, self-contained datasets that can be copied and shared online with ease. But you cannot just copy voluminous continuous real-time streaming data. Researchers need new tools, methods and practices. Attempting to use historical approaches is like trying to get a sip of water, not from a water fountain, but from a hydrant! In this TIG session, we brainstorm on the tools, methods, workflows, and protections needed.

    Discussant: Dr. Merce Crosas is the Director of Product Development for the Dataverse, an open-source application for sharing, citing, archiving, discovering and analyzing research data, created by the Institute of Quantitative Social Science (IQSS) at Harvard and used in installations throughout the world, including most recently at Harvard Library to assist researchers with curation and management of research data. The Dataverse at IQSS currently houses the world's largest collection of social science research data, hosting more than 51,000 studies having 719,000 files.

  4. Creating a Platform for Making Frequently Updated, Very Big Data Accessible for Research

    As very big data sets become increasingly common, the Dataverse Network application will need to support them to continue fulfilling its mission of making data accessible for research. In this discussion we'll brainstorm on technologies, workflows and architecture that can help to make this possible. Should computations run locally or in the cloud? Will new non-relational databases be useful (Cassandra, MongoDB, neo4J, etc)? Should it support different type of storages? What are the use cases for different data types and for different types of researchers? ... and what questions are we missing?

    Discussant: Michael Bar-Sinai is a software engineer and a PhD student in Ben-Gurion University of the Negev, Israel. He became interested in the moral implications of software systems after developing an evaluation system for a human resources department. He later insisted that the system will never be used.

  5. Privacy and Public-Use Mortgage Data

    In the wake of the 2008 real estate market crash and ensuing Great Recession, there have been growing calls for greater transparency and accountability in the US mortgage market. The policy response to these calls has taken the form of new and expanded public-use government mortgage datasets, comprised of individual loan-level data. Fannie Mae and Freddie Mac, the Securities and Exchange Commission, and the new Consumer Financial Protection Bureau have all been tasked with launching enhanced mortgage datasets for public consumption, which could include data on individual borrowers' credit scores, income, age, race, and gender.

    Though the expansion in publicly available government data on mortgage lending promises to enhance research on and supervision of market activity, including the incidence of discriminatory lending, the data may also introduce new privacy concerns. These concerns include an increased risk - and resulting harm - of re-identifying individual borrowers from government mortgage data. Government agencies therefore face a trade-off between releasing the most useful mortgage data and protecting consumer privacy. How should government agencies with public data mandates address these privacy issues, particularly in a world where government and non-government data on individuals is increasingly prevalent?

    Discussant: Sarah Ellis is a Master in Public Policy student at Harvard's Kennedy School of Government. She has previously worked on mortgage data policy at the Consumer Financial Protection Bureau. Her capstone research project focuses on addressing privacy risk in public use mortgage data.


  6. You for Sale: The Business of Consumer Data

    Executives in technology, retail, marketing and other industries like to say that data is "the new oil" or, at least, the fuel that powers the Internet economy. It is a metaphor that casts consumers as natural resources with no say over the valuable commodities that companies extract from them. Yet this data extraction is often opaque to consumers and largely unregulated. To give readers some insight into the data economy, The New York Times' last year published an investigative series, called "You for Sale," in which we examined different industries that collect, analyze, use and sell information about consumers. The series help prompt separate investigations by the United States House of Representatives, U.S. Senate, Government Accountability Office, and the Federal Trade Commission. This year, the series won an award for personal finance reporting from the Society of American Business Editors and Writers.

    Discussant: Natasha Singer is a reporter in the Sunday Business section of The New York Times where she covers the business of consumer data. She was previously a reporter in the Business section covering the pharmaceutical industry and professional medical ethics. In 2010, she was a member of a team of New York Times reporters whose series on cancer was a finalist for a Pulitzer Prize in explanatory reporting. @natashanyt


  7. Protecting Confidentiality by Releasing Simulated Public Use Datasets

    Statistical agencies that disseminate public use micrdata, i.e., data on individual records, seek to do so in ways that (i) protect the confidentiality of data subjects' identities and sensitive attributes, (ii) support a wide range of analyses, and (iii) are easy for secondary data users to work with. One approach is to release data sets in which confidential values are replaced with draws from statistical models. These models are estimated so as to preserve as much structure in the data as possible. This approach has been used to release public use data in, for example, the Longitudinal Business Database, the Survey of Income and Program Participation, OnTheMap, and the American Community Survey group quarters data. In this talk, I review the ideas underpinning the approach, including a discussion of recent applications and methods for generating simulated data sets. I highlight open research areas related to disclosure risk estimation and utility assessment. In these contexts, I also describe some of the research activities of the Triangle Census Research Network (, an NSF-sponsored research center developing new methodology for dissemination of federal (and other) statistical data.

    Discussant: Jerry Reiter is the Mrs. Alexander Hehmeyer Associate Professor of Statistical Science at Duke University. He received his PhD in statistics from Harvard University in 1999. He works extensively with the U. S. Census Bureau and other federal agencies on methods for protecting confidentiality in public use data and on methods for handling missing/faulty data. He supervised the creation of the synthetic Longitudinal Business Database, the first establishment-level, unrestricted public use database on business establishments in the U. S.

  8. Better Democracy Through Random Sampling

    Our goal is to create more inclusive, meaningful civic participation to strengthen democratic governance and individual liberty. Voting is currently plagued by problems, including declining participation, voter disaffection, and tribulations at the polls. We propose a groundbreaking framework that we believe will provide results that will be more truly representative of the electorate. The system behind our approach is more dynamic, agile and flexible than existing systems and will permit the entry of new participants. It leverages traditional paper election systems with random sampling and modern end-to-end user-verified audit capabilities. Each voter can confirm that her vote was counted correctly. Anyone can verify the election results. These mechanisms increase accuracy, integrity, and confidentiality, while also encouraging voter participation and engagement. Our approach enables the will of the people to be expressed at much lower cost and with more validity and much greater discernment. These new practices will enhance democracy through innovations in governance, better participation in decision making, and improved self-determination and collective action.

    Discussants: Dr. Richard T. Carback III ( is a Principal Member of Technical Staff in the Network and Information Concepts group at Charles Stark Draper Laboratories ( He has over a decade of R&D experience in computer security and was a key technical contributor to the Punchscan ( and Scantegrity ( voting systems. Punchscan was the first end-to-end voter verifiable system implementation used in a binding election, which was held at the University of Ottawa in 2007. Scantegrity was the first such system to be used in a binding public election, held in Takoma Park, Maryland, in 2009.

    Deborah Hurley received the Namur Award of the International Federation of Information Processing in recognition of outstanding contributions, with international impact, to awareness of social implications of information technology. She is the author of Pole Star: Human Rights in the Information Society, "Information Policy and Governance" in Governance in a Globalizing World, and other publications. At the Organization for Economic Cooperation and Development, in Paris, France, she was responsible for drafting, negotiation and adoption of the OECD Guidelines for the Security of Information Systems. Hurley is Chair, Board of Directors, Electronic Privacy Information Center (EPIC). She directed the Harvard University Information Infrastructure Project and carried out a Fulbright study in Korea.

  9. Best Privacy Anywhere

    A Double Feature:

    • Best Privacy Anywhere. A modest proposal for a model state privacy law that will give state citizens the best privacy protections that companies offer their customers anywhere in the world.

    • Everything you wanted to know (and more!) about certificates of confidentiality for research. A certificate of confidentiality gives researchers a greater ability to resist subpoenas for their research records about individuals. Are certificates a protection for research subjects or a trap for the unwary researcher? Do you need one for your own research?

    Discussant: Robert Gellman is a privacy and information policy consultant in Washington, D.C., specializing in health confidentiality policy, privacy and data protection, and Internet privacy. Clients have included federal agencies, Fortune 500 companies, trade associations, advocacy groups, foreign governments, and others. A graduate of the Yale Law School, Gellman served for 17 years as chief counsel to the Subcommittee on Government Information in the House of Representatives. He maintains a webpage with many documents and other useful resources at He is coauthor of ONLINE PRIVACY A Reference Handbook published by ABC-CLIO in 2011.

  10. Life on the Hill

    Emmanual Guillory is Staff Assistant to Rep. Joe Barton (TX-6) and talks about what life is like working in Congress.

  11. The Analytic Hierarchy Process and the Theory of Measurement

    Multiple-Criteria Decision-Making methods are commonly used when a complex decision, involving sometimes conflicting criteria, has to be made. Many structured methods have been developed. The AHP (Analytic hierarchy process) is one of the most-used in practice and has been applied in fields such as public health, evaluation, e-government or defense analysis. We argue that besides the classical AHP issues (such rank-reversal or lack of normative foundations), special care has to be taken to deal with the psychological distortions that affect the AHP pairwise comparisons.

    Discussant: Christine Choirat is the Director of Research and an Associate Professor of Quantitative Methods at the School of Economics and Business Administration of the University of Navarra (Pamplona, Spain).

  12. The Business of Personal Data

    Gathering personal information on individual consumers has become the financial lifeblood of the Internet and, increasingly, of many business operations. Who are the people and firms gathering such information? What details are they putting together and how do they get so much information about us? Department of Government Fellow Adam Tanner is finishing a behind-the-scenes book on data gathering firms. Professor of Government and Technology in Residence Latanya Sweeney will offer analysis of how these firms use technology to enhance their data gathering.

    Discussant: Adam Tanner is a 2012-13 fellow in the Department of Government and a 2011-12 Nieman fellow. A long-time investigative reporter and foreign correspondent, he has served as bureau chief for Reuters in the Balkans and San Francisco as well as a correspondent posted in Moscow, Berlin and Washington D.C. His book "Behind the Data Curtain" will be published next year.

  13. Debriefing the Re-identification of the Personal Genome Project

    The Personal Genome Project (PGP) aims to sequence the genotypic and phenotypic information of 100,000 informed volunteers and display it publicly online in an extensive public database. The PGP operates under a privacy protocol it terms "open consent". Individual volunteers freely choose to disclose as much personal data as they want, often including identifying demographic data, such as date of birth, gender, and postal code (ZIP). Online, the profiles appear in a "de-identified state," being void of the direct appearance of the participant’s name or address. Last month members of the Data Privacy Lab conducted an experiment to determine how many of the profiles could re-identify by name using public records and correctly identified 48% of the profiles with 84-97 percent accuracy for those profiles for which they provided names. This discussion will be a debriefing of lessons learned and experiences. The short paper and more information is available at .

    Discussants: Stakeholders from the Personal Genome Project and Silent Spring.

Prior Sessions

Fall 2012 | Spring 2012 | Fall 2011

Copyright © 2012-2014. President and Fellows Harvard University.   |   IQSS   |    Data Privacy Lab