Comprehensive Data Management Planning & Services

Day of Data 2021 Schedule and Information

The 2020 2021 Cornell Day of Data was held over two days: Wednesday and Thursday, January 27-28. The event was 100% virtual, and we hope we provided an environment that brought people together to share and learn effective techniques and tools for working with and managing their data. #cornelldayofdata

Questions? Please contact

Interested in the presentations? Check out full presentation recordings and find slides on the CDOD21 OSF Meeting site.

Code of Conduct | Commitment to AccessibilityGeneral Event Information

Cornell Day of Data 2021 Agenda

Date Time (EST) Event (Track 1) Event (Track 2)
Wednesday, January 27th 12:00 pm Welcome Remarks  
  12:10 pm Keynote Speaker: Shane G. Henderson  
  1:10 pm Break  
  1:20 pm Panel: Collaboration in the Federal Statistical Research Data Center Program Panel: Scaling and Integrating Pipelines for Bioacoustics Big Data Analysis
  2:20 pm Break  
  2:30 pm Lightning Talks, Round 1  
  3:20 pm Break  
  3:30 pm Panel: Creating Reproducible Research at Cornell Panel: Rescuing from the Past, Saving for the Future: An Archaeological Science Perspective on Collaborative Data
Thursday, January 28th 12:00 pm Opening Remarks  
  12:10 pm Themed Presentations: Social Factors Impacting Research Themed Presentations: Communicating our Research
  1:10 pm Break  
  1:15 pm Lightning Talks, Round 2  
  1:55 pm Break  
  2:00 pm Themed Presentations: Sharing Research Workshop 1: Reproducibility and Collaboration When Your Data is Really Large or Confidential
  3:00 pm Break  
  3:15 pm Workshop 2: Introduction to Git Workshop 3: A Functional Approach to Ordered Data: Statistics on Surfing Competitions
  4:15 pm Closing Remarks  

Keynote Speaker:

Head and shoulders image of Shane HendersonShane G. Henderson is the Charles W. Lake, Jr. Professor in Productivity in the School of Operations Research and Information Engineering at Cornell University. His overall professional goal is to contribute to both research and learning in the theory and application of stochastic simulation and applied probability, with emphasis on the interface between these areas and optimization. He is greatly interested in, and motivated by, applications with strong societal relevance, including bike sharing, medical scheduling and ambulance planning.

Dr Henderson's keynote presentation is entitiled, "Yes, Data is Great, But....". 

From Dr Henderson:  I use data every day in my work on decision making under uncertainty in a variety of settings. In this talk, I'll survey some of those decision contexts, including ambulance deployment, bike sharing in NYC, and the questions surrounding reopening Cornell's Ithaca campus for in-person instruction in the time of Covid-19. Data plays a central role in each of these settings, but care must be exercised to avoid pitfalls. I'll discuss some of those pitfalls and our efforts to avoid them. This talk is based on joint work with many people, but most prominently Andrew Mason (ambulances), David Shmoys (bike sharing and Covid 19 at Cornell), and Peter Frazier (Covid 19 at Cornell).


Panel Details:

Collaboration in the Federal Statistical Research Data Center Program, Amanda Eng, Nichole Szembrot, Hyunseob Kim

Wednesday, January 27, 1:20 pm, Track 1

Federal Statistical Research Data Centers allow researchers to access confidential federal data for research. The U.S. Census Bureau manages approximately 30 RDCs across the country, including one located at Cornell University. The FSRDC program facilitates collaboration both across research institutions and between academic researchers and Census Bureau staff. They provide access to restricted-use data from Census and several other federal agencies, including the National Center for Health Statistics, Bureau of Labor Statistics, Bureau of Economic Analysis, and Agency for Healthcare Research and Quality. Nichole Szembrot is the Cornell RDC administrator and will provide an overview of the program and data available to researchers. Hyunseob Kim will discuss his research using confidential business data, which involves collaboration with faculty at several other universities. Amanda Eng will explain how the RDC allowed her to join projects that were initiated by Census Bureau staff.

Scaling and Integrating Pipelines for Bioacoustics Big Data Analysis, Laurel Symes, Viviana Ruiz-Gutierrez, Stefan Kahl, Grant Van Horn, Drew Weber, Alvaro Vega-Hidalgo

Wednesday, January 27, 1:20pm, Track 2

Effective monitoring of biodiversity provides key information on the status of wildlife populations and is critical for conservationists and decision-makers worldwide. One of the most cost-effective and non-invasive approaches for monitoring wildlife populations is through bioacoustic monitoring. Currently, there are millions of hours of recordings worldwide. However, the immense volume of biological information that is captured in these bioacoustic data sets remains largely untapped. The analysis of big acoustic datasets presents challenges that are shared beyond acoustics, presenting opportunities for exchange and learning across domains. Bioacoustic analysis pipelines generally include multiple steps. Each of these steps must be scaled to enable analysis of massive datasets and connected to other steps to enable end-to-end analysis. In practice, the steps that make up these pipelines are often developed by different researchers and research groups, making explicit attention to data structure and conventions even more critical. The objective of this panel discussion is to bring together different groups at Cornell who are working on different aspects of the bioacoustic analysis pipeline, and discuss how to best align and integrate efforts. We will discuss multiple data problems related to bioacoustics, including interfacing human intelligence and artificial intelligence, managing overlapping sounds and creating active learning data loops. We will have panelists from the Center for Conservation Bioacoustics, the Macaulay Library of Natural Sounds, both based at the Cornell Lab of Ornithology. Speakers: Laurel Symes, Viviana Ruiz-Gutierrez, Stefan Kahl, Grant Van Horn, Drew Weber and Alvaro Vega-Hidalgo

Creating Reproducible Research at Cornell, Florio Arguillas, Wendy Kozlowski, Erika Mudrak, Lars Vilhuber

Wednesday, Jan. 27, 3:30 pm, Track 1

Research reproducibility is the idea that the results of a study or experiment should be able to be achieved again if using the same data, and analysis/computational protocols. There are many actions that can be taken throughout the research process to make your work reproducible, from pre-registering your study design and employing good data management practices, to documenting your processing and analysis steps in a transparent and understandable way, to ensuring your data and code are independently understood by others, to making data available in a way that is findable, accessible, interoperable and reusable. This panel will bring together representatives of a few services available at Cornell University to support these efforts. Panelists will discuss briefly their service, and how it fits into the practice of creating better, reproducible research. Speakers will include: Wendy Kozlowski, Data Curation Specialist at the Cornell University Library, speaking about preregistration, the OSF and related management practices, Florio Arguillas, Research Associate at the Cornell Institute for Social and Economic Research, speaking about the Results Reproduction service, Erika Mudrak from the Cornell Statistical Consulting Unit speaking about ethical issues in data analysis, and Lars Vilhuber, Executive Director, Labor Dynamics Institute and Senior Research Associate in the Department of Economics and the ILR school, speaking about data curation practices employed at the time of research publication.

Rescuing from the Past, Saving for the Future: An Archaeological Science Perspective on Collaborative Data, Annapaola Passerini, Rebecca Gerdes, Alice Wolff, Anna Whittemore, Sam Disotell, Jillian Goldfarb

Wednesday, Jan. 27, 3:30 pm, Track 2

Archaeology reconstructs the human past through the interpretation of its material remains. The last few decades have witnessed a boost in the application of STEM techniques to the study of the archaeological record, allowing for the investigation of a variety of materials (e.g. stone, metal, ceramics, organic residues, soil, plant and animal remains) at both the micro and macro level. However, as this subfield of archaeology progressively acquires distinct academic authority under the name of “Archaeological Science”, so do numerous challenges stem from its interdisciplinary nature straddling the worlds of humanities and natural sciences. The production of “Archaeological Science Data” brings necessary attention to issues of research design and sampling strategies, especially given the mismatch between the reproducibility of laboratory techniques to acquire scientific data on materials, and the non-reproducibility and, sometimes, less than ideal representativity of samples collected from the archaeological field. The correct interpretation of such data, therefore, requires a robust understanding of both the social, cultural, and environmental characteristics of the archaeological context of sampling, and the implicit limitations and affordance of the specific STEM technique used in the lab. The renewed interest in legacy data due to the conditions created by COVID-19 have spurred archaeologists to reflect on the limitations of interdisciplinary collaboration and the need to create datasets that could, at once, make sense in the present and be understandable in the future. This panel seeks to expose critical aspects of the production of collaborative data within the discipline of Archaeological Science. Salient points will include the role of research questions in shaping data, standardization, interdisciplinary communication, and storage sustainability. Case studies will be drawn from several specializations, including residue analysis, archaeobotany, zooarchaeology, bioarchaeology, and radiocarbon dating. Ultimately, this panel wishes to inspire more responsible and informed approaches to collaboration within archaeology and beyond. Speakers: Annapaola Passerini, Rebecca Gerdes, Alice Wolff, Anna Whittemore, Sam Disotell and Jillian Goldfarb

Themed Presentations

Social Factors Impacting Research

Thursday, Jan. 28, 12:10 pm, Track 1

Qualitative methods and open science: Building more trustworthy research, Lee Humphreys, Neil Lewis Jr, Katherine Sender and Andrea Stevenson Won

Recent initiatives towards open science in communication have prompted vigorous debate. While gesturing to qualitative research, much of open science does not engage with the constructionist research paradigms from which many qualitative researchers draw (e.g. Dienlin et al, 2020; one thoughtful exception is Haven & Von Grootel, 2019). Nevertheless, shared standards and criteria exist for conducting trustworthy qualitative research (e.g. Lincoln & Guba, 1985). In this presentation, we draw on qualitative and interpretive research methods to expand the key priorities that the open science framework addresses, namely producing trustworthy and quality research. Furthermore, we articulate a general set of principles that can encompass a wide range of epistemologies and methodologies that foster trust and intellectual expansiveness. Qualitative research has demanded reflexive approaches to knowledge building (positionality, partial knowledge, and attempts to offset implicit bias, for example, see Denzin & Lincoln, 2000). Qualitative methods can help deductive scholars embrace the contextual and the reflexive nature necessary for rigorous and inclusive research (Davis, 2014; Sefa Dei, 2005). To be successful, the open science movement will need to engage with the breadth of epistemological, methodological, and ethical traditions that co-exist in the social sciences. It will need to wrestle with the hierarchy of knowledge production and the disparate impacts of that hierarchy in scientific evaluation and rewards (Nosek, Spies, & Motyl, 2012). It will also need to resist the pressure to endlessly publish (Nelson, Simmons, & Simonsohn, 2012) novel and transformative studies (Davis, 1971), since that pressure hinders efforts to build cumulative knowledge (Forscher, 1963; Lewis, 2020). We understand that these are difficult challenges to address, but these issues that must be resolved for research to be truly open and inclusive, now, and in the future.

Developmental research across continents during the time of COVID-19: A transnational, study of children’s eyewitness memory and suggestibility, I-An (Amy) Su, Renee X. Zhang, Brett K. Hayes and Stephen J. Ceci

The proposed research aims to investigate how social and cultural factors influence children’s performance on measures of eyewitness memory and suggestibility by conducting a large-scale transnational comparative study in seven countries, including the U.S., Taiwan, Australia, Germany, India, Israel, and Turkey. One thousand one hundred twenty children (eighty three-year-olds and eighty eight-year-olds from each nation) are shown a to-be-remembered event (a brief video), followed by a short break with/without filler tasks, and then they are presented suggestive misinformation about the video. Following this, they are assessed on their responses to questions (neutral vs. misleading) about the video. Parents of these children are surveyed about their demographic information, parenting style (Buri, 1991; Reitman et al., 2002), and indexes of socioeconomic status (SES). Several measures are designed to ensure cultural consistency among all collaborating nations. For instance, bilingual team members complete translation and back-translation methods to secure the linguistic equivalence of research materials. Child participants receive different filler tasks and rewards according to local research norms. Interviewers are trained in a culturally sensitive fashion. As per different national laws and research regulations, IRB application adjustments are subject to requirements of local review boards. Research accommodations in response to the COVID-19-related changes are developed. A pilot study transitioning the child interview into a web-based modality is in progress to test the validity of online child research as an alternative interviewing method during the time of the coronavirus pandemic. Associated adaptions are also generated to react to stay-at-home lockdowns. The presenter will share challenges and lessons learned from administering a cross-national study amid the period of the virus breakout. The implications of this presentation will shed light on developmental studies that involve cross-cultural child participants via internet-based communication tools.

Democratizing Data for Equitable Community Change, Russel Weaver

As the mass demonstrations and organizing efforts for racial justice in 2020 have shown, campaigns and movements for social change tend to grow and gain traction through gripping public narratives and highly visible trigger events. However, leveraging that momentum into policy and institutional change may hinge on the availability of reliable quantitative evidence regarding how the benefits of change outweigh the costs of maintaining the status quo. The reason for this data dependence is that powerholders exhibit quantitative biases, whereby they dismiss most calls for change if those calls are not backed up by convincing, verifiable, publicly supported empirical arguments. In practice, given the unevenness in the landscapes of data, data access, and data analysis capabilities, these dynamics can create unfair barriers between communities and the powerholders they are targeting for change. The end result is a bias toward the status quo. A new initiative of the Cornell ILR Buffalo Co-Lab is attempting to tear down these barriers and even out the landscape. The Co-Lab, whose mission is to advance and equitable economy and democratic community, recently launched a project called Data for Equitable Economic Development and Sustainability, or Good DEEDS. This presentation introduces participants to the Good DEEDS initiative and some of its early outputs. Broadly speaking, Good DEEDS seeks to democratize geographic, economic, environmental, and social data; provide training on how to use that data; conduct surveys to shed new light on regional workforce, civil society, economic and ecological health, and quality of life; and produce original research, policy memoranda, and reports to advance a High Road policy agenda of shared prosperity, ecological resilience, and participatory democracy. Projects featured in the presentation include a Digital Divide Dashboard for Western New York, an Erie County COVID-19 web mapping application, and a pilot Living Wage Atlas for New York State.

Communicating Our Research

Thursday, Jan. 28, 12:10 pm, Track 2

Gardens of the Roman Empire: interdisciplinary collaboration to assemble, organize, and publish online scholarship, Yifan Li, Shamika Ghate and Jane Millar

The Gardens of the Roman Empire Project (GREP) and the NYU Institute for the Study of the Ancient World (ISAW) are collaborating to produce a digital publication (, gathering the known remains of around 1200 ancient Roman gardens and presenting many of them for the first time. GREP is opening up a vast repository of knowledge for future scholarship in a free online and open-access format, featuring content created by Cornell students in landscape architecture and archaeology. Over the course of an intensive four-week workshop, experts introduced students and scholars from several universities to the digital publication tools (Atom, Markdown, Hugo, and Zotero) and linked vocabularies and databases (Getty, TGN, Pleiades). GREP paired students with senior scholars of different professional and disciplinary backgrounds to enhance the exchange of ideas and learn new data visualization skills. Student interests were taken into consideration with the introduction to metadata and linked open data, editing and writing entries, conducting research on recent archaeological excavations, creating illustrations and artistic reconstructions, mapping, georeferencing, and more. Motivated students became the authors and/or editors of their entries. The study of ancient gardens is multiscalar and multidisciplinary, drawing on such fields as ancient history, urban planning, and environmental archaeology, demonstrated by our diverse cohort. We faced the usual challenges of interdisciplinary collaboration: differences in scholarly vocabulary, unfamiliarity with ancient (archaeological and documentary) source material, publication standards, along with the steep learning curve of new web-based platforms. The pandemic made communication an extra challenge. Collaborators from China to California gathered in flexible small groups and over diverse communication platforms like Slack. The ISAW team was particularly prompt in solving queries and coordinating responses. By developing these flexible workflows, we are on schedule to publish in spring 2021, following previews at regional and national archaeological conferences.

Doing More With Less: Improving Community Livelihoods through Sustainable Management of Livestock and Wildlife in the Pamir Mountains of Tajikistan, Helen Lee

The Cornell Wildlife Health Center is leading an interdisciplinary project in Tajikistan’s Pamir Mountains to empower local communities to achieve a productive and harmonious future in landscapes shared with wildlife. The Pamir Mountains support a fragile ecosystem of limited resources shared by unique wildlife and a distinctive livestock herding culture. Wild goat and sheep species live alongside herds of domestic livestock, which provide food for herding communities and income through market sales in urban centers. The surrounding ecosystem is increasingly at risk from unsustainable numbers of livestock. Greater numbers of livestock reduces the food available for each animal, lowering their survival and productivity, with negative impacts on human livelihoods and the local economy. Grazing competition and disease also threaten wild ungulates and thus the endangered snow leopards that rely on them to survive. Together with partners in Tajikistan, our interdisciplinary team of experts in the fields of conservation, wildlife health, livestock health/husbandry, and economics/business is working with Pamir communities to look for ways that they can maintain or increase their benefit from having smaller but healthier and more productive livestock herds, with a reduced environmental footprint. Our research team conducted health assessments and a value chain analysis to identify areas where veterinary interventions or changes in livestock management and marketing could improve financial security. This was completed through household livelihood questionnaires, livestock animal sampling, laboratory analyses, and semi-structured interviews with stakeholders along the value chain. We will co-develop strategies with communities to improve their livestock productivity and financial security, in ways that conserve the wildlife and natural landscape of the Pamirs. If successful, this system could serve as a conceptual model for conservationists or development agencies to scale and adapt to other systems around the world, helping to improve economies in herding communities while supporting wildlife populations and environmental sustainability.

The Upworthy Research Archive: Insights gained from an open dataset of headline experiments, Marianne Aubin Le Quere and J. Nathan Matias

Companies frequently conduct A/B tests with their users to inform product decisions. Usually, companies use these experiments to test specific implementation details for one-time use (e.g. Which button colour increases clicks the most?). Once these experiments are aggregated, they may hold greater truths about the way citizens behave online. These experimental findings are hard to come by and costly in academia, yet companies keep them closely guarded, making it difficult for most researchers to access such data. Even academics that do gain access to company data are unable to share their data publicly, and potential scientific insights go unexplored. We have compiled and made accessible the Upworthy Research Archive (, an open dataset of 30,000+ A/B tests of headlines conducted by the digital news publisher Upworthy from January 2013 to April 2015. At the time of release, it is the largest open-access collection of randomized behavioral studies available for research and education. Upworthy was an extremely popular digital publisher around the 2013-2015 time period, and they conducted a huge number of experiments. The Upworthy Archive includes the features of each article preview link that was tested by Upworthy during this time period, and the clickthrough rate of that article preview link. At Day of Data, we would like to present the Upworthy Archive, how it was made available through an exploratory-confirmatory process to researchers across disciplines, and one example method for analysis and preliminary findings. We hope that we can inspire attendees to think about how to share industry data to further scientific inquiry, and get involved with the Upworthy Archive. We also hope to spread the word that companies can donate experimental data for large-scale analysis in a way that safeguards user data and privacy and contributes to scientific advancement.

Sharing Research

Thursday, Jan. 28, 2:00 pm, Track 1

Use of remote sensing data for social scientists, Tarana Chauhan

Researchers in varied disciplines resort to using remote sensing data for forecasting/prediction, measuring outcomes, analyzing trends. Apart from acquainting themselves with different kinds of data, they are also required to build knowledge of different principles and theories which requires collaboration. This presentation will address and promote dicussion around limitations of remote sensing data and possible ways to work around common issues.

Transforming breeding through the implementation of genomics tools and customized data management systems, Dongyan Zhao

Genomics and bioinformatics tools have transformed breeding for staple crops and animals over the past decade, yet most smaller breeding programs have been left behind due to limited resources and high costs of adoption of new technologies. Breeding Insight (BI) was established as a full-service conduit for the development and adoption of high-tech genotyping, phenotyping, and data management tools into small breeding programs.

Mapping the yeast epigenome: developing a methodology for data reproducibility and dissemination, Dr. William Lai

The nucleus of every organism on the planet is composed of DNA and interacting protein. The specific positioning and organization of these proteins constitutes the epigenome and is responsible for maintaining and proliferating life. Using the high-resolution ChIP-exo assay, we have generated thousands of unique datasets that describe this protein organization in Saccharomyces cerevisiae (yeast). The depth and complexity of the data generated required the development of novel systems to both track sample metainformation, and provide a mechanism for reproducible analysis. This presentation will describe the methodologies developed during this project to track, reproducibly analyze, and disseminate large-scale genomic data.

Lightning Talks

Round 1

Wednesday, Jan. 27, 2:30 pm, Track 1

  • Does New York City Feed Its Residents Well During COVID-19?, Yuqing Zhang
  • Identifying discriminatory language in capital case juror selection, Anna Effenberger
  • Furthering Child Research with Anonymization Methods, Rouzbeh Rahai
  • Introducing the Qualitative & Interpretive Research Institute, Lee Humphreys
  • Secure Research with Restricted Data, Elena Goloborodko

Round 2

Thursday, Jan. 28, 1:15 pm, Track 1

  • Harnessing Machine Learning and Multivariate Approaches to Understand Attention and Memory with Aging, Nicholas Cicero
  • CU-MSDSp: A Flexible Parallelized Reversible Jump Markov Chain Monte Carlo Method for Model Selection, John T. Chavis III
  • Importance of Theoretical Projects in Experimental Science, Pedro Guicardi
  • Scholarship through collaboration – Expanding active learning all the way from the field to empirical data, Marc Goebel
  • Change-Point Detection in Core-Periphery Networks:A Case Study on Detecting Financial Crises in the Interbank Market, Desheng Ma


Reproducibility and collaboration when your data is really large or confidential, Lars Vilhuber and David Wasser

Thursday, Jan. 28, 2:00 pm, Track 2

Many new collaborative and often reproducible or dynamic tools are being developed or in use. One feature that they have in common is that is hard to use them when the data being used is really large (you cannot put 1TB of data into Github) or confidential (don’t even try to do that with Github). In this workshop, I will convey some tips and tricks on how to set up a reproducible environment that allows for such features of the data. Instructors: Lars Vilhuber and David Wasser

Introduction to Git, Florio Arguillas

Thursday, Jan. 28, 3:15 pm, Track 1

This workshop will help participants get started with Git, a tool that helps keep track of changes made in project documents such as program files or source codes, effectively versioning them, and allowing teams to collaborate via a central repository hub such as Github, Bitbucket, or Gitlab. Git helps determine the changes made in a file, by whom and why, and what was changed exactly. This workshop covers creation and configuration of a git repository; editing, staging, and committing files; determining differences between versions of files; retrieving previous versions of files; branching; and setting up of a central repository on Github for individual and/or project team collaboration. This workshop will be taught via command line using Git Bash. No prior experience with Git needed. Instructor: Florio Arguillas

A Functional Approach to Ordered Data: Statistics on Surfing Competitions, Jojo Aboaf

Thursday, Jan. 28, 3:15 pm, Track 2

Ordered data is ubiquitous. Recommendation systems attempt to leverage information regarding one’s preferences to suggest new content (e.g. music, movies) or products (e.g. books). Ranked-choice voting is used for local, provincial/state, and national level elections across the globe. Even Cornell uses ranked-choice for its elections! In sports, orderings frequently determine tournament structures and season schedules, and in games or general forms of competition, ordered data is a natural way to express outcomes. While the expanding areas of application shows no signs of slowing down, it also reflects two main difficulties: 1) disunity of the theories that underpin available models and 2) computational issues that arise when dealing with permutations of k objects, which scales factorially. Both of these difficulties contribute to the ad-hoc flavor of available methods as well as the relatively small body of work focused on inference. Using the most complete dataset on surfing competitions, we take a four step approach to present the material: First, we define ordered data as a set of objects endowed with a strict order relation (ie. a permutation) and discuss the various ways to represent ordered data mathematically and as a data structure. Second, we construct the main methods/models from the ground up and implement them using the surf competition data. Third, we present a simplified and portable probability model on permutations and demonstrate its effectiveness by identifying empirical distribution(s). Lastly, we present some of the hurdles encountered throughout this project that suggest areas for further collaboration, namely, how to leverage computational algebra systems (which I have used in this project) and how to visualize ordered data to effectively communicate insights. Instructor: Jojo Aboaf