Skip to main content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
Introduction: Data Wrangling
Data "wrangling"(or cleaning) may be required at different stages of the Research Data Management Lifecycle.
This process identifies and corrects errors or makes formatting more consistent.
It is often required to prepare data for analysis and/or visualization.
Data also needs to be cleaned before archiving it to ensure it is preserved correctly, that it is not misinterpreted by other users, and to facilitate interoperability (one of the FAIR Data Principles)
See the list of Resources for some useful tools and tutorials.
Photo by Suat Eman on freedigitalphotos.net
Making It Easier to Re-Use Your Data
White et al. (2013) published an excellent paper "Nine simple ways to make it easier to (re)use your data". in Ideas in Ecology and Evolution. The authors noted that much of the shared data in ecology and evolutionary biology is not easily reused because they don't follow best practices in terms of data structure, metadata and licences.Their nine specific recommendations are:
- Share your data.
- Provide metadata.
- Provide an unprocessed form of the data.
- Use standard data formats.
- Use good null values.
- Make it easy to combine your data with other datasets
- Perform basic quality control.
- Use an established repository.
- Use an established and liberal license
Their advice is on point and highly readable - and it includes some specifics on data wrangling.
Source: White, E., Baldridge, E., Brym, Z., Locey, K.,McGlinn, D. & and Supp, S. (2013) Nine simple ways to make it easier to (re)use your data. Ideas in Ecology and Evolution, 6(2):1-10 doi:10.4033/iee.2013.6b.6.f
Data Organization in Spreadsheets for Ecologists
Data Carpentry lessons are designed for workshops but can also be used for self-guided learning. This lesson covers data organization includes data entry, common formatting mistakes, handling dates and basic quality control and data manipulation. Other domain-specific lessons are available or are being developed.
Data Quality Control and Assurance (DataONE)
Simple one pager with some great advice on ensuring data quality
Data wrangling with R and RStudio (webinar)
Before an R program can look for answers, your data must be cleaned up and converted to a form that makes information accessible. In this webinar, you will learn how to use the `dplyr` and `tidyr` packages to optimise the data wrangling process (53 mins.)
Introduction to OpenRefine
Developed by Stephen Owen on behalf of the British Library. This is a great place to start learning about this amazing tool.
OpenRefine for Ecology Data
This Data Carpentry lesson teachers researchers OpenRefine to effectively clean and format data and automatically track changes. Librarians might appreciate: https://librarycarpentry.github.io/lc-open-refine/
This free, self-paced course covers the foundation of OpenRefine and its scripting language GREL. Lessons include transformations and fuzzy matching for quick but powerful data cleaning, complex transformations in GREL, calling APIs and parsing results and more.
Quartz bad data guide
An exhaustive reference to problems seen in real-world data along with suggestions on how to resolve them.
Using a spreadsheet to clean up a dataset
Data Wrangling Handbook recipe created for the School of Data by Tactical Technology Collective. Looks at 6 common ways that a data set is "dirty" and how to clean them.
We acknowledge the Australian Aboriginal and Torres Strait Islander peoples as the first inhabitants of the nation and acknowledge Traditional Owners of the lands where our staff and students, live, learn and work.