Data curation and harmonisation


The Australian Data Archive has developed a proposed workflow for the harmonisation of social survey data, that takes account of the practical steps required to bring diverse content together in a machine-actionable way. This harmonisation workflow includes pre-processing of metadata consistent with established archival practises, and incorporates external registered, persistent content.

The workflow is oriented towards improving FAIR practices in the harmonisation process – through the use of reusable, accessible metadata structures that can both improve processing consistency for current projects, and be applied to future harmonisation projects.

There is a need for consistent pre-processing of data and metadata within repositories to reduce error handling in the harmonisation process. The ADA have developed an initial set of processing rules that will be implemented in CARAT.

Vocabulary publishing tool

ADA has developed an internal process to partially automate the extraction of codelists from ADA’s Colectica metadata registry and preparation of these metadata for upload to ARDC Research Vocabularies Australia (RVA). As detailed in the user guide, sections of metadata can be selected and downloaded in .csv format. A python script is run to format the .csv data for compatibility with the RVA data editor, PoolParty.

Vocabulary publishing Github repository 

Data harmonisation tool


Aligning variables across datasets can be a very manual and error-prone process. The ADA has developed variable harmonisation templates and script generators for harmonising variables across datasets and automating the harmonised data integration process for creating of integrated files.

Data Harmonisation Github repository


Curation and risk assessment tool

The ADA Curation and Data Risk Assessment Tool, developed as part of the associated IRISS project, is a R library and Shiny tool to guide users to assess their own data on a series of metadata quality and privacy criteria in preparation for submission. It includes pop-up educational suggestions and definitions of data privacy risk, as well as example output using live synthesised data to reflect de-identification options selected by the user. 

Curation and Data Risk Assessment Tool Github repository