OpenRefine

OpenRefine

  • Home

›Operations

Getting Started

  • Index
  • Installing
  • Workspace Data
  • Records mode

Importing Data

  • CSV/TSV
  • JSON
  • XML

Operations

  • Transform
  • Add Column
  • Fill down
  • Blank down
  • Reconcile
  • Columnize by key/value

Facets

  • Text
  • Numeric
  • Scatterplot

Exporters

  • CSV/TSV
  • Google Sheets
  • ODS

GREL

  • cross
  • facetCount
  • toString

Columnize by key/value columns

This operation can be used to reshape a table which contains key and value columns, such that the repeating contents in the key column become new column names, and the contents of the value column are spread in the new columns. This operation can be invoked from any column menu, via Transpose → Columnize by key/value columns.

Overview

Consider the following table:

FieldData
NameGalanthus nivalis
ColorWhite
IUCN ID162168
NameNarcissus cyclamineus
ColorYellow
IUCN ID161899

In this format, each flower species is described by multiple attributes, which are spread on consecutive rows. In this example, the "Field" column contains the keys and the "Data" column contains the values. With this configuration, the Columnize by key/value columns operations transforms this table as follows:

NameColorIUCN ID
Galanthus nivalisWhite162168
Narcissus cyclamineusYellow161899

Entries with multiple values in the same column

If an entry has multiple values for a given key, then these values will be grouped on consecutive rows, to form a record structure.

For instance, flower species can have multiple colors:

FieldData
NameGalanthus nivalis
ColorWhite
ColorGreen
IUCN ID162168
NameNarcissus cyclamineus
ColorYellow
IUCN ID161899

This table is transformed by the operation as follows:

NameColorIUCN ID
Galanthus nivalisWhite162168
Green
Narcissus cyclamineusYellow161899

The first key encountered by the operation serves as the record key. The "Green" value is attached to the "Galanthus nivalis" name because it is the latest record key encountered by the operation as it scans the table. See the Row order section for more details about the influence of row order on the results of the operation.

Notes column

In addition to the key and value columns, a notes column can be used optionally. This can be used to store extra metadata associated to a key/value pair.

Consider the following example:

FieldDataSource
NameGalanthus nivalisIUCN
ColorWhiteContributed by Martha
IUCN ID162168
NameNarcissus cyclamineusLegacy
ColorYellow2009 survey
IUCN ID161899

If the "Source" column is selected as notes column, this table is transformed to:

NameColorIUCN IDSource : NameSource : Color
Galanthus nivalisWhite162168IUCNContributed by Martha
Narcissus cyclamineusYellow161899Legacy2009 survey

Notes columns can therefore be used to preserve provenance or other context about a particular key/value pair.

Extra columns

If the table contains extra columns, which are not used as key, value or notes columns, they can be preserved by the operation. For this to work, they must have the same value in all old rows corresponding to a new row.

Consider for instance the following table, where the "Field" and "Data" columns are used as key and value columns respectively, and the "Wikidata ID" column is not selected:

FieldDataWikidata ID
NameGalanthus nivalisQ109995
ColorWhiteQ109995
IUCN ID162168Q109995
NameNarcissus cyclamineusQ1727024
ColorYellowQ1727024
IUCN ID161899Q1727024

This will be transformed to

Wikidata IDNameColorIUCN ID
Q109995Galanthus nivalisWhite162168
Q1727024Narcissus cyclamineusYellow161899

If extra columns do not contain identical values for all old rows spanning an entry, this can be fixed beforehand by using the fill down operation.

Row order

In the absence of extra columns, it is important to note that the order in which the key/value pairs appear matters. Specifically, the operation will use the first key it encounters as the delimiter for entries: every time it encounters this key again, it will produce a new row and add the following other key/value pairs to that row.

Consider for instance the following table:

FieldData
NameGalanthus nivalis
ColorWhite
IUCN ID162168
NameCrinum variabile
NameNarcissus cyclamineus
ColorYellow
IUCN ID161899

The occurrences of the "Name" value in the "Field" column define the boundaries of the entries. Because there is no other row between the "Crinum variabile" and the "Narcissus cyclamineus" rows, the "Color" and "IUCN ID" columns for the "Crinum variabile" entry will be empty:

NameColorIUCN ID
Galanthus nivalisWhite162168
Crinum variabile
Narcissus cyclamineusYellow161899

This sensitivity to order is removed if there are extra columns: in that case, the first extra column will serve as root identifier for the entries.

Behaviour in records mode

In records mode, this operation behaves just like in rows mode, except that any facets applied to it will be interpreted in records mode.

← ReconcileText →
  • Overview
  • Entries with multiple values in the same column
  • Notes column
  • Extra columns
  • Row order
  • Behaviour in records mode
OpenRefine
Docs
Install OpenRefineExternal Resources
Community
Stack OverflowTwitter
More
BlogGitHubStar
Follow @OpenRefine
Copyright © 2020 OpenRefine contributors