Data Redundancy, Elimination, and Compression Tool (DREC)
Technologies
Python 3, DROID
Source code

The Data Redundancy, Elimination, and Compression Tool (DREC) was developed as part of the SHFT group's 2021 summer research that analyzed a collection of interactive projects from Carnegie Mellon's Entertainment Technology Center (ETC). During the analysis, it became apparent that a lot of the dataset was redundant. One source of redundancy was the lack of lossless compression for raw video files. Another involved the wholesale copying of project boilerplate code and dependencies between projects.

DREC was research associate Ethan Wolfe's attempt to streamline the data backups by automating the analysis of project folders. First, DREC conducted a file-level analysis to determine project file types using the DROID identification tool. After that, the resulting metadata was stored in a local database to preserve the locations of the original files. While the tool is not complete at this time, the goal is to reduce the data footprint for archived projects by removing and linking redundant files across a collection. It would also support a suite of compression tools to automatically reduce (without loss or within a tolerance) the size of larger media files. For the ETC dataset, it is estimated that the tool could reduce the dataset footprint by close to a third (or around 4TB).