{"id":1276,"date":"2014-06-19T11:39:29","date_gmt":"2014-06-19T11:39:29","guid":{"rendered":"https:\/\/libraryblogs.is.ed.ac.uk\/datablog\/?p=1276"},"modified":"2014-06-19T11:39:29","modified_gmt":"2014-06-19T11:39:29","slug":"open-data-repository-file-size-analysis","status":"publish","type":"post","link":"https:\/\/libraryblogs.is.ed.ac.uk\/datablog\/2014\/06\/19\/open-data-repository-file-size-analysis\/","title":{"rendered":"Open data repository &#8211; file size analysis"},"content":{"rendered":"<p>The University of Edinburgh&#8217;s open data sharing repository, <a href=\"http:\/\/datashare.is.ed.ac.uk\/\">DataShare<\/a>, has been running since 2009. \u00a0During this time, over 125 items of research data have been published online. This blog post provides a quick overview of the the number, extent, and distribution of file sizes and file types held in the repository.<\/p>\n<p>First, some high level statistics (as at March 2014):<\/p>\n<ul>\n<li>Number of items: 125<\/li>\n<li>Total number of files: 1946<\/li>\n<li>Mean number of files per item: 16<\/li>\n<li>Total disk space used by files: 76GB (0.074TB)<\/li>\n<\/ul>\n<p>DataShare uses the open source <a href=\"http:\/\/dspace.org\/\">DSpace<\/a> repository platform. As well as stroring the raw data files that are uploaded, it creates derivative files such as thumbnails of images, and plain text versions of text documents such as PDF or Word files, which are then used for full-text indexing. \u00a0Of the files held within DataShare, about 80% are the original files, and 20% are derived files (including for example, licence attachments).<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-1277\" src=\"https:\/\/libraryblogs.is.ed.ac.uk\/datablog\/files\/2014\/06\/filetypes.png\" alt=\"filetypes\" width=\"562\" height=\"326\" srcset=\"https:\/\/libraryblogs.is.ed.ac.uk\/datablog\/files\/2014\/06\/filetypes.png 562w, https:\/\/libraryblogs.is.ed.ac.uk\/datablog\/files\/2014\/06\/filetypes-300x174.png 300w\" sizes=\"(max-width: 562px) 100vw, 562px\" \/><\/p>\n<p>When considering capacity planning for repositories, it is useful to look at the likely file size of files that may be uploaded. \u00a0Often with research data, the assumption is that the file size will be quite large. \u00a0Sometimes this can be true, but the next graph shows the distribution of files by file size. \u00a0The largest proportion of files are under 1\/10th of a megabyte (100KB). \u00a0Ignoring these small files, there is a normal distribution peaking at about 100MB. \u00a0The largest files are nearer to 2GB, but there are very few of these.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-1279\" src=\"https:\/\/libraryblogs.is.ed.ac.uk\/datablog\/files\/2014\/06\/filesizes.png\" alt=\"filesizes\" width=\"567\" height=\"411\" srcset=\"https:\/\/libraryblogs.is.ed.ac.uk\/datablog\/files\/2014\/06\/filesizes.png 567w, https:\/\/libraryblogs.is.ed.ac.uk\/datablog\/files\/2014\/06\/filesizes-300x217.png 300w\" sizes=\"(max-width: 567px) 100vw, 567px\" \/><\/p>\n<p>Finally, it is interesting to look at the file formats\u00a0stored in the repository. \u00a0Unsurprisingly the largest number of files are plain text, followed by a number of Wave Audio files (from the <a href=\"http:\/\/datashare.is.ed.ac.uk\/handle\/10283\/155\">Dinka Songs collection<\/a>). \u00a0Other common file formats\u00a0include XML files, ZIP files, and JPEG images.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-1280\" src=\"https:\/\/libraryblogs.is.ed.ac.uk\/datablog\/files\/2014\/06\/fileformats.png\" alt=\"fileformats\" width=\"606\" height=\"428\" srcset=\"https:\/\/libraryblogs.is.ed.ac.uk\/datablog\/files\/2014\/06\/fileformats.png 606w, https:\/\/libraryblogs.is.ed.ac.uk\/datablog\/files\/2014\/06\/fileformats-300x212.png 300w\" sizes=\"(max-width: 606px) 100vw, 606px\" \/><\/p>\n<p>Stuart Lewis<br \/>\nHead of Research and Learning Services, Library &amp; University Collections<\/p>\n<p>Data provided by the DataShare team.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The University of Edinburgh&#8217;s open data sharing repository, DataShare, has been running since 2009. \u00a0During this time, over 125 items of research data have been published online. This blog post provides a quick overview of the the number, extent, and &hellip; <a href=\"https:\/\/libraryblogs.is.ed.ac.uk\/datablog\/2014\/06\/19\/open-data-repository-file-size-analysis\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":175,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"advanced_seo_description":"","jetpack_seo_html_title":"","jetpack_seo_noindex":false,"jetpack_post_was_ever_published":false},"categories":[10],"tags":[],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/libraryblogs.is.ed.ac.uk\/datablog\/wp-json\/wp\/v2\/posts\/1276"}],"collection":[{"href":"https:\/\/libraryblogs.is.ed.ac.uk\/datablog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/libraryblogs.is.ed.ac.uk\/datablog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/libraryblogs.is.ed.ac.uk\/datablog\/wp-json\/wp\/v2\/users\/175"}],"replies":[{"embeddable":true,"href":"https:\/\/libraryblogs.is.ed.ac.uk\/datablog\/wp-json\/wp\/v2\/comments?post=1276"}],"version-history":[{"count":0,"href":"https:\/\/libraryblogs.is.ed.ac.uk\/datablog\/wp-json\/wp\/v2\/posts\/1276\/revisions"}],"wp:attachment":[{"href":"https:\/\/libraryblogs.is.ed.ac.uk\/datablog\/wp-json\/wp\/v2\/media?parent=1276"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/libraryblogs.is.ed.ac.uk\/datablog\/wp-json\/wp\/v2\/categories?post=1276"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/libraryblogs.is.ed.ac.uk\/datablog\/wp-json\/wp\/v2\/tags?post=1276"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}