Data Storage Best Practices BILL COREY RESEARCH DATA MANAGEMENT
45 Slides4.86 MB
Data Storage Best Practices BILL COREY RESEARCH DATA MANAGEMENT LIBRARIAN UVA LIBRARY RESEARCH DATA SERVICES AND SCIENCES
Welcome! Introduce yourself, and tell us what you would like to get out of this workshop.
Introduction Data Storage is a critical part of every research project. This workshop will explore digital data storage best practices, storage options available from UVa through ITS, options from 3rd parties, data sharing, and data security requirements. We will look at UVaBox, Box, Dropbox, Google Docs, Google Drive, Amazon AWS, and Microsoft OneDrive, SharePoint, and Azure. Let us start with what data is/are, followed by a look at the history of data storage. We will visit a few websites that show the timeline for several different data types.
What is/are Data? data noun, plural in form but singular or plural in construction, often attributive da· ta \ˈdā-tə, ˈda- also ˈdä- \ Definition of data 1 : factual information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation 2 : information in digital form that can be transmitted or processed 3 : information output by a sensing device or organ that includes both useful and irrelevant or redundant information and must be processed to be meaningful https://www.merriam-webster.com/dictionary/data
Data Types Data generally fall into 5 categories: Observational: Captured in real-time. Cannot be reproduced or recaptured. Sometimes called 'unique data'. Example include sensor data, human observation, and survey results. Experimental: Data from lab equipment and under controlled conditions. Usually reproducible, but expensive to do so. Examples include gene sequences, chromatograms, spectroscopy. Simulation: Data generated from test models studying actual or theoretical systems. Models and metadata where the input may be of greater importance than the output. Examples include climate models, economic models, systems engineering. Derived or Compiled: The results of data analysis, or aggregated from multiple sources. Reproducible, but very expensive. Examples include text and data mining, compiled databases, 3D models. Reference or Canonical: Fixed or organic collection datasets, usually peer-reviewed, and often published and curated. Examples include gene sequence databanks, census data, chemical structures.
Data File Formats A file format is the structure of how information is stored in a digital file. There are literally hundreds of file formats. Many are obsolete or proprietary. You want to use ones you are comfortable with, that are required by a specific instrument, process, or software program. Non-proprietary, standards-based, and open formats are preferred for data sharing and data preservation. Unencrypted and uncompressed formats are recommended for images, video, and audio files. Proprietary formats are sometimes required to preserve the richness of data. When you are working with data on a project it makes sense to use the formats that work best with the software and instruments you need. Therefore, Active storage should be format agnostic. When you have completed using that data file, you should think about the correct format for long-term storage. Format obsolescence is a very real and serious problem. Data migration to newer formats and fresh media should be part of your data storage strategy.
Recommended Digital Data Formats Text, Documentation, Scripts: XML, PDF/A, HTML, Plain Text. Still Image: TIFF, JPEG 2000, PNG, JPEG/JFIF, DNG (digital negative), BMP, GIF. Geospatial: Shapefile (SHP, DBF, SHX), GeoTIFF, NetCDF. Graphic Image: raster formats: TIFF, JPEG2000, PNG, JPEG/JFIF, DNG, BMP, GIF. Graphic Image: vector formats: Scalable vector graphics, AutoCAD Drawing Interchange Format, Encapsulated Postscripts, Shape files. Graphic Image: cartographic: Most complete data, GeoTIFF, GeoPDF, GeoJPEG2000, Shapefile. Audio: WAVE, AIFF, MP3, MXF, FLAC. Video: MOV, MPEG-4, AVI, MXF. Database: XML, CSV, TAB. http://www.loc.gov/preservation/resources/rfs/index.html
History of Data Storage Devices
History of Data Storage Devices
History of Data Storage Devices
Data Format Timeline
Audio Format Timeline
Audio Formats for Music
Digital Data & Digital Media Digital Data is stored on three types of media Magnetic: Magnetic disks include the hard drive on your laptop, external hard drives, network environments and servers, and magnetic tape (reel-to-reel and cartridge). Optical: Optical media include Compact Disks (CD, CD-ROM, CD-R, CD-RW), Digital Versatile Disks (DVD, DVD R, DVD-R, DVD-RAM, DVD RW, DVD-RW), Write-Once, ReadMany (WORM) disks, High-definition DVD (Blu-ray & HD-DVD), Smart Cards, and Optical tape. Solid State: Flash memory cards, USB Flash drives, and Solid State Hard Drives.
Digital Media Lifespan What is the lifespan of different types of digital media? It depends on many factors: Original Quality Storage conditions: temperature, humidity, air, moisture, light. Age and Handling Frequency of access Manufacturers will tell you that a HDD will last 30 years. Real world usage shows only 3-5 years. Flash storage should last 5-10 years, but high usage (read-write cycles) will shorten that considerably. CD’s and DVD’s have a shelf life of 5-10 years unrecorded, but only 2-5 years recorded. Manufacturers said they would last 100 years! Magnetic tape will last 10-20 years under ideal conditions (stable temperature and humidity). The other part of the problem: media obsolescence. Can you access the older media with current technology? Do you have access to the older technology?
Best Practices: Data Storage Data Storage can seem pretty simple. Just store the data on your hard drive or in the Cloud, right? You should ask yourself some questions about the data before you start making decisions about storage. How important is the data? Do I need to keep this data? Can the data be reproduced, or is it unique? How long do I want or need to keep the data? How fast do I need to access the data? How secure do I need to keep the data? Do other people need to access the data? What institutional or funder requirements need to be adhered to?
Best Practices: First Steps Locate all of your data files that you want to store. Decide what you need to keep. Create a directory identifying your data files by name, format, size, and file location. Keep it current! Label the data containers, and the media in the containers. What happens if they get separated? Locate and/or create supplemental documentation (metadata) for each data file. Include variable names, descriptions, units, standards, instrument calibrations, codes, algorithms used to transform the data, software (including version and OS). What do you need to be able to use this data file if you do not remember? Organize the files. Document your organization methodology. Use it consistently.
Best Practices: File Naming Be Consistent Use descriptive names Not too long (32 characters max); CamelCase Try to include time Date using YYYYMMDD (create chronological order) Use version numbers Don’t use special characters [*& ! ] Don’t use spaces - use “-” or “ ” Don’t change default extensions Identify different versions clearly Add zeroes in front for large data sets (000001 instead of 01 if you expect 10000 images) http://www.phdcomics.com/comics/archive.php?comicid 1531
Best Practices: File Organization Folders named for major functions/activities Structure by date or event (especially subfolders) Names should be self-explanatory Avoid duplication Make it simple & consistent
Best Practices: Primary Data Never work with your primary data file! Always make a copy to work with. Your computer hard drive or working environment should only store your current working data file. Your primary (master or raw) data file should be stored in a safe environment, and backed up. Disasters and accidents do happen: Hardware failures Software problems Virus Infections Corrupted data files Power failures Hacking Stolen computers Human error Natural Disasters Media degradation Keep the original file as an read-only file. Give it a file name that can be used as the first part of all subsequent files related to it.
Best Practices: 3-2-1-Rule & Threat Zones Data Backups are very important. Follow the 3-2-1 Rule: Keep three copies of any important data files – a primary and two backups. Keep two copies on different digital media – a HDD and Flash drive. Keep one copy offsite, or at least offline. This is sometimes called Here-Near-Far. Working copy Here, primary backup Near, and second backup Far. What is a threat zone? A different geographical location from the one you are working in. If a natural disaster occurred in Virginia, and all of your data files are here, then you will probably lose everything. Put your primary, or primary backup, on a media that is secure, and send it to a friend or colleague in another state for safe keeping. Can the Cloud be considered a different zone? Yes, if the servers aren’t in the same zone you are in. Some cloud storage providers allow you to specify where your data is physically stored.
Best Practices: Backups UVa has a license for CrashPlan, a cloud-enhanced desktop backup service from Code42. It is available to staff, faculty, instructors, and degree-seeking graduate students. It is only to be used with public and moderately sensitive data. The CrashPlan FAQs provide information including getting started and backup-related sections. CrashPlan has an extensive knowledge base available with guides, troubleshooting and a configuring help. It is available for both Windows and Mac machines. UVa had made CrashPlan available to home users with a 25% off deal with Code42. However, as of 10/23, CrashPlan Home is no longer available. They recommend Carbonite or CrashPlan for Small Business. ITS also provides backup for the servers it manages. There are many providers of Backup services available, both online and software-based. Automatic backups are a great option for ensuring that your data is always protected. Acronis, Paragon, and StorageCraft are some of the better companies in this space.
Best Practices: Backup Validation Manually check 5-10% of data files yearly. Is the file collection complete? Compare it to your directory of data files. Did the files transfer properly? Do a bit-by-bit comparison of random files. Use a MD-5 checksum (hash). The UK Data Archive has a good exercise that will help you understand checksum. Write-once media validation. Did you create a validation hash when you created optical disks? If so, compare. Volume and directory validation. Check the media for directory or volume corruption. Storage media integrity. Run a media scan on your HDD for bad sectors. Visual inspection. Does the media look ok?
Best Practices: Network Security & Access Control Network security: Keep confidential or highly sensitive data off computers or servers connected to the internet Physical security: Access to buildings and rooms Encryption: Provides protection by scrambling data, so only the owner of the key or password can read the data. This protects the confidentiality of the data so that if an unauthorized person gained access to the storage device or service, they would be unable to see the data. It also protects the integrity of the data so that it cannot be tampered with without the owner knowing it. Computer systems & files: Use strong passwords on files and systems; use Virus protection (updated continuously and running!); Do not send personal or confidential data via email or FTP. Transmit as encrypted data and require data access agreements or confidentiality agreements from recipients. VPN: A VPN scrambles data as it is transmitted between your mobile device and a server. This allows you to access sensitive data securely stored on a remote server. UVa offers several levels of VPNs. Some are designed to handle secure data.
Best Practices: Common Sense Tips Desktops, tablets, laptops, and phones should NOT be used for storage of your raw, original, or only copy of data. Removable media – USB Sticks, Flash Cards, Memory Cards, CD’s, DVD’s, Cassettes, DAT’s, portable external HDD’s - should NOT be used for storage of your raw, original, or only copies of data. All removable media are subject to degradation and failure. It will happen. Removable media are all inherently vulnerable to temperature and humidity fluctuations, poor handling, air, moisture, light conditions, theft, mechanical breakdowns, forgetfulness. Manage your stored data. Visit often, at least once a year. Migrate your data media to new media on a pre-set schedule. Migrate to newer formats when possible. Migrate to newer software if possible. Always verify data consistency. Keep your directory up to date. Keep a current copy with your data storage options.
UVa Data Storage Options - Research Research Data Storage Visit the CADRE site for information about the Various options available At UVa. They manage the Rivanna and HPC resources. https://cadre.virginia.edu/service-detail/storage
UVa Data Storage Options Enterprise Enterprise Data Storage Visit the CADRE site for information about the Various options available At UVa. They manage the Rivanna and HPC resources. https://cadre.virginia.edu/service-detail/storage
UVa Data Storage Options - ITS UVa’s ITS provides three levels of storage. All are fee-based: Standard: Standard Storage with No Data Recovery (Tier IV), Standard Storage with Data Recovery (Tier III), and Standard Storage with Data Recovery (Tier III) and Backups. Premium: NAS High-performance storage. Premium Storage with No Data Recovery (Tier III), Premium Storage with Data Recovery (Tier I), Premium Storage with Data Recovery (Tier I) & Backups, Windows Environment Premium Storage (ES1, ES3) Value: Self managed for sensitive & non-sensitive UVa data. Academic Value Storage, and Research Value NAS Storage (researchers only). They provide RESSCU service to store mission critical electronic data at a remote storage facility in Blacksburg, VA. This is a co-owned (with Virginia Tech) Hierarchical Storage Manager (HSM)
UVa Data Storage Options - Personal Personal computer UVa Box 1TB storage UVaCollab While not intended for data storage it provides 4 GB storage. Secure, behind NetBadge. Home Directory Service 4 GB storage External Hard Drives (HDD), Flash, USB drives Amazon Web Services OneDrive SharePoint Online Dropbox Google Drive
UVa Data Storage Options – UVa Box UVa Box is a free cloud-based storage and collaboration service available to all UVa eligible students, faculty, and staff. You can store up to 1 TB of non to moderately-sensitive data in it. It is accessible from anywhere you have internet access. You can integrate applications into Box. UVa Box is the same as Box – UVa has an signed agreement about data (which is why the medical folks don’t have access to it). But a non-UVa Box user can interact with a UVa Box user. The UVa Box FAQs can answer many of your questions. How to use UVa Box includes instructions for working with files, sharing & collaborating, and some of the Box features. The UVa Box User Responsibilities & Data Restrictions includes the information you need to understand what types of data you can put in your UVa Box account. The Box Community is a great place to learn about Box from the Knowledge Base and other Users.
UVa Data Storage Options – UVa Collab UVa Collab is the LMS – Learning Management System – at UVa. It is the University of Virginia’s central online environment for teaching, learning, collaboration, and research. UVACollab partners with faculty, staff, and students in the work that sustains the Academical Village— engaging in interactive discussions, joining virtual meetings, securely storing and sharing materials, and much more. Each site has a storage capacity of 4 GB. You can have multiple sites. It is secure (behind NetBadge) and you can collaborate with colleagues at non-UVa institutions. It is not designed to be used as a storage option, but 4 GB is sufficient for a lot of folks. Great for smaller projects! You can link to external sites and resources, and it has several integrated tools including 3rd party ones - a WordPress Blog and Confluence Wiki. The Kaltura Media Gallery is a good too if you work with images and videos. UVaCollab has a extensive Knowledge Base available when you need assistance.
UVa Data Storage Options – Home Directory UVa Home Directory provides: Convenient online file storage - 4 Gigabytes (GB). Backup copies of your documents - within certain time limits. Easy Web publishing - publish and manage webpages. Your personal webpage URL is http://people.virginia.edu/ Your UVA Computing ID. Easy access using a software program on your computer or via the Internet from any computer connected to the UVA network. There are FAQs for Windows users and Mac users. Files stored in Home Directory are backed up daily. File Recovery information can help you recover from a loss. Mapping a drive in Windows or Mac OS X is a useful feature that allows you to direct-connect to transfer files to and from your desktop.
UVa Data Storage Options – SharePoint Online SharePoint Online is part of the Office 365 suite of tools from Microsoft. ITS also offers Groups & Teams which include a limited version of SharePoint. The Jumpstart Guide for Microsoft Teams is a good place to start if you are interested. ITS also has information about the differences between the two products. The Microsoft SharePoint Online website provides a lot of information about this tool. SharePoint Online is great for: Content management Customization of HTML/CSS to change look and feel Strong use of permissions and inheritance in multi-site layout A site template other than the default (Publishing site, Communication Site). SharePoint Online is accessible from any internet connection, and the permissions allow great granularity to protect your data. It is integrated with the other Office 365 applications.
UVa Data Storage Options – External Drives There are many types of external hard drives available. Remember that if you are storing UVarelated data (funded project, UVa data, etc.) on one you need to follow the UVa Data Protection Standards (3.0): https://security.virginia.edu/university-data-protection-standards .
UVa Data Storage Options – AWS Amazon Web Services provides a suite of tools to do everything with your data. UVa has a signed agreement with DLT Solutions, a 3rd party provider who is a Premier Consulting and Managed Services Partner for AWS. You will need an AWS account set up through ITS to access their services. You will then be able to access the DLT Portal. DLT has conducted several training sessions here in the last 2 years, and I have copies of the content I can share by request. AWS provides a long-term storage option – Glacier – which is inexpensive, and would be a good option for data you want to park. There is a local AWS User Group at UVA. This group is dedicated to connecting UVA students, alumni, and the residents of Charlottesville. Founded and organized by UVA alumni and students, they explore all aspects of working with AWS. Learn about new services and features, hear from developers who are using services in new and exciting ways, learn how to build an engineering career using AWS, and enjoy the company of others who are eager to share experiences.
UVa Data Storage Options – OneDrive OneDrive is the Office 365 cloud storage service that UVa has an agreement with Microsoft for UVa staff and faculty who use Office 365. Every user has 5 TB of storage available. It is integrated with Microsoft Office Online and many of the desktop applications. It is intended for University-related activities. You can get a personal account if you need one (or three). Users have access to assistance and training on the UVa Office 365 Resource Portal. There is an extensive list of FAQs and a page on File Sharing: Internal vs. External. The User Responsibilities & Data Restrictions page will provide you with the information you need to know about which data files you can store on OneDrive.
UVa Data Storage Options – Azure UVa does not currently have an agreement with Microsoft for the Azure Computing Platform. Azure is a set of cloud services similar to the AWS system.
Research Data Services Sciences I am part of the Research Data Services Sciences unit of the UVa Library. Our website has information on all of our services. The Research Data Management section includes a link to my Research Data Management Subject Guide. We offer an assortment of workshops each semester. We also have an archive of downloadable content from previous workshops. We are part of the CADRE group.
Thanks for attending! Do you have additional questions? Feel free to contact me at [email protected] to set up a consultation.