Menu: Skip within this page

 

How to Approach a PDS4 Data Set

This page presents a basic description of what is included in a data set for those unfamiliar with the PDS4 archive structure. It also suggests a sequence of actions that new users may find helpful for getting familiar with the contents of a data set. You may work your way through a data set online or by downloading the entire collection.

Here is a typical sequence:

  1. Understanding XML in the Archive
  2. Understanding Archive Organization
  3. Inside the Archive
  4. Read the Collection Description and Documents
  5. Explore the Data and the Labels

1. Understanding XML in the Archive

One of the biggest changes in PDS4 is the use of eXtensible Markup Language (XML). XML is a set of syntax rules that constrains a document. The rules are defined by XML Schema so that anyone who views the data can have a standard reference for the document structure.

In the archive, XML is used in documents called 'labels' which describe the contents of one or more files. They are used to store metadata and various supplementary information about the files they reference. In order to ensure uniformity between all labels they reference the XML Schema and also conform to an explicit set of standards. The Standards Reference describing the standards for the labels and the Data Dictionary that describes the XML structure can be found at the PDS4 documentation page.


2. Understanding Archive Organization

Each data set is organized within an archive as a tree structure with a hierarchy of objects. 'Products' are the lowest level objects in the archive and they include the data files and documents that are bring stored. Groups of products are called 'collections' and they represent the next higher level of objects in the archive. Typically, the products inside a single collection will be closely related. For example, there could be an observational data collection, a document collection, or even a calibration data collection to name a few examples. The highest level object that will be paired with a label is the 'Bundle' which is of course a group of collections. A small archive may only have a single bundle, while a larger archive may have multiple bundles separated in a convenient fashion. Note that when we refer to these objects it is actually the xml label AND the files the label refers to that make up an object.

Every one of these objects are given a unique logical identifier (LID) with a version identifier (VID) that is stored in the associated label. A typical LIDVID will look like the following:

urn:nasa:pds:gbo-kpno:hyakutake_spectrum:offset_0_arcsec::1.0

The urn:nasa:pds fields are required in all PDS LIDS. The gbo-kpno portion represents the bundle ID. The hyakutake_spectrum portion is the collection ID. The offset_0_arcsec portion is the product ID. Note that this is a product LID; if it were a bundle or collection it would simply stop at the bundle or collection portion respectively. Finally, note the ::1.0 appended at the end which signifies the version number. The number will change depending on the version.

If you would like to learn more the Concepts Documents is a good introduction to PDS4.


3. Inside the Archive

Now we can talk about what you would actually see if you were looking at a dataset. At the top level of an archive you will see one or more folders each of which is a different bundle in the archive. The name of the folders will be the bundle ID for that bundle. Inside the bundle you will find a file called 'bundle.xml' and one or more folder. The xml file is the bundle label for this data set. The folders will contain different collections and will be named after their collection ID.

Further down into one of the collections you will find four new files and several folders. The first file of note is the 'collection.xml' file and it is the collection label. The next file of interest is the 'collection_inventory.csv' file. It contains a table of all the products in this collection and is required for PDS4 compliance. The next two files are the 'collection_description.txt' and the 'collection_description.xml' files. They are a short description of the contents of the collection and the corresponding label. The folders in the collection contain the products of the data set. Unlike the previous folders the name of the folders is NOT the product ID. Instead they will be a short name of the contents such as 'data', 'documents', or some other similarly descriptive name.

Finally, once inside one of the product level folders you will see the data or document files and their corresponding labels. The product ID and the label name will be based off the file name of the data or document.


4. Read the Collection Description and Documents

Now that you have an understanding of how the archives are organized we can begin to talk about what you should do. It is highly recommended that you begin by reading the collection description so that you can gain a low-level understanding of the collection you are looking at. They will be simple text files and should be easily viewed. Once you have read the description you should open the documents folder and look for useful documents. These include journal papers, manuals, and various other supplementary materials. As for viewing the documents if you are having trouble, this page explains some of the ways you can view specific file types.


5. Explore the Data and the Labels

Now that you have a better background understanding of the products you can begin to work with them directly. It is recommended that when approaching a product you begin by looking first at the associated label as they will often contain valuable information as will the collection and bundle labels. The SBN PDS4 Migration Wiki has a page on how to view xml files. In addition it also contains various useful tutorials, FAQs, and other such things and is highly recommended as a resource. In particular, the wiki contains an excellent guide on the structure of the different label types and what the various entries mean.

After looking at the product label you can then delve into the data. The file types page explains some ways of viewing potential data files. Additionally, the SBN has a suite known as ReadPDS, available in either IDL or Python, that reads in the data file and the xml label as a single structure. Be certain to download the PDS4 version.