CarLibrary.org -  Importing Spreadsheet Data into Greenstone

April 20, 2016

Importing an Inventory List into Greenstone

This webpage provides a basic step-by-step process about going from a "list" into a basic Greenstone collection of records, with metadata for each record, ready to build a digital library/archive.  It is the second CarLibrary.org webpage on Greenstone, each with a different method of creating an automotive-related "collection".

The following steps assumes you have some familiarity with Greenstone and Excel.  The Greenstone wiki has a link to  a tutorial which is very good and recommended.

The examples on this webpage are Excel files with real, working data imported into Greenstone.  This import technique was done many times until a valid collection was created!

This process also assumes you have some knowledge of library classification techniques.  If not, some research on this topic or consulting with an experienced librarian may be helpful.  One of the official Dublin Core websites is a good starting point, especially topic 4. "Elements" at the bottom of that webpage.  Basic categories such as "Title", "Description", "Subject" and "Keyword" can be confusing and tedious to correct after an import with many records.  It's best to get it (mostly) right in the beginning.

However: "The Perfect is the enemy of the Good (enough)"  - Voltaire  So don't hesitate too long to get started!

Why Make a Digital Collection?

For a collection of objects - cars, paintings, books, photographs - which is reasonably organized, there likely is a list (inventory) of these objects and some of their characteristics: "1948 Oldsmobile 88M, green, bought in 1985 from Ted Smith, loaned to the Oakville Museum, call Fred Friday for status" or "Snowy Winter, oil on canvas, by Patricia Jones, 26"x18", bought in 1997 for $175".  

If this collection ia slightly more organized, this list and its attributes may now be on a computer, in some type of database.  But what is to be done with the documents about the restoration of a 1948 Oldsmobile or the hundreds of photos at various car shows?  Linked references - or sometimes the actual document - can be put on a PC.  But when the documents grow to hundreds or thousands and direct, repeatable access is needed to any type of digital file related to your collection, a software program that goes beyond databases or photo organizers is recommended: the Greenstone Digital Library software is a good example.  The cost is right - it's open-source (free to use).  And it's not just for "libraries", but applicable for collections, archives, libraries and museums.

When a conversion of a (Excel) list to Greenstone is made, the attributes of each object become "metadata" in categories which can be used for access or classifying the current collection and future acquisitions.  Further, if the Greenstone's standard metadata sets are used, these classifications will be recognized and can be accessible to a wider audience.  A collection on Greenstone can remain private on a home or business computer, put on a network or made accessible to the Internet.

Lists, Databases and Excel

A good example of a "list" is a simple contact list of names, addresses and other personal or business information.  Such a list would look like this:

Record No First Name Last Name Street City State Comments
01 Tom Jones 122 Maple Lane Elm Gardens CA Tom is the lead salesperson from the Acme Chemical company and should be a first contact.
02 Sue Scott 432 Apple Avenue Scottsdale Arizona Sue is the CFO of Acme Chemical company and will make final decisions on all contracts 

Lists such as the example above have been hand-written and typed for generations, but computers allow us to put this data into Excel (or Word) tables or more advanced programs, such as a database.  For a useful table (or list) to be used as a "database", some rules apply:

1.  Each line is a single Record with data on the one person, item or other entity on that line.

2.  Each column is headed by a descriptor that is termed a "Field" and all the data in that column has similar characteristics.

Note in the example above, Record 01 uses "CA" as the standard abbreviation for "California" whereas Record 02 spells out the full state name.  This is not good database practice.

Excel was the software used to make the table/list/database for gathering records and data for import into Greenstone.  If you have typed lists, investigate using OCR to convert them to spreadsheet/Excel files.  Or use a good typist familiar with Excel!  If the database-type tables are in Word or a similar program, they usually can be copied into or imported into a spreadsheet-type program, such as Excel.

Excel provides many functions to help you review and clean-up the data.  Sorting, copying, pasting and moving cells of data will speed up any need to make the data uniform and adhere to good database practices.

For more information on using Excel as a database, see Using Excel as a Database or an Excel user book.  

This is another example of records and a database, more similar to data planned to bring into Greenstone:

Record No car.Manufacturer car.Make car.Year car.Serial_No Engine No Engine Type Car.Model
01 AFN Ltd. Frazer Nash 1948 420/E1

-

2-seater Special
02 AFN Ltd. Frazer Nash 1949 421/100/003 1051 85-Series Bristol Two-seater
03 AFN Ltd. Frazer Nash 1948 421/100/004 1055 85-Series Bristol Competition two-seater
04 AFN Ltd. Frazer Nash 1950 421/100/005 FNS 1/14 FNS-Series Bristol Fast Tourer
05 AFN Ltd. Frazer Nash 1948 421/100/006 1053 85-Series Bristol High Speed
06 AFN Ltd. Frazer Nash 1949 421/100/007 FNS 1/2 FNS-Series Bristol High Speed
07 AFN Ltd. Frazer Nash 1949 421/100/008 FNS 1/3 FNS-Series Bristol High Speed
08 AFN Ltd. Frazer Nash 1950 421/100/109 FNS 1/11 FNS-Series Bristol Le Mans Replica

A further example which shows how a archive catalog can be imported into Greenstone:

Category ID Cat ref: QTY Title Pages Description Size Date Condition Est Value Total Value Source
Sales-Promo 1 RAN/1 1 2 Litre cars 400, 401, 402 4 page folder colour illustrations of 400,401,402 + power unit 8.5 x 13 1950 A £120 £120 Gift of R. Smith
Sales-Promo 2 RAN/2 1 2 litre cars 400, 401, 402 4 page folder chassis & engine description 8.5 x 13 1950 B, tears £30 £30   
Sales-Promo 3 RAN/3 1 Brigand Beaufort, Beaufighter & Britannia Sales-Promo 4 page folder range description b & w photos 8 X 11.5   A £5 £5   
Sales-Promo 4 RAN/4 2 Beaufighter, Brigand & Britannia 8 page folder range description b & w & colour photos 8 x 11.5   A £15 £30   

Finally, the following are two examples of field names (columns headings) that have been used for Excel files to record data on cars (vehicles) and their owners, in separate files.  See below for further details:

Vehicle File Owner File
CarNo OwnNo
OwnerNo Last Name
Year First name
Manufacturer Salutation
Make Address1
Model Address2
ChassisNo City
EngineNo District/State
RegNo Postal Code
Former_RegNo Country
DeliveryDate Email Address
OriginalColor Home Phone
OriginalRegNo Work Phone
CurrentLocation Mobile
Remarks Fax
   Remarks

Create an Excel File/Database

Using the examples above, create or check the Excel file to ensure all the data for each record is on a single line, "like kind" data is in each column, variations in each data item (spelling and abbreviations) have been made uniform, and blank lines have been eliminated.  Blank cells are acceptable.

Dates can have many formats.  The can be either "text" or in an Excel date formats - which can look exactly like text.  A good method to fix dates is to sort the entire file on the "Date" field.  Dates in text format should be at the top and should be corrected to an Excel date format.  

The file intended as an import can have only a few records - probably best for initial trials - or thousands of records.  Greenstone imports are very quick!

Finally, create a new first column for the Excel file with a name such as "RecordID" - the exact field name is not critical.  Good practice is to make the data in this column a "name" that relates to the file data plus a number to ensure each records has a unique identifier.  For the Frazer Nash car file above, this would be something like "HighSpeed05", "Highspeed06", and "LeMansReplica08".  For the archive file above, this would be "Sales-Promo001".  If there are many records, this can be a tedious manual process -Excel's process to create a list of consecutive numbers can be used, format the numbers into a standard format (ie. "01", "02") and then use the "Concatenate" function to combine this number field with a data item from a different column.  This column of data will be very useful later in Greenstone.

Review Standard Greenstone Metadata Categories

Assuming there has been some basic Greenstone experience, there will be familiarity with the "Dublin Core Metadata Standard", which is the basic classification scheme used in Greenstone and widely recognized by digital libraries and other resources (including web pages).

The Dublin Core basically consists of these elements:

  1. Title

  2. Creator

  3. Subject

  4. Description

  5. Publisher

  6. Contributor

  7. Date

  8. Type

  9. Format

  10. Identifier

  11. Source

  12. Language

  13. Relation

  14. Coverage

  15. Rights

In Greenstone, each Dublin Core element is prefixed with "dc.", so they appear as dc.Title, dc.Creator, etc. Because these elements are widely accepted and recognized, it is a good idea to match the imported field names to the Dublin Core elements, insofar as that is possible.

For this example of an archive file, this mapping is used:

Archive File   Dublin Core Metadata
Title = dc.Title
Category = dc.Subject
Description = dc.Description
Date = dc.Date
ID = dc.Type
Cat ref: = dc.Identifier
Source = dc.Source
     dc.Format
     dc.Creator
     dc.Publisher
     dc.Contributor
     dc.Language
     dc.Relation
     dc.Coverage
     dc.Rights
QTY = item.Quantity
Size = item.Size
Pages = item.Pages
Condition = item.Condition
EstValue = item.Value
Total Value = item.TotalValue

Note that not all Dublin Core metatags are mapped from the file scheduled to be imported; unused metatags can be added after the import directly in Greenstone, as needed.  Note also that new "item" metatags have appeared.  These should be added to Greenstone before the import.  This process is described below.

Other metatags will be also added, such as "car.Make", "car.Model", etc., specifically because the data of this archive is from a car company and a car club.

Import Preparation

These are the fields/categories suggested for the Greenstone automotive digital collections:

"Car" Metadata

car.ID

car.Manufacturer

car.Make

car.Model

car.Year

car.Serial_No

car.Country

car.Keyword

car.Location

 

Dublin Core Metadata

dc.Title

dc.Subject

dc.Description

dc.Date

dc.Resource Type

dc.Resource Identifier

dc.Source

dc.Format

dc.Creator

dc.Publisher

dc.Contributor

dc.Language

dc.Relation

dc.Coverage

dc.Rights

Note that the Excel file destined for Greenstone does not initially need to have entries for all of these Dublin Core metatags; metatag data can be added after the import as needed, directly in Greenstone.  New metatag categories can be also created, such as "item.xxx" which could be useful to add physical attributes in an archive collections.  New metatags should be added to Greenstone before the import.  This process is described below.

The "mapping" of car collections Excel file to the planned Greenstone collection is very easy - rename column headings to the relevant Dublin Core element or to a new metatag category planned to be added to Greenstone.

For the the imports of the Excel example files shown above, a metadata set was created: "cars.xxx".  For a collection that was primarily an archive inventory, an "item.xxx" metadata set was added.

"Exploding" (Importing) the Spreadsheet Database

In Greenstone, "importing" an Excel file/database is not very difficult.  This process will create a (nul/empty) records from each record in the Excel file.  Each record's data elements will become a metatag, ready to be used to identify/categorize each car, or related document, photo or other digital item.  The Excel file must be converted to a "CSV" format and then "exploded" in Greenstone.  This "explosion" is very safe!

A Greenstone tutorial explains this - on that page, follow on from step 15.

An Excel file cannot be "exploded" directly, but such a file is easily saved as a "comma-delimited" (Comma Separated Values, "CSV") file.  

1. Save the Excel file you want to import in the "comma-delimited" format by "saving as" the "csv" format.

2. With Greenstone open, go to "File" and then "New".  Make "AustinCars" as the "Collection title" and base it on "Car Collections" picked from the list of collections.

 

3. In the "Gather" panel, drag the new "csv" file from the "Local Filespace" on the left into your Collection panel on the right.

4. Right click on the file and choose "Explode metadata database"

5. When the "Explode Metadata Database" menu/panel appears, uncheck the box next to "metadata_set" (see #6, next)
6. Check the box for "document field" and type in the exact field name of the first column, which was created above.  This should be "RecordID" if the first column in the Excel file was so headed.

7. Click "Explode"
8. When the "import" is finished, a panel "Merging Action Required" appears. 
9. Look to see the first "Source metadata element", select a "Target metadata set" such as "Car", but do not select a "Target metadata element", instead "Ignore" matching the "RecordID".  This is done so that  Greenstone names each record using the "RecordID" data.  Otherwise Greenstone will assign a serial number, such as "000001", 000002" etc. to the records.  The actual name of the car rather than a serial number makes the records and file more useful.

Other fields can be "Merged", if you chose an appropriate metatag category.

10. If the column headings of the Excel file match the metatag categories of "Car Collections", the import will complete with further intervention.  Otherwise Greenstone will check each field name, stop and require a match choice. Either "Merge" or "Ignore" through all the field names.
11. When the import is complete, there will be new folders in the Collection panel, with the folder names based on the name of your imported file.  Each folder will have 100 (or less) records.

12. In the "Enrich" panel, select any record   Each record should have correct metadata record in each metadata category.
13.

 

14.

The new folders created by the import process can be renamed - or create new ones - such as "Books", "Photos", etc. Then the records can be moved into the new or renamed folders.

Either select the metadata from the existing imported metadata for the new photo/document or  replace the "nul" record with the actual document/photo.  In either the Gather or Enrich panels, right click the "nul" record, select "Replace" and browse for the actual document/photo and confirm the replacement.

This completes a new Greenstone collection.  There is a third webpage which describes creating a collection entirely with "embedded metadata.  Or return to the Table of Contents.

Email me with any comments, suggestions or questions: Bob Schmitt, rgschmitt@gmail.com