Including A Repository
Including A Repository¶
In order for a data repository to be included in the POLDER Federated Search, the indexer has to know where its data sets are and be able to retrieve some metadata about them. The Federated Search App takes in JSON-LD metadata in order to make data sets searchable via its interface.
Broadly, if your data repository follows the POLDER Schema.org Best Practices (note: this document is still in progress), it will be a good fit for being included in the POLDER Federated Search. These Best Practices are based on the science-on-schema.org guidelines, and are summarized below.
Most of the work in getting a repository ready to be included in the POLDER Federated Search is in getting its metadata to a state where the app can consume it and use it in searches. A good thing to remember is that in order to search on a field, that field has to exist - so if you want people to be able to find your data set using, say, a date search, you have to attach temporal coverage information to it.
Required Metadata Fields¶
Description (can be any length, although Google’s data search requires it to be between 50 and 5000 characters)
Optional Metadata Fields¶
SameAs; if you’re a person who doesn’t like to see duplicate search results, this is for you!
Keywords are helpful for people doing text searches.
Version is not being used right now, but in the future, this can be used to display only the most current version of a dataset. See also: SoSo’s provenance relationships guidelines.
Distribution (i.e., how to get data) is good for if you have a way to get the data that doesn’t just involve going to the data set’s landing page (i.e. the sitemap url that was indexed in order to get this data set’s metadata)
Your metadata catalog should provide a sitemap so that harvesters like Gleaner can know which pages to get information from. Or, if you have a robots.txt file that includes a list of sitemaps, that could work too.
Things that are nice to have¶
It’s better and faster for indexing if your metadata is included in the data set landing page directly, instead of being injected after the page loads.