Current Researches of Common Terminology
Current researches of Common Terminology (CT) is to improve metadata interoperability between metadata records of cooperating organizations:Europeana, Digital Public Library of America, National Library of Korea, Massachusetts Institute of Technology (MIT), and Harvard Library. It is to convert the provided records into the developed Common Terminology (CT), to build a linked open data with the CT, to provide the multilingual service, and to provide an international digital library portal for the public to be able to access their quality collections freely.
February of 2017.
- The CT and IOPDL CT websites, ct.iopdl.org and iopdl.org/common-terminology-ct, are updated for the upgraded CT version 1.1
January of 2017.
- The Common Terminology (CT) version 1.1 is upgraded. The updated CT version 1.1 changes or omits few unnecessary qualifiers of version 1.1. These changes are based on analyses of CT usage used in conversions from EDM of Europeana, MAP of DPLA, National Library of Korea, Harvard and MIT metadata records. Also, it is to adapt and embrace metadata standards (e.g., MAP of DPLA , EDM of Europeana, National Library of Korea’s MODS, Harvard’s Library Cloud, and MIT library’s QDC) and forms (json, rdf, xlsx, cvs, and xml) of cooperating organizations.
- The updated CT version 1.1 Schemas.
- The updated CT version 1.1 Crosswalks.
- The updated CT version 1.1 SKOS Crosswalks.
- Common Terminology Usage is measured with the converted CT records from MAP of DPLA , EDM of Europeana, National Library of Korea’s MODS, Harvard’s Library Cloud, and MIT library’s QDC.
For 2016, developing CT SKOS crosswalk based on the analyzed usage of the provided records, and developing CT conversions to convert them into the developed Common Terminology (CT) in rdf/xml form.
- EuropeanaEDMtoCTConversion is developed from September, 2016 to convert Europeana Data Model (EDM) records of Europeana into the developed Common Terminology (CT). Europeana offered ways to access their data at http://labs.europeana.eu/
api. Thanks to their kind offer, the HarvestEDM program was developed, which harvests their metadata records via their OAI-PMH Service by sets. Thanks to GOD, in the beginning of December 2016, all sets of Europeana EDM records were harvested in rdf/xml form. The harvested EDM records were converted into the developed Common Terminology (CT) in rdf/xml form in the beginning of December 2016. The statistics of the conversion shows excellent performance of CT: converted rate as 99.99%; SKOS semantic exact match rate as 67.16%; narrow match rate as 22.98%; broad match rate as 9.8%; and non converted rate 0.00980%. - MITQDCtoCT Conversion is developed to convert the harvested MIT DSpace records on September of 2015 into the developed Common Terminology (CT). The statistics of the conversion shows excellent performance of CT: converted rate as 100%; SKOS semantic exact match rate as 89.5%; narrow match rate as 10.5%; broad match rate as 0%; and non converted rate 0%.
- NLKMODStoCT Conversion is developed to convert the provided MODS records for the ancient rare resources of National Library of Korea on August of 2015 into the developed Common Terminology (CT). The statistics of the conversion shows excellent performance of CT: converted rate as 100%; SKOS semantic exact match rate as 85.7%; narrow match rate as 14.3%; broad match rate as 0%; and non converted rate 0%.
- HarvardtoCT Conversion is developed to convert the provided library of cloud dataset of Harvard library into the developed Common Terminology (CT). The statistics of the conversion shows excellent performance of CT: converted rate as 100%; SKOS semantic exact match rate as 83.7%; narrow match rate as 16.3%; broad match rate as 0%; and non converted rate 0%.
- CT Conversion is developed that converts Metadata Application Profile(MAP) of DPLA into the developed Common Terminology (CT). The statistics of the conversion shows very high performance of CT: converted rate as 96.5%; SKOS semantic exact match rate as 62.7%; narrow match rate as 35%; broad match rate as 2%; and non converted rate as low as 3.5%.
- CT SKOS crosswalks are developed to enhance understanding and usage of Common Terminology using the found usages of element names/terms of the provided metadata records, and the crosswalk in csv format are designed on March, 2016.
- The planned activities are started. Simple Knowledge Organization System (SKOS) crosswalk is being developed for MARC, MODS, QDC, and DC. Europeana 18 million Europeana Data Model records are harvested, and the usage of their element names in 31 sets of 1827 OAI list sets are analyzed.
For 2013-2015, Organizing International Open Public Digital Library (IOPDL), Inc., not for profit and tax exempt organization, and receiving the provided metadata records from cooperating organizations.
- International Open Public Digital Library (IOPDL) Inc. is organized and operated since January of 2015. Common Terminology (CT) project is continued at IOPDL from January of 2015.
- IOPDL was provided or harvested metadata records of cooperating organizations, and
- The usages of terms/Element Names used in their records are analyzed on January of 2016 recently again.
For 2012-2014,
- Common Terminology (CT) of MARC, MODS, QDC, and DC was developed at University of Illinois at Urbana-Champaign (UIUC) library.
The updated CT version 1.1 Schemas
- ct1-1.xsd: CT XML Schema version 1.1 (pdf)
- ct1-1.rdf: CT RDF Schema version 1.1 (pdf)
The updated CT version 1.1 SKOS concept
Metadata Records
Metadata records are provided from cooperating organizations: Europeana, Digital Public Library of America, National Library of Korea, Massachusetts Institute of Technology (MIT), Harvard Library, and University of Illinois at Urbana-Champaign (UIUC) library. The provided original records are preserved, and saved separately into new files to fit the IOPDL prototype project. The IOPDL prototype is to provide a single portal of cooperating national and Well-Designed Digital Libraries (WDDLs). It is to makes the public be able to access freely their high quality collections.
- Europeana 18 million Europeana Data Model multilingual records from http://labs.europeana.eu/
api. We harvested 31 sets out of 1827 OAI list sets, due to the device speed and stable network connection limitations. - Digital Public Library of America (DPLA) 8 million Metadata Application Profile (MAP) records in json form.
- National Library of Korea 40,000 Metadata Object Description Schema (MODS) records for old rare books, maps, old selected classic novels of South Korea in xlsx form.
- Massachusetts Institute of Technology (MIT) 0.35 Million Qualified Dubin Core records.
- Harvard Library 1.5 million records from Library Cloud’s Harvard catalog dataset that have links out to electronic resources.
- University of Illinois at Urbana-Champaign (UIUC) library 0.39 million MARCXML records. But these records are not used, since the electronic resources of UIUC are already included in DPLA records.
Metadata Element Names Usage Analyses
To analyze usages of their element names, the usage Python program is developed for all used metadata and file formats:
- DPLA MAP metadata format in json file format;
- Europeana EDM format in rdf file format;
- National Library of Korea MODS format in xlsx file format;
- Harvard Library Cloud in tsv file format;
- MIT QDC format in xml file format.
The found used element names and their usages are as follows:
- National Library of Korea MODS element names usage
- Harvard Library Cloud element names usage
- MIT QDC element names usage
- Europeana EDM element names usage
- DPLA MAP element names usage.
CT SKOS Crosswalks
Through the found usages of element names/terms of the provided metadata records, first, SKOS crosswalk are developed.
- The updated CT version 1.1 SKOS Crosswalks.
[gview file=”http://ct.iopdl.org/1.1/ctskos_Crosswalk7.pdf”]
- CT SKOS crosswalk for version 1.1
[gview file=”http://ct.iopdl.org/1.1/ctskos_CrosswalkIII-MARC_MODS_QDC_DC_QDCMIT.rdf.pdf”]
CT Crosswalks
- The updated CT crosswalk for current version 1.1 (pdf)
[gview file=”http://www.ct.iopdl.org/1.1/CT_crosswalk7_doc.pdf” save=”1″]
CT Conversions
CT conversions programs are being developed for the conversions that convert Metadata Application Profile(MAP) of DPLA, Library of cloud dataset of Harvard, QDC of MIT, MODS of National Library of Korea, and EDM of Europeana into the developed Common Terminology (CT). These conversions are based on CT version 1.2 that is slightly upgraded to adapt diverse metadata formats (MAP and EDM) and forms (json) of cooperating organizations.
- Metadata Application Profile(MAP) of DPLA to CT conversion
- Library of cloud dataset of Harvard to CT conversion
- QDC of MIT to CT conversion
- MODS of National Library of Korea to CT conversion
- EDM of Europeana to CT conversion
Metadata Application Profile(MAP) of DPLA to CT conversion
Using Metadata Application Profile(MAP) metadata records, the CT conversion program with Python programing language is developed that converts them into the developed Common Terminology (CT). At last, we had a final result of DPLA MAP to CT conversion and statistic results of the conversion:
The total Match rates of 8012390 records in the folder, C:\Python27\metadata\DPLA
The number of total Statement= 228248135
Converted rate= 96.4896142525
exactMatch rate= 62.6515663931
narrowMatch rate= 35.3335726678
broadMatch rate= 2.01486093913
noConverted rate= 3.51038574751
Not converted Element Names are {u’originalRecord’: 8012390}
The sample record of the original MAP
The Converted CT by DPLAMAPtoCTConversion
MAP to CT Conversion Program
Library Cloud of Harvard to CT Conversion
HarvardtoCTConversion program is to convert the provided library cloud dataset of Harvard library into the developed Common Terminology (CT).
HarvardtoCTConversion Match Rates
The total Match rates of 1525223 records in the folder, C:\Python27\metadata\harvard_library_cloud_urlsII
The number of total Statement= 38847221
HarvardtoCTConversion Converted rate= 100.0
exactMatch rate= 83.703912823
narrowMatch rate= 16.2956856039
broadMatch rate= 0.000401573126685
noConverted rate= 0.0
Not converted Element Names are {}
An Example of Original Harvard Record
Converted CT record in rdf/xml form by HarvardtoCTConversion
MITQDCtoCTConversion
MITQDCtoCTConversion is to convert the harvested MIT DSpace records on September of 2015 into the developed Common Terminology (CT).
MITQDCtoCTConversion Conversion Rates
The measured total Match rates of 358017 records in the folder, C:\Python27\metadata\MIT are the below:
The number of total Statement= 7343157
Converted rate= 100.0%
exactMatch rate= 89.4742955925
narrowMatch rate= 10.5257044075
broadMatch rate= 0.0
noConverted rate= 0.0
Not converted Element Names are {}
Findings
- Few records have no url identifiers that makes access available online. Thus, MITQDCtoCTConversion program generates url identifier for these records with header:identifier starting with ‘oai:dspace.mit.edu’.
- The deleted records stated in header are not converted.
An Example Original Record in xml form
Converted CT record in rdf/xml form by MITQDCtoCTConversion
NLKMODStoCTConversion
NLKMODStoCTConversion is a conversion program that converts MODS records of National Library of Korea to the developed Common Terminology (CT).
NLKMODStoCTConversion Conversion Rates
The total measured Match rates of 43762 records in the folder, C:\Python27\metadata\NationalLibraryOfKorea are the below:
The number of total Statement= 893425
Converted rate= 100.0%
exactMatch rate= 85.7332176736%
narrowMatch rate= 14.2667823264%
broadMatch rate= 0.0%
noConverted rate= 0.0%
Not converted Element Names are {}
Findings
- W3 rdf validation warnings like the below by some specific characters that are not in Unicode Normal Form. But, because these are the warning not the fetal errors and we have no idea yet how to fix the warning, the warnings are not fixed .
- Warning: {W131} String not in Unicode Normal Form C: “不分卷1冊; 21.3 x 14.5 cm”[Line = 43, Column = 49] in nlk1.rdf (不 causes the error).
- Warning: {W131} String not in Unicode Normal Form C: “그 女子의 戀人”[Line = 187, Column = 32]
An Example Original Record in xlsx form
Converted CT record in rdf/xml form by NLKMODStoCTConversion
EuropeanaEDMtoCTConversion
EuropeanaEDMtoCTConversion is a conversion program that converts Europeana Data Model (EDM) records of Europeana to the developed Common Terminology (CT).
Europeana offered ways to access their data at http://labs.europeana.eu/
EuropeanaEDMtoCTConversion Conversion Rates
The Average of Match rates was calculated with the measured values of the grouped sets. For example, the first set group is 1 to 30 sets. The measured values of 11271544 records in 1-30 sets are: Converted rate= 100.0;exactMatch rate= 69;narrowMatch rate= 22.18;broadMatch rate= 8.6;noConverted rate= 0.0%;Not converted Element Names are {}.
The Average of total Match rates of 39937489 EDM records of Europeana are:
The number of Statements= 1925054834.0
The Average of Converted rate= 99.9901938321
The Average of exactMatch rate= 67.1639836621
The Average of narrowMatch rate= 22.9867034728
The Average of broadMatch rate= 9.84931286504
The Average of closeMatch rate= 0.0
The Average of noConverted rate= 0.00980965763118
Not converted Element Names are {‘ore:Proxy/dcterms:isRequiredBy’: 12, ‘ore:Proxy/edm:isDerivativeOf’: 5, ‘edm:EuropeanaAggregation/edm:hasView’: 1282, ‘ore:Aggregation/edm:ugc’: 48500, ‘ore:Proxy/edm:isRepresentationOf’: 2, ‘ore:Proxy/edm:incorporates’: 21, ‘ore:Proxy/edm:isSuccessorOf’: 2}
Difficulties
There were some difficulties in the conversion, EuropeanaEDMtoCTConversion. The main reason of the difficulties comes from the diversity of values that providers described.
- The different language codes are used in some records, which ISO 639 series do not define and causes W3 rdf validation error such as {W116} RFC 3066 section 2.3 mandates the use of ‘en’ instead of ‘eng’. The used language codes that are not defined in ISO series are [‘als’,’ang’,’arz’,’ast’,’azb’,’bar’,’bcl’,’bjn’,’bpy’,’bxr’,’cas’,’cdo’,’ckb’,’diq’,’en-gb’,’en-us’,’eur’,’ext’,’frp’,’gag’,’gan’,’glk’,’gml’,’gom’,’gr’,’hak’,’hbs’,’hif’,’iten’,’japani’,’jp’,’jut’,’koi’,’ksh’, ‘lad’, ‘lbe’,’lij’,’lmi-2010′,’lmo’ ,’lrc’,’ltg’,’lzh’, ‘mhr’,’mo’,’mrj’,’mzn’,’nan’,’nap’,’nov’, ‘nrm’,’olo’,’osx’,’pcd’,’pdc’,’pfl’, ‘pih’,’pms’,’pnb’,’pnt’,’prg’, ‘ran’,’rgn’,’rmy’, ‘rue’,’sgs’,’sh’,’sk-SK’,’Spa’,’stq’,’szl’,’tcy’,’uri’,’vec’,’vep’,’vls’,’wuu’,’xmf’,’xxx’,’yue’,’zea’, ‘zh-hant’,’zh-latn-pinyin-x-hanyu’,’zh-latn-pinyin-x-notone’,’zh-latn-wadegile’]
- Some languages that are used in SKOS concept to provide the multilingual services causes W3 RDF validation warning such as
“Warning: {W131} String not in Unicode Normal Form C: “(sl)pozidano območje, strnjeno naselje;(sk)zastavaná oblasť;(da)bebygget område;(eu)eremu eraiki; eraikitako eremu;(ro)zonă construită;(it)area edificata;(tr)yerleşim alanı;(mt)żona mibnija;(no)bebygd område;(hu)beépített terület;(lv)apbūvēta teritorija;(ar)منطقة مشيَّدة;(lt)apstatyta teritorija;(cs)území zastavěné;(de)Bebaute Fläche;(el)(πυκνο)δομημένη περιοχή/οικιστική περιοχή;built-up area;城市建成区;(fi)rakennettu alue, asutusalue;(pl)teren zabudowany;(pt)povoações;(bg)Застроен район;(fr)agglomération;(sv)tätbebyggelse;(en)built-up area;(ru)застроенная территория;(et)täisehitatud ala;(es)zona edificada;(nl)bebouwde kom”[Line = 22, Column = 783]” - The used diverse prefixes such as ‘odrl’ and ‘cc’ in ‘drl:inheritFrom=”http://www.europeana.eu/rights/out-of-copyright-non-commercial/”‘, cc:deprecatedOn=”2027-11-10″.’
- The rarely used terms that were omitted in the ct crosswalk such as ‘ore:Aggregation/edm:ugc.’
- HTML tags that include ‘>’ . For example,
‘<edm:isShownAt rdf:resource=”http://galenet.galegroup.com/servlet/ECCO?c=1&stp=Author&ste=11&<>af=BN&ae=T152600&tiPG=1&dd=0&dc=flc&docNum=CW119814160&vrsn=1.0< >&srchtp=a&d4=0.33&n=10&SU=0LRF”/>’
Especially, it causes significant semantic errors, because I use ‘>’ as a separator to find the used terms/element names and values in xml form. ‘>’ in the value causes losing original values, and specially it results broken links when ‘>’ was used in URLs. However, changing the logic of the program fixes the problem recovering the original values, but the broken link problem is remained, if the original value was already the broken link.
- The broken links in the value and in the rdf:resource.
- Few files in a set have no records such as <ListRecords></ListRecords>.
- Few records have no metadata description with ‘null’ value. For example,
<record><header><identifier>http://data.europeana.eu/item/2048605/data_item_bbaw_dta_30400</identifier><datestamp>2015-07-18T07:24:17Z</datestamp><setSpec>2048605_Ag_EU_DM2E_bbaw_dta</setSpec></header><metadata>null</metadata></record>
- Few records have no data provider and provider. In this case, the default provider is Europeana.
- In some records, few descriptions have no values such as ‘<dc:rights xmlns:dc=”http://purl.org/dc/elements/1.1/”></dc:rights>’