CT Conversions for DPLA, Harvard, MIT, National Library of Korea, and Europeana
CT conversions programs are being developed to convert Metadata Application Profile(MAP) of DPLA, Library of cloud dataset of Harvard, QDC of MIT, MODS of National Library of Korea, and EDM of Europeana into the developed Common Terminology (CT). These conversions are based on CT version 1.2 that is slightly upgraded to adapt diverse metadata formats (MAP and EDM) and forms (json and rdf) of cooperating organizations.
- Metadata Application Profile(MAP) of DPLA to CT conversion
- Library cloud dataset of Harvard to CT conversion
- QDC of MIT to CT conversion
- MODS of National Library of Korea to CT conversion
- EDM of Europeana to CT conversion
1. Metadata Application Profile(MAP) of DPLA to CT conversion
Using Metadata Application Profile(MAP) metadata records, the CT conversion program with Python programing language is developed to convert them into the developed Common Terminology (CT). The final result of DPLA MAP to CT conversion and statistic results of the conversion are:
The total Match rates of 8012390 records of DPLA
The number of total Statement= 228248135
Converted rate= 96.4896142525
exactMatch rate= 62.6515663931
narrowMatch rate= 35.3335726678
broadMatch rate= 2.01486093913
noConverted rate= 3.51038574751
Not converted Element Names are {u’originalRecord’: 8012390}
The Main Structure of DPLAMAPtoCTConversion
- “originalRecord” information of MAP is not converted, because we believe that MAP of DPLA describes enough the core information of original records.
- CT has 12 common terms with qualifiers. Some terms are the same but the other is different with MAP terms. To preserve your information better, some terms of MAP are still preserved in the value of the terms, since I think it may work well for building Linked Open Data and search engine. For example, <ct:description ct:provenance=”(dataProvider)
NMNH – Mineral Sciences Dept.”/> -
Two statements are added into the transformed records for RDF/XML format:1. ‘rdf:Description rdf:about=url’ statement is added for each record with isShownAt url as default (alternatively, object, hasView/@id, or @id of DPLA information is used if isShownAt info. is absent.)2. ‘ct:identifier ct:source=”DPLA_dataProvider”‘ is added for each record (dataProvider is default, alternatively, provider is used, if no dataProvider is provided).
The sample record of the original MAP
The Converted CT by DPLAMAPtoCTConversion
2. Library Cloud of Harvard to CT Conversion
HarvardtoCTConversion program is to convert the provided library cloud dataset of Harvard library into the developed Common Terminology (CT).
HarvardtoCTConversion Match Rates
The total Match rates of 1525223 records in the folder, C:\Python27\metadata\harvard_library_cloud_urlsII
The number of total Statement= 38847221
HarvardtoCTConversion Converted rate= 100.0
exactMatch rate= 83.703912823
narrowMatch rate= 16.2956856039
broadMatch rate= 0.000401573126685
noConverted rate= 0.0
Not converted Element Names are {}
An Example of Original Harvard Record
Converted CT record in rdf/xml form by HarvardtoCTConversion
3. MITQDCtoCTConversion
MITQDCtoCTConversion is to convert the harvested MIT DSpace records on September of 2015 into the developed Common Terminology (CT).
MITQDCtoCTConversion Conversion Rates
The measured total Match rates of 358017 records in the folder, C:\Python27\metadata\MIT are the below:
The number of total Statement= 7343157
Converted rate= 100.0%
exactMatch rate= 89.4742955925
narrowMatch rate= 10.5257044075
broadMatch rate= 0.0
noConverted rate= 0.0
Not converted Element Names are {}
Findings
- Few records have no url identifiers that makes access available online. Thus, MITQDCtoCTConversion program generates url identifier for these records with header:identifier starting with ‘oai:dspace.mit.edu’.
- The deleted records stated in header are not converted.
An Example Original Record in xml form
Converted CT record in rdf/xml form by MITQDCtoCTConversion
4. NLKMODStoCTConversion
NLKMODStoCTConversion is a conversion program that converts MODS records of National Library of Korea to the developed Common Terminology (CT).
NLKMODStoCTConversion Conversion Rates
The total measured Match rates of 43762 records in the folder, C:\Python27\metadata\NationalLibraryOfKorea are the below:
The number of total Statement= 893425
Converted rate= 100.0%
exactMatch rate= 85.7332176736%
narrowMatch rate= 14.2667823264%
broadMatch rate= 0.0%
noConverted rate= 0.0%
Not converted Element Names are {}
Findings
- W3 rdf validation warnings like the below by some specific characters that are not in Unicode Normal Form. But, because these are the warning not the fetal errors and we have no idea yet how to fix the warning, the warnings are not fixed .
- Warning: {W131} String not in Unicode Normal Form C: “不分卷1冊; 21.3 x 14.5 cm”[Line = 43, Column = 49] in nlk1.rdf (不 causes the error).
- Warning: {W131} String not in Unicode Normal Form C: “그 女子의 戀人”[Line = 187, Column = 32]
An Example Original Record in xlsx form
Converted CT record in rdf/xml form by NLKMODStoCTConversion
5. EuropeanaEDMtoCTConversion
EuropeanaEDMtoCTConversion is a conversion program that converts Europeana Data Model (EDM) records of Europeana to the developed Common Terminology (CT).
Europeana offered ways to access their data at http://labs.europeana.eu/
EuropeanaEDMtoCTConversion Conversion Rates
The Average of Match rates was calculated with the measured values of the grouped sets. For example, the first set group is 1 to 30 sets. The measured values of 11271544 records in 1-30 sets are: Converted rate= 100.0;exactMatch rate= 69;narrowMatch rate= 22.18;broadMatch rate= 8.6;noConverted rate= 0.0%;Not converted Element Names are {}.
The Average of total Match rates of 39937489 EDM records of Europeana are:
The number of Statements= 1925054834.0
The Average of Converted rate= 99.9901938321
The Average of exactMatch rate= 67.1639836621
The Average of narrowMatch rate= 22.9867034728
The Average of broadMatch rate= 9.84931286504
The Average of closeMatch rate= 0.0
The Average of noConverted rate= 0.00980965763118
Not converted Element Names are {‘ore:Proxy/dcterms:isRequiredBy’: 12, ‘ore:Proxy/edm:isDerivativeOf’: 5, ‘edm:EuropeanaAggregation/edm:hasView’: 1282, ‘ore:Aggregation/edm:ugc’: 48500, ‘ore:Proxy/edm:isRepresentationOf’: 2, ‘ore:Proxy/edm:incorporates’: 21, ‘ore:Proxy/edm:isSuccessorOf’: 2}
Difficulties
There were some difficulties in the conversion, EuropeanaEDMtoCTConversion. The main reason of the difficulties comes from the diversity of values that providers described.
- The different language codes are used in some records, which ISO 639 series do not define and causes W3 rdf validation error such as {W116} RFC 3066 section 2.3 mandates the use of ‘en’ instead of ‘eng’. The used language codes that are not defined in ISO series are [‘als’,’ang’,’arz’,’ast’,’azb’,’bar’,’bcl’,’bjn’,’bpy’,’bxr’,’cas’,’cdo’,’ckb’,’diq’,’en-gb’,’en-us’,’eur’,’ext’,’frp’,’gag’,’gan’,’glk’,’gml’,’gom’,’gr’,’hak’,’hbs’,’hif’,’iten’,’japani’,’jp’,’jut’,’koi’,’ksh’, ‘lad’, ‘lbe’,’lij’,’lmi-2010′,’lmo’ ,’lrc’,’ltg’,’lzh’, ‘mhr’,’mo’,’mrj’,’mzn’,’nan’,’nap’,’nov’, ‘nrm’,’olo’,’osx’,’pcd’,’pdc’,’pfl’, ‘pih’,’pms’,’pnb’,’pnt’,’prg’, ‘ran’,’rgn’,’rmy’, ‘rue’,’sgs’,’sh’,’sk-SK’,’Spa’,’stq’,’szl’,’tcy’,’uri’,’vec’,’vep’,’vls’,’wuu’,’xmf’,’xxx’,’yue’,’zea’, ‘zh-hant’,’zh-latn-pinyin-x-hanyu’,’zh-latn-pinyin-x-notone’,’zh-latn-wadegile’]
- Some languages that are used in SKOS concept to provide the multilingual services causes W3 RDF validation warning such as
“Warning: {W131} String not in Unicode Normal Form C: “(sl)pozidano območje, strnjeno naselje;(sk)zastavaná oblasť;(da)bebygget område;(eu)eremu eraiki; eraikitako eremu;(ro)zonă construită;(it)area edificata;(tr)yerleşim alanı;(mt)żona mibnija;(no)bebygd område;(hu)beépített terület;(lv)apbūvēta teritorija;(ar)منطقة مشيَّدة;(lt)apstatyta teritorija;(cs)území zastavěné;(de)Bebaute Fläche;(el)(πυκνο)δομημένη περιοχή/οικιστική περιοχή;built-up area;城市建成区;(fi)rakennettu alue, asutusalue;(pl)teren zabudowany;(pt)povoações;(bg)Застроен район;(fr)agglomération;(sv)tätbebyggelse;(en)built-up area;(ru)застроенная территория;(et)täisehitatud ala;(es)zona edificada;(nl)bebouwde kom”[Line = 22, Column = 783]” - The used diverse prefixes such as ‘odrl’ and ‘cc’ in ‘drl:inheritFrom=”http://www.europeana.eu/rights/out-of-copyright-non-commercial/”‘, cc:deprecatedOn=”2027-11-10″.’
- The rarely used terms that were omitted in the ct crosswalk such as ‘ore:Aggregation/edm:ugc.’
- HTML tags that include ‘>’ . For example,
‘<edm:isShownAt rdf:resource=”http://galenet.galegroup.com/servlet/ECCO?c=1&stp=Author&ste=11&<>af=BN&ae=T152600&tiPG=1&dd=0&dc=flc&docNum=CW119814160&vrsn=1.0< >&srchtp=a&d4=0.33&n=10&SU=0LRF”/>’
Especially, it causes significant semantic errors, because I use ‘>’ as a separator to find the used terms/element names and values in xml form. ‘>’ in the value causes losing original values, and specially it results broken links when ‘>’ was used in URLs. However, changing the logic of the program fixes the problem recovering the original values, but the broken link problem is remained, if the original value was already the broken link.
- The broken links in the value and in the rdf:resource.
- Few files in a set have no records such as <ListRecords></ListRecords>.
- Few records have no metadata description with ‘null’ value. For example,
<record><header><identifier>http://data.europeana.eu/item/2048605/data_item_bbaw_dta_30400</identifier><datestamp>2015-07-18T07:24:17Z</datestamp><setSpec>2048605_Ag_EU_DM2E_bbaw_dta</setSpec></header><metadata>null</metadata></record>
- Few records have no data provider and provider. In this case, the default provider is Europeana.
- In some records, few descriptions have no values such as ‘<dc:rights xmlns:dc=”http://purl.org/dc/elements/1.1/”></dc:rights>’