Marketplaces for Digital Data: Quo Vadis?

The survey presented in this work investigates emerging markets for data and is the third of its kind, providing a deeper understanding of this emerging type of market. The findings indicate that data providers focus on limited business models and that data remains individualized and differentiated. Nevertheless, a trend towards commoditization for certain types of data can be foreseen, which allows an outlook to further developments in this area


Introduction
The Internet enables almost ubiquitous transactions and exchanges of information.Increasingly, data is both supplied and demanded publicly on the Internet, which has led to the emergence of data marketplaces, i.e., virtual spaces of exchange between many actors on the supply and demand side (Muschalle, Stahl, Löser, & Vossen, 2013).This paper reports on the third iteration of studies on marketplaces, continuing the work of (Schomm, Stahl, & Vossen, 2013;Stahl, Schomm, & Vossen, 2014), and addresses the following questions: • What manifestations do data providers choose to operate on data markets?
• Is there a progression of commoditization of data and if this is the case, how far has it advanced?
• What is to be expected within the next three to five years?The first question serves to identify whether certain forms of data provisioning are more reasonable than others and how providers deal with the issue of generating revenue from data.This relates to the topic of the data they sell, how they reduce buyers' uncertainty, and which means of differentiation they adopt.The second question is concerned with the good data itself.Data is a rather abstract, digitized good the value of which is difficult to assess.A process of product standardization, called commoditization, has the potential to facilitate the issue of value attribution.The state of commoditization of data indicates whether traded data is differentiated and unique, in which case one provider has only few direct competitors or whether the data converges towards commodities, which entails a more perfect market.The last question is based on the first two and intends to provide an outlook of the direction(s) in which this area of business will move in the near future.This question will be answered with the results of all three surveys in mind.
Every new market is characterized by numerous participants entering and exiting while developing solutions and strategies for the challenges that every new business opportunity entails.The relatively high number of providers leaving the field in the past few years illustrates that data markets appear to be particularly challenging.Interviews with founders of the visualization tool Swivel, closed in 2010, yielded that a main obstacle to their business was that the number of users willing to pay for their services was "in the single-digit area" (Kosara, 2010).The Internet, the very medium that has led to the transformation of data markets in the first place, is also one of the major threats to their economy: Users are accustomed to have instant access to information for free resulting in a low willingness to pay for data.
Despite a lot of discussion in the blogosphere, systematic research on the landscape of data marketplaces is still scarce.Some evaluations on a small scale have been performed, notable examples being (Dumbill, 2012;Gislason, 2011;Miller, 2012;O'Grady, 2011); however, several of the offerings discussed are already out of business.Until recently, a deficiency in the investigation of data marketplaces was the lack of a theoretical groundwork as well as the lack of clear terminology.In order to mitigate those issues, we have developed a theoretical framework and provide term definitions in (Stahl, Schomm, Vossen, & Vomfell, 2016) to transparently communicate the foundation of our surveys.In order to come to a clearer understanding of the market, it is crucial to analyze the solutions providers employ and the various business models.To this end, we next introduce the methodology.Findings are displayed in Section 3. Section 4 and 5 discuss trends and future scenarios, respectively.Finally, Section 6 concludes this paper.

Methodology
The methodology of this survey is almost identical to those employed in previous iterations (Schomm et al., 2013;Stahl et al., 2014): Services that fulfill the provider definition are included in the sample, inspected by hand and then categorized along dimensions based on an analysis of their respective web sites.Our approach is substantiated both in the need for comparability as well as in resource limitations.Two modifications to the previous methodology have been made for this iteration and are explained in the following subsections.

Provider Definition & Acquisition
We comprehensively studied the definition of data marketplaces in (Stahl et al., 2016); here, we repeat the most important points: 1.The providers' primary business model needs to be providing data.
2. The providers offer an infrastructure to upload, browse or download machine-readable (e.g., RDF or XML) data to buy and sell.The data has to be hosted by the providers and it needs to be clear whether the specific data comes from the community or the operator.
3. The providers offer or sell proprietary data they host themselves.However, there must be transparency and traceability on the data sources.Providers of analyzed data must disclose the sources and methods of calculation.
4. The data analysis tools must be online tools and provide storable data as their main offering.They need to use proprietary data in their calculations; services like algorithm-based analyses of customers' data or the provision of crawling code do not classify for this type.
It can be argued that only machine-readable data successfully indicates the commoditization of data on marketplaces.Otherwise, the information is simply shared because users personally deem them useful without being concerned about allocability and effective distribution.This rule applies for example to Wikipedia: Its marketplace-like infrastructure allows users to upload or access information free of charge, which is not machine-readable though.
Data vendors only linking to data locations without hosting them (like KDnuggets.com'slist of data sets) also do not fulfill the criteria of this study.Similarly, providers that do not make their sources and methods transparent are excluded because no serious conclusions on their trustworthiness, on the data origin and sometimes even what type of data is offered can be drawn.Government agencies or non-government organizations providing free data are generally not considered as a data vendor, as they publish data as a side effect of their purpose in general and are not set on commoditizing data or even finding an appropriate business model.
Finally, financial institutions like stock exchanges are excluded from the survey as well, due to their redundancy: The World Federation of Exchanges alone counted 64 official and 15 affiliate members as of October 2014, not including the countless futures and online exchanges like CMEgroup.com("World Federation of Exchanges," 2014).Nearly all of them offer very similar data, which would heavily skew our results.Thus, they are discarded entirely.
The provider lists of (Schomm et al., 2013) and (Stahl et al., 2014) formed the basis for the provider sample of this survey.As the statistical methods applied require a sufficiently large number of cases, the sample was expanded from 47 to 72.To this end, ten keywords were identified by induction from their selection: "complex data analysis", "data crawler", "data market", "data marketplace", "data platform", "data provider", "data tagging", "data vendor", "data search engine", "sentiment analysis".They were taken as a basis for a keyword-based Web search and an analysis of the first 50 Google results.The results of the search and the lists of the previous surveys were matched against the criteria, resulting in 72 providers.
The provider selection just described has several restrictions.Including a certain offering means including all of its competitors and similar offerings, leading to a possibly very large sample size.Consequently, a compromise between finding as many online offerings that provide data as possible while remaining within a reasonable and manageable scope is necessary, which we achieve by drawing a clear line with the distinct definition given in (Stahl et al., 2016).
The providers are evaluated solely by analyzing their online presences.Their self-portrayal on the respective websites does not necessarily reflect an objective assessment so the results of the survey can be biased.As a personal testing of the offerings cannot be covered due to resource limitations, an evaluation based on the Web presence appears an appropriate solution.Four of the 15 categories are inherently subjective so the results in these categories should be interpreted accordingly.Therefore, they are not included in the analyses and only serve to give an impression of the market and the providers.
Not all dimensions could be examined for every provider so the data is not complete.Missing values are treated as N/A and disregarded in the analysis.Only 1.2% of all fields are counted as N/A, most of them in the Pricing, Data Access, and Data Output dimensions.Even though the evaluation is intended as a continuation of previous surveys (Schomm et al., 2013;Stahl et al., 2014), two changes lead to differences.First, the sample is significantly expanded while some of the previously surveyed providers are discarded: Some, like Uberblic, are no longer in business; some no longer fit the provider definition, such as the governmental providers.Secondly, modifications to the dimensions of the preceding surveys have been made.The dimension Website Language is removed entirely, while the related dimension Data Language is re-interpreted to refer strictly to the metadata available.Additionally, the dimension of Ownership has been added to allow for an analysis of the inherent bias of the providers.Depending on whether the operator allows other providers to participate on his platform or not, the business can be biased towards the operator (Stahl et al., 2016).

Statistical Analysis Methods
The survey consists of categorical variables, which only allow for positive (1) or negative (0) responses.As not all dimensions are exclusive, i.e., some dimensions are "tick all that apply" questions, they are analyzed with methods for multiple response categorical variables (MRCVs) (Bilder & Loughin, 2004).In a first step, two dimensions at a time are merged to show the combined responses to each category of the dimensions.The combinations of dimensions are picked based on three considerations: First, which dimension combinations return meaningful results at all, second which combinations can provide answers to the provider manifestation question and third, which ones can give indicators as to the commoditization process of data.Traditionally, the commoditization of data can be inferred from knowledge about the competition situation and the standardization of data quality.Due to the fact that neither can be construed from website evaluations, the associations between data domains serve to at least gauge information on the commoditization.
Based on the first consideration, subjective dimensions are left out.Additionally, not every combination of the 15 dimensions provides potential for meaningful insights.This is partly due to unrelated dimension combinations (like Ownership and Data Access) and partly due to inherent correlations, so for example the combination Data Output / Data Access returns a high number of providers offering CSV data via an API, which is not an insightful result.
The manifestations of providers as well as some indications on the competition situation can best be illustrated by the following combinations: Type / Domain shows whether certain business models are more likely to offer a certain type of data.Whether certain business models obtain data from a specific source is revealed by Type / Origin.As an additional reference, Audience / Pricing is analyzed to determine which pricing models are more likely to be employed for different customer groups.Type / Pricing shows whether certain pricing strategies make more sense for some business models.Regarding the standardization of data, the evaluation is more difficult.To this end, the source of certain data domains is regarded in Origin / Domain, which could provide explanations on the specificity of the data.Other combinations (e.g., Domain / Access) have been tested but no meaningful results were found.The tables in Section 3.2 are positive response tables computed with the MRCV package in R. As the providers in the sample size may manifest several times per dimension, both the absolute and the percentage margins do not sum up to the population size or 100%.
Specifically, statistical independence among the individual variables is of interest.This is akin to asking the question "whether the probability of a positive response to each item changes depending on the responses to other questions" (Bilder & Loughin, 2014).If two variables are statistically independent, knowing the value of one variable does not help to predict the value of another variable (Seltman, 2014).In the case of two MRCVs, independence means that each manifestation of an MRCV is independent of each manifestation of the other variable and that this holds true for every response combination (Bilder & Loughin, 2004).This hypothesis is called simultaneous pairwise marginal independence (SPMI) (Bilder & Loughin, 2004).The independence is marginal because each manifestation is counted without regard to the other responses of the specific individual to the categories (Bilder & Loughin, 2014).When testing for independence, neither the presumed association direction nor the "roles" of the variables are relevant (Seltman, 2014).This means that it is not important to know beforehand which variable is the explanatory one and which is the outcome variable.
The test for SPMI is computed in R with the MRCV package and the MI.test(data, I, J, type, B=1999, summary.data=TRUE)command.The test structure is briefly explained here, for more detailed explanations see (Bilder & Loughin, 2014) from where the following notation is derived.Consider the case of two MRCVs W with I items and Y with J items.For item ∈ and ∈ the variables are referred to as and respectively.Define the joint probability for = 1, … , | | and = 1, … , | | as = ( = 1, = 1) Additionally, = ( = 1) and = ( = 1) denote the marginal probabilities for the positive responses for the items.Let then the hypotheses for SPMI be: : for at least one ( , ) pair the equality does not hold where = specifies marginal independence.The hypotheses can be tested by a variety of different testing methods, including Rao-Scott Second-Order adjustments, Bonferroni adjustments, and bootstrap, a resampling algorithm under the assumption of independence (Bilder & Loughin, 2004).For a large number of binary categories Rao-Scott adjustments are not realizable in R (Koziol & Bilder, 2014).Bonferroni adjustments sometimes return more conservative critical values for small sample sizes while bootstrap p-values appear to have the highest power (Bilder & Loughin, 2004).In order to show the results for both approaches, the Bonferroni p-values as well as p-values obtained under bootstrap are used.If the returned p-value is below the confidence level of = 0.05, it can be assumed that they are independent.

Findings
This chapter presents the findings of the survey in two parts.Section 3.1 describes the 15 dimensions along which the providers have been classified and presents the findings in bar chart form.Several trends are already pointed out and briefly compared to results of (Schomm et al., 2013) and (Stahl et al., 2014).Section 3.2 presents the findings from the marginal tables and the tests for SPMI.
We mention that the two earlier surveys (Schomm et al., 2013) and (Stahl et al., 2014) have established the initial framework, which is used and developed further here.As will be seen, this framework allows meaningful comparisons of the results between the data providers over the years and to make predictions about future trends in this area.

Dimension Results in Distributions
The providers we considered are classified by 15 dimensions each of which consists of several categories.These dimensions originate from the (Schomm et al., 2013) and (Stahl et al., 2014) surveys and are split up into objective and subjective dimensions.The quantifiable, objective dimensions structure the surveyed data offerings into different types, while the subjective dimensions aim at capturing an impression of the respective company.All values are strictly Boolean, as an offering either fulfills a category or not.Most categories are not mutually exclusive and a single offering can cover several categories.When categories are exclusive, it is pointed out in the respective description.The writing in the next two chapters is deliberate: When referring to a dimension, the name of that dimension is capitalized (e.g., Size).Categories are capitalized and italic (e.g., Economic Data).The setup of the bar charts is identical for all figures: The abscissa maps the categories of the respective dimension and the ordinate maps the absolute number of cases.For the non-mutually exclusive categories, the case numbers do not sum up to the 72 surveyed providers.

Type
This dimension specifies the business model(s) of a data vendor.The categories are not mutually exclusive as one business model may cover several categories or one company may offer several services.Crawlers and Customizable Crawlers are offerings that search (crawl) a specific webpage or a set of webpages along links and extract data matching certain keywords into a given format.While Crawlers are bound to one domain of data, Customizable Crawlers can be set up to crawl for any content by the customer.A Search Engine returns lists of relevant content to the user's input of keywords.Raw Data Vendors offer data in a cleaned formatted way, usually in tables, but without further analysis.Complex Data Vendors in contrast process the data available in some way, for example by integrating various data sources or using statistical analysis.Matching Data Services sell the verification of customer input, which they match against their own data, for example as address or business risk verification.When data is merged, matched, or compared to other data, it is enriched and its value increases.Enrichment services differ from Complex Data Vendors in that they enrich the data by the customer's specification.
Enrichment -Tagging provides metadata to mostly textual data by tagging additional information like geo-coordinates to addresses or topics to Twitter posts.Enrichment -Sentiment services extracts sentiments and opinions towards a certain product or topics, usually based on social media postings.Enrichment -Analysis are services that provide more additional information, using statistics or comparisons with historical data to enrich the data.Data Marketplaces as a category does not refer to the infrastructural phenomenon that is the topic of the paper, but rather the intuitive understanding of platforms with a high number of buyers and suppliers.When a marketplace operator also supplies its proprietary data on the marketplace, both Data Marketplace and the corresponding vendor category, most often Raw Data Vendor, is ticked.
Figure 1 details the distribution of Type.Raw Data Vendor is the most commonly encountered type in the survey with a share of 37.5%, followed by Data Marketplace and Enrichment -Analysis, each with 20.8% and 12.5%, respectively.Fifty-two (72%) of the providers classify for a single category, 17 (23.6%)correspond to two categories, and the remaining three respond to three categories.This suggests that every category represents a sensible business model that can stand on its own.It could also be an indicator that most providers prefer to focus on a single offering without spreading their business model too far.The most common combinations of categories are among the enrichment services that make up for 23 counts with only 15 distinct providers.

Domain
This dimension describes the area of application or topic of the offered data.Whereas the dimension is not mutually exclusive, the Any category is exclusive to classify data vendors that sell a variety of data unrestricted to any domain.Economic Data is data about stock markets, company developments, product information like pricing, and about specific economic sectors.Scientific Data describes data on environmental, pharmaceutical, medical, or scientific work or research.Social Media refers to the capturing of posts, tweets, opinions, and trends on social media.Geo is any data relating to maps, landscapes, and the geographical position of businesses or individuals expressed in coordinates.Contact data in the form of address lists, email lists, or customer information is categorized in Address Data.(As opposed to (Schomm et al., 2013) and (Stahl et al., 2014) the Economic Data and the Scientific Data categories are renamed to clarify their content, their meaning remains.)The domain distribution in Figure 2 shows that data without domain restrictions makes up for 29.2% of the data.Only 13.8% of the providers offer more than one domain of data, which implies that most data providers specialize in only one domain.This possibly reflects the limitation to only one business model.On the other hand, the results from (Schomm et al., 2013) and (Stahl et al., 2014) clearly suggest a trend towards Any data.As such, the results from this survey might be heavily influenced by the group of newcomers.
The combinations among the providers with more than one domain are evenly distributed on Economic / Geo / Address Data, Economic / Address Data, and Geo / Address Data.

Data Origin
The Internet as a data source means that providers manually or automatically collect data from other online sources and sell either the aggregation or further processing of the data.Self-generated sources refer either to services that assemble their data privately via patented generation and analysis methods or to services that gather the data from various other data sources not covered in the remaining categories, like news agencies.User-generated content means that users of the service need to provide some data input in order to receive the desired results.This category cannot stand on its own because all user input needs to be matched against some proprietary data.A service is categorized as Community when the data is supplied by the users like in a marketplace or in a crowdsourcing service or when the users can edit the supplied data.Data from Governments is official data collected by highly trustworthy sources like ministries or government agencies and distributed by the provider.Authority as a source describes data that is curated by some expert (organization), e.g., the Postal Office on addresses.Only institutional sources are recognized as authoritative, e.g., reputable journals like "Nature" are not.The results in Figure 3 show two opposing trends.On the one hand, reliable data from Authorities and governmental sources seems to be commonly used.On the other hand, despite their questionable reliability, selfand community generated data is relied upon by an even higher number of providers.It should be noted that approximately one fifth of the 40 providers that have only a single data source relies on self-generated data alone.Only 12 (16.7%) of the vendors use three or more data sources.While data and metadata available on the Internet remain a main data source for all providers due to their relatively effortless exploitation, there seems to be a trend towards self-generation.This indicates that individualized data sources become a unique selling point.

Time Frame
The currentness of the data and whether it needs to be updated regularly to remain valid is observed in this dimension.The categories are overlapping because different types of data may be offered.We differentiate Static/Factual data (facts that are valid for longer periods of time) and Up to Date data (e.g., stock data that are only valuable for short periods).
Of the surveyed providers, 26.4% sell both data types and only 13.9%, less than a fifth of the remaining providers, sell Up to Date data only.This may be attributable to several constraints of Up to Date data: Its collection requires setups that are more sophisticated and it is often collected without finding a buyer immediately, therefore demanding capacities and effort, which might go to waste.

Pricing Model
Some services provide their data entirely for Free.In Freemium models, a part of the service can be used free of charge before paying for a premium account or service.This category cannot stand on its own and is always in combination with the remaining two categories: Pay-per-Use or Flat Rate.The former is based on the number of times a dataset is called via API queries or access clicks.The latter charges a monthly or annual fee for the data access, sometimes with an amount limit.
Most of the providers (54.2%) offer only one pricing model, 25% offer Freemium in combination with another model, and 6.9% offer three or more pricing models.With 61.1% Freemium / Flat Rate is the most popular combination among the freemium models, followed by 27.8% offering both Flat Rate and Pay-per-Use in combination with Freemium and only two providers offering Freemium in combination with Pay-per-Use.Only one provider, the Microsoft Azure Marketplace, offers all pricing models.
For the non-freemium providers, Flat Rates still take front rank before Pay-per-Use with 59% over 41%, though with a narrower margin.Its clear lead over other pricing models suggests that continuous access to data may take precedence over granular pricing.This is most likely due to provider preference for Flat Rates, because those provide a higher certainty of revenue (Muschalle et al., 2013).Overall, the distribution was as follows: Free (15), Freemium (18), Pay-Per-Use (23), and Flat Rate (39).

Data Access
The data access dimension determines how the user can display and access the data.Most services offer several options.An API (application programming interface) allows for seamless integration of the data provided into other software applications because it is not bound to a specific platform.Download requires no special prerequisites on the customer's side and provides clients with a reliable data access in the form of downloadable files.Specialized Software developed by the data provider helps examine, analyze, or visualize the data via software clients, mobile apps, or desktop applications.A Web Interface allows the customers to directly explore and use the data from within a browser.
No clear trend towards a specific data access type can be identified: 18.1% of the providers offer only one data access type, 41.7% two, 27.8% three and only 8.3% offer all access types.Since one third of the providers offers three or all data access types, data providers seem to identify a necessity to give customers more flexible data retrieval options.The relatively uniform distribution on API (47), Download (42), and Web Interface (45) shows that both average users and more technically versed users are targeted.In contrast, Specialized Software (23) is significantly less often offered.

Data Output
Most services do not rely on a single output format but rather offer a combination of several data display and retrieval options.The exchange formats XML and JSON are used for semi-structured data.RDF represents data in triples and is commonly used in the Semantic Web.Tabular data readable with most standard spreadsheet software is grouped in the CSV/XLS category.The Report category discloses all visualized data formats like PDF, DOC, or JPEG.Offerings that provide data in formats not covered by any of the output categories are treated as zero values in all categories.The notion of flexible data access types is not confirmed by the observations depicted in Figure 4.Only 25% of the providers offer more than two data output formats.The most common combination is CSV / JSON, closely followed by CSV / Report.The high number of CSV data could possibly show that data providers aim at a convergence towards the mainstream market and that structured data maintains its high importance.

Data Language
This dimension refers to the language of the metadata of the offered data.This includes names of columns and tables as well as localizations of units (e.g., Fahrenheit vs. Celsius).It is not relevant whether the data refers to information in other languages, only the language of the metadata itself is observed.The predominance of English and German is due to the selection criteria that have been applied.All other languages available have been counted and the three most frequently encountered have been added to the dimension, namely Spanish, French, and Portuguese.The More category is used when a data provider offers the data in a further language.
Apparently, a national focus does not mean that providers also translate the metadata, which universally remains in English (71) as the primary language for most data-related technology.This is further supported by the fact that only one data provider did not offer English data.Of course, the acquisition of new providers is based on English keywords, which heavily skews their distribution, so these results should not be overemphasized.Nevertheless, other languages observed are German (14), Spanish (8), French (8), Portuguese (6), and Others (9).

Target Audience
The clients of the providers surveyed are of concern in this dimension.Business-to-Business (B2B) services have other companies as their buyers and are categorized in Business.In business-to-consumers (B2C) the service is geared towards private persons interested in specific information and categorized in Consumer.
Businesses remain the main customers of data providers (65).In contrast, 20 providers serve consumers, only eight of which target them exclusively.It should be noted that several consumer-oriented providers are excluded from the survey, namely wikis and institutional websites geared towards citizens, which therefore skews the results.Just 18.1% of the providers target both customer types.

Ownership
The new dimension Ownership is introduced to evaluate whether the manifestation structures identified in (Stahl et al., 2016) occur in the data marketplace.Services that control the flow and price of the data offered are Private.Platforms operated by associations of several providers are categorized as a Consortium.Services simply providing the infrastructure for the marketplaces are Independent.In line with the model definitions, marketplaces where the provider also takes an active part on its own marketplace are not independent but are categorized as consortium marketplaces.Two providers, Dayta.com and eXelate.com,offer separate services: a privately operated data service and an independent marketplace infrastructure where they themselves are not active.Due to the different roles, they assume in the dual offerings, they are included twice as distinct offerings.Nearly all services are privately owned (54).Of the observed marketplaces, nine are independently operated and six are consortium-based.Further three independent operators run search engines.

Pre-Purchase Testability
The possibility of evaluating the offered services prior to a purchase is rated in this dimension.The categories here are mutually exclusive.With None, the buyer has to rely completely on the additional information without any means of previewing the data before buying.Restricted Functions means that only some functions of a tool are unlocked for the potential customer to preview.Restricted Number/Volume testability allows the customer access to the full functionality of the service, but is limited to a fixed number of operations or a timeframe.Complete access means that every user can use all functions and features of the final product immediately or after registering.
Our results show that the number of providers not offering any possibility to preview or test the data is surprisingly high ( 26).An additional group of providers relies on the assumption that a glimpse of their offering (i.e., Restricted Functions) (10) is enough to convince potential customers.Together, they make up for 50% of the providers.The remainder provides at least limited access to the complete offering ( 16) or gives complete access (20).When considering that a portion of those is most likely free providers, the results clearly suggest that most providers hesitate to allow access to their data.

Pre-Purchase Information
In this mutually exclusive dimension the information on the final product available is examined.The amount rather than an even more subjective notion of information quality is the determinant for this subjective dimension.With Barely Any information, the potential customer has to guess the features of the service offered or -as with most services in this category -has to inquire after information via email.Sparse Medial Information refers to providers that give some information on the general features of their products without technical details or implementation instructions.Comprehensive Medial Information refers to services that provide a variety of information from demo videos to fact sheets, screenshots, or customer reviews.Two-thirds of the providers give out plenty of information on their precise offering and its functions in the form of videos and demonstrations (48).Only 26.4% (19) and 6.9% (5) of the services supply sparse or no information, respectively.In combination with the results regarding Pre-Purchase Testability, these results show that providers prefer to lower the high buyers' uncertainty through information rather than through previews of the data.

Trustworthiness
This subjective dimension rates services on their trustworthiness, mainly based on the data sources.The detailed disclosure of data generation methods with named, reliable sources points towards a High trustworthiness.Services that only provide their general sources or rely on rather debatable sources indicate a Medium trustworthiness.Tagged as Low are offerings that do not even claim to provide complete and reliable data.Other factors include the level of sophistication of methods of data retrieval (basic crawling service vs. daily crawling with manual checks) and the reputation of the vendor.The categories are not mutually exclusive to reflect different data qualities within one offering.
The fact that Low (23) and Medium (20) trustworthiness combined are just a little more than High (38) does not allow for meaningful conclusions.Only eight providers check for more than one Trust category and are evenly spread on Low / Medium and Medium / High.

Size of Vendor
As only the respective websites are evaluated, the estimation of a company's size is subjective and mutually exclusive.It should be noted that the size of the vendor refers to the company behind the concrete project so that a rather small project like Freebase.com is still categorized as "Global Player" because it is owned by Google.Startups have only recently been funded by investors.Medium refers to businesses that have left the startup phase and established themselves in the market, usually with one core product.Big refers to vendors that have a well-established market position and cover a big market share with a variety of products.Global Player refers only to the biggest companies in the internet market such as IBM, Google, or Yahoo.
A little more than a third of the providers are Startups (26).This indicates that the market provides market gaps, which can be filled by first-movers.This potential for growth is balanced by the high share of established firms of all sizes (medium: 22, big: 15, global players: 9) which apparently still have sufficient possibilities for development.The combined results suggest a market in motion, which has not yet exhausted all innovation potential.It should be noted that 16 of the 35 newly included providers are Startups, which might skew the results in this dimension.

Maturity
This mutually exclusive dimension is subjective as well and refers to the stage of business development.
Research Projects are rarely commercialized and refer to trials of projects or proof-of-concept websites.Beta projects are in development and sometimes already commercialized.Medium offerings provide a sophisticated data or service supply.A High maturity refers to a range of different, refined products.
Of the surveyed providers, 59.7% (43) possess a High maturity.Additional 19.4% ( 14) have a Medium maturity which indicates a generally high maturity among the offerings.Only eight classify as research project and seven as beta.Combined with the results for size, the market could be tentatively characterized as innovative with sophisticated products.

Statisticsa
Table 1 shows the combinations of Type and Domain among the surveyed providers.The two most common combinations are Raw Data Vendor / Economic Data and Raw Data Vendor / Address Data.Raw Data Vendor / Any, Marketplace / Any, and Enrichment -Sentiment / Social Media tie for the third place.Those results are somewhat expected since Raw Data Vendor and Economic Data are among the most often encountered categories.Most data domains are distributed over a variety of different business models with the exception of Scientific Data, which is distributed via only two distribution channels.Even though this domain may also be covered in the Any category (Thomson Reuters, for example, sells a variety of medical and pharmaceutical data) it is evident that Scientific Data is not only rarely sold as a standalone pr oduct but also through only a limited variety of providers.The combined results of Type / Origin in Table 2 confirm some intuitive speculations: Enrichment services and crawlers collect their information on the Internet while marketplaces provide mainly community-curated data.One result is somewhat misleading: It appears at a first glance that the majority of Raw Data Vendors, the category that most providers match, collects their data themselves, which could indicate a demand for specialized, not yet publicly available data.However, only six providers depend on the self-generated raw data alone which means that the true majority aggregates online, federal, and institutional sources, which indicates a demand for aggregated and cleaned data.
The independence hypothesis can be rejected for three combinations, which means they are highly correlated.These combinations are Matching Data / User, Marketplace / Community, and Enrichment -Sentiment / Internet.

Table 3. Marginal Table Pricing / Domain
In Table 3, one can see that specialized domain data is rarely given away free of charge.With the exception of Scientific Data (80% free) virtually none of the other domains are distributed free of charge.The Any category shows no clear trend with its even distribution on the pricing models.Social Media and Economic Data tend to be priced in flat rates which makes sense given that the majority of them need to be updated regularly.
The hypothesis of independence can be rejected at a confidence level of = 0.05 with Free / Any and Free / Scientific Data as the two most significant combinations.
Table 4 shows that less than a third of the offerings geared towards private customers charge for the data.Virtually all of the remaining services offer a Freemium model.When bearing in mind that only 9.7% of the providers serve exclusively private customers, it becomes apparent that surveyed providers focus solely on B2B relations.For Business customers, fees seem to be the norm.Considering that they are the most common combination, Freemium and Flat Rate models could represent a strategy to accustom customers to the data offering and make use of lock-in effects.The hypothesis of independence of the two dimensions can be rejected for all combinations between audience and pricing model.In Table 5, it can be seen that some types of data providers prefer certain pricing models.Some of the previously identified associations between dimensions provide possible explanations for this: enrichment services mainly sell Social Media data from Internet sources (which in turn are closely associated as well), which favors Flat Rates (as evident from the Tables 1 and 2).These cross-associations suggest that similar business models manifest in the same way across dimensions.The fact that the Bonferroni adjusted p.adj.-value is not significant at a confidence level of = 0.05 and no significant combination could be found indicates that the association between Pricing Model and Type is not strong.
Regarding the Raw Data Vendors, the clear trend towards Flat Rate and Freemium (which is mainly distributed on Flat Rates) indicates that a constant supply to data represents an important selling point.Marketplaces have the most diverse pricing models with nearly half of these 15 providers offering their data free of charge, while the other half is evenly distributed on Pay-Per-Use and Flat Rate.The lack of certain results is also due to the methodology: Web Crawlers that provide their code free of charge are excluded from the survey due to the lack of proprietary data (and non-profit crawlers could not be found) so no combination of those two categories is observed.

Table 4. Marginal Table Origin / Domain
Table 6 shows that some domains draw from a variety of sources whereas others are rather restricted to a specific type of source.For most domains, this allocation is natural, i.e., only the Internet for Social Media or all data sources for Any.Economic Data is mainly derived from authoritative and individual data sources and only rarely from "freely" available sources.Again, this might either reflect the different data types contained in that category or rather point towards Self-Generated data as a distinguishing feature for competitive advantage.
The most common combination of Domain / Origin is Address Data and Self-Generating, which implies only a dim transparency on the sourcing process of address data on the internet.Any data is mostly coming from Communities, which could indicate that low participation barriers lead to unrestricted data domains.
The hypothesis of independence can be rejected at a confidence level of = 0.05 for all methods with Social Media / Internet as the only significant combination.

Trends
In this section, we present trends, incorporating results from our two preceding surveys conducted in (Schomm et al., 2013) and (Stahl et al., 2014).When looking at the results of the surveys over the course of the last three years, five global trends can be identified: First, some provider manifestations seem to make more sense than others do: Enrichment providers often cover sentiment analysis and other enrichment services of social media, sourced from the Internet and sold through flat rates.Another common type is matching data services, which use user-and self-generated data to match addresses, geographical, and economic data.Generally, providers focus on only one category (73.6%) and limit themselves to only one domain (89%) and one data source (56.9%).This indicates that providers split themselves into two groups: Hierarchical ("vertical") providers with only a single domain offering and intermediate ("horizontal") platforms, where unrestricted domain data can be acquired.Community contributions on marketplaces result in data on a variety of topics.
Second, the growing significance of unique data is evident from the increase of self-generated data.Providers who specialize in one domain rarely give their data away for free and usually charge a fee.In light of the fact that the market is mainly a B2B one, this is little surprising.In the survey from 2012, Internet sources made up for half of the observed sources, while the survey from 2013 finds an over-proportional increase of self-generated, community and user sources.As self-generated data is rarely employed as the only source, those sources represent a point of differentiation among the competition.This development also indicates that providers decrease their efforts in reselling data available on the Internet and move to individualized data sources.
Third, the clear advancement of flat rates over pay-per-use is somewhat unexpected when compared to the previous surveys where those two types lie level with each other.Providers clearly prefer flat rates due to their steadier revenues and usually combine them with freemium models to reduce uncertainty and take advantage of lock-in effects.Furthermore, pay-per-use models have not (yet) reached the level of sophistication necessary to prevent arbitrage exploitation.To find technical and policy amendments, research has been conducted (Balazinska, Howe, & Suciu, 2011;Koutris, Upadhyaya, Balazinska, Howe, & Suciu, 2012).Customers favor simpler pricing models as well and are not satisfied with granular pricing models that restrict unfocused data exploration.This is supported by the pricing development on the private TV sector: Flat rate models as offered by, e.g., Netflix are far more successful than models where customers have to decide on their willingness to pay for every single movie.Overall, there is a trend towards flat-rate-based pricing models for digital media content in general.This is evident when considering success stories such as Spotify for music streaming, Netflix for video streaming, or Amazon's Kindle Unlimited for eBooks.This trend will continue, such that flat rates will be even more dominant as a pricing model for data and data-related services.
Fourth, the results from the ownership dimension indicate that hierarchical ("vertical") relations still dominate the data market.The low number of intermediaries shows that the efficiency of the market is still limited and that data products are much differentiated.
Fifth, the occurrence of data access types has changed over the last years, away from APIs, which were originally dominating the field; Web exchange formats like JSON and XML gained importance, only to be surpassed by CSV data this year.Although this could be related to the sample, the likewise high number of report formats allows for two possible explanations.Either, as argued in (Stahl et al., 2014), these two results point towards more processed data or, when considering the high number of raw data vendors, this indicates that the providers aim at making the data more available to non-technical users.The development of data formats and access options suggests an orientation of the market towards a mainstream market that is also targeting non-technical companies and users: Many providers offer several, some even all, access possibilities but limit the number of data formats.The restriction to mostly standard formats like reports or CSV probably aims at reducing presuppositions on data use.The high number of API accesses indicates that this development does most likely not involve a withdrawal from the initial target group.
With regard to the size of the providers, an interesting observation can be made when looking at the progression across the surveys.Initially, the market consisted mainly of bigger, established companies originating from other soft-and hardware related industries.Over the years, this domination has diminished, as the market became more diverse with providers of different sizes and especially new companies participating.Through the extension of the sample, some of the new entrants are now included and surveyed as well, as evident by the high number of startups in the newcomer group.The combined findings allow for the suggestion that initially the market had rather high entry barriers.This gave advantages to established companies that could raise the necessary investments and quickly establish a relevant market share.Ever since the first survey, the entry barriers have clearly lowered, which now allows startups to form and join the market.Since electronic markets are considered to have low entry barriers, this is just one possibility but probably the most likely.This development is supported by a growing number of startups that consume data from data markets (see, for example, http://mlwave.com/ycombinator-2014-data-science-start-ups/).
The collective entry of startups does not contradict the finding of a growing and maturing market.Quite to the contrary, their development insinuates that the trading of data through intermediaries is now established and investors are willing to fund innovative new concepts.The tendencies in the maturity dimension confirm this.Despite the new providers, the direction towards a high maturity is constant throughout all surveys.This means that startups start with sophisticated business models.All this suggests that the market has settled from its initial launch phase into a more stable but still highly innovative phase, where both newcomers and established data providers find plenty of potential for development.This phase is accompanied by a high fluctuation of providers that enter and leave the market, which is also evident in the sample with seven closed services since 2012.
When applying the second explanation that presumes two data demands, commoditization gains relevance.A commoditization of individualized data with a high specificity is presumably undesirable for the consumers.As such, an intensification of commoditization for that group is unlikely.In the case of the first group, data of constant quality, a convergence towards commodities would likely accelerate and amplify its exchange.As presented in (Stahl et al., 2016), the more standardized a product is, the lower the costs of implementation are, and the more likely its purchase on a marketplace is.This would entail a more competitive market for that group.
The most important indicator for that development will probably be the development of the Ownership dimension.Intermediary platforms will proliferate and represent that development.Due to the fact that competition and commoditization are highly interdependent, their parallel advancement would presumably catalyze the commoditization of data further.

Emerging Scenarios
The future development of data markets remains dependent on the resolution to ARROW's information paradox.This paradox states that if a customer wants to evaluate the quality and value of information, he needs to examine the information itself before purchasing it, which he cannot do as he would then have gotten the information for free (Arrow, 1962).Today, more than 50% of the providers offer at most an excerpt of their offering and are very reluctant to provide potential customers with previews of their data.Apparently, they are aware of this obstacle and aim to reduce the buyers' uncertainty by providing information on the data.However, the content of data is far more relevant than the functionality of the accompanying API so that it is rather unlikely that pre-purchase information supplied by the seller is sufficient to resolve all uncertainties completely.
One of the most interesting results are the seemingly opposing notions of a trend towards processed data in (Stahl et al., 2014) and this year's trend towards raw data.The simplest reason for the occurrence of these results would be that the market has changed its direction.This explanation can be traced back to the development and expansion of the market in general in the last two years: Originally dominated by larger companies from other industries, the market is now diversified through a number of startups that entered the market.While this is a plausible suggestion, another suggestion is presented in (Muschalle et al., 2013).In interviews with data providers, two customer demands in data acquisition are distinguished: a first one in which customers expect complete, formatted, and reliable data.The second is about customers that are not dependent on the quality of the data and rather wish for tendencies and answers to be integrated into the decision making of companies (Muschalle et al., 2013).
When extending this idea further, two different scenarios can be developed.In the first scenario, the data is used as a type of manufacturing input.In order to process the acquired data further and use it as a basis for the production of another good, its quality must be extremely high and the access to it must be reliable.Especially the growing importance of data in the medical and pharmaceutical sector supports this notion, as does, for example, the emerging area of 3D printed cars.In the second scenario, the data is considered an add-on and a specialized product that can be spot purchased whenever necessary or be acquired on a regular basis.Its quality is not of crucial importance compared to the importance of its specificity.An example of such a demand could be 3D printing files (other than for cars).In the add-on scenario, customers expect a higher individuality of the product to match their particular wishes while data buyers in the first scenario would more likely expect a constant standard, which they can depend on.Examples of the first scenario are the financial data APIs offered by Xignite.com,BloombergPolarLake.com, or InteractiveData.com.The specialized inputs in the second scenario could be some enrichment services like CrowdSource.com,crawling services like 80legs.com or address sellers like xDayta.com.
Under the assumption of those two scenarios, the opposing directions can be resolved and explained.Additionally, this explanation is backed by the continually inconclusive trustworthiness dimension, which takes shape in both high and low trustworthiness.In addition, the origin dimension, which shows an importance of both highly reliable sources like authorities as well as the strong increase in self-generated data, could point towards the second scenario.
Clearly, this explanation is not exhaustive.Several providers like the address validation tools fall into neither category or one would have difficulty deciding for one category.Some are obviously spot-purchase oriented like VICO-Research.com but other like Gnip.com could serve customers both as a regular pillar of information in business or be only an add-on information service.Nevertheless, it provides an interesting perspective on the different data types demanded and insinuates that not only high quality data is demanded.

Conclusions
In this paper, we have reported on the third iteration of our data marketplace study as well as compared them to earlier results obtained in previous years.Furthermore, we have identified trends and outlined future scenarios.
Concluding the third iteration of the data market survey, one result has been obvious: ARROW's information paradox remains the major obstacle to data trading.The empirical and qualitative data confirm that providers are very reluctant to share information about their data before a business deal.As long as this issue prevails, the pricing of data remains far from its competitive price.
Regarding the next five years, a diversification of providers will continue; even today, this can be observed, as established providers mature and innovative new companies and business models emerge.Moreover, a trend towards the mainstream market for non-technical companies and subsequently non-technical staff can be observed which involves data formats that are easily accessible through downloads and Web interfaces are becoming more common.One business model that is currently emerging allows consumers to directly sell their personal data (e.g., from fitness trackers) for profit on platforms, handshake.uk.com and datacoup.combeing among the first companies to offer this.
Automated surveillance and analysis of social media data, is the most promising business model to become the next "big thing" within the data community.As all social networks continue to grow and expand their offerings themselves, companies become increasingly reliant on observing what happens when it happens.
As such, the value of data is likely to become a normal thing and expectations that "all information on the Internet is free" need to adjust (or will fade away anyway).For the time being, most business models on the Internet can easily be identified, as most of them embody the virtual translation of previously existing industries such as, for example, contact data selling or business partner verification.Contradicting the perception that entrepreneurs entering the data market will always be innovative, most business models so far stick with specific, consolidated business models that promise secure revenue opportunities, an observation that does not apply to the Internet at large.Although the data procurement has moved to the market as evident from the publicly accessible web sites surveyed, "real" intermediaries in the sense of open platforms are still rather rare.Most providers seem to prefer hierarchical relations.
Regarding the commoditization, data products are still highly differentiated and not in direct competition with each other.Data is still a highly individualized good and it is hard to compare different data sets with each other.Nevertheless, the data format is being homogenized with the observed rise of standards like XML and JSON.This leads to easier data handling and processing for customers, and lower overall costs for data integration.A similar progression has been observed in the international freight traffic, which has been substantially simplified through the usage of standardized containers, e.g., on ships and trucks.Furthermore, the high number of raw data vendors indicates that the market moves in that direction and that data will become more of a commodity.The data market will become more competitive and pricing models will become even more relevant.This is especially relevant for static and factual data because the marginal cost for an additional copy of the product are virtually zero, which potentially leads to existence-threatening price competitions.The consequences this might have on the willingness to pay on the consumers' side will be interesting to observe.
As for the future of the data demands mentioned earlier, the demand for data as manufacturing input will remain within private business relationships between large-scale providers.Specialized data on the other hand has the potential to be provided and purchased on intermediary platforms.Statista and their infographic service targeted towards newspapers is a good example as publishers can purchase the specific information and re-use it for their purposes.

Figure 1 .
Figure 1.Distribution of Type

Figure 2 .
Figure 2. Distribution of Domain

Figure 3 .
Figure 3. Distribution of Data Origin

Figure 4 .
Figure 4. Distribution of Data Output

Table 1 .
Marginal Table Type / Domain

Table 2 .
Marginal Table Type / Origin

Table 2 .
Marginal Table Audience / Pricing Model Table 3. Marginal Table Type / Pricing Model