‘Big data’ has been around since the 1960’s when the first data warehouses were established, however, the way in which data is handled and the quantity in which it is collected has changed beyond recognition since then. Using the 6 v’s of big data (volume, velocity, variety, veracity, value and variability), as a measure, several notable changes in the concept of big data can be seen over the past 60 years.
Nowadays, data is everywhere, and the quantity of data collected globally is growing exponentially. It is believed that 2.5 quintillion (a million times a trillion) bytes of data are collected every day, and on top of that, 90% of current global data storage has only been collected over the past 5 years. The New York Stock Exchange, for example, generates about one terabyte of new trading data per day and every photo, video and comment posted on Facebook contributes to over 500 terabytes of new data daily. That’s a lot of data – and forms the bedrock from which the term big data has been derived.
There is controversy as to how big data should be officially defined in the modern sense; however in layman’s terms, it is best described as the aggregation of vast quantities of structured, semi-structured and unstructured data which may be sorted and analysed to produce relevant results.
Types of Data.
Big data refers to an accumulation of multiple different small data sets which produce one larger set, whereas “small data” is individual files or smaller amounts of data used by analysts in their own local systems & laptops.
Small data can be distinguished along many dimensions, the most important being organisation. When differentiating the degree of organisation between data, it can be classified into 3 types; structured, semi-structured and unstructured.
Most likely to come from an operational platform within a business and presented in a highly organised format. Often stored in a data warehouse. For example: the sales file for a given business stored in a relational database.
This has no predefined organisational form or specific format. Examples of unstructured data include: Images (jpeg or png files), video (mp4 or m4a files), plain text files, word files, PDF files and so on. Approximately 80% of data collected by companies is of this format, despite the difficulty in both accessing and understanding it.
Semi-structured data has some degree of organisation. Think of a TXT file with text that has some structure (headers, paragraphs etc). This data can be searched for / organised but with less efficiency than structured data.
All data forms are important both collectively and individually and vary in importance based on individual organisations / their objectives. With data storage and analysis becoming increasingly popular, it is being collected from more sources and data that until now was considered irrelevant is now being rediscovered as valuable.
Characterisation of Big Data
As discussed, the 6 V’s of big data represent the key components that define its effectiveness/usefulness. Certain components have a greater degree of importance than others dependant on an organisation’s demands, however, all play a role in ensuring data can be utilised as effectively as possible.
Volume quite simply refers to the amount of data being collected, so, to be considered big data the volume is taken to be significant, regardless of the data form(s).
Variety refers to the form(s) of data that are available. This can be structured, semi-structured, unstructured or a combination of all types. This is relevant as unstructured and semi-structured data formats will require additional processing of metadata to make it understandable for analysts and display it in a practical format.
Velocity refers to the speed at which data is collected and used. If the data is streamed directly into memory and not written to disk, it means that the velocity will be higher. Consequently operational analytics can be processed way faster and provide near real-time data. But this will also require the means to evaluate the data in real-time. Velocity is also the big V that is the most important for fields like machine learning and artificial intelligence (AI).
Veracity refers to the quality and accuracy of data sets. Low quality and inaccurate data may cause issues such as inaccurate analysis and thus decision making; therefore, verification of data is essential for producing valid results.
Not all data collected is relevant and it is important to understand the value of data being collected by an organisation for filtering. Irrelevant or unusable data can act as a hindrance in the data analysis process and so it is considered very important that all data collected is useable in the present or future. ‘Cleansing data’ is a phrase used to distinguish between relevant and irrelevant data and eliminate anomalies in useful data.
A lot of data can be used for multiple purposes and formatted in different ways. Variability is a measure of how versatile existing data is and the degree to which it can be used for multiple purposes.
What is big data used for?
Due to several factors including the decreased cost and increased ease in data collection, big data is now gathered across every industry and by almost every business globally. Additionally, more and more devices are connected to the internet every day, gathering data on customer usage patterns and product/service performance.
This big data can be used for numerous things including but not limited to; creation of new products and functions, making informed business decisions, creating new opportunities, improving operations and marketing campaigns. Consumer and environmental data can be analysed by the business to predict patterns in human behaviour and our interaction with technology, altering the way in which the consumer behaves.
Big data can provide vast amounts of potential for those who utilise it effectively. In essence, the more you know about anything, the greater your ability is to act or respond. Therefore, organisations that are using big data can make faster and more informed business decisions. With 1.6 billion websites and 100,000 domains being registered every day, commercial competition is only increasing and organisations are constantly looking to innovate and gain a competitive advantage. Effective big data usage presents these organisations with opportunities they didn’t realise they had and could provide the competitive advantage they are seeking to excel within the market.
The potential benefits gained by using big data are endless, as are the examples of where it is effectively used at the present. In a world of increasingly fine margins for both failure and success, the commercial benefits of managing big data effectively to enable organisations to analyse, curate, manage and process data cannot be ignored.
Difficulties and Issues
There are two major drawbacks of big data which concern the user/organisation (there are several more drawbacks for the consumer including factors such as privacy issues) and these are data handling, storage and physical transportation of data (when stored in a disk format).
Unless data is handled appropriately and carefully it can be extremely complicated, time-consuming and ineffective trying to make sense of it. One example of this is often in enterprise scenarios, when big data is either too large or moves too fast, making it difficult to process using traditional database/ processing techniques. Many organisations lack the infrastructure or capital availability for the development thereof, limiting their ability to use personal consumer data (which is considered one of the most valuable forms of data to possess as a business). Furthermore, most organisations lack the workforce who possess the skills to handle data effectively, once again often requiring both capital and operational investment.
The storage and transport of data in its current form (disk) is extremely inefficient as one disk holds 16-20 terabytes of data. While this may seem like a lot at first, 1 exabyte of data (1 billion gigabytes) would require storage across 50-60,000 disks and with the rate at which data is growing this is not an unrealistic expectation of data processing within the next 20 years or so. Even if this quantity of data could be processed by a single machine, it would be ineffective to attach the required number of disks to it. In fact, even with an 800 megabyte bandwidth, the transfer of this data would take 40-50 years posing potential future holdbacks with data storage.
So why is this relevant to Azquo?
Azquo puts data into a format that analysts can easily manipulate and format with the power and rigour of a database behind it, without being a database in the traditional sense as it has no dimensionality to it. This means that the data handling possibilities Azquo can provide are endless.
Azquo is incredibly efficient and does not use relational structure at all in operations. It is a radical non-dimensional type of database that requires no interim processes to use the data and render it into a usable array, all conducted in memory with direct joins between data. The only use of any form of relational storage within the Azquo ecosystem is for persistence.
Amongst other functions, Azquo can be used to deploy Excel as a programming language to surface usable data directly from a data warehouse/ data lake such as Hadoop. Almost all of the processing occurs outside the spreadsheet which serves only as a window into the Big Data set being accessed.
Azquo is a multifunctional platform making previously unusable data useable (improving the veracity and value of data). Azquo’s capabilities are limitless, as are the opportunities it can provide for its users to build applications that surface previously inaccessible data quickly.
See Case studies for real-life examples of how we make sense of big data and turn it from an opportunity into a usable reality.
If you’d like us to help you understand your data and make sure the power inside the data is unleashed, then please do give us a call. Tel. 0203 424 5023
Azquo, making sense of data.