Autodidact: August 2012

Monday, August 13, 2012

HANA Features

Features of HANA (Hybrid In-Memory Database):

No Disk

Modern servers provide up to multiple Terabytes of main memory and allow keeping the complete transactional data in memory. This eliminates multiple I/O layers and simplifies database design, allowing for high throughput of transactional and analytical queries.

In-Memory Computing

High performance in-memory computing will change how enterprises work. Currently, enterprise data is split into two databases for performance reasons. Disk-based row-oriented database systems are used for operational data (e.g. "insert or update a specific sales order") and column-oriented databases are used for analytics (e.g. “sum of all sales in China grouped by product”). While analytical databases are often kept in-memory, they can also be mixed with disk-based storage media.

It is possible to store data sets of entire Fortune 500 companies in main memory. At the same time, orders of magnitude faster performance than with disk-based systems can be achieved.

Read more at http://www.no-disk.com/

Active and Passive Data Store

By default, all data is stored in-memory to achieve high-speed data access. However, not all data is accessed or updated frequently and needs to reside in-memory, as this increases the required amount of main memory unnecessarily. This so-called historic or passive data can be stored in a specific passive data storage based on less expensive storage media, such as SSDs or hard disks, still providing sufficient performance for possible accesses at lower cost. The dynamic transition from active to passive data is supported by the database, based on custom rules defined as per customer needs.

We define two categories of data stores: active and passive: We refer to active data when it is accessed frequently and updates are expected (e.g., access rules). In contrast, we refer to passive data when this data either is not used frequently and neither updated nor read. Passive data is purely used for analytical and statistical purposes or in exceptional situations where specific investigations require this data. A possible storage hierarchy is given by memory registers, cache memory, main memory, flash storages, solid state disks, SAS hard disk drives, SATA hard disk drives, tapes, etc. As a result, rules for migrating data from one store to another need to be defined, we refer to it as aging strategy or aging rules. The process of aging data, i.e. migrating it from a faster store to a slower one, is considered as background tasks, which occurs on regularly basis, e.g. weekly or daily. Since this process involves reorganization of the entire data set, it should be processed during times with low data access, e.g. during nights or weekends.

Combined Row and Column Store

To support analytical and transactional workloads, two different types of database systems evolved. On the one hand, database systems for transactional workloads store and process every day’s business data in rows, i.e. attributes are stored side-by-side. On the other hand, analytical database systems aim to analyze selected attributes of huge data sets in a very short time.

SQL Interface on Columns and Rows

Column and row-oriented storage in HANA provides the foundation to store data according to its frequent usage patterns in column or in row-oriented manner to achieve optimal performance. Through the usage of SQL, that supports column as well as row-oriented storage, the applications on top stay oblivious to the choice of storage layout.

Minimal Projections

To increase the overall performance, it is required to select only the minimal set of attributes that should be projected for each query. This has two important advantages: First, it dramatically reduces the amount of accessed data that is transferred between client and server. Second, it reduces the number of accesses to random memory locations and thus increases the overall performance

Any Attribute as an Index

Traditional row-oriented databases store tables as collections of tuples. To improve access to specific values within columns and to avoid scanning the entire table, that is, all columns and rows, indexes are typically created for these columns. The offsets of the matching values are used as an index to retrieve the values of the remaining attributes, avoiding the need to read data that is not required for the result set. Consequently, complex objects can be filtered and retrieved via any of their attributes.

Insert-only

Insert-only or append-only describes how data is managed when inserting new data. The principle idea of insert-only is that changes to existing data are handled by appending new tuples to the data storage. Insert-only enables storing the complete history of value changes and the latest value for a certain attribute.For the history-based access control, insert-only builds the basis to store the entire history of queries for access decision. In addition, insert-only enables tracing of access decision, which can be used to perform incident analysis

Group-Key

A common access pattern of enterprise applications is to select a small group of records from larger relations. To speed up such queries, group-key indexes can be defined that build on the compressed dictionary. A group key index maps a dictionary-encoded value of a column to a list of positions where this value can be found in a relation.

No Aggregate Tables

Given the incredible aggregation speed provided by HANA, all aggregates required by any application can now be computed from the source data on-the-fly, providing the same or better performance as before and dramatically decreasing code complexity which makes system maintenance a lot easier.

On-the-fly Extensibility

The possibility of adding new columns to existing database tables dramatically simplifies a wide range of customization projects. In a column store database, such as HANA, all columns are stored in physical separation from one another. This allows for a simple implementation of column extensibility, which does not need to update any other existing columns of the table. This reduces a schema change to a pure metadata operation, allowing for flexible and real-time schema extensions.

Reduction of Layers

To avoid redundant data, logical layers, describing the transformations, are executed during runtime, thus increasing efficient use of hardware resources by removing all materialized data maintenance task. Moving formalizable application logic to the data it operates on results in a smaller application stack and increases maintainability by code reduction. (Use Stored Procedures & other Business Fuction Libraries)

Partitioning

We distinguish between two partitioning approaches: vertical and horizontal partitioning, whereas a combination of both approaches is also possible. Vertical partitioning refers to the rearranging of individual database columns. It is achieved by splitting columns of a database table into two or more column sets. Each of the column sets can be distributed on individual databases servers. This can also be used to build up database columns with different ordering to achieve better search performance while guaranteeing high-availability of data. In contrast, horizontal partitioning addresses large database tables and how to divide them into smaller pieces of data. As a result, each piece of the database table contains a subset of the complete data within the table. Splitting data into equivalent long horizontal partitions is used to support search operations and better scalability.

Lightweight Compression

For in-memory databases, compression is applied to reduce the amount of data that is transferred between main memory and CPU, as well as to reduce overall main memory consumption. However, the more complex the compression algorithm is, the more CPU cycles it will take to decompress the data to perform query execution. As a result, in-memory databases choose a trade-off between compression ration and performance using so called light-weight compression algorithms.

An example for a light-weight compression algorithm is dictionary compression. With dictionary compression, all value occurrences are replaced by a fixed length encoded value. This algorithm has two major advantages for in memory databases: First, it reduces the amount of required storage and second, it allows to perform predicate evaluation directly on the compressed data.

Bulk Load

Besides transactional inserts, HANA also supports a bulk load mode. This mode is designed to insert large sets of data without the transactional overhead and thus enables significant speed-ups when setting up systems or restoring previously collected data.

Multi-core and Parallelization

A CPU core can be considered as single worker on a construction area. If it is possible to map each query to a single core, the system’s response time is optimal. Query processing also involves data processing, i.e. the database needs to be queried in parallel, too. If the database is able to distribute the workload across multiple cores of a single system, this is optimal. If the workload exceeds physical capacities of a single system, multiple servers or blades need to be involved for work distribution to achieve optimal processing behavior. From the database perspective, the partitioning of data sets enables parallelization since multiple cores across servers can be involved for data processing.

MapReduce

MapReduce is a programming model to parallelize the processing of large amounts of data. HANA emulates the MapReduce programming model and allows the developer to define map functions as user-defined procedures. Support for the MapReduce programming model enables developers to implement specific analysis algorithms on HANA faster, without worrying about parallelization and efficient execution by HANA’s calculation engine.

Dynamic Multithreading within Nodes

partitioning database tasks on large data sets into as many jobs as threads are available on a given node. This way, the maximal utilization of any supported hardware can be achieved.

Single and Multi-Tenancy

To achieve the highest level of operational efficiency, the data of multiple customers can be consolidated onto a single HANA server. Multi-tenancy allows making HANA accessible for smaller customers at lower cost, as a benefit from the consolidation. Already today HANA is equipped with the technology to enable such consolidation while ensuring that no critical resources are contending between the customers sharing a server and while ensuring a reliant and highly-available storage of the customers data at the hosting site.

Analytics on Historical Data

For analytics the historical data is the key. In HANA, historical data is instantly available for analytical processing from solid state disk (SSD) drives. Only active data is required to reside in-memory permanently.

Text Retrieval and Exploration

Elements of search in unstructured data, such as linguistic or fuzzy search find their way into the domain of structured data, changing system interaction. Furthermore, for business environments added value lies in combining search in unstructured data with analytics of structured data.

Object Data Guides

The in-memory database improves the retrieving performance of a business object by adding

which is called object data guide and includes two aspects: In addition to the parent instance, every node instance can contain the link to the corresponding root instance. Using this additional attribute, it is possible to retrieve all nodes in parallel instead of waiting for the information from the parent level. Additionally, each node type in a business object can be numbered. Then, for every root instance, a bit vector (Object Data Guide) is stored, whose bit at position i indicates if an instance of node number i exists for this root instance. Using this bit vector, a table only needs to be checked if the corresponding bit is set, reducing the complexity of queries to a minimum. This is mainly for Sparse tree data.

Thursday, August 9, 2012

Features of the Column Storage Engine (HANA)

Both the row and column storage engine each have a built-in query capability optimized for the respective storage method. Consequently each of the storage engines has its strengths when used for differing tasks. The row engine supports all features provided by SAP HANA SQL. The column engine, however, has been optimized for analytics and supports a narrower set of SQL features. These features are much faster than the corresponding functions of the row store. Since not all features provided by SAP HANA SQL are natively supported by the column engine, the query optimizer sometimes creates an execution plan that involves both storage engines even if the data reside completely in the column store. On the other hand it is sometimes desirable to transfer intermediate results from the row store to the column store because subsequent calculation steps can be processed much faster there. As a general rule of thumb we recommend to use the "Explain Plan" feature of the SAP HANA studio to explore execution of critical SQL statements and optimize the data model and SQL statements accordingly.

The following list explains the query features of the column engine. The column engine can natively execute queries that are logically equivalent to queries that can be expressed in the following format.

SELECT

FROM

[ WHERE ]

[ GROUP BY ]

[ HAVING ]

[ ORDER BY

ASC | DESC separated by comma> ]

[ LIMIT [OFFSET ] ]

A column engine FROM specification can be one of the following:

· ( SELECT * FROM WHERE )

· INNER JOIN ON

· LEFT OUTER JOIN ON

· RIGHT OUTER JOIN ON

· FULL OUTER JOIN ON

The column engine supports following join conditions.

· =

§ Data types of two columns should be the same. Decimals with different precisions or scales are also treated as different data types.

· AND operations between column engine join conditions

§ For outer joins, cyclical joining is not supported. For example, T.a = S.a AND T.b = R.b is not supported if there is already a join condition between S and R.

The column engine supports the following filter conditions

· Binary comparison (=, <>, <, >, <= and >=) between column engine expressions

· LIKE, IS NULL and IN predicate on a column engine expression

· AND, OR, NOT operations over column engine filter conditions

The column engine supports the following scalar expressions.

· Constant values

· Binary arithmetic operation (+, -, * and /) between column engine expressions

· String concatenation (||) between column engine expressions

· CASE WHEN THEN

· expression> ... [ELSE ] END

· CASE WHEN THEN ... [ELSE ] END

· Functions listed below with column engine expressions as arguments

§ TO_DECIMAL, TO_NUMBER, TO_BIGINT, TO_REAL, TO_DOUBLE, TO_CHAR,

§ TO_NCHAR, TO_DATE, TO_TIMESTAMP and BINTOHEX/HEXTOBIN

§ Format strings are not supported for TO_CHAR, TO_NCHAR, TO_DATE and

§ TO_TIMESTAMP

§ LTRIM | RTRIM | TRIM, LENGTH, SUBSTR, INSTR and LOWER | UPPER

§ WEEKDAY, DAYS_BETWEEN, SECONDS_BETWEEN, ADD_DAYS, UTCTOLOCAL,

§ LOCALTOUTC, ISOWEEK and QUARTER

§ and FLOOR | CEIL

§ NULLIF, NVL, NVL2 and COALESCE

§ EXTRACT( YEAR | MONTH FROM )

· Functions without arguments are evaluated by the SQL parser. The evaluated results are passed to the column engine as constant values.

§ CURRENT_CONNECTION, CURRENT_SCHEMA, CURRENT_USER, SYSUUID, CURRENT_DATE/ CURRENT_TIME/ CURRENT_TIMESTAMP and CURRENT_UTCDATE/ CURRENT_UTCTIME/ CURRENT_UTCTIMESTAMP

§ Expressions equivalent to one of the listed above (e.g. CAST function)

· The column engine supports the following aggregation expressions.

§ MIN | MAX | COUNT | SUM | AVG ([DISTINCT] )

Since SQL provides multiple possible ways to formulate a query, an SQL query that is not in the above format can be processed natively by the column engine if the query is equivalent to a query that can be represented in the correct format. Below are examples of such equivalence. Equivalence is denoted as 􀑞.

SELECT * FROM T, S WHERE T.a = S.a

􀑞.SELECT * FROM T INNER JOIN S ON T.a = S.a

SELECT DISTINCT a, b, c FROM T

􀑞.SELECT a, b, c FROM T GROUP BY a, b, c

SELECT * FROM T WHERE EXISTS (SELECT * FROM S WHERE T.a = S.a)

􀑞.SELECT DISTINCT T.* FROM T INNER JOIN S ON T.a = S.a

Equivalent only if table T has a primary key.

SELECT * FROM T WHERE NOT EXISTS (SELECT * FROM S WHERE T.a =

S.a)

􀑞.SELECT T.* FROM T LEFT OUTER JOIN S ON T.a = S.a WHERE S.a

IS NULL

HANA Data Types Mapping to Java

UNIX supported locales for BOE

Locales supported for UNIX BOE Install:

Language	Browser Locales	Vesion	Solari 10	AIX Locales	Linux Locales	HP IA Locales
				(5.3 & 6.1)	RedHat 5.2 & 5.3 Suse 10 & 11	11.23 & 11.31
Tier 1	English	English (*)	en_US.UTF-8	EN_US.UTF-8	en_US.utf8	en_US.utf8
Tier 1	French	French (*)	fr_FR.UTF-8	FR_FR.UTF-8	fr_FR.utf8	fr_FR.utf8
Tier 1	Japanese	Japanese	ja_JP.UTF-8	JA_JP.UTF-8	ja_JP.utf8	ja_JP.utf8
Tier 1	Simlified Chinese	Chinese (China) [zh-cn] Chinese (Singapore) [zh-sg] Chinese [zh]	zh_CN.UTF-8@pinyin zh_CN.UTF-8@radical zh_CN.UTF-8@stroke	ZH_CN.UTF-8	zh_CN.utf8	zh_CN.utf8
Tier 1	German	German (*)	de_DE.UTF-8	DE_DE.UTF-8	de_DE.utf8	de_DE.utf8
Tier 1	Italian	Italian (*)	it_IT.UTF-8	IT_IT.UTF-8	it_IT.utf8	it_IT.utf8
Tier 1	Spanish	Spanish (*)	es_ES.UTF-8	ES_ES.UTF-8	es_ES.utf8	es_ES.utf8
Tier 1	Dutch	Dutch (*)	nl_NL.UTF-8	NL_NL.UTF-8	nl_NL.utf8	nl_NL.utf8
Tier 1	Russian	Russian	ru_RU.UTF-8	RU_RU.UTF-8	ru_RU.utf8	ru_RU.utf8
Tier 2	Korean	Korean	ko_KR.UTF-8@dict	KO_KR.UTF-8	ko_KR.utf8	ko_KR.utf8
Tier 2	Traditional Chinese	Chinese (Hong Kong SAR) [zh-hk] Chinese (Macau SAR) [zh-mo] Chinese (Taiwan) [zh-tw]	zh_TW.UTF-8@pinyin zh_TW.UTF-8@radical zh_TW.UTF-8@stroke zh_TW.UTF-8@zhuyin zh_HK.UTF-8@radical zh_HK.UTF-8@stroke	ZH_TW.UTF-8 ZH_HK.UTF-8	zh_TW.utf8 zh_HK.utf8	zh_TW.utf8 zh_HK.utf8
Tier 2	Portuguese	Portuguese (Brazil) [pt-br]	pt_BR.UTF-8	PT_BR.UTF-8	pt_BR.utf8	pt_BR.utf8
Tier 2	Swedish	Swedish [sv]	sv_SE.UTF-8	SV_SE.UTF-8	sv_SE.utf8	sv_SE.utf8
Tier-3	Polish	Polish	pl_PL.UTF-8	PL_PL.UTF-8	pl_PL.utf8	pl_PL.utf8
Tier-3	Danish	Danish	da_DK.UTF-8	DA_DK.UTF-8	da_DK.utf8	da_DK.utf8
Tier-3	Thai	Thai	th_TH.UTF-8	TH_TH.UTF-8	th_TH.utf8	No Locale
Tier-3	Norwegian	Bokmal	nb_NO.UTF-8	NO_NO.UTF-8	no_NO.utf8	no_NO.utf8
Tier-3	Czechoslovakia	Czech	cs_CZ.UTF-8	CS_CZ.UTF-8	cs_CZ.utf8	cs_CZ.utf8
Tier-3	Finnish	Finnish	fi_FI.UTF-8	FI_FI.UTF-8	fi_FI.utf8	fi_FI.utf8
Tier-3	Turkish	Turkish	tr_TR.UTF-8	TR_TR.UTF-8	tr_TR.utf8	tr_TR.utf8
Tier-3	Hungarian	Hungarian	hu_HU.UTF-8	HU_HU.UTF-8	hu_HU.utf8	hu_HU.utf8
Tier-3	Slovakian	Slovakian	sk_SK.UTF-8	SK_SK.UTF-8	sk_SK.utf8	sk_SK.utf8

Autodidact

Labels