首页 > dictionary > 我如何阅读java中的镶木地板字典

我如何阅读java中的镶木地板字典 (How can I read the parquet dictionary in java)

问题

我已经看到镶木地板格式使用字典来存储一些列,并且这些字典可用于加速过滤器,如果useDictionaryFilter()在上面使用的话ParquetReader

有没有办法从java代码访问这些词典?
我想使用它们来创建列的不同成员列表,尽管只读取字典值比扫描整列更快。

我已经研究过org.apache.parquet.hadoop.ParquetReaderAPI,但没有找到任何东西。

解决方法

方法org.apache.parquet.column.Dictionary允许您:

  • 查询字典索引的范围:介于0和getMaxId()之间
  • 查找与任何索引对应的条目,例如,对于可以使用decodeToInt()的int字段。

一旦你有了Dictionary,你可以迭代所有索引来获取所有条目,所以问题归结为获得一个Dictionary。为此,请使用ColumnReaderImpl作为指南:

getDictionary(ColumnDescriptor path, PageReader pageReader) {
  DictionaryPage dictionaryPage = pageReader.readDictionaryPage();
  if (dictionaryPage != null) {
    Dictionary dictionary = dictionaryPage.getEncoding().initDictionary(path, dictionaryPage);
  }
}

请注意,列块可能包含混合的数据页,有些是字典编码的,有些则不是,因为如果字典“满了”(达到允许的最大大小),则编写器输出字典页面和字典编码数据页面和切换到不对其余数据页面使用字典编码。

问题

I have seen that parquet format uses dictionaries to store some columns and that these dictionaries can be used to speed up the filters if useDictionaryFilter() is used on the ParquetReader.

Is there any way to access these dictionaries from java code ?
I'd like to use them to create a list of distinct members of my column and though that it would be faster to read only the dictionary values than scanning the whole column.

I have looked into org.apache.parquet.hadoop.ParquetReader API but did not found anything.

解决方法

The methods in org.apache.parquet.column.Dictionary allow you to:

  • Query the range of dictionary indexes: Between 0 and getMaxId().
  • Look up the entry corresponding to any index, for example for an int field you can use decodeToInt().

Once you have a Dictionary, you can iterate over all indexes to get all entries, so the question boils down to getting a Dictionary. To do that, use ColumnReaderImpl as a guide:

getDictionary(ColumnDescriptor path, PageReader pageReader) {
  DictionaryPage dictionaryPage = pageReader.readDictionaryPage();
  if (dictionaryPage != null) {
    Dictionary dictionary = dictionaryPage.getEncoding().initDictionary(path, dictionaryPage);
  }
}

Please note that a column chunk may contain a mixture of data pages, some dictionary-encoded and some not, because if the dictionary "gets full" (reaches the maximum allowed size), then the writer outputs the dictionary page and the dictionary-encoded data pages and switches to not using dictionary-encoding for the rest of the data pages.

相似信息