ClickHouse系列教程二:使用航班飞行数据

ClickHouse系列教程: ClickHouse系列教程


根据官方提供的教程:ClickHouse Quick Start Guide

先下载数据:ontime.csv.xz — Yandex.Disk
压缩包大小是3G,解压后61G。
解压命令如下:

root@ubuntu:/home/zhang# xz -v  -d ontime.csv.xz
ontime.csv.xz (1/1)
  100 %      3,368.8 MiB / 61.6 GiB = 0.053   101 MiB/s      10:26   

然后在数据库中建表:

root@ubuntu:/home/zhang# clickhouse-client --password     --multiline
ClickHouse client version 19.9.3.31 (official build).
Password for user (default): 
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 19.9.3 revision 54421.

ubuntu :) CREATE TABLE ontime
(
    Year UInt16,
    Quarter UInt8,
    Month UInt8,
    DayofMonth UInt8,
    DayOfWeek UInt8,
    FlightDate Date,
    UniqueCarrier FixedString(7),
    AirlineID Int32,
    Carrier FixedString(2),
    TailNum String,
    FlightNum String,
    OriginAirportID Int32,
    OriginAirportSeqID Int32,
    OriginCityMarketID Int32,
    Origin FixedString(5),
    OriginCityName String,
    OriginState FixedString(2),
    OriginStateFips String,
    OriginStateName String,
    OriginWac Int32,
    DestAirportID Int32,
    DestAirportSeqID Int32,
    DestCityMarketID Int32,
    Dest FixedString(5),
    DestCityName String,
    DestState FixedString(2),
    DestStateFips String,
    DestStateName String,
    DestWac Int32,
    CRSDepTime Int32,
    DepTime Int32,
    DepDelay Int32,
    DepDelayMinutes Int32,
    DepDel15 Int32,
    DepartureDelayGroups String,
    DepTimeBlk String,
    TaxiOut Int32,
    WheelsOff Int32,
    WheelsOn Int32,
    TaxiIn Int32,
    CRSArrTime Int32,
    ArrTime Int32,
    ArrDelay Int32,
    ArrDelayMinutes Int32,
    ArrDel15 Int32,
    ArrivalDelayGroups Int32,
    ArrTimeBlk String,
    Cancelled UInt8,
    CancellationCode FixedString(1),
    Diverted UInt8,
    CRSElapsedTime Int32,
    ActualElapsedTime Int32,
    AirTime Int32,
    Flights Int32,
    Distance Int32,
    DistanceGroup UInt8,
    CarrierDelay Int32,
    WeatherDelay Int32,
    NASDelay Int32,
    SecurityDelay Int32,
    LateAircraftDelay Int32,
    FirstDepTime String,
    TotalAddGTime String,
    LongestAddGTime String,
    DivAirportLandings String,
    DivReachedDest String,
    DivActualElapsedTime String,
    DivArrDelay String,
    DivDistance String,
    Div1Airport String,
    Div1AirportID Int32,
    Div1AirportSeqID Int32,
    Div1WheelsOn String,
    Div1TotalGTime String,
    Div1LongestGTime String,
    Div1WheelsOff String,
    Div1TailNum String,
    Div2Airport String,
    Div2AirportID Int32,
    Div2AirportSeqID Int32,
    Div2WheelsOn String,
    Div2TotalGTime String,
    Div2LongestGTime String,
    Div2WheelsOff String,
    Div2TailNum String,
    Div3Airport String,
    Div3AirportID Int32,
    Div3AirportSeqID Int32,
    Div3WheelsOn String,
    Div3TotalGTime String,
    Div3LongestGTime String,
    Div3WheelsOff String,
    Div3TailNum String,
    Div4Airport String,
    Div4AirportID Int32,
    Div4AirportSeqID Int32,
    Div4WheelsOn String,
    Div4TotalGTime String,
    Div4LongestGTime String,
    Div4WheelsOff String,
    Div4TailNum String,
    Div5Airport String,
    Div5AirportID Int32,
    Div5AirportSeqID Int32,
    Div5WheelsOn String,
    Div5TotalGTime String,
    Div5LongestGTime String,
    Div5WheelsOff String,
    Div5TailNum String
)
ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192);

CREATE TABLE ontime
(
    `Year` UInt16, 
    `Quarter` UInt8, 
    `Month` UInt8, 
    `DayofMonth` UInt8, 
    `DayOfWeek` UInt8, 
    `FlightDate` Date, 
    `UniqueCarrier` FixedString(7), 
    `AirlineID` Int32, 
    `Carrier` FixedString(2), 
    `TailNum` String, 
    `FlightNum` String, 
    `OriginAirportID` Int32, 
    `OriginAirportSeqID` Int32, 
    `OriginCityMarketID` Int32, 
    `Origin` FixedString(5), 
    `OriginCityName` String, 
    `OriginState` FixedString(2), 
    `OriginStateFips` String, 
    `OriginStateName` String, 
    `OriginWac` Int32, 
    `DestAirportID` Int32, 
    `DestAirportSeqID` Int32, 
    `DestCityMarketID` Int32, 
    `Dest` FixedString(5), 
    `DestCityName` String, 
    `DestState` FixedString(2), 
    `DestStateFips` String, 
    `DestStateName` String, 
    `DestWac` Int32, 
    `CRSDepTime` Int32, 
    `DepTime` Int32, 
    `DepDelay` Int32, 
    `DepDelayMinutes` Int32, 
    `DepDel15` Int32, 
    `DepartureDelayGroups` String, 
    `DepTimeBlk` String, 
    `TaxiOut` Int32, 
    `WheelsOff` Int32, 
    `WheelsOn` Int32, 
    `TaxiIn` Int32, 
    `CRSArrTime` Int32, 
    `ArrTime` Int32, 
    `ArrDelay` Int32, 
    `ArrDelayMinutes` Int32, 
    `ArrDel15` Int32, 
    `ArrivalDelayGroups` Int32, 
    `ArrTimeBlk` String, 
    `Cancelled` UInt8, 
    `CancellationCode` FixedString(1), 
    `Diverted` UInt8, 
    `CRSElapsedTime` Int32, 
    `ActualElapsedTime` Int32, 
    `AirTime` Int32, 
    `Flights` Int32, 
    `Distance` Int32, 
    `DistanceGroup` UInt8, 
    `CarrierDelay` Int32, 
    `WeatherDelay` Int32, 
    `NASDelay` Int32, 
    `SecurityDelay` Int32, 
    `LateAircraftDelay` Int32, 
    `FirstDepTime` String, 
    `TotalAddGTime` String, 
    `LongestAddGTime` String, 
    `DivAirportLandings` String, 
    `DivReachedDest` String, 
    `DivActualElapsedTime` String, 
    `DivArrDelay` String, 
    `DivDistance` String, 
    `Div1Airport` String, 
    `Div1AirportID` Int32, 
    `Div1AirportSeqID` Int32, 
    `Div1WheelsOn` String, 
    `Div1TotalGTime` String, 
    `Div1LongestGTime` String, 
    `Div1WheelsOff` String, 
    `Div1TailNum` String, 
    `Div2Airport` String, 
    `Div2AirportID` Int32, 
    `Div2AirportSeqID` Int32, 
    `Div2WheelsOn` String, 
    `Div2TotalGTime` String, 
    `Div2LongestGTime` String, 
    `Div2WheelsOff` String, 
    `Div2TailNum` String, 
    `Div3Airport` String, 
    `Div3AirportID` Int32, 
    `Div3AirportSeqID` Int32, 
    `Div3WheelsOn` String, 
    `Div3TotalGTime` String, 
    `Div3LongestGTime` String, 
    `Div3WheelsOff` String, 
    `Div3TailNum` String, 
    `Div4Airport` String, 
    `Div4AirportID` Int32, 
    `Div4AirportSeqID` Int32, 
    `Div4WheelsOn` String, 
    `Div4TotalGTime` String, 
    `Div4LongestGTime` String, 
    `Div4WheelsOff` String, 
    `Div4TailNum` String, 
    `Div5Airport` String, 
    `Div5AirportID` Int32, 
    `Div5AirportSeqID` Int32, 
    `Div5WheelsOn` String, 
    `Div5TotalGTime` String, 
    `Div5LongestGTime` String, 
    `Div5WheelsOff` String, 
    `Div5TailNum` String
)
ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192)

Ok.

0 rows in set. Elapsed: 0.077 sec. 

现在我们有一个MergeTree类型的表。建议在生产中使用MergeTree表类型。这种表具有用于递增排序表数据的主键。这允许在主键的范围内快速执行查询。

然后加载数据:

cat ontime.csv | clickhouse-client --query="INSERT INTO ontime FORMAT CSV"  --password

我在加载数据的时候使用了8G内存,最后使用了14G的磁盘空间。

加载完成后进行测试,查询2015年最受欢迎的目的地:

SELECT
    OriginCityName,
    DestCityName,
    count(*) AS flights,
    bar(flights, 0, 20000, 40)
FROM ontime WHERE Year = 2015 GROUP BY OriginCityName, DestCityName ORDER BY flights DESC LIMIT 20

结果如下:

root@ubuntu:/home/zhang# clickhouse-client --password     --multiline
ClickHouse client version 19.9.3.31 (official build).
Password for user (default): 
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 19.9.3 revision 54421.

ubuntu :) SELECT
:-]     OriginCityName,
:-]     DestCityName,
:-]     count(*) AS flights,
:-]     bar(flights, 0, 20000, 40)
:-] FROM ontime WHERE Year = 2015 GROUP BY OriginCityName, DestCityName ORDER BY flights DESC LIMIT 20;

SELECT 
    OriginCityName, 
    DestCityName, 
    count(*) AS flights, 
    bar(flights, 0, 20000, 40)
FROM ontime 
WHERE Year = 2015
GROUP BY 
    OriginCityName, 
    DestCityName
ORDER BY flights DESC
LIMIT 20

┌─OriginCityName────┬─DestCityName──────┬─flights─┬─bar(count(), 0, 20000, 40)──────┐
│ San Francisco, CA │ Los Angeles, CA15116 │ ██████████████████████████████▏ │
│ Los Angeles, CA   │ San Francisco, CA14799 │ █████████████████████████████▌  │
│ New York, NY      │ Chicago, IL14734 │ █████████████████████████████▍  │
│ Chicago, IL       │ New York, NY14632 │ █████████████████████████████▎  │
│ Boston, MA        │ New York, NY13201 │ ██████████████████████████▍     │
│ New York, NY      │ Boston, MA13201 │ ██████████████████████████▍     │
│ New York, NY      │ Los Angeles, CA13113 │ ██████████████████████████▏     │
│ Los Angeles, CA   │ New York, NY13106 │ ██████████████████████████▏     │
│ Chicago, IL       │ Washington, DC12509 │ █████████████████████████       │
│ Washington, DC    │ Chicago, IL12310 │ ████████████████████████▌       │
│ Atlanta, GA       │ Chicago, IL12213 │ ████████████████████████▍       │
│ Chicago, IL       │ Atlanta, GA12103 │ ████████████████████████▏       │
│ Los Angeles, CA   │ Chicago, IL11111 │ ██████████████████████▏         │
│ Atlanta, GA       │ New York, NY11004 │ ██████████████████████          │
│ New York, NY      │ Atlanta, GA10986 │ █████████████████████▊          │
│ Miami, FL         │ New York, NY10790 │ █████████████████████▌          │
│ New York, NY      │ Miami, FL10779 │ █████████████████████▌          │
│ Chicago, IL       │ Los Angeles, CA10755 │ █████████████████████▌          │
│ Las Vegas, NV     │ Los Angeles, CA10657 │ █████████████████████▎          │
│ Boston, MA        │ Washington, DC10655 │ █████████████████████▎          │
└───────────────────┴───────────────────┴─────────┴─────────────────────────────────┘

20 rows in set. Elapsed: 13.848 sec. Processed 7.79 million rows, 359.82 MB (562.20 thousand rows/s., 25.98 MB/s.) 

速度非常快。

已标记关键词 清除标记
相关推荐
<p style="font-family: tahoma, 微软雅黑, arial, 宋体; margin: 0px; padding: 0px; box-sizing: border-box; border: 0px; outline: 0px; font-size: 14px; vertical-align: baseline; background: #ffffff; list-style-type: none; color: #333333; word-spacing: 1px;">ClickHouse是一个用于OLAP的数据分析引擎,由俄罗斯搜索巨头Yandex公司开源。</p> <p style="font-family: tahoma, 微软雅黑, arial, 宋体; margin: 0px; padding: 0px; box-sizing: border-box; border: 0px; outline: 0px; font-size: 14px; vertical-align: baseline; background: #ffffff; list-style-type: none; color: #333333; word-spacing: 1px;">1、可支持PB级超大容量的数据库管理系统。</p> <p style="font-family: tahoma, 微软雅黑, arial, 宋体; margin: 0px; padding: 0px; box-sizing: border-box; border: 0px; outline: 0px; font-size: 14px; vertical-align: baseline; background: #ffffff; list-style-type: none; color: #333333; word-spacing: 1px;">2、基于SQL语句, 使用成本低。 </p> <p style="font-family: tahoma, 微软雅黑, arial, 宋体; margin: 0px; padding: 0px; box-sizing: border-box; border: 0px; outline: 0px; font-size: 14px; vertical-align: baseline; background: #ffffff; list-style-type: none; color: #333333; word-spacing: 1px;">3、超亿级数据量分析的秒级响应,计算性能横向扩展。</p> <p style="font-family: tahoma, 微软雅黑, arial, 宋体; margin: 0px; padding: 0px; box-sizing: border-box; border: 0px; outline: 0px; font-size: 14px; vertical-align: baseline; background: #ffffff; list-style-type: none; color: #333333; word-spacing: 1px;">4、海量数据即查即用。</p> <p style="font-family: tahoma, 微软雅黑, arial, 宋体; margin: 0px; padding: 0px; box-sizing: border-box; border: 0px; outline: 0px; font-size: 14px; vertical-align: baseline; background: #ffffff; list-style-type: none; color: #333333; word-spacing: 1px;">5、提供数据的预聚合能力,进一步提升数据查询的效率。</p> <p style="font-family: tahoma, 微软雅黑, arial, 宋体; margin: 0px; padding: 0px; box-sizing: border-box; border: 0px; outline: 0px; font-size: 14px; vertical-align: baseline; background: #ffffff; list-style-type: none; color: #333333; word-spacing: 1px;">6、列式存储, 数据压缩,降低磁盘IO和网络IO,提升计算性能,节约70%物理存储。</p> <p style="font-family: tahoma, 微软雅黑, arial, 宋体; margin: 0px; padding: 0px; box-sizing: border-box; border: 0px; outline: 0px; font-size: 14px; vertical-align: baseline; background: #ffffff; list-style-type: none; color: #333333; word-spacing: 1px;">7、支持副本, 实现跨机房的数据容灾。</p> <p style="font-family: tahoma, 微软雅黑, arial, 宋体; margin: 0px; padding: 0px; box-sizing: border-box; border: 0px; outline: 0px; font-size: 14px; vertical-align: baseline; background: #ffffff; list-style-type: none; color: #333333; word-spacing: 1px;"><br style="font-family: 'Helvetica Neue', Helvetica, 'Hiragino Sans GB', Arial, sans-serif; margin: 0px; padding: 0px;" />课程基于ClickHouse的**的20.x版本,从理论到实践,由浅入深详解ClickHouse的原理和使用。</p> <p style="font-family: tahoma, 微软雅黑, arial, 宋体; margin: 0px; padding: 0px; box-sizing: border-box; border: 0px; outline: 0px; font-size: 14px; vertical-align: baseline; background: #ffffff; list-style-type: none; color: #333333; word-spacing: 1px;">1. 本课程从零开始,有Linux操作系统和SQL使用经验即可。</p> <p style="font-family: tahoma, 微软雅黑, arial, 宋体; margin: 0px; padding: 0px; box-sizing: border-box; border: 0px; outline: 0px; font-size: 14px; vertical-align: baseline; background: #ffffff; list-style-type: none; color: #333333; word-spacing: 1px;">2. 课程以案例驱动,包含大量的实践案例。</p> <p style="font-family: tahoma, 微软雅黑, arial, 宋体; margin: 0px; padding: 0px; box-sizing: border-box; border: 0px; outline: 0px; font-size: 14px; vertical-align: baseline; background: #ffffff; list-style-type: none; color: #333333; word-spacing: 1px;">3. 将ClickHouse与MySQL、Kafka、HDFS等第三方系统集成,增强ClickHouse的功能。</p> <p style="font-family: tahoma, 微软雅黑, arial, 宋体; margin: 0px; padding: 0px; box-sizing: border-box; border: 0px; outline: 0px; font-size: 14px; vertical-align: baseline; background: #ffffff; list-style-type: none; color: #333333; word-spacing: 1px;">4. 课程覆盖面广,既有基础功能的详细讲解, 也包含了集群部署、查询优化、运维管理等高级内容。</p>
©️2020 CSDN 皮肤主题: 编程工作室 设计师:CSDN官方博客 返回首页