图算法 #

一、图算法概述 #

1.1 什么是图算法 #

图算法是用于分析图结构数据的算法，用于发现图中的模式、结构和特征。

text

图算法分类：
├── 路径算法：最短路径、所有路径
├── 中心性算法：PageRank、度中心性
├── 社区发现算法：连通分量、标签传播
├── 相似度算法：Jaccard、余弦相似度
└── 预测算法：链接预测、节点分类

1.2 Neptune图算法支持 #

text

Neptune支持方式：
├── Gremlin内置算法
├── Neptune ML（机器学习）
├── 自定义算法
└── 外部工具集成

二、中心性算法 #

2.1 度中心性 #

gremlin

// 出度
g.V().hasLabel('user').project('name', 'outDegree').
  by('name').
  by(outE().count())

// 入度
g.V().hasLabel('user').project('name', 'inDegree').
  by('name').
  by(inE().count())

// 总度
g.V().hasLabel('user').project('name', 'degree').
  by('name').
  by(bothE().count())

// 高度节点
g.V().hasLabel('user').
  order().by(bothE().count(), desc).
  limit(10).
  project('name', 'degree').
    by('name').
    by(bothE().count())

2.2 PageRank模拟 #

gremlin

// 简化版PageRank（迭代计算）
// 初始化
g.V().hasLabel('page').property('rank', 1.0)

// 迭代更新
g.V().hasLabel('page').as('v').
  property('rank',
    0.15 + 0.85 * 
    __.inE('links').outV().values('rank').sum() / 
    __.inE('links').outV().outE('links').count()
  )

// 获取排名
g.V().hasLabel('page').
  order().by('rank', desc).
  limit(10).
  project('name', 'rank').
    by('name').
    by('rank')

2.3 接近中心性 #

gremlin

// 计算平均最短路径长度
g.V().hasLabel('user').as('v').
  project('name', 'avgDistance').
    by('name').
    by(
      repeat(both().simplePath()).emit().times(5).
      groupCount().by(loops()).
      project('total', 'count').
        by(select(values).unfold().sum()).
        by(select(values).unfold().count()).
      math('total / count')
    )

三、社区发现算法 #

3.1 连通分量 #

gremlin

// 查找连通分量
g.V().hasLabel('user').
  repeat(both().simplePath()).emit().
  dedup().
  group().by('componentId')

// 查找最大连通分量
g.V().hasLabel('user').as('v').
  repeat(both().simplePath()).emit().as('connected').
  select('v').groupCount().by(id).
  order(local).by(values, desc).
  limit(local, 1)

// 孤立节点
g.V().hasLabel('user').where(bothE().count().is(0))

3.2 标签传播模拟 #

gremlin

// 简化版标签传播
// 初始化：每个节点使用自己的ID作为标签
g.V().hasLabel('user').property('label', id())

// 迭代：每个节点采用邻居的主流标签
g.V().hasLabel('user').as('v').
  property('label',
    __.both().groupCount().by('label').
    order(local).by(values, desc).
    limit(local, 1).
    select(keys).unfold()
  )

// 获取社区
g.V().hasLabel('user').group().by('label')

3.3 三角计数 #

gremlin

// 计算三角形数量
g.V().hasLabel('user').as('a').
  out().as('b').
  out().where(eq('a')).as('c').
  where(out().where(eq('b'))).
  count()

// 每个节点的三角形数
g.V().hasLabel('user').as('v').
  project('name', 'triangles').
    by('name').
    by(
      out().as('b').
      out().where(eq('v')).
      where(out().where(eq('b'))).
      count()
    )

四、相似度算法 #

4.1 Jaccard相似度 #

gremlin

// 计算两个节点的Jaccard相似度
g.V('1').as('a').V('2').as('b').
  project('intersection', 'union').
    by(__.select('a').out().where(within(__.select('b').out())).count()).
    by(__.union(__.select('a').out(), __.select('b').out()).dedup().count()).
  math('intersection / union')

// 找相似用户
g.V().hasLabel('user').as('a').
  V().hasLabel('user').where(neq('a')).as('b').
  project('user1', 'user2', 'similarity').
    by(select('a').values('name')).
    by(select('b').values('name')).
    by(
      __.project('intersection', 'union').
        by(__.select('a').out().where(within(__.select('b').out())).count()).
        by(__.union(__.select('a').out(), __.select('b').out()).dedup().count()).
      math('intersection / union')
    ).
  order().by(select('similarity'), desc).
  limit(10)

4.2 余弦相似度 #

gremlin

// 计算余弦相似度
g.V('1').as('a').V('2').as('b').
  project('dot', 'normA', 'normB').
    by(__.select('a').out().as('x').
        select('b').out().where(eq('x')).
        count()).
    by(__.select('a').out().count().math('sqrt(_)')).
    by(__.select('b').out().count().math('sqrt(_)')).
  math('dot / (normA * normB)')

4.3 共同邻居 #

gremlin

// 共同邻居
g.V('1').out().where(within(g.V('2').out().fold())).dedup()

// 共同邻居数量
g.V('1').out().where(within(g.V('2').out().fold())).dedup().count()

// Adamic-Adar指数
g.V('1').out().where(within(g.V('2').out().fold())).as('common').
  group().by().by(in().count().math('1 / log(_)')).
  select(values).unfold().sum()

五、Neptune ML #

5.1 Neptune ML概述 #

text

Neptune ML功能：
├── 图神经网络（GNN）
├── 节点分类
├── 节点回归
├── 链接预测
└── 边分类

5.2 节点分类 #

text

节点分类流程：
├── 1. 数据准备
├── 2. 模型训练
├── 3. 模型推理
└── 4. 结果查询

5.3 链接预测 #

gremlin

// 使用Neptune ML进行链接预测
g.with('Neptune#mlEndpoint', 'endpoint-name').
  V('1').out('predicted_knows').has('prediction_score', gt(0.8))

六、实际应用示例 #

6.1 社交网络分析 #

gremlin

// 找出影响力最大的用户
g.V().hasLabel('user').
  project('name', 'followers', 'following').
    by('name').
    by(inE('follows').count()).
    by(outE('follows').count()).
  order().by('followers', desc).
  limit(10)

// 找出社区核心
g.V().hasLabel('user').as('v').
  project('name', 'communityScore').
    by('name').
    by(
      both().dedup().as('neighbor').
      both().where(within('neighbor')).
      count()
    ).
  order().by('communityScore', desc).
  limit(10)

6.2 欺诈检测 #

gremlin

// 检测异常交易网络
g.V().hasLabel('transaction').has('amount', gt(10000)).
  repeat(both('related').simplePath()).times(3).
  emit(has('flagged', true)).
  path().by('transactionId').
  dedup()

// 检测紧密连接的账户组
g.V().hasLabel('account').as('a').
  out('transferred_to').as('b').
  out('transferred_to').where(eq('a')).
  where(out('transferred_to').where(eq('b'))).
  select('a', 'b').by('accountId')

6.3 推荐系统 #

gremlin

// 基于协同过滤的推荐
g.V().has('userId', 'user_001').out('purchased').as('item').
  in('purchased').where(neq(g.V().has('userId', 'user_001'))).
  out('purchased').where(neq('item')).
  groupCount().by(id).
  order(local).by(values, desc).
  limit(local, 10).
  select(keys).unfold().
  valueMap('name', 'price')

// 基于相似度的推荐
g.V().has('userId', 'user_001').as('user').
  V().hasLabel('user').where(neq('user')).as('other').
  project('other', 'similarity').
    by('name').
    by(
      __.project('intersection', 'union').
        by(__.select('user').out('purchased').where(within(__.select('other').out('purchased').fold())).count()).
        by(__.union(__.select('user').out('purchased'), __.select('other').out('purchased')).dedup().count()).
      math('intersection / union')
    ).
  order().by('similarity', desc).
  limit(5)

七、算法优化 #

7.1 性能优化 #

text

性能优化建议：
├── 限制遍历深度
├── 使用索引过滤
├── 分批处理大数据
├── 使用缓存
└── 选择合适的算法

7.2 算法选择 #

text

算法选择建议：
├── 小图：使用Gremlin内置
├── 中等图：使用优化查询
├── 大图：使用Neptune ML
└── 复杂算法：使用外部工具

八、总结 #

图算法要点：

算法类型	用途	Gremlin实现
度中心性	影响力分析	outE().count()
PageRank	网页排名	迭代计算
连通分量	社区发现	repeat(both())
Jaccard相似度	相似度计算	集合运算
链接预测	推荐系统	Neptune ML

最佳实践：

选择合适的算法
限制计算范围
使用索引优化
分批处理大数据
考虑使用Neptune ML

下一步，让我们学习性能优化！